Drill in 10 Minutes

Objective

Use Apache Drill to query sample data in 10 minutes. For simplicity, you’ll run Drill in embedded mode rather than distributed mode to try out Drill without having to perform any setup tasks.

A Few Bits About Apache Drill

Drill is a clustered, powerful MPP (Massively Parallel Processing) query engine for Hadoop that can process petabytes of data, fast. Drill is useful for short, interactive ad-hoc queries on large-scale data sets. Drill is capable of querying nested data in formats like JSON and Parquet and performing dynamic schema discovery. Drill does not require a centralized metadata repository.

Dynamic schema discovery

Drill does not require schema or type specification for data in order to start the query execution process. Drill starts data processing in record-batches and discovers the schema during processing. Self-describing data formats such as Parquet, JSON, AVRO, and NoSQL databases have schema specified as part of the data itself, which Drill leverages dynamically at query time. Because schema can change over the course of a Drill query, all Drill operators are designed to reconfigure themselves when schemas change.

Flexible data model

Drill allows access to nested data attributes, just like SQL columns, and provides intuitive extensions to easily operate on them. From an architectural point of view, Drill provides a flexible hierarchical columnar data model that can represent complex, highly dynamic and evolving data models. Drill allows for efficient processing of these models without the need to flatten or materialize them at design time or at execution time. Relational data in Drill is treated as a special or simplified case of complex/multi-structured data.

De-centralized metadata

Drill does not have a centralized metadata requirement. You do not need to create and manage tables and views in a metadata repository, or rely on a database administrator group for such a function. Drill metadata is derived from the storage plugins that correspond to data sources. Storage plugins provide a spectrum of metadata ranging from full metadata (Hive), partial metadata (HBase), or no central metadata (files). De-centralized metadata means that Drill is NOT tied to a single Hive repository. You can query multiple Hive repositories at once and then combine the data with information from HBase tables or with a file in a distributed file system. You can also use SQL DDL syntax to create metadata within Drill, which gets organized just like a traditional database. Drill metadata is accessible through the ANSI standard INFORMATION_SCHEMA database.

Extensibility

Drill provides an extensible architecture at all layers, including the storage plugin, query, query optimization/execution, and client API layers. You can customize any layer for the specific needs of an organization or you can extend the layer to a broader array of use cases. Drill provides a built in classpath scanning and plugin concept to add additional storage plugins, functions, and operators with minimal configuration.

Installation Overview

You can install Drill in embedded mode on a machine running Linux, Mac OS X, or Windows. For information about running Drill in distributed mode, see Deploying Drill in a Cluster.

This installation procedure includes how to download the Apache Drill archive and extract the contents to a directory on your machine. The Apache Drill archive contains sample JSON and Parquet files that you can query immediately.

After installing Drill, you start SQLLine. SQLLine is a pure-Java console-based utility for connecting to relational databases and executing SQL commands. SQLLine is used as the shell for Drill. Drill follows the ANSI SQL: 2011 standard with extensions for nested data formats and other capabilities.

Embedded Mode Installation Prerequisites

You need to meet the following prerequisites to run Drill:

Linux, Mac OS X, and Windows: Oracle Java SE Development (JDK) Kit 7 installation
Windows only:
- A JAVA_HOME environment variable set up that points to to the JDK installation
- A PATH environment variable that includes a pointer to the JDK installation
- A third-party utility for unzipping a tar.gz file

Java Installation Prerequisite Check

Run the following command in a terminal (Linux and Mac OS X) or Command Prompt (Windows) to verify that Java 7 is the version in effect:

java -version

The output looks something like this:

java version "1.7.0_79"
Java(TM) SE Runtime Environment (build 1.7.0_7965-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.79-b02, mixed mode)

Install Drill on Linux or Mac OS X

Complete the following steps to install Drill:

Issue the following command in a terminal to download the latest, stable version of Apache Drill to a directory on your machine, or download Drill from the Drill web site:
```
wget http://getdrill.org/drill/download/apache-drill-0.8.0.tar.gz  
```
Copy the downloaded file to the directory where you want to install Drill.
Extract the contents of the Drill tar.gz file. Use sudo if necessary:
```
sudo tar -xvzf apache-drill-0.8.0..tar.gz  
```

The extraction process creates the installation directory named apache-drill-0.8.0 containing the Drill software.

At this point, you can start Drill.

Start Drill on Linux and Mac OS X

Launch SQLLine using the sqlline command to start to Drill in embedded mode. The command directs SQLLine to connect to Drill. The zk=local means the local node is the zookeeper node. Complete the following steps to launch SQLLine and start Drill:

Navigate to the Drill installation directory. For example:
```
cd apache-drill-0.8.0  
```
Issue the following command to launch SQLLine:
```
bin/sqlline -u jdbc:drill:zk=local  
```

The 0: jdbc:drill:zk=local> prompt appears.

At this point, you can submit queries to Drill.

Install Drill on Windows

You can install Drill on Windows 7 or 8. First, set the JAVA_HOME environment variable, and then install Drill. Complete the following steps to install Drill:

Click the following link to download the latest, stable version of Apache Drill: http://getdrill.org/drill/download/apache-drill-0.8.0.tar.gz
Move the apache-drill-0.8.0.tar.gz file to a directory where you want to install Drill.
Unzip the TAR.GZ file using a third-party tool. If the tool you use does not unzip the TAR file as well as the TAR.GZ file, unzip the apache-drill-0.8.0.tar to extract the Drill software. The extraction process creates the installation directory named apache-drill-0.8.0 containing the Drill software. For example: At this point, you can start Drill.

Start Drill on Windows

Launch SQLLine using the sqlline command to start to Drill in embedded mode. The command directs SQLLine to connect to Drill. The zk=local means the local node is the zookeeper node. Complete the following steps to launch SQLLine and start Drill:

Open the apache-drill-0.8.0 folder.
Open the bin folder, and double-click the sqlline.bat file: The Windows command prompt opens.
At the sqlline> prompt, type !connect jdbc:drill:zk=local and then press Enter:
Enter the username, admin, and password, also admin when prompted. The 0: jdbc:drill:zk=local> prompt appears. At this point, you can submit queries to Drill.

Stopping Drill

Issue the following command when you want to exit SQLLine:

!quit

Query Sample Data

Your Drill installation includes a sample-date directory with JSON and Parquet files that you can query. The local file system on your machine is configured as the dfs storage plugin instance by default when you install Drill in embedded mode. For more information about storage plugin configuration, refer to Storage Plugin Registration.

Use SQL syntax to query the sample JSON and Parquet files in the sample-data directory on your local file system.

Querying a JSON File

A sample JSON file, employee.json, contains fictitious employee data.

To view the data in the employee.json file, submit the following SQL query to Drill:

0: jdbc:drill:zk=local> SELECT * FROM cp.`employee.json`;

The query returns the following results:

Example of partial output

+-------------+------------+------------+------------+-------------+-----------+
| employee_id | full_name  | first_name | last_name  | position_id | position_ |
+-------------+------------+------------+------------+-------------+-----------+
| 1101        | Steve Eurich | Steve      | Eurich         | 16          | Store T |
| 1102        | Mary Pierson | Mary       | Pierson    | 16          | Store T |
| 1103        | Leo Jones  | Leo        | Jones      | 16          | Store Tem |
| 1104        | Nancy Beatty | Nancy      | Beatty     | 16          | Store T |
| 1105        | Clara McNight | Clara      | McNight    | 16          | Store  |
| 1106        | Marcella Isaacs | Marcella   | Isaacs     | 17          | Stor |
| 1107        | Charlotte Yonce | Charlotte  | Yonce      | 17          | Stor |
| 1108        | Benjamin Foster | Benjamin   | Foster     | 17          | Stor |
| 1109        | John Reed  | John       | Reed       | 17          | Store Per |
| 1110        | Lynn Kwiatkowski | Lynn       | Kwiatkowski | 17          | St |
| 1111        | Donald Vann | Donald     | Vann       | 17          | Store Pe |
| 1112        | William Smith | William    | Smith      | 17          | Store  |
| 1113        | Amy Hensley | Amy        | Hensley    | 17          | Store Pe |
| 1114        | Judy Owens | Judy       | Owens      | 17          | Store Per |
| 1115        | Frederick Castillo | Frederick  | Castillo   | 17          | S |
| 1116        | Phil Munoz | Phil       | Munoz      | 17          | Store Per |
| 1117        | Lori Lightfoot | Lori       | Lightfoot  | 17          | Store |
+-------------+------------+------------+------------+-------------+-----------+
1,155 rows selected (0.762 seconds)
0: jdbc:drill:zk=local>

Querying a Parquet File

Query the region.parquet and nation.parquet files in the sample-data directory on your local file system.

Region File

If you followed the Apache Drill in 10 Minutes instructions to install Drill in embedded mode, the path to the parquet file varies between operating systems.

Note

When you enter the query, include the version of Drill that you are currently running.

To view the data in the region.parquet file, issue the query appropriate for your operating system:

Linux

SELECT * FROM dfs.`/opt/drill/apache-drill-<version>/sample-data/region.parquet`;

Mac OS X

SELECT * FROM dfs.`/Users/max/drill/apache-drill-<version>/sample-data/region.parquet`;

Windows

SELECT * FROM dfs.`C:\drill\apache-drill-<version>\sample-data\region.parquet`;

The query returns the following results:

+------------+------------+
|   EXPR$0   |   EXPR$1   |
+------------+------------+
| AFRICA     | lar deposits. blithely final packages cajole. regular waters ar |
| AMERICA    | hs use ironic, even requests. s |
| ASIA       | ges. thinly even pinto beans ca |
| EUROPE     | ly final courts cajole furiously final excuse |
| MIDDLE EAST | uickly special accounts cajole carefully blithely close reques |
+------------+------------+
5 rows selected (0.165 seconds)
0: jdbc:drill:zk=local>

Nation File

If you followed the Apache Drill in 10 Minutes instructions to install Drill in embedded mode, the path to the parquet file varies between operating systems.

Note: When you enter the query, include the version of Drill that you are currently running.

To view the data in the nation.parquet file, issue the query appropriate for your operating system:

Linux

  SELECT * FROM dfs.`/opt/drill/apache-drill-<version>/sample-data/nation.parquet`;

Mac OS X

  SELECT * FROM dfs.`/Users/max/drill/apache-drill-<version>/sample-data/nation.parquet`;

Windows

  SELECT * FROM dfs.`C:\drill\apache-drill-<version>\sample-data\nation.parquet`;

The query returns the following results:

Summary

Now you know a bit about Apache Drill. To summarize, you have completed the following tasks:

Learned that Apache Drill supports nested data, schema-less execution, and decentralized metadata.
Downloaded and installed Apache Drill.
Invoked SQLLine with Drill in embedded mode.
Queried the sample JSON file, employee.json, to view its data.
Queried the sample region.parquet file to view its data.
Queried the sample nation.parquet file to view its data.

Next Steps

Now that you have an idea about what Drill can do, you might want to:

More Information

For more information about Apache Drill, explore the Apache Drill web site.

← Tutorials IntroductionAnalyzing the Yelp Academic Dataset →