Use Apache Drill to query sample data in 10 minutes. For simplicity, you’ll run Drill in embedded mode rather than distributed mode to try out Drill without having to perform any setup tasks.
Drill is a clustered, powerful MPP (Massively Parallel Processing) query engine for Hadoop that can process petabytes of data, fast. Drill is useful for short, interactive ad-hoc queries on large-scale data sets. Drill is capable of querying nested data in formats like JSON and Parquet and performing dynamic schema discovery. Drill does not require a centralized metadata repository.
Drill does not require schema or type specification for data in order to start the query execution process. Drill starts data processing in record-batches and discovers the schema during processing. Self-describing data formats such as Parquet, JSON, AVRO, and NoSQL databases have schema specified as part of the data itself, which Drill leverages dynamically at query time. Because schema can change over the course of a Drill query, all Drill operators are designed to reconfigure themselves when schemas change.
Drill allows access to nested data attributes, just like SQL columns, and provides intuitive extensions to easily operate on them. From an architectural point of view, Drill provides a flexible hierarchical columnar data model that can represent complex, highly dynamic and evolving data models. Drill allows for efficient processing of these models without the need to flatten or materialize them at design time or at execution time. Relational data in Drill is treated as a special or simplified case of complex/multi-structured data.
Drill does not have a centralized metadata requirement. You do not need to create and manage tables and views in a metadata repository, or rely on a database administrator group for such a function. Drill metadata is derived from the storage plugins that correspond to data sources. Storage plugins provide a spectrum of metadata ranging from full metadata (Hive), partial metadata (HBase), or no central metadata (files). De-centralized metadata means that Drill is NOT tied to a single Hive repository. You can query multiple Hive repositories at once and then combine the data with information from HBase tables or with a file in a distributed file system. You can also use SQL DDL syntax to create metadata within Drill, which gets organized just like a traditional database. Drill metadata is accessible through the ANSI standard INFORMATION_SCHEMA database.
Drill provides an extensible architecture at all layers, including the storage plugin, query, query optimization/execution, and client API layers. You can customize any layer for the specific needs of an organization or you can extend the layer to a broader array of use cases. Drill provides a built in classpath scanning and plugin concept to add additional storage plugins, functions, and operators with minimal configuration.
Download the Apache Drill archive and extract the contents to a directory on your machine. The Apache Drill archive contains sample JSON and Parquet files that you can query immediately.
Query the sample JSON and parquet files using SQLLine. SQLLine is a pure-Java console-based utility for connecting to relational databases and executing SQL commands. SQLLine is used as the shell for Drill. Drill follows the ANSI SQL: 2011 standard with a few extensions for nested data formats.
You must have the following software installed on your machine to run Drill:
Software | Description |
Oracle JDK version 7 | A set of programming tools for developing Java applications. |
Run the following command to verify that the system meets the software prerequisite:
Command | Example Output |
java –version | java version "1.7.0_65" Java(TM) SE Runtime Environment (build 1.7.0_65-b19) Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode) |
You can install Drill on a machine running Linux, Mac OS X, or Windows.
Complete the following steps to install Drill:
Issue the following command to download the latest, stable version of Apache Drill to a directory on your machine:
wget http://getdrill.org/drill/download/apache-drill-0.8.0.tar.gz
Issue the following command to create a new directory to which you can extract the contents of the Drill tar.gz
file:
sudo mkdir -p /opt/drill
Navigate to the directory where you downloaded the Drill tar.gz
file.
Issue the following command to extract the contents of the Drill tar.gz
file:
sudo tar -xvzf apache-drill-<version>.tar.gz -C /opt/drill
Issue the following command to navigate to the Drill installation directory:
cd /opt/drill/apache-drill-<version>
At this point, you can start Drill.
Complete the following steps to install Drill:
Open a Terminal window, and create a drill
directory inside your home directory (or in some other location if you prefer).
Example
$ pwd
/Users/max
$ mkdir drill
$ cd drill
$ pwd
/Users/max/drill
Click the following link to download the latest, stable version of Apache Drill:
http://getdrill.org/drill/download/apache-drill-0.8.0.tar.gz
Open the downloaded TAR
file with the Mac Archive utility or a similar tool for unzipping files.
Move the resulting apache-drill-<version>
folder into the drill
directory that you created.
Issue the following command to navigate to the apache-drill-<version>
directory:
cd /Users/max/drill/apache-drill-<version>
At this point, you can start Drill.
You can install Drill on Windows 7 or 8. To install Drill on Windows, you must
have JDK 7, and you must set the JAVA_HOME
path in the Windows Environment
Variables. You must also have a utility, such as
7-zip, installed on your machine. These instructions
assume that the 7-zip decompression utility is
installed to extract a Drill archive file that you download.
Complete the following steps to set JAVA_HOME
:
Control Panel\All Control Panel Items\System
, and select Advanced System Settings. The System Properties window appears.Add/Edit JAVA_HOME
to point to the location where the JDK software is located.
Example
C:\Program Files\Java\jdk1.7.0_65
Click OK to exit the windows.
Complete the following steps to install Drill:
Create a drill
directory on your C:\
drive, (or in some other location if you prefer).
Example
C:\drill
Do not include spaces in your directory path. If you include spaces in the directory path, Drill fails to run.
Click the following link to download the latest, stable version of Apache Drill: http://getdrill.org/drill/download/apache-drill-0.8.0.tar.gz
Move the apache-drill-<version>.tar.gz
file to the drill
directory that you created on your C:\
drive.
Unzip the TAR.GZ
file and the resulting TAR
file.
apache-drill-<version>.tar.gz,
and select 7-Zip>Extract Here
. The utility extracts the apache-drill-<version>.tar
file.apache-drill-<version>.tar
, and select 7-Zip>Extract Here
. The utility extracts the apache-drill-<version>
folder.Open the apache-drill-<version>
folder.
Open the bin
folder, and double-click on the sqlline.bat
file. The Windows command prompt opens.
At the sqlline>
prompt, type !connect jdbc:drill:zk=local
and then press Enter
.
Enter the username and password.
admin
and then press Enter.admin
and then press Enter. The cursor blinks for a few seconds and then 0: jdbc:drill:zk=local>
displays in the prompt.At this point, you can submit queries to Drill. Refer to the Query Sample Dat a section of this document.
Launch SQLLine, the Drill shell, to start and run Drill in embedded mode. Launching SQLLine automatically starts a new Drillbit within the shell. In a production environment, Drillbits are the daemon processes that run on each node in a Drill cluster.
Complete the following steps to launch SQLLine and start Drill:
Verify that you are in the Drill installation directory.
Example: ~/apache-drill-<version>
Issue the following command to launch SQLLine:
bin/sqlline -u jdbc:drill:zk=local
-u
is a JDBC connection string that directs SQLLine to connect to Drill. It
also starts a local Drillbit. If you are connecting to an Apache Drill
cluster, the value of zk=
would be a list of Zookeeper quorum nodes. For
more information about how to run Drill in clustered mode, go to Deploying
Apache Drill in a Clustered Environment.
When SQLLine starts, the system displays the following prompt:
0: jdbc:drill:zk=local>
Issue the following command when you want to exit SQLLine:
!quit
Your Drill installation includes a sample-date
directory with JSON and
Parquet files that you can query. The local file system on your machine is
configured as the dfs
storage plugin instance by default when you install
Drill in embedded mode. For more information about storage plugin
configuration, refer to Storage Plugin Registration.
Use SQL syntax to query the sample JSON
and Parquet
files in the sample-
data
directory on your local file system.
A sample JSON file, employee.json
, contains fictitious employee data.
To view the data in the employee.json
file, submit the following SQL query
to Drill:
0: jdbc:drill:zk=local> SELECT * FROM cp.`employee.json`;
The query returns the following results:
Example of partial output
+-------------+------------+------------+------------+-------------+-----------+
| employee_id | full_name | first_name | last_name | position_id | position_ |
+-------------+------------+------------+------------+-------------+-----------+
| 1101 | Steve Eurich | Steve | Eurich | 16 | Store T |
| 1102 | Mary Pierson | Mary | Pierson | 16 | Store T |
| 1103 | Leo Jones | Leo | Jones | 16 | Store Tem |
| 1104 | Nancy Beatty | Nancy | Beatty | 16 | Store T |
| 1105 | Clara McNight | Clara | McNight | 16 | Store |
| 1106 | Marcella Isaacs | Marcella | Isaacs | 17 | Stor |
| 1107 | Charlotte Yonce | Charlotte | Yonce | 17 | Stor |
| 1108 | Benjamin Foster | Benjamin | Foster | 17 | Stor |
| 1109 | John Reed | John | Reed | 17 | Store Per |
| 1110 | Lynn Kwiatkowski | Lynn | Kwiatkowski | 17 | St |
| 1111 | Donald Vann | Donald | Vann | 17 | Store Pe |
| 1112 | William Smith | William | Smith | 17 | Store |
| 1113 | Amy Hensley | Amy | Hensley | 17 | Store Pe |
| 1114 | Judy Owens | Judy | Owens | 17 | Store Per |
| 1115 | Frederick Castillo | Frederick | Castillo | 17 | S |
| 1116 | Phil Munoz | Phil | Munoz | 17 | Store Per |
| 1117 | Lori Lightfoot | Lori | Lightfoot | 17 | Store |
+-------------+------------+------------+------------+-------------+-----------+
1,155 rows selected (0.762 seconds)
0: jdbc:drill:zk=local>
Query the region.parquet
and nation.parquet
files in the sample-data
directory on your local file system.
If you followed the Apache Drill in 10 Minutes instructions to install Drill in embedded mode, the path to the parquet file varies between operating systems.
Note: When you enter the query, include the version of Drill that you are currently running.
To view the data in the region.parquet
file, issue the query appropriate for
your operating system:
Linux
SELECT * FROM dfs.`/opt/drill/apache-drill-<version>/sample-data/region.parquet`;
Mac OS X
SELECT * FROM dfs.`/Users/max/drill/apache-drill-<version>/sample-data/region.parquet`;
Windows
SELECT * FROM dfs.`C:\drill\apache-drill-<version>\sample-data\region.parquet`;
The query returns the following results:
+------------+------------+
| EXPR$0 | EXPR$1 |
+------------+------------+
| AFRICA | lar deposits. blithely final packages cajole. regular waters ar |
| AMERICA | hs use ironic, even requests. s |
| ASIA | ges. thinly even pinto beans ca |
| EUROPE | ly final courts cajole furiously final excuse |
| MIDDLE EAST | uickly special accounts cajole carefully blithely close reques |
+------------+------------+
5 rows selected (0.165 seconds)
0: jdbc:drill:zk=local>
If you followed the Apache Drill in 10 Minutes instructions to install Drill in embedded mode, the path to the parquet file varies between operating systems.
Note: When you enter the query, include the version of Drill that you are currently running.
To view the data in the nation.parquet
file, issue the query appropriate for
your operating system:
Linux
SELECT * FROM dfs.`/opt/drill/apache-drill-<version>/sample-data/nation.parquet`;
Mac OS X
SELECT * FROM dfs.`/Users/max/drill/apache-drill-<version>/sample-data/nation.parquet`;
Windows
SELECT * FROM dfs.`C:\drill\apache-drill-<version>\sample-data\nation.parquet`;
The query returns the following results:
Now you know a bit about Apache Drill. To summarize, you have completed the following tasks:
employee.json
, to view its data.region.parquet
file to view its data.nation.parquet
file to view its data.Now that you have an idea about what Drill can do, you might want to:
For more information about Apache Drill, explore the Apache Drill web site.