How to Run the Drill Demo
This section describes how to get started by running the Drill demo.
Pre-requisites
Maven 2 or higher
On Ubuntu, you can do this as root:
apt-get install maven2
On the Mac, maven is pre-installed.
Note that installing maven can result in installing java 1.6 and setting that to your default version. Make sure you check java version before compiling or running.
Java 1.7
You will need java 1.7 to compile and run the Drill demo.
On Ubuntu you can get the right version of Java by doing this as root:
apt-get install openjdk-7-jdk sudo update-alternatives --set java $(update-alternatives --list java | grep 7 | head -1)
On a Mac, go to Oracle's web- site to download and install java 7. You will also need to set JAVA_HOME in order to use the right version of java.
Drill will not compile correctly using java 6. There is also a subtle problem that can occur if you have both
java 6 and java 7 with the default version set to 6. In that case, you may be able to compile, but execution may not work correctly.Send email to the dev list if this is a problem for you.
Protobuf
Drill requires Protobuf 2.5. Install this on Ubuntu using:
apt-get install protobuf-compiler
On Centos 6.4, OEL or RHEL you will need to compile protobuf-compiler:
wget http://protobuf.googlecode.com/files/protobuf-2.5.0.tar.bz2 tar xfj protobuf-2.5.0.tar.bz2 pushd protobuf-2.5.0 ./configure make sudo make install
git
On Ubuntu you can install git by doing this as root:
apt-get install git-all
On the Mac or Windows, go to this site to download and install git.
Check your installation versions
Run
java -version
mvn -version
Verify that your default java and maven versions are correct, and that maven runs the right version of java. On my Mac, you see something like this:
ted:apache-drill-1.0.0-m1$ java -version
java version "1.7.0_11"
Java(TM) SE Runtime Environment (build 1.7.0_11-b21)
Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode)
ted:apache-drill-1.0.0-m1$ mvn -version
Apache Maven 3.0.3 (r1075438; 2011-02-28 09:31:09-0800)
Maven home: /usr/share/maven
Java version: 1.7.0_11, vendor: Oracle Corporation
Java home: /Library/Java/JavaVirtualMachines/jdk1.7.0_11.jdk/Contents/Home/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "mac os x", version: "10.7.5", arch: "x86_64", family: "mac"
ted:apache-drill-1.0.0-m1$
Get the Source
git clone https://git-wip-us.apache.org/repos/asf/incubator-drill.git
Compile the Code
cd incubator-drill/sandbox/prototype
mvn clean install -DskipTests
rm .classpath
This takes about a minute on a not-terribly-current MacBook.
Run the interactive Drill shell
./sqlline -u jdbc:drill:schema=parquet-local -n admin -p admin
The first time you run this program, you will get reams of output. What is
happening is that the program is running maven in order to build a
(voluminous) class path for the actual program and stores this classpath into
the file called .classpath
. When you run this program again, it will note
that this file already exists and avoid re-creating it. You should delete this
file every time the dependencies of Drill are changed. If you start getting
"class not found" errors, that is a good hint that .classpath
is out of date
and needs to be deleted and recreated.
The -u
argument to sqlline is a
JDBC connection string that directs sqlline to connect to drill. The Drill
JDBC driver currently includes enough smarts to run Drill in embedded mode so
this command also effectively starts a local drill bit. The schema=
part of
the JDBC connection string causes Drill to consider the "parquet-local"
storage engine to be default. Other storage engines can be specified. The list
of supported storage engines can be found in the file
./sqlparser/src/main/resources/storage-engines.json
. Each storage engine
specifies the format of the data and how to get the data. See the section
below on "Storage Engines" for more detail.
When you run sqlline, you should see something like this after quite a lot of log messages:
Connected to: Drill (version 1.0)
Driver: Apache Drill JDBC Driver (version 1.0)
Autocommit status: true
Transaction isolation: TRANSACTION_REPEATABLE_READ
sqlline version ??? by Marc Prud'hommeaux
0: jdbc:drill:schema=parquet-local>
Tip: To quit sqlline at any time, type "!quit" at the prompt.
0: jdbc:drill:schema=parquet-local> !quit
Run a Query
Once you have sqlline running you can now try out some queries:
select * from "sample-data/region.parquet";
You should see a number of debug messages and then something like this:
+--------------------------------------------------------------------------------------------------------------------------------------------+
| _MAP |
+--------------------------------------------------------------------------------------------------------------------------------------------+
| {"R_REGIONKEY":0,"R_NAME":"AFRICA","R_COMMENT":"lar deposits. blithely final packages cajole. regular waters are final requests. regular a |
| {"R_REGIONKEY":1,"R_NAME":"AMERICA","R_COMMENT":"hs use ironic, even requests. s"} |
| {"R_REGIONKEY":2,"R_NAME":"ASIA","R_COMMENT":"ges. thinly even pinto beans ca"} |
| {"R_REGIONKEY":3,"R_NAME":"EUROPE","R_COMMENT":"ly final courts cajole furiously final excuse"} |
| {"R_REGIONKEY":4,"R_NAME":"MIDDLE EAST","R_COMMENT":"uickly special accounts cajole carefully blithely close requests. carefully final asy |
+--------------------------------------------------------------------------------------------------------------------------------------------+
5 rows selected (1.103 seconds)
Drill has no idea what the structure of this
file is in terms of what fields exist, but Drill does know that every record
has a pseudo-field called _MAP
. This field contains a map of all of the
actual fields to values. When returned via JDBC, this fields is rendered as
JSON since JDBC doesn't really understand maps.
This can be made more readable by using a query like this:
select _MAP['R_REGIONKEY'] as region_key, _MAP['R_NAME'] as name, _MAP['R_COMMENT'] as comment
from "sample-data/region.parquet";
The output will look something like this:
+-------------+--------------+---------------------------------------------------------------------------------------------------------------+
| REGION_KEY | NAME | COMMENT |
+-------------+--------------+---------------------------------------------------------------------------------------------------------------+
| 0 | AFRICA | lar deposits. blithely final packages cajole. regular waters are final requests. regular accounts are accordi |
| 1 | AMERICA | hs use ironic, even requests. s |
| 2 | ASIA | ges. thinly even pinto beans ca |
| 3 | EUROPE | ly final courts cajole furiously final excuse |
| 4 | MIDDLE EAST | uickly special accounts cajole carefully blithely close requests. carefully final asymptotes haggle furiousl |
+-------------+--------------+---------------------------------------------------------------------------------------------------------------+
In upcoming versions, Drill will insert the _MAP[ ... ]
goo and will also
unwrap the contents of _MAP
in results so that things seem much more like
ordinary SQL. The reason that things work this way now is that SQL itself
requires considerable type information for queries to be parsed and that
information doesn't necessarily exist for all kinds of files, especially those
with very flexible schemas. To avoid all these problems, Drill adopts the
convention of the _MAP fields for all kinds of input.
A Note Before You Continue
Drill currently supports a wide variety of queries. It currently also has a fair number of deficiencies in terms of the number of operators that are actually supported and exactly which expressions are passed through to the execution engine in the correct form for execution.
These problems fall into roughly three categories,
missing operators. Many operators have been implemented for only a subset of the types available in Drill. This will cause queries to work for some types of data, but not for others. This is particularly true for operators with many possible type signatures such as comparisons. This lack is being remedied at a fast pace so check back in frequently if you suspect this might be a problem.
Missing operators will result in error messages like this:
UnsupportedOperationException:[ Missing function implementation: compare_to (BIT-OPTIONAL, BIT-OPTIONAL) ]
missing casts. The SQL parser currently has trouble producing a valid logical plan without sufficient type information. Ironically, this type information is often not necessary to the execution engine because Drill generates the code on the fly based on the types of the data it encounters as data are processed. Currently, the work-around is to cast fields in certain situations to give the parser enough information to proceed. This problem will be remedied soon, but probably not quite as quickly as the missing operators.
The typical error message that indicates you need an additional cast looks like
Cannot apply '>' to arguments of type '<ANY> > <CHAR(1)>'. Supported form(s): '<COMPARABLE_TYPE> > <COMPARABLE_TYPE>'
- weak optimizer. The current optimizer that transforms the logical plan into a physical plan is not the fully-featured cost based optimizer that Optiq normally uses. This is because some of the transformations that are needed for Drill are not yet fully supported by Optiq. In order to allow end-to-end execution of queries, a deterministic peep-hole optimizer has been used instead. This optimizer cannot handle large plan transformations and so some queries cannot be transformed correctly from logical to physical plan. We expect that the necessary changes to the cost-based optimizer will allow it to be used in an upcoming release, but didn't want to delay the current release waiting for that to happen.
Try Fancier Queries
This query does a join between two files:
SELECT nations.name, regions.name FROM (
SELECT _MAP['N_REGIONKEY'] as regionKey, _MAP['N_NAME'] as name
FROM "sample-data/nation.parquet") nations
join (
SELECT _MAP['R_REGIONKEY'] as regionKey, _MAP['R_NAME'] as name
FROM "sample-data/region.parquet") regions
on nations.regionKey = regions.regionKey
order by nations.name;
Notice the use of sub-queries to avoid the spread of the _MAP
idiom.
This query illustrates how a cast is currently necessary to make the parser happy:
SELECT
_MAP['N_REGIONKEY'] as regionKey,
_MAP['N_NAME'] as name
FROM
"sample-data/nation.parquet"
WHERE
cast(_MAP['N_NAME'] as varchar) IN ('MOROCCO', 'MOZAMBIQUE');
Here are more queries that you can try.
// count distinct
SELECT count(distinct _MAP['N_REGIONKEY']) FROM "sample-data/nation.parquet";
// aliases
SELECT
_MAP['N_REGIONKEY'] as regionKey,
_MAP['N_NAME'] as name
FROM "sample-data/nation.parquet";
// order by
SELECT
_MAP['N_REGIONKEY'] as regionKey,
_MAP['N_NAME'] as name
FROM
"sample-data/nation.parquet"
ORDER BY
_MAP['N_NAME'] DESC;
// subquery order by
select * from (
SELECT
_MAP['N_REGIONKEY'] as regionKey,
_MAP['N_NAME'] as name
FROM
"sample-data/nation.parquet"
) as x
ORDER BY
name DESC;
// String where
SELECT
_MAP['N_REGIONKEY'] as regionKey,
_MAP['N_NAME'] as name
FROM
"sample-data/nation.parquet"
WHERE
cast(_MAP['N_NAME'] as varchar) > 'M';
// INNER Join + Order (parquet)
SELECT n.name, r.name FROM
(SELECT _MAP['N_REGIONKEY'] as regionKey, _MAP['N_NAME'] as name FROM "sample-data/nation.parquet")n
join (SELECT _MAP['R_REGIONKEY'] as regionKey, _MAP['R_NAME'] as name FROM "sample-data/region.parquet")r
using (regionKey);
// INNER Join + Order (parquet)
SELECT n.name, r.name FROM
(SELECT _MAP['N_REGIONKEY'] as regionKey, _MAP['N_NAME'] as name FROM "sample-data/nation.parquet")n
join (SELECT _MAP['R_REGIONKEY'] as regionKey, _MAP['R_NAME'] as name FROM "sample-data/region.parquet")r
on n.regionKey = r.regionKey
order by n.name;
Analyze the Execution of Queries
Drill sends log events to a logback socket appender. This makes it easy to catch and filter these log events using a tool called Lilith. You can download Lilith and install it easily. A tutorial can be found here. This is especially important if you find errors that you want to report back to the mailing list since Lilith will help you isolate the stack trace of interest.
By default, Lilith uses a slightly lurid splash page based on a pre-Raphaelite image of the mythical Lilith. This is easily disabled if the image is not to your taste (or if your work-mates are not well-versed in Victorian views of Sumerian mythology).