Querying Directories
You can store multiple files in a directory and query them as if they were a single entity. You do not have to explicitly join the files. The files must be compatible, in the sense that they must have comparable data types and columns in the same order. This type of query is not limited to text files; you can also query directories of JSON files, for example.
For example, assume that a testdata
directory contains two files with the
same structure: plays.csv
and moreplays.csv
. The first file contains 7
records and the second file contains 3 records. The following query returns
the "union" of the two files, ordered by the first column:
0: jdbc:drill:zk=local> select columns[0] as `Year`, columns[1] as Play
from dfs.`/Users/brumsby/drill/testdata` order by 1;
+------------+------------+
| Year | Play |
+------------+------------+
| 1594 | Comedy of Errors |
| 1595 | Romeo and Juliet |
| 1596 | The Merchant of Venice |
| 1599 | As You Like It |
| 1599 | Hamlet |
| 1601 | Twelfth Night |
| 1606 | Macbeth |
| 1606 | King Lear |
| 1609 | The Winter's Tale |
| 1610 | The Tempest |
+------------+------------+
10 rows selected (0.296 seconds)
You can drill down further and automatically query subdirectories as well. For example, assume that you have a logs directory that contains a subdirectory for each year and subdirectories for each month (1 through 12). The month directories contain JSON files.
[root@ip-172-16-1-200 logs]# pwd
/mapr/drilldemo/labs/clicks/logs
[root@ip-172-16-1-200 logs]# ls
2012 2013 2014
[root@ip-172-16-1-200 logs]# cd 2013
[root@ip-172-16-1-200 2013]# ls
1 10 11 12 2 3 4 5 6 7 8 9
You can query all of these files, or a subset, by referencing the file system
once in a Drill query. For example, the following query counts the number of
records in all of the files inside the 2013
directory:
0: jdbc:drill:> select count(*) from MFS.`/mapr/drilldemo/labs/clicks/logs/2013` ;
+------------+
| EXPR$0 |
+------------+
| 24000 |
+------------+
1 row selected (2.607 seconds)
You can also use variables dir0
, dir1
, and so on, to refer to
subdirectories in your workspace path. For example, assume that bob.logdata
is a workspace that points to the logs
directory, which contains multiple
subdirectories: 2012
, 2013
, and 2014
. The following query constrains
files inside the subdirectory named 2013
. The variable dir0
refers to the
first level down from logs, dir1
to the next level, and so on.
0: jdbc:drill:> use bob.logdata;
+------------+------------+
| ok | summary |
+------------+------------+
| true | Default schema changed to 'bob.logdata' |
+------------+------------+
1 row selected (0.305 seconds)
0: jdbc:drill:> select * from logs where dir0='2013' limit 10;
+------------+------------+------------+------------+------------+------------+------------+------------+------------+-------------+
| dir0 | dir1 | trans_id | date | time | cust_id | device | state | camp_id | keywords |
+------------+------------+------------+------------+------------+------------+------------+------------+------------+-------------+
| 2013 | 2 | 12115 | 02/23/2013 | 19:48:24 | 3 | IOS5 | az | 5 | who's |
| 2013 | 2 | 12127 | 02/26/2013 | 19:42:03 | 11459 | IOS5 | wa | 10 | for |
| 2013 | 2 | 12138 | 02/09/2013 | 05:49:01 | 1 | IOS6 | ca | 7 | minutes |
| 2013 | 2 | 12139 | 02/23/2013 | 06:58:20 | 1 | AOS4.4 | ms | 7 | i |
| 2013 | 2 | 12145 | 02/10/2013 | 10:14:56 | 10 | IOS5 | mi | 6 | wrong |
| 2013 | 2 | 12157 | 02/15/2013 | 02:49:22 | 102 | IOS5 | ny | 5 | want |
| 2013 | 2 | 12176 | 02/19/2013 | 08:39:02 | 28 | IOS5 | or | 0 | and |
| 2013 | 2 | 12194 | 02/24/2013 | 08:26:17 | 125445 | IOS5 | ar | 0 | say |
| 2013 | 2 | 12236 | 02/05/2013 | 01:40:05 | 10 | IOS5 | nj | 2 | sir |
| 2013 | 2 | 12249 | 02/03/2013 | 04:45:47 | 21725 | IOS5 | nj | 5 | no |
+------------+------------+------------+------------+------------+------------+------------+------------+------------+-------------+
10 rows selected (0.583 seconds)
For more information about querying directories, see the section, "Query Directory Functions".