The lesson shows you how to query a plain text file. Drill handles plain text files and directories like standard SQL tables and can infer knowledge about the schema of the data. No setup is required. For example, you do not need to perform extract, transform, and load (ETL) operations on the data source. Exercises in the tutorial demonstrate the general guidelines for querying a plain text file:
COLUMNS[n]
syntax in lieu of column names, which do not exist in a plain text file. The first column is column 0
.This lesson uses a tab-separated value (TSV) files that you download from a Google internet site. The data in the file consists of phrases from books that Google scans and generates for its Google Books Ngram Viewer. You use the data to find the relative frequencies of Ngrams.
Each line in the TSV file has the following structure:
ngram TAB year TAB match_count TAB volume_count NEWLINE
For example, lines 1722089 and 1722090 in the file contain this data:
ngram | year | match_count | volume_count |
---|---|---|---|
Zoological Journal of the Linnean | 2007 | 284 | 101 |
Zoological Journal of the Linnean | 2008 | 257 | 87 |
In 2007, "Zoological Journal of the Linnean" occurred 284 times overall in 101 distinct books of the Google sample.
After downloading the file, you use the dfs
storage plugin, and then select
data from the file as you would a table. In the SELECT statement, enclose the
path and name of the file in backticks.
Download the compressed Google Ngram data from this location:
http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-5gram-20120701-zo.gz)
Unzip the file.
A file named googlebooks-eng-all-5gram-20120701-zo appears.
Change the file name to add a .tsv
extension.
The Drill dfs
storage plugin definition includes a TSV format that requires
a file to have this extension.
Get data about "Zoological Journal of the Linnean" that appears more than 250 times a year in the books that Google scans.
Switch back to using the dfs
storage plugin.
USE dfs;
Issue a SELECT statement to get the first three columns in the file. In the FROM clause of the example, substitute your path to the TSV file. In the WHERE clause, enclose the string literal "Zoological Journal of the Linnean" in single quotation marks. Limit the output to 10 rows.
SELECT COLUMNS[0], COLUMNS[1], COLUMNS[2]
FROM `/Users/drilluser/Downloads/googlebooks-eng-all-5gram-20120701-zo.tsv`
WHERE ((columns[0] = 'Zoological Journal of the Linnean')
AND (columns[2] > 250)) LIMIT 10;
The output is:
+------------+------------+------------+
| EXPR$0 | EXPR$1 | EXPR$2 |
+------------+------------+------------+
| Zoological Journal of the Linnean | 1993 | 297 |
| Zoological Journal of the Linnean | 1997 | 255 |
| Zoological Journal of the Linnean | 2003 | 254 |
| Zoological Journal of the Linnean | 2007 | 284 |
| Zoological Journal of the Linnean | 2008 | 257 |
+------------+------------+------------+
5 rows selected (1.599 seconds)
Repeat the query using aliases to replace the column headers, such as EXPR$0, with user-friendly column headers, Ngram, Publication Date, and Frequency. In the FROM clause of the example, substitute your path to the TSV file.
SELECT COLUMNS[0] AS Ngram,
COLUMNS[1] AS Publication_Date,
COLUMNS[2] AS Frequency
FROM `/Users/drilluser/Downloads/googlebooks-eng-all-5gram-20120701-zo.tsv`
WHERE ((columns[0] = 'Zoological Journal of the Linnean')
AND (columns[2] > 250)) LIMIT 10;
The improved output is:
+------------+------------------+------------+
| Ngram | Publication_Date | Frequency |
+------------+------------------+------------+
| Zoological Journal of the Linnean | 1993 | 297 |
| Zoological Journal of the Linnean | 1997 | 255 |
| Zoological Journal of the Linnean | 2003 | 254 |
| Zoological Journal of the Linnean | 2007 | 284 |
| Zoological Journal of the Linnean | 2008 | 257 |
+------------+------------------+------------+
5 rows selected (1.628 seconds)