Querying Plain Text Files
You can use Drill to access both structured file types and plain text files (flat files). This section shows a few simple examples that work on flat files:
- CSV files (comma-separated values)
- TSV files (tab-separated values)
- PSV files (pipe-separated values)
The examples here show CSV files, but queries against TSV and PSV files return
equivalent results. However, make sure that your registered storage plugins
recognize the appropriate file types and extensions. For example, the
following configuration expects PSV files (files with a pipe delimiter) to
have a tbl
extension, not a psv
extension. Drill returns a "file not
found" error if references to files in queries do not match these conditions.
"formats": {
"psv": {
"type": "text",
"extensions": [
"tbl"
],
"delimiter": "|"
}
SELECT * FROM a CSV File
The first query selects rows from a .csv
text file. The file contains seven
records:
$ more plays.csv
1599,As You Like It
1601,Twelfth Night
1594,Comedy of Errors
1595,Romeo and Juliet
1596,The Merchant of Venice
1610,The Tempest
1599,Hamlet
Drill recognizes each row as an array of values and returns one column for each row.
0: jdbc:drill:zk=local> select * from dfs.`/Users/brumsby/drill/plays.csv`;
+------------+
| columns |
+------------+
| ["1599","As You Like It"] |
| ["1601","Twelfth Night"] |
| ["1594","Comedy of Errors"] |
| ["1595","Romeo and Juliet"] |
| ["1596","The Merchant of Venice"] |
| ["1610","The Tempest"] |
| ["1599","Hamlet"] |
+------------+
7 rows selected (0.089 seconds)
Columns[n] Syntax
You can use the COLUMNS[n]
syntax in the SELECT list to return these CSV
rows in a more readable, column by column, format. (This syntax uses a zero-
based index, so the first column is column 0
.)
0: jdbc:drill:zk=local> select columns[0], columns[1] from dfs.`/Users/brumsby/drill/plays.csv`;
+------------+------------+
| EXPR$0 | EXPR$1 |
+------------+------------+
| 1599 | As You Like It |
| 1601 | Twelfth Night |
| 1594 | Comedy of Errors |
| 1595 | Romeo and Juliet |
| 1596 | The Merchant of Venice |
| 1610 | The Tempest |
| 1599 | Hamlet |
+------------+------------+
7 rows selected (0.137 seconds)
You can use aliases to return meaningful column names. Note that YEAR
is a
reserved word, so the Year
alias must be enclosed by back ticks.
0: jdbc:drill:zk=local> select columns[0] as `Year`, columns[1] as Play
from dfs.`/Users/brumsby/drill/plays.csv`;
+------------+------------+
| Year | Play |
+------------+------------+
| 1599 | As You Like It |
| 1601 | Twelfth Night |
| 1594 | Comedy of Errors |
| 1595 | Romeo and Juliet |
| 1596 | The Merchant of Venice |
| 1610 | The Tempest |
| 1599 | Hamlet |
+------------+------------+
7 rows selected (0.113 seconds)
You cannot refer to the aliases in subsequent clauses of the query. Use the
original columns[n]
syntax, as shown in the WHERE clause for the following
example:
0: jdbc:drill:zk=local> select columns[0] as `Year`, columns[1] as Play
from dfs.`/Users/brumsby/drill/plays.csv` where columns[0]>1599;
+------------+------------+
| Year | Play |
+------------+------------+
| 1601 | Twelfth Night |
| 1610 | The Tempest |
+------------+------------+
2 rows selected (0.201 seconds)
Note that the restriction with the use of aliases applies to queries against all data sources.