Tajo is a big daga warehouse system on Hadoop that provides low-latency and scalable ad-hoc queries and ETL on large-data sets stored on HDFS and other data sources.
Download the source code from http://tajo.apache.org/downloads.html.
You can compile source code and get a binary archive as follows:
$ cd tajo $ mvn clean package -DskipTests -Pdist -Dtar $ ls tajo-dist/target/tajo-x.y.z-SNAPSHOT.tar.gz
To launch the tajo master, execute start-tajo.sh.
$ $TAJO_HOME/bin/start-tajo.sh
After then, you can use tajo-cli to access the command line interface of Tajo. If you want to how to use tsql, read Tajo Interactive Shell document.
$ $TAJO_HOME/bin/tsql
If you type ? on tsql, you can see help documentation.
First of all, we need to prepare some data for query execution. For example, you can make a simple text-based table as follows:
$ mkdir /home/x/table1 $ cd /home/x/table1 $ cat > data.csv 1|abc|1.1|a 2|def|2.3|b 3|ghi|3.4|c 4|jkl|4.5|d 5|mno|5.6|e <CTRL + D>
This schema of this table is (int, text, float, text).
$ $TAJO_HOME/bin/tsql tajo> create external table table1 (id int, name text, score float, type text) using csv with ('csvfile.delimiter'='|') location 'file:/home/x/table1';
In order to load an external table, you need to use ‘create external table’ statement. In the location clause, you should use the absolute directory path with an appropriate scheme. If the table resides in HDFS, you should use ‘hdfs’ instead of ‘file’.
If you want to know DDL statements in more detail, please see Query Language.
tajo> \d table1
‘d’ command shows the list of tables.
tajo> \d table1 table name: table1 table path: file:/home/x/table1 store type: CSV number of rows: 0 volume (bytes): 78 B schema: id INT name TEXT score FLOAT type TEXT
‘d [table name]’ command shows the description of a given table.
Also, you can execute SQL queries as follows:
tajo> select * from table1 where id > 2; final state: QUERY_SUCCEEDED, init time: 0.069 sec, response time: 0.397 sec result: file:/tmp/tajo-hadoop/staging/q_1363768615503_0001_000001/RESULT, 3 rows ( 35B) id, name, score, type - - - - - - - - - - - - - 3, ghi, 3.4, c 4, jkl, 4.5, d 5, mno, 5.6, e tajo>
Add the following configs to tajo-site.xml file.
<property> <name>tajo.rootdir</name> <value>hdfs://hostname:port/tajo</value> </property> <property> <name>tajo.master.umbilical-rpc.address</name> <value>hostname:26001</value> </property> <property> <name>tajo.catalog.client-rpc.address</name> <value>hostname:26005</value> </property>
If you want to know Tajo’s configuration in more detail, see Configuration page.
Before launching the tajo, you should create the tajo root dir and set the permission as follows:
$ $HADOOP_HOME/bin/hadoop fs -mkdir /tajo $ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tajo
Then, execute start-tajo.sh
$ $TAJO_HOME/bin/start-tajo.sh
Enjoy Apache Tajo!
Tajo’s configuration is based on Hadoop’s configuration system. Tajo uses two config files:
Each config consists of a pair of a name and a value. If you want to set the config name a.b.c with the value 123, add the following element to an appropriate file.
<property> <name>a.b.c</name> <value>123</value> </property>
Tajo has a variety of internal configs. If you don’t set some config explicitly, the default config will be used for for that config. Tajo is designed to use only a few of configs in usual cases. You may not be concerned with the configuration.
In default, there is no tajo-site.xml in ${TAJO}/conf directory. If you set some configs, first copy $TAJO_HOME/conf/tajo-site.xml.templete to tajo-site.xml. Then, add the configs to your tajo-site.
tajo-env.sh is a shell script file. The main purpose of this file is to set shell environment variables for TajoMaster and TajoWorker java program. So, you can set some variable as follows:
VARIABLE=value
If a value is a literal string, type this as follows:
VARIABLE='value'
Tajo uses HDFS as a primary storage layer. So, one Tajo cluster instance should have one tajo rootdir. A user is allowed to specific your tajo rootdir as follows:
<property> <name>tajo.rootdir</name> <value>hdfs://namenode_hostname:port/path</value> </property>
Tajo rootdir must be a url form like scheme://hostname:port/path. The current implementaion only supports hdfs:// and file:// schemes. The default value is file:///tmp/tajo-${user.name}/.
The environment variable TAJO_MASTER_HEAPSIZE in conf/tajo-env.sh allow Tajo Master to use the specified heap memory size.
If you want to adjust heap memory size, set TAJO_MASTER_HEAPSIZE variable in conf/tajo-env.sh with a proper size as follows:
TAJO_MASTER_HEAPSIZE=2000
The default size is 1000 (1GB).
The environment variable TAJO_WORKER_HEAPSIZE in conf/tajo-env.sh allow Tajo Worker to use the specified heap memory size.
If you want to adjust heap memory size, set TAJO_WORKER_HEAPSIZE variable in conf/tajo-env.sh with a proper size as follows:
TAJO_WORKER_HEAPSIZE=8000
The default size is 1000 (1GB).
TajoWorker stores temporary data on local file system due to out-of-core algorithms. It is possible to specify one or more temporary data directories where temporary data will be stored.
tajo-site.xml
<property> <name>tajo.worker.tmpdir.locations</name> <value>/disk1/tmpdir,/disk2/tmpdir,/disk3/tmpdir</value> </property>
Each worker can execute multiple tasks at a time. Tajo allows users to specify the maximum number of parallel running tasks for each worker.
tajo-site.xml
<property> <name>tajo.worker.parallel-execution.max-num</name> <value>12</value> </property>
If you want to customize the catalog service, copy $TAJO_HOME/conf/catalog-site.xml.templete to catalog-site.xml. Then, add the following configs to catalog-site.xml. Note that the default configs are enough to launch Tajo cluster in most cases.
Service Name | Config Property Name | Description | default address |
---|---|---|---|
Tajo Master Umbilical Rpc | tajo.master.umbilical-rpc.address | localhost:26001 | |
Tajo Master Client Rpc | tajo.master.client-rpc.address | localhost:26002 | |
Tajo Master Info Http | tajo.master.info-http.address | 0.0.0.0:26080 | |
Tajo Catalog Client Rpc | tajo.catalog.client-rpc.address | localhost:26005 |
Service Name | Config Property Name | Description | default address |
---|---|---|---|
Tajo Worker Peer Rpc | tajo.worker.peer-rpc.address | 0.0.0.0:28091 | |
Tajo Worker Client Rpc | tajo.worker.client-rpc.address | 0.0.0.0:28092 | |
Tajo Worker Info Http | tajo.worker.info-http.address | 0.0.0.0:28080 |
Synopsis
bin/tsql [options]
Options
-c "quoted sql" : Execute quoted sql statements, and then the shell will exist.
-f filename (--file filename) : Use the file named filename as the source of commands instead of interactive shell.
-h hostname (--host hostname) : Specifies the host name of the machine on which the Tajo master is running.
-p port (--port port) : Specifies the TCP port. If it is not set, the port will be 26002 in default.
If the hostname and the port num are not given, tsql will try to connect the Tajo master specified in ${TAJO_HOME}/conf/tajo-site.xml.
bin/tsql tajo>
If you want to connect a specified TajoMaster, you should use ‘-h’ and (or) ‘p’ options as follows:
bin/tsql -h localhost -p 9004 tajo>
In tsql, anything command that begins with an unquoted backslash (’') is a tsql meta-command that is processed by tsql itself.
In the current implementation, there are meta commands as follows:
tajo> \? General \copyright show Apache License 2.0 \version show Tajo version \? show help \q quit tsql Informational \d list tables \d NAME describe table Documentations tsql guide http://wiki.apache.org/tajo/tsql Query language http://wiki.apache.org/tajo/QueryLanguage Functions http://wiki.apache.org/tajo/Functions Backup & restore http://wiki.apache.org/tajo/BackupAndRestore Configuration http://wiki.apache.org/tajo/Configuration
If you want to list all table names, use ‘d’ meta command as follows:
tajo> \d customer lineitem nation orders part partsupp region supplier
Now look at the table description:
tajo> \d orders table name: orders table path: hdfs:/xxx/xxx/tpch/orders store type: CSV number of rows: 0 volume (bytes): 172.0 MB schema: o_orderkey INT8 o_custkey INT8 o_orderstatus TEXT o_totalprice FLOAT8 o_orderdate TEXT o_orderpriority TEXT o_clerk TEXT o_shippriority INT4 o_comment TEXT
Supported | SQL Type Name | Alias | Size (byte) | Description | Range |
---|---|---|---|---|---|
O | boolean | bool | 1 | true/false | |
bit | 1 | 1/0 | |||
varbit | bit varying | ||||
O | smallint | tinyint, int2 | 2 | small-range integer value | -2^15 (-32,768) to 2^15 (32,767) |
O | integer | int, int4 | 4 | integer value | -2^31 (-2,147,483,648) to 2^31 - 1 (2,147,483,647) |
O | bigint | bit varying | 8 | larger range integer value | -2^63 (-9,223,372,036,854,775,808) to 2^63-1 (9,223,372,036,854,775,807) |
O | real | int8 | 4 | variable-precision, inexact, real number value | -3.4028235E+38 to 3.4028235E+38 (6 decimal digits precision) |
O | float[(n)] | float4 | 4 or 8 | variable-precision, inexact, real number value | |
O | double | float8, double precision | 8 | variable-precision, inexact, real number value | 1 .7E–308 to 1.7E+308 (15 decimal digits precision) |
number | decimal | ||||
char[(n)] | character | ||||
varchar[(n)] | character varying | ||||
O | text | text | variable-length unicode text | ||
binary | binary | ||||
varbinary[(n)] | binary varying | ||||
O | blob | bytea | variable-length binary string | ||
date | |||||
time | |||||
timetz | time with time zone | ||||
timestamp | |||||
timestamptz | |||||
O | inet4 | 4 | IPv4 address |
The real and double data types are mapped to float and double of java primitives respectively. Java primitives float and double follows the IEEE 754 specification. So, these types are correctly matched to SQL standard data types.
Do not use approximate real number columns in WHERE clause in order to compare some exact matches, especially the = and <> operators. The > or < comparisons work well.
Synopsis
CREATE TABLE <table_name> [(<column_name> <data_type>, ... )] [using <storage_type> [with (<key> = <value>, ...)]] [AS <select_statement>] CREATE EXTERNAL TABLE CREATE EXTERNAL TABLE <table_name> (<column_name> <data_type>, ... ) using <storage_type> [with (<key> = <value>, ...)] LOCATION '<path>'
If you want to add an external table that contains compressed data, you should give ‘compression.code’ parameter to CREATE TABLE statement.
create EXTERNAL table lineitem ( L_ORDERKEY bigint, L_PARTKEY bigint, ... L_COMMENT text) USING csv WITH ('csvfile.delimiter'='|','compression.codec'='org.apache.hadoop.io.compress.DeflateCodec') LOCATION 'hdfs://localhost:9010/tajo/warehouse/lineitem_100_snappy';
‘compression.codec’ parameter can have one of the following compression codecs: * org.apache.hadoop.io.compress.BZip2Codec * org.apache.hadoop.io.compress.DeflateCodec * org.apache.hadoop.io.compress.GzipCodec * org.apache.hadoop.io.compress.SnappyCodec
Synopsis
SELECT [distinct [all]] * | <expression> [[AS] <alias>] [, ...] [FROM <table name> [[AS] <table alias name>] [, ...]] [WHERE <condition>] [GROUP BY <expression> [, ...]] [HAVING <condition>] [ORDER BY <expression> [ASC|DESC] [NULL FIRST|NULL LAST] [, ...]]
IN predicate provides row and array comparison.
Synopsis
column_reference IN (val1, val2, ..., valN) column_reference NOT IN (val1, val2, ..., valN)
Examples are as follows:
-- this statement filters lists down all the records where col1 value is 1, 2 or 3: SELECT col1, col2 FROM table1 WHERE col1 IN (1, 2, 3); -- this statement filters lists down all the records where col1 value is neither 1, 2 nor 3: SELECT col1, col2 FROM table1 WHERE col1 NOT IN (1, 2, 3);
You can use ‘IN clause’ on text data domain as follows:
SELECT col1, col2 FROM table1 WHERE col2 IN ('tajo', 'hadoop'); SELECT col1, col2 FROM table1 WHERE col2 NOT IN ('tajo', 'hadoop');
LIKE operator returns true or false depending on whether its pattern matches the given string. An underscore (_) in pattern matches any single character. A percent sign (%) matches any sequence of zero or more characters.
Synopsis
string LIKE pattern string NOT LIKE pattern
ILIKE is the same to LIKE, but it is a case insensitive operator. It is not in the SQL standard. We borrow this operator from PostgreSQL.
Synopsis
string ILIKE pattern string NOT ILIKE pattern
Synopsis
string SIMILAR TO pattern string NOT SIMILAR TO pattern
It returns true or false depending on whether its pattern matches the given string. Also like LIKE, ‘SIMILAR TO’ uses ‘_’ and ‘%’ as metacharacters denoting any single character and any string, respectively.
In addition to these metacharacters borrowed from LIKE, ‘SIMILAR TO’ supports more powerful pattern-matching metacharacters borrowed from regular expressions:
metacharacter | description |
---|---|
| | denotes alternation (either of two alternatives). |
* | denotes repetition of the previous item zero or more times. |
+ | denotes repetition of the previous item one or more times. |
? | denotes repetition of the previous item zero or one time. |
{m} | denotes repetition of the previous item exactly m times. |
{m,} | denotes repetition of the previous item m or more times. |
{m,n} | denotes repetition of the previous item at least m and not more than n times. |
[] | A bracket expression specifies a character class, just as in POSIX regular expressions. |
() | Parentheses can be used to group items into a single logical item. |
Note that ‘.’ is not used as a metacharacter in ‘SIMILAR TO’ operator.
Regular expressions provide a very powerful means for string pattern matching. In the current Tajo, regular expressions are based on Java-style regular expressions instead of POSIX regular expression. The main difference between java-style one and POSIX’s one is character class.
Synopsis
string ~ pattern string !~ pattern string ~* pattern string !~* pattern
operator | Description |
---|---|
~ | It returns true if a given regular expression is matched to string. Otherwise, it returns false. |
!~ | It returns false if a given regular expression is matched to string. Otherwise, it returns true. |
~* | It is the same to ‘~’, but it is case insensitive. |
!~* | It is the same to ‘!~’, but it is case insensitive. |
Here are examples:
'abc' ~ '.*c' true 'abc' ~ 'c' false 'aaabc' ~ '([a-z]){3}bc true 'abc' ~* '.*C' true 'abc' !~* 'B.*' true
Regular expressions operator is not in the SQL standard. We borrow this operator from PostgreSQL.
Synopsis for REGEXP and RLIKE operators
string REGEXP pattern string NOT REGEXP pattern string RLIKE pattern string NOT RLIKE pattern
But, they do not support case-insensitive operators.
INSERT OVERWRITE statement overwrites a table data of an existing table or a data in a given directory. Tajo’s INSERT OVERWRITE statement follows ‘INSERT INTO SELECT’ statement of SQL. The examples are as follows:
create table t1 (col1 int8, col2 int4, col3 float8); -- when a target table schema and output schema are equivalent to each other INSERT OVERWRITE INTO t1 SELECT l_orderkey, l_partkey, l_quantity FROM lineitem; -- or INSERT OVERWRITE INTO t1 SELECT * FROM lineitem; -- when the output schema are smaller than the target table schema INSERT OVERWRITE INTO t1 SELECT l_orderkey FROM lineitem; -- when you want to specify certain target columns INSERT OVERWRITE INTO t1 (col1, col3) SELECT l_orderkey, l_quantity FROM lineitem;
In addition, INSERT OVERWRITE statement overwrites table data as well as a specific directory.
INSERT OVERWRITE INTO LOCATION '/dir/subdir' SELECT l_orderkey, l_quantity FROM lineitem;
function definition | return type | description | example | result |
---|---|---|---|---|
count(*) | int8 | |||
count(expr) | int8 | |||
avg(expr) | depending on expr | |||
sum(expr) | depending on expr | |||
min(expr) | depending on expr | |||
max(expr) | depending on expr |
function definition | return type | description | example | result |
---|---|---|---|---|
string || string | text | string concatenate | ‘Ta’ || ‘jo’ | Tajo |
char_length(string text) or character_length(string text) | int | Number of characters in string | char_length(‘Tajo’) | 4 |
trim([leading | trailing | both] [characters] from string) | text | Remove the characters (a space by default) from the start/end/both ends of the string | trim(both ‘x’ from ‘xTajoxx’) | Tajo |
btrim(string text [, characters text]) | text | Remove the characters (a space by default) from the both ends of the string | trim(‘xTajoxx’, ‘x’) | Tajo |
ltrim(string text [, characters text]) | text | Remove the characters (a space by default) from the start ends of the string | ltrim(‘xxTajo’, ‘x’) | Tajo |
rtrim(string text [, characters text]) | text | Remove the characters (a space by default) from the end ends of the string | rtrim(‘Tajoxx’, ‘x’) | Tajo |
split_part(string text, delimiter text, field int) | text | Split a string on delimiter and return the given field (counting from one) | split_part(‘ab_bc_cd’,‘_’,2) | bc |
regexp_replace(string text, pattern text, replacement text) | text | Replace substrings matched to a given regular expression pattern | regexp_replace(‘abcdef’, ‘(ˆab|ef$)’, ‘–’) | –cd– |
upper(string text) | text | makes an input text to be upper case | upper(‘tajo’) | TAJO |
lower(string text) | text | makes an input text to be lower case | lower(‘TAJO’) | tajo |
Now, Tajo supports a two backup methods for
SQL dump is an easy and strong way. If you use this approach, you don’t need to concern database-level compatibilities. If you want to backup your catalog, just use bin/tajo_dump command. The basic usage of this command is:
$ tajo_dump table_name > outfile
For example, if you want to backup a table customer, you should type a command as follows:
$ bin/tajo_dump customer > table_backup.sql $ $ cat table_backup.sql -- Tajo database dump -- Dump date: 10/04/2013 16:28:03 -- -- -- Name: customer; Type: TABLE; Storage: CSV -- Path: file:/home/hyunsik/tpch/customer -- CREATE EXTERNAL TABLE customer (c_custkey INT8, c_name TEXT, c_address TEXT, c_nationkey INT8, c_phone TEXT, c_acctbal FLOAT8, c_mktsegment TEXT, c_comment TEXT) USING CSV LOCATION 'file:/home/hyunsik/tpch/customer';
If you want to restore the catalog from the SQL dump file, please type the below command:
$ bin/tsql -f table_backup.sql
If you use an option ‘-a’, tajo_dump will dump all table DDLs.
$ bin/tajo_dump -a > all_backup.sql