How to Hive
-----------

DISCLAIMER: This is a prototype version of Hive and is NOT production 
quality. This is provided mainly as a way of illustrating the capabilities 
of Hive and is provided as-is. However - we are working hard to make 
Hive a production quality system. Hive has only been tested on unix(linux) 
and mac systems using Java 1.6 for now - although it may very well work 
on other similar platforms. It does not work on Cygwin right now. Most of 
our testing has been on Hadoop 0.17 - so we would advise running it against 
this version of hadoop - even though it may compile/work against other versions


Downloading and building
------------------------

You can either build hive jar files for your environment:
- apply patch
- cd hadoop/src/contrib/hive
- src/contrib/hive> ant -Dtarget.dir=<your-install-dir> package

Or you can use the pre-packaged jars that is available at the following location
and untar it: 
http://mirror.facebook.com/facebook/hive/hadoop-0.17/dist.tar.gz

$ wget http://mirror.facebook.com/facebook/hive/hadoop-0.17/dist.tar.gz
$ tar zxvf dist.tar.gz
$ cd dist
$ ls
README  bin conf  lib 

bin/ (all the shell scripts)
lib/ (required jar files)
conf/ (configuration files)

In the rest of the README, we use dist and <install-dir> interchangeably.

Running Hive
------------

Hive uses hadoop  that means:
- you must have hadoop in your path OR
- export HADOOP=<hadoop-install-dir>/bin/hadoop

To use hive command line interface (cli) from the shell:
$ bin/hive

Using Hive
----------

Configuration management overview
---------------------------------

- hive configuration is stored in <install-dir>/conf/hive-default.xml 
  and log4j in hive-log4j.properties
- hive configuration is an overlay on top of hadoop - meaning the 
  hadoop configuration variables are inherited by default.
- hive configuration can be manipulated by:
  o editing hive-default.xml and defining any desired variables 
    (including hadoop variables) in it
  o from the cli using the set command (see below) 
  o by invoking hive using the syntax:
     * bin/hive -hiveconf x1=y1 -hiveconf x2=y2
    this sets the variables x1 and x2 to y1 and y2


Error Logs
----------
Hive uses log4j for logging. By default logs are not emitted to the 
console by the cli. They are stored in the file:
- /tmp/{user.name}/hive.log

If the user wishes - the logs can be emitted to the console by adding 
the arguments shown below:
- bin/hive -hiveconf hive.root.logger=INFO,console

Note that setting hive.root.logger via the 'set' command does not 
change logging properties since they are determined at initialization time.

Error logs are very useful to debug problems. Please send them with 
any bugs (of which there are many!) to athusoo@facebok.com.


DDL Operations
--------------

Creating Hive tables and browsing through them

hive> CREATE TABLE pokes (foo INT, bar STRING);  

Creates a table called pokes with two columns, first being an 
integer and other a string columns

hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);  

Creates a table called invites with two columns and a partition column 
called ds. The partition column is a virtual column  it is not part 
of the data itself  but is derived from the partition that a 
particular dataset is loaded into.


By default tables are assumed to be of text input format and the 
delimiters are assumed to be ^A(ctrl-a). We will be soon publish additional 
commands/recipes to add binary (sequencefiles) data and configurable 
delimiters etc.

hive> SHOW TABLES;

lists all the tables

hive> SHOW TABLES '.*s';

lists all the table that end with 's'. The pattern matching follows Java regular 
expressions. Check out this link for documentation 
http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html

hive> DESCRIBE invites;

shows the list of columns

Altering tables. Table name can be changed and additional columns can be dropped

hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);
hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment');
hive> ALTER TABLE events RENAME TO 3koobecaf;

Dropping tables
hive> DROP TABLE pokes;


Metadata Store
--------------

Metadata is in an embedded Derby database whose location is determined by the 
hive configuration variable named javax.jdo.option.ConnectionURL. By default 
(see conf/hive-default.xml) - this location is ./metastore_db

Right now - in the default configuration, this metadata can only be seen by 
one user at a time. 

Metastore can be stored in any database that is supported by JPOX. The 
location and the type of the RDBMS can be controlled by the two variables 
'javax.jdo.option.ConnectionURL' and 'javax.jdo.option.ConnectionDriverName'. 
Refer to JDO (or JPOX) documentation for more details on supported databases. 
The database schema is defined in JDO metadata annotations file package.jdo 
at src/contrib/hive/metastore/src/model.

In the future - the metastore itself can be a standalone server.


DML Operations
--------------

Loading data from flat files into Hive

hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes; 

Loads a file that contains two columns separated by ctrl-a into pokes table. 
'local' signifies that the input file is on the local system. If 'local' 
is omitted then it looks for the file in HDFS.

The keyword 'overwrite' signifies that existing data in the table is deleted. 
If the 'overwrite' keyword is omitted - then data files are appended to existing data sets.

NOTES:
- NO verification of data against the schema
- if the file is in hdfs it is moved into hive controlled file system namespace. 
  The root of the hive directory is specified by the option hive.metastore.warehouse.dir 
  in hive-default.xml. We would advise that this directory be pre-existing before 
  trying to create tables via Hive.

hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
hive> LOAD DATA LOCAL INPATH './examples/files/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-08');

The two LOAD statements above load data into two different partitions of the table
invites. Table invites must be created as partitioned by the key ds for this to succeed.

Loading/Extracting data using Queries
-------------------------------------

Runtime configuration
---------------------

- Hives queries are executed using map-reduce queries and as such the behavior 
  of such queries can be controlled by the hadoop configuration variables

- The cli can be used to set any hadoop (or hive) configuration variable. For example:
   o hive> SET mapred.job.tracker=myhost.mycompany.com:50030
   o hive> SET - v 
  The latter shows all the current settings. Without the v option only the 
  variables that differ from the base hadoop configuration are displayed
- In particular the number of reducers should be set to a reasonable number 
  to get good performance (the default is 1!)


EXAMPLE QUERIES
---------------

Some example queries are shown below. They are available in examples/queries.
More are available in the hive contrib sources src/test/queries/positive

SELECTS and FILTERS
-------------------

hive> SELECT a.foo FROM invites a;

select column 'foo' from all rows of invites table. The results are not
stored anywhere, but are displayed on the console.

Note that in all the examples that follow, INSERT (into a hive table, local 
directory or HDFS directory) is optional. 

hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a;

select all rows from invites table into an HDFS  directory. The result data 
is in files (depending on the number of mappers) in that directory.
NOTE: partition columns if any are selected by the use of *. They can also 
be specified in the projection clauses.

hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a;

Select all rows from pokes table into a local directory

hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a;
hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100; 
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a;
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a;
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(1) FROM invites a;
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a;
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a;

Sum of a column. avg, min, max can also be used

NOTE: there are some flaws with the type system that cause doubles to be 
returned with integer types would be expected. We expect to fix these in the coming week.

GROUP BY
--------

hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(1) WHERE a.foo > 0 GROUP BY a.bar;
hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(1) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;

NOTE: Currently Hive always uses two stage map-reduce for groupby operation. This is 
to handle skews in input data. We will be optimizing this in the coming weeks.

JOIN
----

hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo

MULTITABLE INSERT
-----------------
FROM src
INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300

STREAMING
---------
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09';

This streams the data in the map phase through the script /bin/cat (like hadoop streaming). 
Similarly - streaming can be used on the reduce side (please see the Hive Tutorial or examples)

KNOWN BUGS/ISSUES
-----------------
* hive cli may hang for a couple of minutes because of a bug in getting metadata
  from the derby database. let it run and you'll be fine!
* hive cli does not support multi-line queries.
* hive cli creates derby.log in the directory from which it has been invoked.
* DESCRIBE table currently only shows columns in a table. Other metadata like
  partitions, buckets etc are not shown.
* LOAD FILE or INSERT INTO TABLE do not validate schemas of the destination tables.
* COUNT(*) does not work for now. Use COUNT(1) instead.
* String literals are indicated by single quotes(double quotes are not supported). 
  So 'is a valid string' while "is not a valid string" in the query language. Hive
  does support escaping quotes and semi-colon similar to MySQL.
* Multiple GROUP BYs are not supported in the multi-table table INSERT queries.
* ORDER BY not supported.
* Only string and thrift types (http://developers.facebook.com/thrift) have been tested.


FUTURE FEATURES
---------------