Sqoop

Overview

Sqoop is a tool designed to help users of large data import existing relational databases into their Hadoop clusters. Sqoop uses JDBC to connect to a database, examine the schema for tables, and auto-generate the necessary classes to import data into HDFS. It then instantiates a MapReduce job to read the table from the database via the DBInputFormat (JDBC-based InputFormat). The table is read into a set of files loaded into HDFS. Both SequenceFile and text-based targets are supported.

Longer term, Sqoop will support automatic connectivity to Hive, with the ability to load data files directly into the Hive warehouse directory, and also to inject the appropriate table definition into the metastore.

Getting Started

Getting Sqoop Sqoop is distributed as a "contrib" jar with Hadoop. It is built in the contrib/sqoop/ directory.

You can run Sqoop by running:

$ hadoop jar /path/to/sqoop.jar org.apache.hadoop.sqoop.Sqoop (options)

This does nothing of interest without any options. The --help option displays the full usage instructions.

Connecting to a Database Server

Sqoop is designed to import tables from a database into HDFS. As such, it requires a connect string that describes how to connect to the database. The connect string looks like a URL, and is communicated to Sqoop with the --connect argument. This describes the server and database to connect to; it may also specify the port. e.g.:

$ hadoop jar /path/to/sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://database.example.com/employees

This string will connect to a MySQL database named employees on the host database.example.com. It's important that you do not use the URL localhost if you intend to use Sqoop with a distributed Hadoop cluster. The connect string you supply will be used on all the TaskTracker nodes in your MapReduce cluster; if they're told to connect to the literal name localhost, they'll each reach a different database (or more likely, no database at all)! Instead, you should use the full DNS or IP address of the database host that can be seen by all your remote nodes.

You may need to authenticate against the database before you can access it. The --username and --password parameters can be used to supply a username and a password to the database. (Note: password access currently requires passing the password on the command-line, which is insecure.) e.g.:

$ hadoop jar /path/to/sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://database.example.com/employees --username aaron --password 12345

Sqoop automatically supports MySQL and HSQLDB. Connect strings beginning with jdbc:mysql:// and jdbc:hsqldb:hsql:// automatically inform Sqoop of the correct JDBC driver class to load. HSQLDB's JDBC driver is bundled with Hadoop, and so will work "out of the box." If you install MySQL's Connector/J driver in Hadoop's lib/ directory, Sqoop will also automatically take advantage of this for any jdbc:mysql:// connect strings you use. You can use Sqoop with any other JDBC-compliant database as well. First, download the appropriate JDBC driver for the database you want to import from, and install the .jar file in the $HADOOP_HOME/lib directory on all machines in your Hadoop cluster, or some other directory which is in the classpath on all nodes. Each driver jar also has a specific driver class which defines the entry-point to the driver. For example, MySQL's Connector/J library has a driver class of com.mysql.jdbc.Driver. Refer to your database vendor-specific documentation to determine the main driver class. This class must be provided as an argument to Sqoop with --driver.

For example, to connect to a postgres database, first download the driver from http://jdbc.postgresql.org and install it in your Hadoop lib path. Then run Sqoop with something like:

$ hadoop jar /path/to/sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:postgresql://postgres-server.example.com/employees --driver org.postgresql.Driver

Note: Sqoop uses the JDBC specification to connect to databases; this should provide a versatile client that interoperates with many different databases. That having been said, we have only thoroughly tested this tool with HSQLDB and MySQL.

Listing Available Databases

Once connected to a database server, you can list the available databases with the --list-databases parameter. This currently is supported only by HSQLDB and MySQL. Note that in this case, the connect string does not include a database name, just a server address.

$ hadoop jar /path/to/sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://database.example.com/ --list-databases
information_schema
employees

This only works with HSQLDB and MySQL. A vendor-agnostic implementation of this function has not yet been implemented.

Listing Available Tables

Within a database, you can list the tables available for import with the --list-tables command. The following example shows four tables available within the "employees" example database:

$ hadoop jar /path/to/sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://database.example.com/employees --list-tables

employee_names
payroll_checks
job_descriptions
office_supplies

Automatic Full-database Import

If you want to import all the tables in a database, you can use the --all-tables command to do so:

$ hadoop jar /path/to/sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://database.example.com/employees --all-tables

This will query the database for the available tables, generate an ORM class for each table, and run a MapReduce job to import each one. Hadoop uses the DBInputFormat to read from a database into a Mapper instance. To read a table into a MapReduce program requires creating a class to hold the fields of one row of the table. One of the benefits of Sqoop is that it generates this class definition for you, based on the table definition in the database.

The generated .java files are, by default, placed in the current directory. You can supply a different directory with the --outdir parameter. These are then compiled into .class and .jar files for use by the MapReduce job that it launches. These files are created in a temporary directory. You can redirect this target with --bindir.

Each table will be imported into a separate directory in HDFS, with the same name as the table. For instance, if my Hadoop username is aaron, the above command would have generated the following directories in HDFS:

/user/aaron/employee_names
/user/aaron/payroll_checks
/user/aaron/job_descriptions
/user/aaron/office_supplies

You can change the base directory under which the tables are loaded with the --warehouse-dir parameter. For example:

$ hadoop jar /path/to/sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://database.example.com/employees --all-tables --warehouse-dir /common/warehouse

This would create the following directories instead:

/common/warehouse/employee_names
/common/warehouse/payroll_checks
/common/warehouse/job_descriptions
/common/warehouse/office_supplies

By default the data will be read into text files in HDFS. Each of the columns will be represented as comma-delimited text. Each row is terminated by a newline. There is currently no mechanism to quote or escape commas or newlines inside of CHAR or VARCHAR columns of the database. Applications which depend on comma-delimited parsing of the output files must be careful if commas or newlines may be present in the database.

If you expect commas or newlines to appear in text columns of the database, or you want to leverage compression and binary file formats, the --as-sequencefile argument to Sqoop will import the table to a set of SequenceFiles instead. As this uses a separate object for each field of each database record, no quoting or escaping of values is necessary. This representation is also likely to be higher performance when used as an input to subsequent MapReduce programs. For completeness, Sqoop provides an --as-textfile option, which is implied by default. An --as-textfile on the command-line will override a previous --as-sequencefile argument.

The SequenceFile format will embed the records from the database as objects using the code generated by Sqoop. It is important that you retain the .java file for this class, as you will need to be able to instantiate the same type to read the objects back later, in other user-defined applications.

Importing Individual Tables

In addition to full-database imports, Sqoop will allow you to import individual tables. Instead of using --all-tables, specify the name of a particular table with the --table argument:

$ hadoop jar /path/to/sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://database.example.com/employees --table employee_names 

You can further specify a subset of the columns in a table by using the --columns argument. This takes a list of column names, delimited by commas, with no spaces in between. e.g.:


$ hadoop jar /path/to/sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://database.example.com/employees --table employee_names --columns employee_id,first_name,last_name,dept_id

Sqoop will use a MapReduce job to read sections of the table in parallel. For the MapReduce tasks to divide the table space, the results returned by the database must be orderable. Sqoop will automatically detect the primary key for a table and use that to order the results. If no primary key is available, or (less likely) you want to order the results along a different column, you can specify the column name with --order-by. Important: To guarantee correctness of your input, you must select an ordering column for which each row has a unique value. If duplicate values appear in the ordering column, the results of the import are undefined, and Sqoop will not be able to detect the error.

The --columns and --order-by arguments are incompatible with --all-tables. If you require special handling for some of the tables, then you must manually run a separate import job for each table.

Miscellaneous Additional Arguments

If you want to generate the Java classes to represent tables without actually performing an import, supply a connect string and (optionally) credentials as above, as well as --all-tables or --table, but also use the --generate-only argument. This will generate the classes and cease further operation.

You can override the $HADOOP_HOME environment variable within Sqoop with the --hadoop-home argument.