//// Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. //// +sqoop-export+ -------------- Purpose ~~~~~~~ include::export-purpose.txt[] Syntax ~~~~~~ ---- $ sqoop export (generic-args) (export-args) $ sqoop-export (generic-args) (export-args) ---- Although the Hadoop generic arguments must preceed any export arguments, the export arguments can be entered in any order with respect to one another. include::common-args.txt[] .Export control arguments: [grid="all"] `----------------------------------------`------------------------------ Argument Description ------------------------------------------------------------------------ +\--direct+ Use direct export fast path +\--export-dir + HDFS source path for the export +-m,\--num-mappers + Use 'n' map tasks to export in\ parallel +\--table + Table to populate +\--update-key + Anchor column to use for updates.\ Use a comma separated list of columns\ if there are more than one column. +\--update-mode + Specify how updates are performed\ when new rows are found with\ non-matching keys in database. Legal values for +mode+ include\ +updateonly+ (default) and\ +allowinsert+. +\--input-null-string + The string to be interpreted as\ null for string columns +\--input-null-non-string + The string to be interpreted as\ null for non-string columns +\--staging-table + The table in which data will be\ staged before being inserted into\ the destination table. +\--clear-staging-table+ Indicates that any data present in\ the staging table can be deleted. +\--batch+ Use batch mode for underlying\ statement execution. ------------------------------------------------------------------------ The +\--table+ and +\--export-dir+ arguments are required. These specify the table to populate in the database, and the directory in HDFS that contains the source data. You can control the number of mappers independently from the number of files present in the directory. Export performance depends on the degree of parallelism. By default, Sqoop will use four tasks in parallel for the export process. This may not be optimal; you will need to experiment with your own particular setup. Additional tasks may offer better concurrency, but if the database is already bottlenecked on updating indices, invoking triggers, and so on, then additional load may decrease performance. The +\--num-mappers+ or +-m+ arguments control the number of map tasks, which is the degree of parallelism used. MySQL provides a direct mode for exports as well, using the +mysqlimport+ tool. When exporting to MySQL, use the +\--direct+ argument to specify this codepath. This may be higher-performance than the standard JDBC codepath. NOTE: When using export in direct mode with MySQL, the MySQL bulk utility +mysqlimport+ must be available in the shell path of the task process. The +\--input-null-string+ and +\--input-null-non-string+ arguments are optional. If +\--input-null-string+ is not specified, then the string "null" will be interpreted as null for string-type columns. If +\--input-null-non-string+ is not specified, then both the string "null" and the empty string will be interpreted as null for non-string columns. Note that, the empty string will be always interpreted as null for non-string columns, in addition to other string if specified by +\--input-null-non-string+. Since Sqoop breaks down export process into multiple transactions, it is possible that a failed export job may result in partial data being committed to the database. This can further lead to subsequent jobs failing due to insert collisions in some cases, or lead to duplicated data in others. You can overcome this problem by specifying a staging table via the +\--staging-table+ option which acts as an auxiliary table that is used to stage exported data. The staged data is finally moved to the destination table in a single transaction. In order to use the staging facility, you must create the staging table prior to running the export job. This table must be structurally identical to the target table. This table should either be empty before the export job runs, or the +\--clear-staging-table+ option must be specified. If the staging table contains data and the +\--clear-staging-table+ option is specified, Sqoop will delete all of the data before starting the export job. NOTE: Support for staging data prior to pushing it into the destination table is not available for +--direct+ exports. It is also not available when export is invoked using the +--update-key+ option for updating existing data. Inserts vs. Updates ~~~~~~~~~~~~~~~~~~~ By default, +sqoop-export+ appends new rows to a table; each input record is transformed into an +INSERT+ statement that adds a row to the target database table. If your table has constraints (e.g., a primary key column whose values must be unique) and already contains data, you must take care to avoid inserting records that violate these constraints. The export process will fail if an +INSERT+ statement fails. This mode is primarily intended for exporting records to a new, empty table intended to receive these results. If you specify the +\--update-key+ argument, Sqoop will instead modify an existing dataset in the database. Each input record is treated as an +UPDATE+ statement that modifies an existing row. The row a statement modifies is determined by the column name(s) specified with +\--update-key+. For example, consider the following table definition: ---- CREATE TABLE foo( id INT NOT NULL PRIMARY KEY, msg VARCHAR(32), bar INT); ---- Consider also a dataset in HDFS containing records like these: ---- 0,this is a test,42 1,some more data,100 ... ---- Running +sqoop-export \--table foo \--update-key id \--export-dir /path/to/data \--connect ...+ will run an export job that executes SQL statements based on the data like so: ---- UPDATE foo SET msg='this is a test', bar=42 WHERE id=0; UPDATE foo SET msg='some more data', bar=100 WHERE id=1; ... ---- If an +UPDATE+ statement modifies no rows, this is not considered an error; the export will silently continue. (In effect, this means that an update-based export will not insert new rows into the database.) Likewise, if the column specified with +\--update-key+ does not uniquely identify rows and multiple rows are updated by a single statement, this condition is also undetected. The argument +\--update-key+ can also be given a comma separated list of column names. In which case, Sqoop will match all keys from this list before updating any existing record. Depending on the target database, you may also specify the +\--update-mode+ argument with +allowinsert+ mode if you want to update rows if they exist in the database already or insert rows if they do not exist yet. include::input-args.txt[] include::output-args.txt[] Sqoop automatically generates code to parse and interpret records of the files containing the data to be exported back to the database. If these files were created with non-default delimiters (comma-separated fields with newline-separated records), you should specify the same delimiters again so that Sqoop can parse your files. If you specify incorrect delimiters, Sqoop will fail to find enough columns per line. This will cause export map tasks to fail by throwing +ParseExceptions+. include::codegen-args.txt[] If the records to be exported were generated as the result of a previous import, then the original generated class can be used to read the data back. Specifying +\--jar-file+ and +\--class-name+ obviate the need to specify delimiters in this case. The use of existing generated code is incompatible with +\--update-key+; an update-mode export requires new code generation to perform the update. You cannot use +\--jar-file+, and must fully specify any non-default delimiters. Exports and Transactions ~~~~~~~~~~~~~~~~~~~~~~~~ Exports are performed by multiple writers in parallel. Each writer uses a separate connection to the database; these have separate transactions from one another. Sqoop uses the multi-row +INSERT+ syntax to insert up to 100 records per statement. Every 100 statements, the current transaction within a writer task is committed, causing a commit every 10,000 rows. This ensures that transaction buffers do not grow without bound, and cause out-of-memory conditions. Therefore, an export is not an atomic process. Partial results from the export will become visible before the export is complete. Failed Exports ~~~~~~~~~~~~~~ Exports may fail for a number of reasons: - Loss of connectivity from the Hadoop cluster to the database (either due to hardware fault, or server software crashes) - Attempting to +INSERT+ a row which violates a consistency constraint (for example, inserting a duplicate primary key value) - Attempting to parse an incomplete or malformed record from the HDFS source data - Attempting to parse records using incorrect delimiters - Capacity issues (such as insufficient RAM or disk space) If an export map task fails due to these or other reasons, it will cause the export job to fail. The results of a failed export are undefined. Each export map task operates in a separate transaction. Furthermore, individual map tasks +commit+ their current transaction periodically. If a task fails, the current transaction will be rolled back. Any previously-committed transactions will remain durable in the database, leading to a partially-complete export. Example Invocations ~~~~~~~~~~~~~~~~~~~ A basic export to populate a table named +bar+: ---- $ sqoop export --connect jdbc:mysql://db.example.com/foo --table bar \ --export-dir /results/bar_data ---- This example takes the files in +/results/bar_data+ and injects their contents in to the +bar+ table in the +foo+ database on +db.example.com+. The target table must already exist in the database. Sqoop performs a set of +INSERT INTO+ operations, without regard for existing content. If Sqoop attempts to insert rows which violate constraints in the database (for example, a particular primary key value already exists), then the export fails.