//// Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. //// Saved Jobs ---------- Imports and exports can be repeatedly performed by issuing the same command multiple times. Especially when using the incremental import capability, this is an expected scenario. Sqoop allows you to define _saved jobs_ which make this process easier. A saved job records the configuration information required to execute a Sqoop command at a later time. The section on the +sqoop-job+ tool describes how to create and work with saved jobs. By default, job descriptions are saved to a private repository stored in +$HOME/.sqoop/+. You can configure Sqoop to instead use a shared _metastore_, which makes saved jobs available to multiple users across a shared cluster. Starting the metastore is covered by the section on the +sqoop-metastore+ tool. +sqoop-job+ ----------- Purpose ~~~~~~~ include::job-purpose.txt[] Syntax ~~~~~~ ---- $ sqoop job (generic-args) (job-args) [-- [subtool-name] (subtool-args)] $ sqoop-job (generic-args) (job-args) [-- [subtool-name] (subtool-args)] ---- Although the Hadoop generic arguments must preceed any job arguments, the job arguments can be entered in any order with respect to one another. .Job management options: [grid="all"] `---------------------------`------------------------------------------ Argument Description ----------------------------------------------------------------------- +\--create + Define a new saved job with the specified \ job-id (name). A second Sqoop \ command-line, separated by a +\--+ should \ be specified; this defines the saved job. +\--delete + Delete a saved job. +\--exec + Given a job defined with +\--create+, run \ the saved job. +\--show + Show the parameters for a saved job. +\--list+ List all saved jobs ----------------------------------------------------------------------- Creating saved jobs is done with the +\--create+ action. This operation requires a +\--+ followed by a tool name and its arguments. The tool and its arguments will form the basis of the saved job. Consider: ---- $ sqoop job --create myjob -- import --connect jdbc:mysql://example.com/db \ --table mytable ---- This creates a job named +myjob+ which can be executed later. The job is not run. This job is now available in the list of saved jobs: ---- $ sqoop job --list Available jobs: myjob ---- We can inspect the configuration of a job with the +show+ action: ---- $ sqoop job --show myjob Job: myjob Tool: import Options: ---------------------------- direct.import = false codegen.input.delimiters.record = 0 hdfs.append.dir = false db.table = mytable ... ---- And if we are satisfied with it, we can run the job with +exec+: ---- $ sqoop job --exec myjob 10/08/19 13:08:45 INFO tool.CodeGenTool: Beginning code generation ... ---- The +exec+ action allows you to override arguments of the saved job by supplying them after a +\--+. For example, if the database were changed to require a username, we could specify the username and password with: ---- $ sqoop job --exec myjob -- --username someuser -P Enter password: ... ---- .Metastore connection options: [grid="all"] `----------------------------`----------------------------------------- Argument Description ----------------------------------------------------------------------- +\--meta-connect + Specifies the JDBC connect string used \ to connect to the metastore ----------------------------------------------------------------------- By default, a private metastore is instantiated in +$HOME/.sqoop+. If you have configured a hosted metastore with the +sqoop-metastore+ tool, you can connect to it by specifying the +\--meta-connect+ argument. This is a JDBC connect string just like the ones used to connect to databases for import. In +conf/sqoop-site.xml+, you can configure +sqoop.metastore.client.autoconnect.url+ with this address, so you do not have to supply +\--meta-connect+ to use a remote metastore. This parameter can also be modified to move the private metastore to a location on your filesystem other than your home directory. If you configure +sqoop.metastore.client.enable.autoconnect+ with the value +false+, then you must explicitly supply +\--meta-connect+. .Common options: [grid="all"] `---------------------------`------------------------------------------ Argument Description ----------------------------------------------------------------------- +\--help+ Print usage instructions +\--verbose+ Print more information while working ----------------------------------------------------------------------- Saved jobs and passwords ~~~~~~~~~~~~~~~~~~~~~~~~ The Sqoop metastore is not a secure resource. Multiple users can access its contents. For this reason, Sqoop does not store passwords in the metastore. If you create a job that requires a password, you will be prompted for that password each time you execute the job. You can enable passwords in the metastore by setting +sqoop.metastore.client.record.password+ to +true+ in the configuration. Note that you have to set +sqoop.metastore.client.record.password+ to +true+ if you are executing saved jobs via Oozie because Sqoop cannot prompt the user to enter passwords while being executed as Oozie tasks. Saved jobs and incremental imports ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Incremental imports are performed by comparing the values in a _check column_ against a reference value for the most recent import. For example, if the +\--incremental append+ argument was specified, along with +\--check-column id+ and +\--last-value 100+, all rows with +id > 100+ will be imported. If an incremental import is run from the command line, the value which should be specified as +\--last-value+ in a subsequent incremental import will be printed to the screen for your reference. If an incremental import is run from a saved job, this value will be retained in the saved job. Subsequent runs of +sqoop job \--exec someIncrementalJob+ will continue to import only newer rows than those previously imported. +sqoop-metastore+ ----------------- Purpose ~~~~~~~ include::metastore-purpose.txt[] Syntax ~~~~~~ ---- $ sqoop metastore (generic-args) (metastore-args) $ sqoop-metastore (generic-args) (metastore-args) ---- Although the Hadoop generic arguments must preceed any metastore arguments, the metastore arguments can be entered in any order with respect to one another. .Metastore management options: [grid="all"] `---------------------------`------------------------------------------ Argument Description ----------------------------------------------------------------------- +\--shutdown+ Shuts down a running metastore instance \ on the same machine. ----------------------------------------------------------------------- Running +sqoop-metastore+ launches a shared HSQLDB database instance on the current machine. Clients can connect to this metastore and create jobs which can be shared between users for execution. The location of the metastore's files on disk is controlled by the +sqoop.metastore.server.location+ property in +conf/sqoop-site.xml+. This should point to a directory on the local filesystem. The metastore is available over TCP/IP. The port is controlled by the +sqoop.metastore.server.port+ configuration parameter, and defaults to 16000. Clients should connect to the metastore by specifying +sqoop.metastore.client.autoconnect.url+ or +\--meta-connect+ with the value +jdbc:hsqldb:hsql://:/sqoop+. For example, +jdbc:hsqldb:hsql://metaserver.example.com:16000/sqoop+. This metastore may be hosted on a machine within the Hadoop cluster, or elsewhere on the network. +sqoop-merge+ ------------- Purpose ~~~~~~~ include::merge-purpose.txt[] Syntax ~~~~~~ ---- $ sqoop merge (generic-args) (merge-args) $ sqoop-merge (generic-args) (merge-args) ---- Although the Hadoop generic arguments must preceed any merge arguments, the job arguments can be entered in any order with respect to one another. .Merge options: [grid="all"] `---------------------------`------------------------------------------ Argument Description ----------------------------------------------------------------------- +\--class-name + Specify the name of the record-specific \ class to use during the merge job. +\--jar-file + Specify the name of the jar to load the \ record class from. +\--merge-key + Specify the name of a column to use as \ the merge key. +\--new-data + Specify the path of the newer dataset. +\--onto + Specify the path of the older dataset. +\--target-dir + Specify the target path for the output \ of the merge job. ----------------------------------------------------------------------- The +merge+ tool runs a MapReduce job that takes two directories as input: a newer dataset, and an older one. These are specified with +\--new-data+ and +\--onto+ respectively. The output of the MapReduce job will be placed in the directory in HDFS specified by +\--target-dir+. When merging the datasets, it is assumed that there is a unique primary key value in each record. The column for the primary key is specified with +\--merge-key+. Multiple rows in the same dataset should not have the same primary key, or else data loss may occur. To parse the dataset and extract the key column, the auto-generated class from a previous import must be used. You should specify the class name and jar file with +\--class-name+ and +\--jar-file+. If this is not availab,e you can recreate the class using the +codegen+ tool. The merge tool is typically run after an incremental import with the date-last-modified mode (+sqoop import --incremental lastmodified ...+). Supposing two incremental imports were performed, where some older data is in an HDFS directory named +older+ and newer data is in an HDFS directory named +newer+, these could be merged like so: ---- $ sqoop merge --new-data newer --onto older --target-dir merged \ --jar-file datatypes.jar --class-name Foo --merge-key id ---- This would run a MapReduce job where the value in the +id+ column of each row is used to join rows; rows in the +newer+ dataset will be used in preference to rows in the +older+ dataset. This can be used with both SequenceFile-, Avro- and text-based incremental imports. The file types of the newer and older datasets must be the same.