//// Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. //// Notes for specific connectors ----------------------------- pg_bulkload connector ~~~~~~~~~~~~~~~~~~~~~ Purpose ^^^^^^^ pg_bulkload connector is a direct connector for exporting data into PostgreSQL. This connector uses http://pgbulkload.projects.postgresql.org/index.html[pg_bulkload]. Users benefit from functionality of pg_bulkload such as fast exports bypassing shared bufferes and WAL, flexible error records handling, and ETL feature with filter functions. Requirements ^^^^^^^^^^^^ pg_bulkload connector requires following conditions for export job execution: * The link:http://pgbulkload.projects.postgresql.org/index.html[pg_bulkload] must be installed on DB server and all slave nodes. RPM for RedHat or CentOS is available in then link:http://pgfoundry.org/frs/?group_id=1000261[download page]. * The link:http://jdbc.postgresql.org/index.html[PostgreSQL JDBC] is required on client node. * Superuser role of PostgreSQL database is required for execution of pg_bulkload. Syntax ^^^^^^ Use +--connection-manager+ option to specify connection manager classname. ---- $ sqoop export (generic-args) --connection-manager org.apache.sqoop.manager.PGBulkloadManager (export-args) $ sqoop-export (generic-args) --connection-manager org.apache.sqoop.manager.PGBulkloadManager (export-args) ---- This connector supports export arguments shown below. .Supported export control arguments: [grid="all"] `----------------------------------------`--------------------------------------- Argument Description --------------------------------------------------------------------------------- +\--export-dir + HDFS source path for the export +-m,\--num-mappers + Use 'n' map tasks to export in\ parallel +\--table + Table to populate +\--input-null-string + The string to be interpreted as\ null for string columns +\--clear-staging-table+ Indicates that any data present in\ the staging table can be deleted. --------------------------------------------------------------------------------- There are additional configuration for pg_bulkload execution specified via Hadoop Configuration properties which can be given with +-D + option. Because Hadoop Configuration properties are generic arguments of the sqoop, it must preceed any export control arguments. .Supported export control properties: [grid="all"] `-----------------------------`---------------------------------------------- Property Description ----------------------------------------------------------------------------- mapred.reduce.tasks Number of reduce tasks for staging. \ The defalt value is 1. \ Each tasks do staging in a single transaction. pgbulkload.bin Path of the pg_bulkoad binary \ installed on each slave nodes. pgbulkload.check.constraints Specify whether CHECK constraints are checked \ during the loading. \ The default value is YES. pgbulkload.parse.errors The maximum mumber of ingored records \ that cause errors during parsing, \ encoding, filtering, constraints checking, \ and data type conversion. \ Error records are recorded \ in the PARSE BADFILE. \ The default value is INFINITE. pgbulkload.duplicate.errors Number of ingored records \ that violate unique constraints. \ Duplicated records are recorded in the \ DUPLICATE BADFILE on DB server. \ The default value is INFINITE. pgbulkload.filter Specify the filter function \ to convert each row in the input file. \ See the pg_bulkload documentation to know \ how to write FILTER functions. ----------------------------------------------------------------------------- Here is a example of complete command line. ---- $ sqoop export \ -Dmapred.reduce.tasks=2 -Dpgbulkload.bin="/usr/local/bin/pg_bulkload" \ -Dpgbulkload.input.field.delim=$'\t' \ -Dpgbulkload.check.constraints="YES" \ -Dpgbulkload.parse.errors="INFINITE" \ -Dpgbulkload.duplicate.errors="INFINITE" \ --connect jdbc:postgresql://pgsql.example.net:5432/sqooptest \ --connection-manager org.apache.sqoop.manager.PGBulkloadManager \ --table test --username sqooptest --export-dir=/test -m 2 ---- Data Staging ^^^^^^^^^^^^ Each map tasks of pg_bulkload connector's export job create their own staging table on the fly. The Name of staging tables is decided based on the destination table and the task attempt ids. For example, the name of staging table for the "test" table is like +test_attempt_1345021837431_0001_m_000000_0+ . Staging tables are automatically dropped if tasks successfully complete or map tasks fail. When reduce task fails, staging table for the task are left for manual retry and users must take care of it.