== == |SIP | 4 | |Title | Public API for Sqoop v1.0.0 | |Author | Aaron Kimball (aaron at cloudera dot com) | |Created | May 14, 2010 | |Status | Accepted | |Discussion | "http://github.com/cloudera/sqoop/issues/issue/13":http://github.com/cloudera/sqoop/issues/issue/13 | |Implementation| "http://review.hbase.org/r/73/":http://review.hbase.org/r/73/ | h2. Abstract This SIP defines the public API to be exposed in the first release of Sqoop. The @org.apache.hadoop.sqoop.lib@ package contains the public API relied-upon by external clients of Sqoop. Generated code produced by Sqoop depends on these modules. Clients of imported data may also rely on additional modules specified here. h2. Problem statement To deal with the unique table schemas of each database a Sqoop user imports, Sqoop's current design requires that it generate a per-table class. This class is used to interact with the data after it is imported to Hadoop; data can be stored in SequenceFiles, requiring this class to deserialize records. Subsequent re-exports of the data rely on this class to push records back to the RDBMS. And the generated class includes support for parsing text-based representations of the data. This class, however, relies on reusable code modules provided with Sqoop. These code modules are all placed in the @org.apache.hadoop.sqoop.lib@ package. Clients of generated code must be able to rely on previously-generated code to work with later versions of Sqoop. While code regeneration is possible, Sqoop users should see the @lib@ package as the most stable API provided by Sqoop. Sqoop also provides a file format for large object data; while large objects can be manipulated in the context of their encapsulating records (e.g., through @BlobRef@ or @ClobRef@ references to the data), the large object file store may be inspected directly. This SIP defines the official "surface area" of the public packages which will be maintained. In order to ensure that future versions remain backwards compatible, some existing class definitions must be modified. It is hoped that these sorts of "breaking changes" will occur only before incrementing the major version number (1.0, 2.0, etc.), and are thus infrequent disruptions to Sqoop users. Sqoop clients who target only the APIs specified may be confident that their programs will work properly with all subsequent Sqoop releases in the 1.0 series (in accordance with the compatibility and deprecation policy specified in [[SIP-2]]). h2. Specification h3. lib package As of 5/14/2010, the lib package contains the following classes: * @BigDecimalSerializer@ * @BlobRef@ * @ClobRef@ * @FieldFormatter@ * @JdbcWritableBridge@ * @LargeObjectLoader@ * @LobRef@ * @LobSerializer@ * @RecordParser@ * @TaskId@ and the following interface: * @SqoopRecord@ Classes generated by Sqoop fulfill the interface of @SqoopRecord@. The first change necessary in this package is to transform @SqoopRecord@ from an interface into an abstract class. This way, subsequent releases in the 1.0 series can introduce additional methods required by SqoopRecords along with a default implementation for previously-generated clients. The @TaskId@ class is improperly placed in this package. This class is Sqoop-internal and should be moved to the @util@ package. We should add a class called @DelimiterSet@ which encapsulates the parameters regarding formatting of delimiters around fields: the field terminator, the record terminator, the escape character, the enclosing character, and whether the latter of these is optional. This would allow sets of delimiters to be manipulated easily. The @SqoopRecord@ class could then be extended with a @toString(DelimiterSet)@ method that allowed users to format output with alternate delimiters than the ones specified during codegen time. @LobRef@ is an abstract base class that encapsulates common code in @BlobRef@ and @ClobRef@. The constructors for @LobRef@ are marked as @protected@. Clients of Sqoop should not subclass @LobRef@ directly. Classes in the lib package may depend on classes elsewhere in Sqoop's implementation. Clients should not do so directly. h3. io package Clients of Sqoop who have imported large objects into HDFS may have large object files holding their data; this file format is defined in [[SIP-3]]. The large objects may be manipulated by iterating over their encapsulating records and calling @{B,C}lobRef.getDataStream()@, which will retrieve the data for a large object from its underlying store. However, the large objects may also be directly retrieved from their underlying LobFile storage. The @org.apache.hadoop.sqoop.io.LobFile@ class is considered part of the public API. Clients of Sqoop may depend on the @LobFile.Writer@ and @LobFile.Reader@ APIs. Clients should never instantiate subclasses of @Writer@ and @Reader@ directly; instead they should use the static methods @LobFile.create()@ and @LobFile.open@ respectively. The underlying concrete Writer and Reader implementation classes are considered private. To allow users to verify the compression formats available in LobFiles, the @CodecMap.getCodecNames()@ method is also public. h3. Entry-points to Sqoop A well-defined programmatic entry-point to Sqoop is *not* defined by this specification. The only method of @org.apache.hadoop.sqoop.Sqoop@ considered stable is its @main()@ method; all others are currently internal. This restriction will be relaxed in a future specification, allowing programmatic client interaction with Sqoop. h3. Base package The base package in Sqoop is currently @org.apache.hadoop.sqoop@. To reflect Sqoop's migration from an Apache Hadoop subproject to its own project, the class hierarchy should be moved to @com.cloudera.sqoop@. h2. Compatibility Issues The modification of @SqoopRecord@ from interface to class will cause existing generated code to break. Such a change is expected prior to the 1.0.0 release. This is the last interface in Sqoop; once it is transitioned to an abstract class, subsequent changes to the SqoopRecord API should be backwards-compatible. h2. Test Plan The changes required to implement this specification are minimal; the existing unit test suite should cover all necessary testing. h2. Discussion Please provide feedback and comments at "http://github.com/cloudera/sqoop/issues/issue/13":http://github.com/cloudera/sqoop/issues/issue/13