Interface | Description |
---|---|
ConnectionInfo |
Helper interface to get connection related information.
|
PartitionHandler | |
RecordWriter | |
StreamingConnection |
Class | Description |
---|---|
AbstractRecordWriter | |
AbstractRecordWriter.OrcMemoryPressureMonitor | |
ConnectionStats |
Store statistics about streaming connection.
|
HiveStreamingConnection |
Streaming connection implementation for hive.
|
HiveStreamingConnection.Builder | |
PartitionInfo |
Simple wrapper class for minimal partition related information used by streaming ingest.
|
StrictDelimitedInputWriter |
Streaming Writer handles delimited input (eg.
|
StrictDelimitedInputWriter.Builder | |
StrictJsonWriter |
Streaming Writer handles utf8 encoded Json (Strict syntax).
|
StrictJsonWriter.Builder | |
StrictRegexWriter |
Streaming Writer handles text input data with regex.
|
StrictRegexWriter.Builder |
Enum | Description |
---|---|
HiveStreamingConnection.TxnState |
Exception | Description |
---|---|
ConnectionError | |
InvalidTable | |
InvalidTransactionState | |
PartitionCreationFailed | |
SerializationError | |
StreamingException | |
StreamingIOFailure | |
TransactionError |
Traditionally adding new data into hive requires gathering a large amount of data onto HDFS and then periodically adding a new partition. This is essentially a batch insertion. Insertion of new data into an existing partition or table is not done in a way that gives consistent results to readers. Hive Streaming API allows data to be pumped continuously into Hive. The incoming data can be continuously committed in small batches (of records) into a Hive partition. Once data is committed it becomes immediately visible to all Hive queries initiated subsequently.
This API is intended for streaming clients such as NiFi, Flume and Storm, which continuously generate data. Streaming support is built on top of ACID based insert/update support in Hive.
The classes and interfaces part of the Hive streaming API are broadly categorized into two. The first set provides support for connection and transaction management while the second set provides I/O support. Transactions are managed by the Hive MetaStore. Writes are performed to HDFS via Hive wrapper APIs that bypass MetaStore.
Note on packaging: The APIs are defined in the org.apache.hive.streaming Java package and included as the hive-streaming jar.
A few things are currently required to use streaming.
Note: Streaming to unpartitioned tables is also supported.
The class HiveEndPoint is a Hive end point to connect to. An endpoint is either a Hive table or partition. An endpoint is cheap to create and does not internally hold on to any network connections. Invoking the newConnection method on it creates a new connection to the Hive MetaStore for streaming purposes. It returns a StreamingConnection object. Multiple connections can be established on the same endpoint. StreamingConnection can then be used to initiate new transactions for performing I/O.
Concurrency Note: I/O can be performed on multiple TransactionBatchs concurrently. However the transactions within a transaction batch much be consumed sequentially.
These classes and interfaces provide support for writing the data to Hive within a transaction. RecordWriter is the interface implemented by all writers. A writer is responsible for taking a record in the form of a byte[] containing data in a known format (e.g. CSV) and writing it out in the format supported by Hive streaming. A RecordWriter may reorder or drop fields from the incoming record if necessary to map them to the corresponding columns in the Hive Table. A streaming client will instantiate an appropriate RecordWriter type and pass it to StreamingConnection.fetchTransactionBatch(). The streaming client does not directly interact with the RecordWriter therafter, but relies on the TransactionBatch to do so.
Currently, out of the box, the streaming API provides two implementations of the RecordWriter interface. One handles delimited input data (such as CSV, tab separated, etc. and the other for JSON (strict syntax). Support for other input formats can be provided by additional implementations of the RecordWriter interface.
Each StreamingConnection is writing data at the rate the underlying FileSystem can accept it. If that is not sufficient, multiple StreamingConnection objects can be created concurrently.
Each StreamingConnection can have at most 1 outstanding TransactionBatch and each TransactionBatch may have at most 2 threads operaing on it. See TransactionBatch
Copyright © 2022 The Apache Software Foundation. All rights reserved.