VALUE
- The row typepublic interface AcidInputFormat<KEY extends org.apache.hadoop.io.WritableComparable,VALUE> extends org.apache.hadoop.mapred.InputFormat<KEY,VALUE>, InputFormatChecker
The goal is to provide ACID transactions to Hive. There are several primary use cases:
The design changes the layout of data within a partition from being in files at the top level to having base and delta directories. Each write operation will be assigned a sequential global transaction id and each read operation will request the list of valid transaction ids.
$partition/$bucket
$partition/base_$tid/$bucket delta_$tid_$tid/$bucket
With each new write operation a new delta directory is created with events that correspond to inserted, updated, or deleted rows. Each of the files is stored sorted by the original transaction id (ascending), bucket (ascending), row id (ascending), and current transaction id (descending). Thus the files can be merged by advancing through the files in parallel.
The base files include all transactions from the beginning of time (transaction id 0) to the transaction in the directory name. Delta directories include transactions (inclusive) between the two transaction ids.
Because read operations get the list of valid transactions when they start, all reads are performed on that snapshot, regardless of any transactions that are committed afterwards.
The base and the delta directories have the transaction ids so that major (merge all deltas into the base) and minor (merge several deltas together) compactions can happen while readers continue their processing.
To support transitions between non-ACID layouts to ACID layouts, the input formats are expected to support both layouts and detect the correct one.
A note on the KEY of this InputFormat.
For row-at-a-time processing, KEY can conveniently pass RowId into the operator
pipeline. For vectorized execution the KEY could perhaps represent a range in the batch.
Since OrcInputFormat
is declared to return
NullWritable
key, org.apache.hadoop.hive.ql.io.AcidRecordReader
is defined
to provide access to the RowId. Other implementations of AcidInputFormat can use either
mechanism.
Modifier and Type | Interface and Description |
---|---|
static interface |
AcidInputFormat.AcidRecordReader<K,V>
RecordReader returned by AcidInputFormat working in row-at-a-time mode should AcidRecordReader.
|
static class |
AcidInputFormat.Options
Options for controlling the record readers.
|
static interface |
AcidInputFormat.RawReader<V> |
static interface |
AcidInputFormat.RowReader<V> |
Modifier and Type | Method and Description |
---|---|
AcidInputFormat.RawReader<VALUE> |
getRawReader(org.apache.hadoop.conf.Configuration conf,
boolean collapseEvents,
int bucket,
ValidTxnList validTxnList,
org.apache.hadoop.fs.Path baseDirectory,
org.apache.hadoop.fs.Path[] deltaDirectory)
Get a reader that returns the raw ACID events (insert, update, delete).
|
AcidInputFormat.RowReader<VALUE> |
getReader(org.apache.hadoop.mapred.InputSplit split,
AcidInputFormat.Options options)
Get a record reader that provides the user-facing view of the data after
it has been merged together.
|
validateInput
AcidInputFormat.RowReader<VALUE> getReader(org.apache.hadoop.mapred.InputSplit split, AcidInputFormat.Options options) throws IOException
split
- the split to readoptions
- the options to read withIOException
AcidInputFormat.RawReader<VALUE> getRawReader(org.apache.hadoop.conf.Configuration conf, boolean collapseEvents, int bucket, ValidTxnList validTxnList, org.apache.hadoop.fs.Path baseDirectory, org.apache.hadoop.fs.Path[] deltaDirectory) throws IOException
conf
- the configurationcollapseEvents
- should the ACID events be collapsed so that only
the last version of the row is kept.bucket
- the bucket to readvalidTxnList
- the list of valid transactions to usebaseDirectory
- the base directory to read or the root directory for
old style filesdeltaDirectory
- a list of delta files to include in the mergeIOException
Copyright © 2017 The Apache Software Foundation. All rights reserved.