public class OrcInputFormat extends Object implements org.apache.hadoop.mapred.InputFormat<org.apache.hadoop.io.NullWritable,OrcStruct>, InputFormatChecker, VectorizedInputFormatInterface, LlapWrappableInputFormatInterface, SelfDescribingInputFormatInterface, AcidInputFormat<org.apache.hadoop.io.NullWritable,OrcStruct>, CombineHiveInputFormat.AvoidSplitCombination, BatchToRowInputFormat
This class implements both the classic InputFormat, which stores the rows directly, and AcidInputFormat, which stores a series of events with the following schema:
class AcidEvent<ROW> { enum ACTION {INSERT, UPDATE, DELETE} ACTION operation; long originalTransaction; int bucket; long rowId; long currentTransaction; ROW row; }Each AcidEvent object corresponds to an update event. The originalWriteId, bucket, and rowId are the unique identifier for the row. The operation and currentWriteId are the operation and the table write id within current txn that added this event. Insert and update events include the entire row, while delete events have null for row.
Modifier and Type | Class and Description |
---|---|
static class |
OrcInputFormat.ContextFactory |
static interface |
OrcInputFormat.FooterCache
Represents footer cache.
|
static class |
OrcInputFormat.FooterCacheKey |
static class |
OrcInputFormat.NullKeyRecordReader
Return a RecordReader that is compatible with the Hive 0.12 reader
with NullWritable for the key instead of RecordIdentifier.
|
AcidInputFormat.AcidRecordReader<K,V>, AcidInputFormat.DeltaMetaData, AcidInputFormat.Options, AcidInputFormat.RawReader<V>, AcidInputFormat.RowReader<V>
Constructor and Description |
---|
OrcInputFormat() |
Modifier and Type | Method and Description |
---|---|
static org.apache.orc.TypeDescription |
convertTypeInfo(TypeInfo info) |
protected ExternalCache.ExternalFooterCachesByConf |
createExternalCaches() |
static RecordReader |
createReaderFromFile(Reader file,
org.apache.hadoop.conf.Configuration conf,
long offset,
long length) |
static boolean[] |
genIncludedColumns(org.apache.orc.TypeDescription readerSchema,
List<Integer> included) |
static boolean[] |
genIncludedColumns(org.apache.orc.TypeDescription readerSchema,
List<Integer> included,
Integer recursiveStruct) |
static List<Integer> |
genIncludedColumnsReverse(org.apache.orc.TypeDescription readerSchema,
boolean[] included,
boolean isFullColumnMatch)
Reverses genIncludedColumns; produces the table columns indexes from ORC included columns.
|
static org.apache.orc.TypeDescription[] |
genIncludedTypes(org.apache.orc.TypeDescription fileSchema,
List<Integer> included,
Integer recursiveStruct) |
static org.apache.orc.TypeDescription |
getDesiredRowTypeDescr(org.apache.hadoop.conf.Configuration conf,
boolean isAcidRead,
int dataColumns)
Generate the desired schema for reading the file.
|
AcidInputFormat.RawReader<OrcStruct> |
getRawReader(org.apache.hadoop.conf.Configuration conf,
boolean collapseEvents,
int bucket,
org.apache.hadoop.hive.common.ValidWriteIdList validWriteIdList,
org.apache.hadoop.fs.Path baseDirectory,
org.apache.hadoop.fs.Path[] deltaDirectory)
Get a reader that returns the raw ACID events (insert, update, delete).
|
AcidInputFormat.RowReader<OrcStruct> |
getReader(org.apache.hadoop.mapred.InputSplit inputSplit,
AcidInputFormat.Options options)
Get a record reader that provides the user-facing view of the data after
it has been merged together.
|
org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.NullWritable,OrcStruct> |
getRecordReader(org.apache.hadoop.mapred.InputSplit inputSplit,
org.apache.hadoop.mapred.JobConf conf,
org.apache.hadoop.mapred.Reporter reporter) |
static int |
getRootColumn(boolean isOriginal)
Get the root column for the row.
|
org.apache.hadoop.mapred.InputSplit[] |
getSplits(org.apache.hadoop.mapred.JobConf job,
int numSplits) |
VectorizedSupport.Support[] |
getSupportedFeatures() |
BatchToRowReader<?,?> |
getWrapper(org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.NullWritable,org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch> vrr,
VectorizedRowBatchCtx vrbCtx,
List<Integer> includedCols) |
boolean |
isFullAcidRead(org.apache.hadoop.conf.Configuration conf,
org.apache.hadoop.mapred.InputSplit inputSplit)
We can derive if a split is ACID or not from the flags encoded in OrcSplit.
|
static boolean |
isOriginal(org.apache.orc.OrcProto.Footer footer) |
static boolean |
isOriginal(Reader file) |
static boolean[] |
pickStripesViaTranslatedSarg(org.apache.hadoop.hive.ql.io.sarg.SearchArgument sarg,
org.apache.orc.OrcFile.WriterVersion writerVersion,
List<org.apache.orc.OrcProto.Type> types,
List<org.apache.orc.StripeStatistics> stripeStats,
int stripeCount) |
static void |
raiseAcidTablesMustBeReadWithAcidReaderException(org.apache.hadoop.conf.Configuration conf) |
static boolean[] |
shiftReaderIncludedForAcid(boolean[] included) |
boolean |
shouldSkipCombine(org.apache.hadoop.fs.Path path,
org.apache.hadoop.conf.Configuration conf) |
static ArrayList<org.apache.orc.TypeDescription> |
typeDescriptionsFromHiveTypeProperty(String hiveTypeProperty,
int maxColumns)
Convert a Hive type property string that contains separated type names into a list of
TypeDescription objects.
|
boolean |
validateInput(org.apache.hadoop.fs.FileSystem fs,
HiveConf conf,
List<org.apache.hadoop.fs.FileStatus> files)
This method is used to validate the input files.
|
public VectorizedSupport.Support[] getSupportedFeatures()
getSupportedFeatures
in interface VectorizedInputFormatInterface
public boolean shouldSkipCombine(org.apache.hadoop.fs.Path path, org.apache.hadoop.conf.Configuration conf) throws IOException
shouldSkipCombine
in interface CombineHiveInputFormat.AvoidSplitCombination
IOException
public boolean isFullAcidRead(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.mapred.InputSplit inputSplit)
conf
- inputSplit
- public static int getRootColumn(boolean isOriginal)
isOriginal
- is the file in the original format?public static void raiseAcidTablesMustBeReadWithAcidReaderException(org.apache.hadoop.conf.Configuration conf) throws IOException
IOException
public static RecordReader createReaderFromFile(Reader file, org.apache.hadoop.conf.Configuration conf, long offset, long length) throws IOException
IOException
public static boolean isOriginal(Reader file)
public static boolean isOriginal(org.apache.orc.OrcProto.Footer footer)
public static boolean[] genIncludedColumns(org.apache.orc.TypeDescription readerSchema, List<Integer> included)
public static boolean[] genIncludedColumns(org.apache.orc.TypeDescription readerSchema, List<Integer> included, Integer recursiveStruct)
public static org.apache.orc.TypeDescription[] genIncludedTypes(org.apache.orc.TypeDescription fileSchema, List<Integer> included, Integer recursiveStruct)
public static List<Integer> genIncludedColumnsReverse(org.apache.orc.TypeDescription readerSchema, boolean[] included, boolean isFullColumnMatch)
readerSchema
- The ORC reader schema for the table.included
- The included ORC columns.isFullColumnMatch
- Whether full column match should be enforced (i.e. whether to expect
that all the sub-columns or a complex type column should be included or excluded
together in the included array. If false, any sub-column being included for a complex
type is sufficient for the entire complex column to be included in the result.public boolean validateInput(org.apache.hadoop.fs.FileSystem fs, HiveConf conf, List<org.apache.hadoop.fs.FileStatus> files) throws IOException
InputFormatChecker
validateInput
in interface InputFormatChecker
IOException
public static boolean[] shiftReaderIncludedForAcid(boolean[] included)
public org.apache.hadoop.mapred.InputSplit[] getSplits(org.apache.hadoop.mapred.JobConf job, int numSplits) throws IOException
getSplits
in interface org.apache.hadoop.mapred.InputFormat<org.apache.hadoop.io.NullWritable,OrcStruct>
IOException
public org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.NullWritable,OrcStruct> getRecordReader(org.apache.hadoop.mapred.InputSplit inputSplit, org.apache.hadoop.mapred.JobConf conf, org.apache.hadoop.mapred.Reporter reporter) throws IOException
getRecordReader
in interface org.apache.hadoop.mapred.InputFormat<org.apache.hadoop.io.NullWritable,OrcStruct>
IOException
public AcidInputFormat.RowReader<OrcStruct> getReader(org.apache.hadoop.mapred.InputSplit inputSplit, AcidInputFormat.Options options) throws IOException
AcidInputFormat
getReader
in interface AcidInputFormat<org.apache.hadoop.io.NullWritable,OrcStruct>
inputSplit
- the split to readoptions
- the options to read withIOException
public static boolean[] pickStripesViaTranslatedSarg(org.apache.hadoop.hive.ql.io.sarg.SearchArgument sarg, org.apache.orc.OrcFile.WriterVersion writerVersion, List<org.apache.orc.OrcProto.Type> types, List<org.apache.orc.StripeStatistics> stripeStats, int stripeCount) throws org.apache.orc.FileFormatException
org.apache.orc.FileFormatException
public AcidInputFormat.RawReader<OrcStruct> getRawReader(org.apache.hadoop.conf.Configuration conf, boolean collapseEvents, int bucket, org.apache.hadoop.hive.common.ValidWriteIdList validWriteIdList, org.apache.hadoop.fs.Path baseDirectory, org.apache.hadoop.fs.Path[] deltaDirectory) throws IOException
AcidInputFormat
getRawReader
in interface AcidInputFormat<org.apache.hadoop.io.NullWritable,OrcStruct>
bucket
- bucket/writer ID for this split of the compaction jobconf
- the configurationcollapseEvents
- should the ACID events be collapsed so that only
the last version of the row is kept.validWriteIdList
- the list of valid write ids to usebaseDirectory
- the base directory to read or the root directory for
old style filesdeltaDirectory
- a list of delta files to include in the mergeIOException
public static ArrayList<org.apache.orc.TypeDescription> typeDescriptionsFromHiveTypeProperty(String hiveTypeProperty, int maxColumns)
hiveTypeProperty
- the desired types from hivemaxColumns
- the maximum number of desired columnspublic static org.apache.orc.TypeDescription convertTypeInfo(TypeInfo info)
public static org.apache.orc.TypeDescription getDesiredRowTypeDescr(org.apache.hadoop.conf.Configuration conf, boolean isAcidRead, int dataColumns)
conf
- the configurationisAcidRead
- is this an acid format?dataColumns
- the desired number of data columns for vectorized readIllegalArgumentException
protected ExternalCache.ExternalFooterCachesByConf createExternalCaches()
public BatchToRowReader<?,?> getWrapper(org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.NullWritable,org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch> vrr, VectorizedRowBatchCtx vrbCtx, List<Integer> includedCols)
getWrapper
in interface BatchToRowInputFormat
Copyright © 2022 The Apache Software Foundation. All rights reserved.