OrcInputFormat (Hive 3.1.3 API)

java.lang.Object
- org.apache.hadoop.hive.ql.io.orc.OrcInputFormat

All Implemented Interfaces:

VectorizedInputFormatInterface, AcidInputFormat<org.apache.hadoop.io.NullWritable,OrcStruct>, BatchToRowInputFormat, CombineHiveInputFormat.AvoidSplitCombination, InputFormatChecker, LlapWrappableInputFormatInterface, SelfDescribingInputFormatInterface, org.apache.hadoop.mapred.InputFormat<org.apache.hadoop.io.NullWritable,OrcStruct>
```
public class OrcInputFormat
extends Object
implements org.apache.hadoop.mapred.InputFormat<org.apache.hadoop.io.NullWritable,OrcStruct>, InputFormatChecker, VectorizedInputFormatInterface, LlapWrappableInputFormatInterface, SelfDescribingInputFormatInterface, AcidInputFormat<org.apache.hadoop.io.NullWritable,OrcStruct>, CombineHiveInputFormat.AvoidSplitCombination, BatchToRowInputFormat
```
A MapReduce/Hive input format for ORC files.
This class implements both the classic InputFormat, which stores the rows directly, and AcidInputFormat, which stores a series of events with the following schema:
```
   class AcidEvent<ROW> {
     enum ACTION {INSERT, UPDATE, DELETE}
     ACTION operation;
     long originalTransaction;
     int bucket;
     long rowId;
     long currentTransaction;
     ROW row;
   }
 
```
Each AcidEvent object corresponds to an update event. The originalWriteId, bucket, and rowId are the unique identifier for the row. The operation and currentWriteId are the operation and the table write id within current txn that added this event. Insert and update events include the entire row, while delete events have null for row.

Nested Class Summary

Nested Classes
Modifier and Type	Class and Description
`static class`	`OrcInputFormat.ContextFactory`
`static interface`	`OrcInputFormat.FooterCache` Represents footer cache.
`static class`	`OrcInputFormat.FooterCacheKey`
`static class`	`OrcInputFormat.NullKeyRecordReader` Return a RecordReader that is compatible with the Hive 0.12 reader with NullWritable for the key instead of RecordIdentifier.

Nested classes/interfaces inherited from interface org.apache.hadoop.hive.ql.io.AcidInputFormat
AcidInputFormat.AcidRecordReader<K,V>, AcidInputFormat.DeltaMetaData, AcidInputFormat.Options, AcidInputFormat.RawReader<V>, AcidInputFormat.RowReader<V>

Constructor Summary

Constructors
Constructor and Description

OrcInputFormat()

Constructors
Constructor and Description
`OrcInputFormat()`

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`static org.apache.orc.TypeDescription`	`convertTypeInfo(TypeInfo info)`
`protected ExternalCache.ExternalFooterCachesByConf`	`createExternalCaches()`
`static RecordReader`	`createReaderFromFile(Reader file, org.apache.hadoop.conf.Configuration conf, long offset, long length)`
`static boolean[]`	`genIncludedColumns(org.apache.orc.TypeDescription readerSchema, List<Integer> included)`
`static boolean[]`	`genIncludedColumns(org.apache.orc.TypeDescription readerSchema, List<Integer> included, Integer recursiveStruct)`
`static List<Integer>`	`genIncludedColumnsReverse(org.apache.orc.TypeDescription readerSchema, boolean[] included, boolean isFullColumnMatch)` Reverses genIncludedColumns; produces the table columns indexes from ORC included columns.
`static org.apache.orc.TypeDescription[]`	`genIncludedTypes(org.apache.orc.TypeDescription fileSchema, List<Integer> included, Integer recursiveStruct)`
`static org.apache.orc.TypeDescription`	`getDesiredRowTypeDescr(org.apache.hadoop.conf.Configuration conf, boolean isAcidRead, int dataColumns)` Generate the desired schema for reading the file.
`AcidInputFormat.RawReader<OrcStruct>`	`getRawReader(org.apache.hadoop.conf.Configuration conf, boolean collapseEvents, int bucket, org.apache.hadoop.hive.common.ValidWriteIdList validWriteIdList, org.apache.hadoop.fs.Path baseDirectory, org.apache.hadoop.fs.Path[] deltaDirectory)` Get a reader that returns the raw ACID events (insert, update, delete).
`AcidInputFormat.RowReader<OrcStruct>`	`getReader(org.apache.hadoop.mapred.InputSplit inputSplit, AcidInputFormat.Options options)` Get a record reader that provides the user-facing view of the data after it has been merged together.
`org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.NullWritable,OrcStruct>`	`getRecordReader(org.apache.hadoop.mapred.InputSplit inputSplit, org.apache.hadoop.mapred.JobConf conf, org.apache.hadoop.mapred.Reporter reporter)`
`static int`	`getRootColumn(boolean isOriginal)` Get the root column for the row.
`org.apache.hadoop.mapred.InputSplit[]`	`getSplits(org.apache.hadoop.mapred.JobConf job, int numSplits)`
`VectorizedSupport.Support[]`	`getSupportedFeatures()`
`BatchToRowReader<?,?>`	`getWrapper(org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.NullWritable,org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch> vrr, VectorizedRowBatchCtx vrbCtx, List<Integer> includedCols)`
`boolean`	`isFullAcidRead(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.mapred.InputSplit inputSplit)` We can derive if a split is ACID or not from the flags encoded in OrcSplit.
`static boolean`	`isOriginal(org.apache.orc.OrcProto.Footer footer)`
`static boolean`	`isOriginal(Reader file)`
`static boolean[]`	`pickStripesViaTranslatedSarg(org.apache.hadoop.hive.ql.io.sarg.SearchArgument sarg, org.apache.orc.OrcFile.WriterVersion writerVersion, List<org.apache.orc.OrcProto.Type> types, List<org.apache.orc.StripeStatistics> stripeStats, int stripeCount)`
`static void`	`raiseAcidTablesMustBeReadWithAcidReaderException(org.apache.hadoop.conf.Configuration conf)`
`static boolean[]`	`shiftReaderIncludedForAcid(boolean[] included)`
`boolean`	`shouldSkipCombine(org.apache.hadoop.fs.Path path, org.apache.hadoop.conf.Configuration conf)`
`static ArrayList<org.apache.orc.TypeDescription>`	`typeDescriptionsFromHiveTypeProperty(String hiveTypeProperty, int maxColumns)` Convert a Hive type property string that contains separated type names into a list of TypeDescription objects.
`boolean`	`validateInput(org.apache.hadoop.fs.FileSystem fs, HiveConf conf, List<org.apache.hadoop.fs.FileStatus> files)` This method is used to validate the input files.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail
- OrcInputFormat
```
public OrcInputFormat()
```

Method Detail

getSupportedFeatures
```
public VectorizedSupport.Support[] getSupportedFeatures()
```
Specified by:

getSupportedFeatures in interface VectorizedInputFormatInterface

shouldSkipCombine

public boolean shouldSkipCombine(org.apache.hadoop.fs.Path path,
                                 org.apache.hadoop.conf.Configuration conf)
                          throws IOException

Specified by:: shouldSkipCombine in interface CombineHiveInputFormat.AvoidSplitCombination
Throws:: IOException

isFullAcidRead
```
public boolean isFullAcidRead(org.apache.hadoop.conf.Configuration conf,
                              org.apache.hadoop.mapred.InputSplit inputSplit)
```
We can derive if a split is ACID or not from the flags encoded in OrcSplit. If the file split is not instance of OrcSplit then its definitely not ACID. If file split is instance of OrcSplit and the flags contain hasBase or deltas then it's definitely ACID. Else fallback to configuration object/table property.

Parameters:

conf -

inputSplit -

Returns:

getRootColumn
```
public static int getRootColumn(boolean isOriginal)
```
Get the root column for the row. In ACID format files, it is offset by the extra metadata columns.

Parameters:

isOriginal - is the file in the original format?

Returns:

the column number for the root of row.

raiseAcidTablesMustBeReadWithAcidReaderException

public static void raiseAcidTablesMustBeReadWithAcidReaderException(org.apache.hadoop.conf.Configuration conf)
                                                             throws IOException

Throws:: IOException

createReaderFromFile

public static RecordReader createReaderFromFile(Reader file,
                                                org.apache.hadoop.conf.Configuration conf,
                                                long offset,
                                                long length)
                                         throws IOException

Throws:: IOException

isOriginal

public static boolean isOriginal(Reader file)

isOriginal

public static boolean isOriginal(org.apache.orc.OrcProto.Footer footer)

genIncludedColumns

public static boolean[] genIncludedColumns(org.apache.orc.TypeDescription readerSchema,
                                           List<Integer> included)

genIncludedColumns

public static boolean[] genIncludedColumns(org.apache.orc.TypeDescription readerSchema,
                                           List<Integer> included,
                                           Integer recursiveStruct)

genIncludedTypes

public static org.apache.orc.TypeDescription[] genIncludedTypes(org.apache.orc.TypeDescription fileSchema,
                                                                List<Integer> included,
                                                                Integer recursiveStruct)

genIncludedColumnsReverse
```
public static List<Integer> genIncludedColumnsReverse(org.apache.orc.TypeDescription readerSchema,
                                                      boolean[] included,
                                                      boolean isFullColumnMatch)
```
Reverses genIncludedColumns; produces the table columns indexes from ORC included columns.

Parameters:

readerSchema - The ORC reader schema for the table.

included - The included ORC columns.

isFullColumnMatch - Whether full column match should be enforced (i.e. whether to expect that all the sub-columns or a complex type column should be included or excluded together in the included array. If false, any sub-column being included for a complex type is sufficient for the entire complex column to be included in the result.

Returns:

The list of table column indexes.

validateInput

public boolean validateInput(org.apache.hadoop.fs.FileSystem fs,
                             HiveConf conf,
                             List<org.apache.hadoop.fs.FileStatus> files)
                      throws IOException

Description copied from interface: InputFormatChecker

This method is used to validate the input files.

Specified by:: validateInput in interface InputFormatChecker
Throws:: IOException

shiftReaderIncludedForAcid

public static boolean[] shiftReaderIncludedForAcid(boolean[] included)

getSplits

public org.apache.hadoop.mapred.InputSplit[] getSplits(org.apache.hadoop.mapred.JobConf job,
                                                       int numSplits)
                                                throws IOException

Specified by:: getSplits in interface org.apache.hadoop.mapred.InputFormat<org.apache.hadoop.io.NullWritable,OrcStruct>
Throws:: IOException

getRecordReader

public org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.NullWritable,OrcStruct> getRecordReader(org.apache.hadoop.mapred.InputSplit inputSplit,
                                                                                                          org.apache.hadoop.mapred.JobConf conf,
                                                                                                          org.apache.hadoop.mapred.Reporter reporter)
                                                                                                   throws IOException

Specified by:: getRecordReader in interface org.apache.hadoop.mapred.InputFormat<org.apache.hadoop.io.NullWritable,OrcStruct>
Throws:: IOException

getReader
```
public AcidInputFormat.RowReader<OrcStruct> getReader(org.apache.hadoop.mapred.InputSplit inputSplit,
                                                      AcidInputFormat.Options options)
                                               throws IOException
```
Description copied from interface: AcidInputFormat

Get a record reader that provides the user-facing view of the data after it has been merged together. The key provides information about the record's identifier (write id, bucket, record id).

Specified by:

getReader in interface AcidInputFormat<org.apache.hadoop.io.NullWritable,OrcStruct>

Parameters:

inputSplit - the split to read

options - the options to read with

Returns:

a record reader

Throws:

IOException

pickStripesViaTranslatedSarg

public static boolean[] pickStripesViaTranslatedSarg(org.apache.hadoop.hive.ql.io.sarg.SearchArgument sarg,
                                                     org.apache.orc.OrcFile.WriterVersion writerVersion,
                                                     List<org.apache.orc.OrcProto.Type> types,
                                                     List<org.apache.orc.StripeStatistics> stripeStats,
                                                     int stripeCount)
                                              throws org.apache.orc.FileFormatException

Throws:: org.apache.orc.FileFormatException

getRawReader
```
public AcidInputFormat.RawReader<OrcStruct> getRawReader(org.apache.hadoop.conf.Configuration conf,
                                                         boolean collapseEvents,
                                                         int bucket,
                                                         org.apache.hadoop.hive.common.ValidWriteIdList validWriteIdList,
                                                         org.apache.hadoop.fs.Path baseDirectory,
                                                         org.apache.hadoop.fs.Path[] deltaDirectory)
                                                  throws IOException
```
Description copied from interface: AcidInputFormat

Get a reader that returns the raw ACID events (insert, update, delete). Should only be used by the compactor.

Specified by:

getRawReader in interface AcidInputFormat<org.apache.hadoop.io.NullWritable,OrcStruct>

Parameters:

bucket - bucket/writer ID for this split of the compaction job

conf - the configuration

collapseEvents - should the ACID events be collapsed so that only the last version of the row is kept.

validWriteIdList - the list of valid write ids to use

baseDirectory - the base directory to read or the root directory for old style files

deltaDirectory - a list of delta files to include in the merge

Returns:

a record reader

Throws:

IOException

typeDescriptionsFromHiveTypeProperty
```
public static ArrayList<org.apache.orc.TypeDescription> typeDescriptionsFromHiveTypeProperty(String hiveTypeProperty,
                                                                                             int maxColumns)
```
Convert a Hive type property string that contains separated type names into a list of TypeDescription objects.

Parameters:

hiveTypeProperty - the desired types from hive

maxColumns - the maximum number of desired columns

Returns:

the list of TypeDescription objects.

convertTypeInfo

public static org.apache.orc.TypeDescription convertTypeInfo(TypeInfo info)

getDesiredRowTypeDescr
```
public static org.apache.orc.TypeDescription getDesiredRowTypeDescr(org.apache.hadoop.conf.Configuration conf,
                                                                    boolean isAcidRead,
                                                                    int dataColumns)
```
Generate the desired schema for reading the file.

Parameters:

conf - the configuration

isAcidRead - is this an acid format?

dataColumns - the desired number of data columns for vectorized read

Returns:

the desired schema or null if schema evolution isn't enabled

Throws:

IllegalArgumentException

createExternalCaches

protected ExternalCache.ExternalFooterCachesByConf createExternalCaches()

getWrapper

public BatchToRowReader<?,?> getWrapper(org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.NullWritable,org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch> vrr,
                                        VectorizedRowBatchCtx vrbCtx,
                                        List<Integer> includedCols)

Specified by:: getWrapper in interface BatchToRowInputFormat

Class OrcInputFormat

Nested Class Summary

Nested classes/interfaces inherited from interface org.apache.hadoop.hive.ql.io.AcidInputFormat

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

OrcInputFormat

Method Detail

getSupportedFeatures

shouldSkipCombine

isFullAcidRead

getRootColumn

raiseAcidTablesMustBeReadWithAcidReaderException

createReaderFromFile

isOriginal

isOriginal

genIncludedColumns

genIncludedColumns

genIncludedTypes

genIncludedColumnsReverse

validateInput

shiftReaderIncludedForAcid

getSplits

getRecordReader

getReader

pickStripesViaTranslatedSarg

getRawReader

typeDescriptionsFromHiveTypeProperty

convertTypeInfo

getDesiredRowTypeDescr

createExternalCaches

getWrapper