public class RCFile extends Object
RCFile
s, short of Record Columnar File, are flat files
consisting of binary key/value pairs, which shares much similarity with
SequenceFile
.
RCFile stores columns of a table in a record columnar way. It first
partitions rows horizontally into row splits. and then it vertically
partitions each row split in a columnar way. RCFile first stores the meta
data of a row split, as the key part of a record, and all the data of a row
split as the value part. When writing, RCFile.Writer first holds records'
value bytes in memory, and determines a row split if the raw bytes size of
buffered records overflow a given parameterWriter.columnsBufferSize,
which can be set like: conf.setInt(COLUMNS_BUFFER_SIZE_CONF_STR,
4 * 1024 * 1024)
.
RCFile
provides RCFile.Writer
, RCFile.Reader
and classes for
writing, reading respectively.
RCFile stores columns of a table in a record columnar way. It first partitions rows horizontally into row splits. and then it vertically partitions each row split in a columnar way. RCFile first stores the meta data of a row split, as the key part of a record, and all the data of a row split as the value part.
RCFile compresses values in a more fine-grained manner then record level
compression. However, It currently does not support compress the key part
yet. The actual compression algorithm used to compress key and/or values can
be specified by using the appropriate CompressionCodec
.
The RCFile.Reader
is used to read and explain the bytes of RCFile.
CompressionCodec
class which is used
for compression of keys and/or values (if compression is enabled).SequenceFile.Metadata
for this file.
The following is a pseudo-BNF grammar for RCFile. Comments are prefixed
with dashes:
rcfile ::=
<file-header>
<rcfile-rowgroup>+
file-header ::=
<file-version-header>
<file-key-class-name> (only exists if version is seq6)
<file-value-class-name> (only exists if version is seq6)
<file-is-compressed>
<file-is-block-compressed> (only exists if version is seq6)
[<file-compression-codec-class>]
<file-header-metadata>
<file-sync-field>
-- The normative RCFile implementation included with Hive is actually
-- based on a modified version of Hadoop's SequenceFile code. Some
-- things which should have been modified were not, including the code
-- that writes out the file version header. Consequently, RCFile and
-- SequenceFile originally shared the same version header. A newer
-- release has created a unique version string.
file-version-header ::= Byte[4] {'S', 'E', 'Q', 6}
| Byte[4] {'R', 'C', 'F', 1}
-- The name of the Java class responsible for reading the key buffer
-- component of the rowgroup.
file-key-class-name ::=
Text {"org.apache.hadoop.hive.ql.io.RCFile$KeyBuffer"}
-- The name of the Java class responsible for reading the value buffer
-- component of the rowgroup.
file-value-class-name ::=
Text {"org.apache.hadoop.hive.ql.io.RCFile$ValueBuffer"}
-- Boolean variable indicating whether or not the file uses compression
-- for the key and column buffer sections.
file-is-compressed ::= Byte[1]
-- A boolean field indicating whether or not the file is block compressed.
-- This field is *always* false. According to comments in the original
-- RCFile implementation this field was retained for backwards
-- compatability with the SequenceFile format.
file-is-block-compressed ::= Byte[1] {false}
-- The Java class name of the compression codec iff <file-is-compressed>
-- is true. The named class must implement
-- org.apache.hadoop.io.compress.CompressionCodec.
-- The expected value is org.apache.hadoop.io.compress.GzipCodec.
file-compression-codec-class ::= Text
-- A collection of key-value pairs defining metadata values for the
-- file. The Map is serialized using standard JDK serialization, i.e.
-- an Int corresponding to the number of key-value pairs, followed by
-- Text key and value pairs. The following metadata properties are
-- mandatory for all RCFiles:
--
-- hive.io.rcfile.column.number: the number of columns in the RCFile
file-header-metadata ::= Map<Text, Text>
-- A 16 byte marker that is generated by the writer. This marker appears
-- at regular intervals at the beginning of rowgroup-headers, and is
-- intended to enable readers to skip over corrupted rowgroups.
file-sync-hash ::= Byte[16]
-- Each row group is split into three sections: a header, a set of
-- key buffers, and a set of column buffers. The header section includes
-- an optional sync hash, information about the size of the row group, and
-- the total number of rows in the row group. Each key buffer
-- consists of run-length encoding data which is used to decode
-- the length and offsets of individual fields in the corresponding column
-- buffer.
rcfile-rowgroup ::=
<rowgroup-header>
<rowgroup-key-data>
<rowgroup-column-buffers>
rowgroup-header ::=
[<rowgroup-sync-marker>, <rowgroup-sync-hash>]
<rowgroup-record-length>
<rowgroup-key-length>
<rowgroup-compressed-key-length>
-- rowgroup-key-data is compressed if the column data is compressed.
rowgroup-key-data ::=
<rowgroup-num-rows>
<rowgroup-key-buffers>
-- An integer (always -1) signaling the beginning of a sync-hash
-- field.
rowgroup-sync-marker ::= Int
-- A 16 byte sync field. This must match the <file-sync-hash> value read
-- in the file header.
rowgroup-sync-hash ::= Byte[16]
-- The record-length is the sum of the number of bytes used to store
-- the key and column parts, i.e. it is the total length of the current
-- rowgroup.
rowgroup-record-length ::= Int
-- Total length in bytes of the rowgroup's key sections.
rowgroup-key-length ::= Int
-- Total compressed length in bytes of the rowgroup's key sections.
rowgroup-compressed-key-length ::= Int
-- Number of rows in the current rowgroup.
rowgroup-num-rows ::= VInt
-- One or more column key buffers corresponding to each column
-- in the RCFile.
rowgroup-key-buffers ::= <rowgroup-key-buffer>+
-- Data in each column buffer is stored using a run-length
-- encoding scheme that is intended to reduce the cost of
-- repeated column field values. This mechanism is described
-- in more detail in the following entries.
rowgroup-key-buffer ::=
<column-buffer-length>
<column-buffer-uncompressed-length>
<column-key-buffer-length>
<column-key-buffer>
-- The serialized length on disk of the corresponding column buffer.
column-buffer-length ::= VInt
-- The uncompressed length of the corresponding column buffer. This
-- is equivalent to column-buffer-length if the RCFile is not compressed.
column-buffer-uncompressed-length ::= VInt
-- The length in bytes of the current column key buffer
column-key-buffer-length ::= VInt
-- The column-key-buffer contains a sequence of serialized VInt values
-- corresponding to the byte lengths of the serialized column fields
-- in the corresponding rowgroup-column-buffer. For example, consider
-- an integer column that contains the consecutive values 1, 2, 3, 44.
-- The RCFile format stores these values as strings in the column buffer,
-- e.g. "12344". The length of each column field is recorded in
-- the column-key-buffer as a sequence of VInts: 1,1,1,2. However,
-- if the same length occurs repeatedly, then we replace repeated
-- run lengths with the complement (i.e. negative) of the number of
-- repetitions, so 1,1,1,2 becomes 1,~2,2.
column-key-buffer ::= Byte[column-key-buffer-length]
rowgroup-column-buffers ::= <rowgroup-value-buffer>+
-- RCFile stores all column data as strings regardless of the
-- underlying column type. The strings are neither length-prefixed or
-- null-terminated, and decoding them into individual fields requires
-- the use of the run-length information contained in the corresponding
-- column-key-buffer.
rowgroup-column-buffer ::= Byte[column-buffer-length]
Byte ::= An eight-bit byte
VInt ::= Variable length integer. The high-order bit of each byte
indicates whether more bytes remain to be read. The low-order seven
bits are appended as increasingly more significant bits in the
resulting integer value.
Int ::= A four-byte integer in big-endian format.
Text ::= VInt, Chars (Length prefixed UTF-8 characters)
Modifier and Type | Class and Description |
---|---|
static class |
RCFile.KeyBuffer
KeyBuffer is the key of each record in RCFile.
|
static class |
RCFile.Reader
Read KeyBuffer/ValueBuffer pairs from a RCFile.
|
static class |
RCFile.ValueBuffer
ValueBuffer is the value of each record in RCFile.
|
static class |
RCFile.Writer
Write KeyBuffer/ValueBuffer pairs to a RCFile.
|
Modifier and Type | Field and Description |
---|---|
static String |
BLOCK_MISSING_MESSAGE |
static String |
COLUMN_NUMBER_CONF_STR |
static String |
COLUMN_NUMBER_METADATA_STR |
static String |
RECORD_INTERVAL_CONF_STR |
static int |
SYNC_INTERVAL
The number of bytes between sync points.
|
static String |
TOLERATE_CORRUPTIONS_CONF_STR |
Constructor and Description |
---|
RCFile() |
Modifier and Type | Method and Description |
---|---|
static org.apache.hadoop.io.SequenceFile.Metadata |
createMetadata(org.apache.hadoop.io.Text... values)
Create a metadata object with alternating key-value pairs.
|
public static final String COLUMN_NUMBER_METADATA_STR
public static final String RECORD_INTERVAL_CONF_STR
public static final String COLUMN_NUMBER_CONF_STR
public static final String TOLERATE_CORRUPTIONS_CONF_STR
public static final String BLOCK_MISSING_MESSAGE
public static final int SYNC_INTERVAL
Copyright © 2017 The Apache Software Foundation. All rights reserved.