Version 0.1
DRAFT
This document is the authoritative specification of a file format. Its intent is to permit compatible, independent implementations that read and/or write files in this format.
Data sets are often described as a table composed of rows and columns. Each record in the dataset is considered a row, with each field of the record occupying a different column. Writing records to a file one-by-one as they are created results in a row-major format, like Hadoop’s SequenceFile or Avro data files.
In many cases higher query performance may be achieved if the data is instead organized in a column-major format, where multiple values of a given column are stored adjacently. This document defines such a column-major file format for datasets.
To permit scalable, distributed query evaluation, datasets are partitioned into row groups, containing distinct collections of rows. Each row group is organized in column-major order, while row groups form a row-major partitioning of the entire dataset.
The format is meant satisfy the following goals:
To meet these goals we propose the following design.
The use of a single block per file achieves the same effect as the custom block placement policy described in the CIF paper, but while still permitting HDFS rebalancing and not increasing the number of files in the namespace.
This section formally describes the proposed column file format.
We assume a simple data model, where a record is a set of named fields, and the value of each field is a sequence of untyped bytes. A type system may be layered on top of this, as specified in the Type Mapping section below.
We define the following primitive value types:
decimal value | hex bytes |
0 | 00 |
-1 | 01 |
1 | 02 |
... | |
-64 | 7f |
64 | 80 01 |
... |
For example, the three-character string "foo" would be encoded as the long value 3 (encoded as hex 06) followed by the UTF-8 encoding of 'f', 'o', and 'o' (the hex bytes 66 6f 6f): 06 66 6f 6f
The following type names are used to describe column values:
Type names are represented as strings (UTF-8 encoded, length-prefixed).
Metadata consists of:
All metadata properties that start with "trevni." are reserved.
The following file metadata properties are defined:
The following column metadata properties are defined:
For example, consider the following row, as JSON, where all values are primitive types, but one has multiple values.
{"id"=566, "date"=23423234234 "from"="foo@bar.com", "to"=["bar@baz.com", "bang@foo.com"], "content"="Hi!"}
The columns for this might be specified as:
name=id type=int name=date type=long name=from type=string name=to type=string array=true name=content type=string
If a row contains an array of records, e.g. "received" in the following:
{"id"=566, "date"=23423234234 "from"="foo@bar.com", "to"=["bar@baz.com", "bang@foo.com"], "content"="Hi!" "received"=[{"date"=234234234234, "host"="192.168.0.0.1"}, {"date"=234234545645, "host"="192.168.0.0.2"}] }
Then one can define a parent column followed by a column for each field in the record, adding the following columns:
name=received type=null array=true name=date type=long parent=received name=host type=string parent=received
If an array value itself contains an array, e.g. the "sigs" below:
{"id"=566, "date"=23423234234 "from"="foo@bar.com", "to"=["bar@baz.com", "bang@foo.com"], "content"="Hi!" "received"=[{"date"=234234234234, "host"="192.168.0.0.1", "sigs"=[{"algo"="weak", "value"="0af345de"}]}, {"date"=234234545645, "host"="192.168.0.0.2", "sigs"=[]}] }
Then a parent column may be defined that itself has a parent column.
name=sigs type=null array=true parent=received name=algo type=string parent=sigs name=value type=string parent=sigs
No block metadata properties are currently defined.
A file consists of:
A file header consists of:
A column consists of:
A block descriptor consists of:
A block consists of:
We define a standard mapping for how types defined in various serialization systems are represented in a column file. Records from these systems are shredded into columns. When records are nested, a depth-first recursive walk can assign a separate column for each primitive value.
Some possible techniques for writing column files include:
(1) is the simplest to implement and most implementations should start with it.
CIF Column-Oriented Storage Techniques for MapReduce, Floratou, Patel, Shekita, & Tata, VLDB 2011.
DREMEL Dremel: Interactive Analysis of Web-Scale Datasets, Melnik, Gubarev, Long, Romer, Shivakumar, & Tolton, VLDB 2010.