~~ Licensed to the Apache Software Foundation (ASF) under one or more
~~ contributor license agreements.  See the NOTICE file distributed with
~~ this work for additional information regarding copyright ownership.
~~ The ASF licenses this file to You under the Apache License, Version 2.0
~~ (the "License"); you may not use this file except in compliance with
~~ the License.  You may obtain a copy of the License at
~~
~~     http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License.
  ---
  Trevni: A Column File Format
  ---

Trevni: A Column File Format

  Version 0.1

  DRAFT

  This document is the authoritative specification of a file format.
  Its intent is to permit compatible, independent implementations that
  read and/or write files in this format.

Introduction

  Data sets are often described as a <table> composed of <rows> and
  <columns>.  Each record in the dataset is considered a row, with
  each field of the record occupying a different column.  Writing
  records to a file one-by-one as they are created results in a
  <row-major> format, like Hadoop’s SequenceFile or Avro data files.

  In many cases higher query performance may be achieved if the data
  is instead organized in a <column-major> format, where multiple
  values of a given column are stored adjacently.  This document
  defines such a column-major file format for datasets.

  To permit scalable, distributed query evaluation, datasets are
  partitioned into row groups, containing distinct collections of
  rows.  Each row group is organized in column-major order, while row
  groups form a row-major partitioning of the entire dataset.

Rationale

* Goals

  The format is meant satisfy the following goals:

  [[1]] Maximize the size of row groups.  Disc drives are used most
  efficiently when sequentially accessing data.  Consider a drive that
  takes 10ms to seek and transfers at 100MB/second.  If a 10-column
  dataset whose values are all the same size is split into 10MB row
  groups, then accessing a single column will require a sequence of
  seek+1MB reads, for a cost of 20ms/MB processed.  If the same
  dataset is split into 100MB row groups then this drops to 11ms/MB
  processed.  This effect is exaggerated for datasets with larger
  numbers of columns and with columns whose values are smaller than
  average.  So we’d prefer row groups that are 100MB or greater.

  [[1]] Permit random access within a row group.  Some queries will
  first examine one column, and, only when certain relatively rare
  criteria are met, examine other columns.  Rather than iterating
  through selected columns of the row-group in parallel, one might
  iterate through one column and randomly access another.  This is
  called support for WHERE clauses, after the SQL operator of that
  name.

  [[1]] Minimize the number of files per dataset.  HDFS is a primary
  intended deployment platform for these files.  The HDFS Namenode
  requires memory for each file in the filesystem, thus for a format
  to be HDFS-friendly it should strive to require the minimum number
  of distinct files.

  [[1]] Support co-location of columns within row-groups.  Row groups
  are the unit of parallel operation on a column dataset.  For
  efficient file i/o, the entirety of a row-group should ideally
  reside on the host that is evaluating the query in order to avoid
  network latencies and bottlenecks.

  [[1]] Data integrity.  The format should permit applications to
  detect data corruption.  Many file systems may prevent corruption,
  but files may be moved between filesystems and be subject to
  corruption at points in that process.  It is best if the data in a
  file can be validated independently.

  [[1]] Extensibility.  The format should permit applications to store
  additional annotations about a datasets in the files, such as type
  information, origin, etc.  Some environments may have metadata
  stores for such information, but not all do, and files might be
  moved among systems with different metadata systems.  The ability to
  keep such information within the file simplifies the coordination of
  such information.

  [[1]] Minimal overhead.  The column format should not make datasets
  appreciably larger.  Storage is a primary cost and a choice to use
  this format should not require additional storage.

  [[1]] Primary format.  The column format should be usable as a
  primary format for datasets, not as an auxiliary, accelerated
  format. Applications that process a dataset in row-major order
  should be able to easily consume column files and applications that
  produce datasets in row-major order should be able to easily
  generate column files.

* Design

  To meet these goals we propose the following design.

  [[1]] Each row group is a separate file.  All values of a column in
  a file are written contiguously.  This maximizes the row group size,
  optimizing performance when querying few and small columns.

  [[1]] Each file occupies a single HDFS block.  A larger than normal
  block size may be specified, e.g., ~1GB instead of the typical
  ~100MB.  This guarantees co-location and eliminates network use when
  query processing can be co-located with the file.  This also
  moderates the memory impact on the HDFS Namenode since no small
  files are written.

  [[1]] Each column in a file is written as a sequence of ~64kB
  compressed blocks.  The sequence is prefixed by a table describing
  all of the blocks in the column to permit random access within the
  column.

  [[1]] Application-specific metadata may be added at the file,
  column, and block levels.

  [[1]] Checksums are included with each block, providing data integrity.

* Discussion

  The use of a single block per file achieves the same effect as the
  custom block placement policy described in the {{CIF}} paper,
  but while still permitting HDFS rebalancing and not increasing the
  number of files in the namespace.

Format Specification

  This section formally describes the proposed column file format.

* Data Model

  We assume a simple data model, where a record is a set of named
  fields, and the value of each field is a sequence of untyped bytes.
  A type system may be layered on top of this, as specified in the
  Type Mapping section below.

* Primitive Values

  We define the following primitive value types:

  * Signed 64-bit <<long>> values are written using a variable-length
zig-zag coding, where the high-order bit in each byte determines
whether subsequent bytes are present.  For example:

*--------------*------*
 decimal value | hex bytes
*--------------*------*
0              | 00
*--------------*------*
-1             | 01
*--------------*------*
1              | 02
*--------------*------*
...             
*--------------*------*
-64            | 7f
*--------------*------*
64             | 80 01
*--------------*------*
...             
*--------------*------*

  * <<bytes>> are encoded as a <long> followed by that many bytes of data.

  * a <<string>> is encoded as a <long> followed by that many bytes of
    UTF-8 encoded character data.

  For example, the three-character string "foo" would be encoded as
  the <long> value 3 (encoded as hex 06) followed by the UTF-8
  encoding of 'f', 'o', and 'o' (the hex bytes 66 6f 6f): 06 66 6f 6f

* Type Names

  The following type names are used to describe column values:

  * <<null>>, requires zero bytes.  Sometimes used in array columns.

  * <<boolean>>, one bit, packed into bytes, little-endian;

  * <<int>>, like <long>, but restricted to 32-bit signed values

  * <<long>> 64-bit signed values, represented as above

  * <<fixed32>> 32-bit values stored as four bytes, little-endian.

  * <<fixed64>> 64-bit values stored as eight bytes, little-endian.

  * <<float>> 32-bit IEEE floating point value, little-endian

  * <<double>> 64-bit IEEE floating point value, little-endian

  * <<string>> as above

  * <<bytes>> as above, may be used to encapsulate more complex objects

  []

  Type names are represented as <strings> (UTF-8 encoded, length-prefixed).

* Metadata

  <<Metadata>> consists of:

  * A <long> indicating the number of metadata key/value pairs.

  * For each pair, a <string> key and <bytes> value.

  []

  All metadata properties that start with "trevni." are reserved.

** File Metadata

  The following file metadata properties are defined:

  * <<trevni.codec>> the name of the default compression codec used to
    compress blocks, as a <string>. Implementations are required to
    support the "null" codec.  Optional.  If absent, it is assumed to
    be "null".  Codecs are described in more detail below.

  * <<trevni.checksum>> the name of the checksum algorithm used in this
    file, as a <string>.  Implementations are required to support the
    "crc-32” checksum.  Optional.  If absent, it is assumed to be
    "null".  Checksums are described in more detail below.

  []

** Column Metadata

  The following column metadata properties are defined:

  * <<trevni.codec>> the name of the compression codec used to compress
    the blocks of this column, as a <string>. Implementations are
    required to support the "null" codec.  Optional.  If absent, it is
    assumed to be "null".  Codecs are described in more detail below.

  * <<trevni.name>> the name of the column, as a <string>.  Required.

  * <<trevni.type>> the type of data in the column.  One of the type names
    above.  Required.

  * <<trevni.values>> if present, indicates that the initial value of each
    block in this column will be stored in the block’s descriptor.
    Not permitted for array columns or columns that specify a parent.

  * <<trevni.array>> if present, indicates that each row in this column
    contains a sequence of values of the named type rather than just a
    single value.  An integer length precedes each sequence of values
    indicating the count of values in the sequence.  If the length is
    negative then it indicates a sequence of zero or one lengths, where -1
    indicates two zeros, -2 two ones, -3 three zeros, -4 three ones, etc.

  * <<trevni.parent>> if present, the name of an <array> column whose
    lengths are also used by this column.  Thus values of this column
    are sequences but no lengths are stored in this column.

  []

  For example, consider the following row, as JSON, where all values
  are primitive types, but one has multiple values.

---
{"id"=566, "date"=23423234234
 "from"="foo@bar.com",
 "to"=["bar@baz.com", "bang@foo.com"],
 "content"="Hi!"}
---

  The columns for this might be specified as:

---
name=id       type=int
name=date     type=long
name=from     type=string
name=to       type=string  array=true
name=content  type=string 
---

  If a row contains an array of records, e.g. "received" in the following:

---
{"id"=566, "date"=23423234234
 "from"="foo@bar.com",
 "to"=["bar@baz.com", "bang@foo.com"],
 "content"="Hi!"
 "received"=[{"date"=234234234234, "host"="192.168.0.0.1"},
             {"date"=234234545645, "host"="192.168.0.0.2"}]
}
---

  Then one can define a parent column followed by a column for each
  field in the record, adding the following columns:

---
name=received  type=null    array=true
name=date      type=long    parent=received
name=host      type=string  parent=received
---

  If an array value itself contains an array, e.g. the "sigs" below:

---
{"id"=566, "date"=23423234234
 "from"="foo@bar.com",
 "to"=["bar@baz.com", "bang@foo.com"],
 "content"="Hi!"
 "received"=[{"date"=234234234234, "host"="192.168.0.0.1",
              "sigs"=[{"algo"="weak", "value"="0af345de"}]},
             {"date"=234234545645, "host"="192.168.0.0.2",
              "sigs"=[]}]
}
---

  Then a parent column may be defined that itself has a parent column.

---
name=sigs   type=null    array=true  parent=received
name=algo   type=string              parent=sigs
name=value  type=string              parent=sigs
---

** Block Metadata

  No block metadata properties are currently defined.

* File Format

  A <<file>> consists of:

  * A <file header>, followed by

  * one or more <columns>.

  []

  A <<file header>> consists of:

  * Four bytes, ASCII 'T', 'r', 'v', followed by 0x02.

  * a <fixed64> indicating the number of rows in the file

  * a <fixed32> indicating the number of columns in the file

  * file <metadata>.

  * for each column, its <column metadata>

  * for each column, its starting position in the file as a <fixed64>.

  []

  A <<column>> consists of:

  * A <fixed32> indicating the number of blocks in this column.

  * For each block, a <block descriptor>

  * One or more <blocks>.

  []

  A <<block descriptor>> consists of:

  * A <fixed32> indicating the number of rows in the block

  * A <fixed32> indicating the size in bytes of the block before the
    codec is applied (excluding checksum).

  * A <fixed32> indicating the size in bytes of the block after the
    codec is applied (excluding checksum).

  * If this column’s metadata declares it to include values, the first
    value in the column, serialized according to this column's type.

  []

  A <<block>> consists of:

  * The serialized column values.  If a column is an array column then
    value sequences are preceded by their length, as an <int>.  If a
    codec is specified, the values and lengths are compressed by that
    codec.

  * The checksum, as determined by the file metadata.

  []

* Codecs

  [null] The "null" codec simply passes data through uncompressed.

  [deflate] The "deflate" codec writes the data block using the
  deflate algorithm as specified in RFC 1951.

  [snappy] The "snappy" codec uses Google's Snappy compression library.

* Checksum algorithms

  [null] The "null" checksum contains zero bytes.

  [crc-32] Each "crc-32" checksum contains the four bytes of an ISO
  3309 CRC-32 checksum of the uncompressed block data as a fixed32.

* Type Mappings

  We define a standard mapping for how types defined in various
  serialization systems are represented in a column file.  Records
  from these systems are <shredded> into columns.  When records are
  nested, a depth-first recursive walk can assign a separate column
  for each primitive value.

** Avro

** Protocol Buffers

** Thrift

Implementation Notes

  Some possible techniques for writing column files include:

  [[1]] Use a standard ~100MB block, buffer in memory up to the block
  size, then flush the file directly to HDFS.  A single reduce task
  might create multiple output files.  The namenode requires memory
  proportional to the number of names and blocks*replication.  This
  would increase the number of names but not blocks, so this should
  still be much better than a file per column.

  [[1]] Spill each column to a separate local, temporary file then,
  when the file is closed, append these files, writing a single file
  to HDFS whose block size is set to be that of the entire file.  This
  would be a bit slower than and may have trouble when the local disk
  is full, but it would better use HDFS namespace and further reduce
  seeks when processing columns whose values are small.

  [[1]] Use a separate mapreduce job to convert row-major files to
  column-major.  The map output would output a by (row#, column#,
  value) tuple, partitioned by row# but sorted by column# then row#.
  The reducer could directly write the column file.  But the column
  file format would need to be changed to write counts, descriptors,
  etc. at the end of files rather than at the front.

  []

  (1) is the simplest to implement and most implementations should
  start with it.

* References

  {CIF} {{{http://arxiv.org/pdf/1105.4252.pdf}<Column-Oriented Storage
  Techniques for MapReduce>}}, Floratou, Patel, Shekita, & Tata, VLDB
  2011.

  {DREMEL} {{{http://research.google.com/pubs/archive/36632.pdf}<Dremel:
  Interactive Analysis of Web-Scale Datasets>}}, Melnik, Gubarev, Long,
  Romer, Shivakumar, & Tolton, VLDB 2010.