Title: Data Formats
Mahout uses a few file formats quite a bit in its various job
implementations.
* [File formats](#DataFormats-Fileformats)
* [Raw formats for import](#DataFormats-Rawformatsforimport)
* [Raw formats for export](#DataFormats-Rawformatsforexport)
* [Who Stores What in a SequenceFile?](#DataFormats-WhoStoresWhatinaSequenceFile?)
* ["Simple" Text Vectors](#DataFormats-"Simple"TextVectors)
* [Encoded Text Vectors](#DataFormats-EncodedTextVectors)
* [Directories](#DataFormats-Directories)
* [Matrices](#DataFormats-Matrices)
* [Clusters](#DataFormats-Clusters)
* [FPGrowth Clusters](#DataFormats-FPGrowthClusters)
* [Life cycle](#DataFormats-Lifecycle)
## File formats
### Raw formats for import
* Text files
** can be parsed into SequenceFiles of:
*** (line number, text of line)
*** (file name, full contents of file)
*** (line number, parts of line extracted with regex patterns)
** can also be parsed into Lucene indexes:
*** _precise index design ???_
* ARFF files
** Weka project text data format
** Parsed into SequenceFile of
* Mailbox files
** can be parsed into SequenceFiles of:
*** (mail message id, text body of mail message)
*** no html or attachment support
* CSV files
** generally without column or row headers
** no "multiple values per column" options
* Hadoop SequenceFile
** canonical, no variations. Currently no use of metadata.
* Lucene indexes
** translated into SequenceFiles
*** _precise index design ???_
### Raw formats for export
* SequenceFiles
* Text lines, mostly of the toString() variety
* MatrixWritable for ConfusionMatrix
* CSV for MatrixWritable
* A special CSV format for Clusters
* [GraphML XML](http://graphml.graphdrawing.org/)
for Clusters
## Who Stores What in a SequenceFile?
### "Simple" Text Vectors
Simple text vectors represent documents. The dimensions are the set of
terms in the entire document set. Each document vector stores a number in
the position of each term it contains. This number may be derived from the
count of the term inside the document.
### Encoded Text Vectors
Each vector represents a document. However, term dimensions are "collapsed"
stochastically, meaning each term in the full term set is mapped randomly
to several smaller indexes.
### Directories
pairs which match matrix rows to input text keys like movie
names, document file names etc. These are made by RowIdJob.
### Matrices
Matrices are almost universally stored as LongWritable/VectorWritable
pairs, where VectorWritable can be sparse or dense.
### Clusters
Clusters are stored in complex data structures.
### FPGrowth Clusters
These are stored in a custom data structure.
## Life cycle
All Mahout jobs generally assume that files generated have no lifespan. All
Writable formats may change, and some may disappear. There are no file
compatibility requirements.