Title: Data Formats Mahout uses a few file formats quite a bit in its various job implementations. * [File formats](#DataFormats-Fileformats) * [Raw formats for import](#DataFormats-Rawformatsforimport) * [Raw formats for export](#DataFormats-Rawformatsforexport) * [Who Stores What in a SequenceFile?](#DataFormats-WhoStoresWhatinaSequenceFile?) * ["Simple" Text Vectors](#DataFormats-"Simple"TextVectors) * [Encoded Text Vectors](#DataFormats-EncodedTextVectors) * [Directories](#DataFormats-Directories) * [Matrices](#DataFormats-Matrices) * [Clusters](#DataFormats-Clusters) * [FPGrowth Clusters](#DataFormats-FPGrowthClusters) * [Life cycle](#DataFormats-Lifecycle) ## File formats ### Raw formats for import * Text files ** can be parsed into SequenceFiles of: *** (line number, text of line) *** (file name, full contents of file) *** (line number, parts of line extracted with regex patterns) ** can also be parsed into Lucene indexes: *** _precise index design ???_ * ARFF files ** Weka project text data format ** Parsed into SequenceFile of * Mailbox files ** can be parsed into SequenceFiles of: *** (mail message id, text body of mail message) *** no html or attachment support * CSV files ** generally without column or row headers ** no "multiple values per column" options * Hadoop SequenceFile ** canonical, no variations. Currently no use of metadata. * Lucene indexes ** translated into SequenceFiles *** _precise index design ???_ ### Raw formats for export * SequenceFiles * Text lines, mostly of the toString() variety * MatrixWritable for ConfusionMatrix * CSV for MatrixWritable * A special CSV format for Clusters * [GraphML XML](http://graphml.graphdrawing.org/) for Clusters ## Who Stores What in a SequenceFile? ### "Simple" Text Vectors Simple text vectors represent documents. The dimensions are the set of terms in the entire document set. Each document vector stores a number in the position of each term it contains. This number may be derived from the count of the term inside the document. ### Encoded Text Vectors Each vector represents a document. However, term dimensions are "collapsed" stochastically, meaning each term in the full term set is mapped randomly to several smaller indexes. ### Directories pairs which match matrix rows to input text keys like movie names, document file names etc. These are made by RowIdJob. ### Matrices Matrices are almost universally stored as LongWritable/VectorWritable pairs, where VectorWritable can be sparse or dense. ### Clusters Clusters are stored in complex data structures. ### FPGrowth Clusters These are stored in a custom data structure. ## Life cycle All Mahout jobs generally assume that files generated have no lifespan. All Writable formats may change, and some may disappear. There are no file compatibility requirements.