Apache Mahout > Mahout Wiki > Data Formats |
Mahout uses a few file formats quite a bit in its various job implementations.
Simple text vectors represent documents. The dimensions are the set of terms in the entire document set. Each document vector stores a number in the position of each term it contains. This number may be derived from the count of the term inside the document.
Each vector represents a document. However, term dimensions are "collapsed" stochastically, meaning each term in the full term set is mapped randomly to several smaller indexes.
<Integer,Text> pairs which match matrix rows to input text keys like movie names, document file names etc. These are made by RowIdJob.
Matrices are almost universally stored as LongWritable/VectorWritable pairs, where VectorWritable can be sparse or dense.
Clusters are stored in complex data structures.
These are stored in a custom data structure.
All Mahout jobs generally assume that files generated have no lifespan. All Writable formats may change, and some may disappear. There are no file compatibility requirements.