Apache Mahout > Mahout Wiki > File Format Integrations

There are several importers and exporters for common file formats.

General-purpose convertors

Importer 'bin/mahout' jobs

Run these with --help to see options

  • bin/mahout arff.vector
  • bin/mahout lucene.vector
  • bin/mahout seqdirectory
    • turns text files into sequence files, one file per key/value pair
  • bin/mahout SequenceFilesFromMailArchives
    • parses mailboxes and emits one text body per mail message
  • bin/mahout regexconverter
    • reads text lines and emits the regex output lines into SequenceFiles.

Exporter 'bin/mahout' jobs

Some programs exist to dump text versions of SequenceFiles for perusal. Run these with --help to see options.

  • bin/mahout clusterdump
  • bin/mahout cmdump
  • bin/mahout matrixdump
  • bin/mahout seqdumper
  • bin/mahout vectordump

Note: all classes with a 'main' method can be used as a bin/mahout job name.

Importer classes

These are not main() classes and must be coded against.

  • CSVVectorIterator imports CSV files into vectors.

Exporter classes

  • GraphMLClusterWriter saves cluster data in the GraphML
  • CSVClusterWriter saves clusters in a csv-based format.

Both of these formats are read by the Gephi program, an interactive graph explorer.

There are many file importers which are custom-made for particular algorithms:

  • The various text -> Lucene index converters

Examples

Regex Converter

For example, the following will extract queries from HTTP request logs to Solr and prepare them for use by Frequent Itemset Mining.

bin/mahout regexconverter --input /Users/grantingersoll/projects/content/lucid/lucidfind/logs --output /tmp/solr/output --regex "(?<=(\?|&)q=).*?(?=&|$)" --overwrite --transformerClass url --formatterClass fpg

See tutorial and cheat sheet for this marvelously opaque toolkit.