Apache Mahout > Mahout Wiki > Converting Content

Intro

Mahout has some tools for converting content into formats more consumable for Mahout. While they shouldn't be confused as a full ETL layer, they can be useful for things like converting text files and log files. All of these can be accessed via the $MAHOUT_HOME/bin/mahout command line driver.

SequenceFilesFrom*

  • SequenceFilesFromDirectory – Converts a directory of text files to a SequenceFile where the key is the name of the file and the value is all of the text
  • SequenceFilesFromMailArchives – Similar to Directory but converts mbox files.

RegexConverterDriver

Useful for converting things like log files from one format to another. For instance, you could convert Solr log files containing query requests to a format consumable by FrequentItemsetMining

For example, the following will extract queries from HTTP request logs to Solr and prepare them for use by Frequent Itemset Mining.

bin/mahout regexconverter --input /Users/grantingersoll/projects/content/lucid/lucidfind/logs --output /tmp/solr/output --regex "(?<=(\?|&)q=).*?(?=&|$)" --overwrite --transformerClass url --formatterClass fpg