Apache Mahout > Mahout Wiki > Creating Vectors > Creating Vectors from Weka's ARFF Format

Introduction

Mahout now has capabilities for converting Weka's ARFF (2.1) format to Mahout's Vector format.

Running the Converter

ARFF files are easily converted using the org.apache.mahout.utils.arff.Driver program. The input arguments can be found by running it with the --help argument which produces results similar to:

Usage:
 [--input <input> --output <output> --max <max> --help --dictOut <dictOut>
--outputWriter <outputWriter> --delimiter <delimiter>]
Options
  --input (-d) input                  The file or directory containing the ARFF
                                      files.  If it is a directory, all .arff
                                      files will be converted. (Mandatory parameter)
  --output (-o) output                The output directory.  Files will have
                                      the same name as the input, but with the
                                      extension .mvc (Mandatory parameter)
  --max (-m) max                      The maximum number of vectors to output.
                                      If not specified, then it will loop over
                                      all docs (Optional parameter)
  --help (-h)                         Print out help (Optional parameter)
  --dictOut (-t) dictOut              The file to output the label bindings
                                      (Mandatory parameter)
  --outputWriter (-e) outputWriter    The VectorWriter to use, either seq
                                      (SequenceFileVectorWriter - default) or
                                      file (Writes to a File using JSON format)
                                      (Optional parameter)
  --delimiter (-l) delimiter          The delimiter for outputing the
                                      dictionary (Optional parameter)

You can use the parameters in its long format like --input or using the equivalent short name -d. From here, running the Driver is as simple as pointing it at the ARFF file:

$MAHOUT_HOME/bin/mahout arff.vector -d ./content/reuters-modapte/ \
      -t ./content/reuters-modapte/output/dict.txt -o ./content/reuters-modapte/output/convert