Apache Mahout > Mahout Wiki > Creating Vectors > Creating Vectors from Weka's ARFF Format |
Mahout now has capabilities for converting Weka's ARFF (2.1) format to Mahout's Vector format.
ARFF files are easily converted using the org.apache.mahout.utils.arff.Driver program. The input arguments can be found by running it with the --help argument which produces results similar to:
Usage: [--input <input> --output <output> --max <max> --help --dictOut <dictOut> --outputWriter <outputWriter> --delimiter <delimiter>] Options --input (-d) input The file or directory containing the ARFF files. If it is a directory, all .arff files will be converted. (Mandatory parameter) --output (-o) output The output directory. Files will have the same name as the input, but with the extension .mvc (Mandatory parameter) --max (-m) max The maximum number of vectors to output. If not specified, then it will loop over all docs (Optional parameter) --help (-h) Print out help (Optional parameter) --dictOut (-t) dictOut The file to output the label bindings (Mandatory parameter) --outputWriter (-e) outputWriter The VectorWriter to use, either seq (SequenceFileVectorWriter - default) or file (Writes to a File using JSON format) (Optional parameter) --delimiter (-l) delimiter The delimiter for outputing the dictionary (Optional parameter)
You can use the parameters in its long format like --input or using the equivalent short name -d. From here, running the Driver is as simple as pointing it at the ARFF file:
$MAHOUT_HOME/bin/mahout arff.vector -d ./content/reuters-modapte/ \ -t ./content/reuters-modapte/output/dict.txt -o ./content/reuters-modapte/output/convert