Title: k-means-commandline # Introduction This quick start page describes how to run the kMeans clustering algorithm on a Hadoop cluster. # Steps Mahout's k-Means clustering can be launched from the same command line invocation whether you are running on a single machine in stand-alone mode or on a larger Hadoop cluster. The difference is determined by the $HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to an operating Hadoop cluster on the target machine then the invocation will run k-Means on that cluster. If either of the environment variables are missing then the stand-alone Hadoop configuration will be invoked instead. ./bin/mahout kmeans * In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job will be generated in $MAHOUT_HOME/core/target/ and it's name will contain the Mahout version number. For example, when using Mahout 0.3 release, the job will be mahout-core-0.3.job ## Testing it on one single machine w/o cluster * Put the data: cp testdata * Run the Job: ./bin/mahout kmeans -i testdata -o output -c clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k 25 ## Running it on the cluster * (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh * Put the data: $HADOOP_HOME/bin/hadoop fs -put testdata * Run the Job: export HADOOP_HOME= export HADOOP_CONF_DIR=$HADOOP_HOME/conf ./bin/mahout kmeans -i testdata -o output -c clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k 25 * Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output to view all outputs. # Command line options --input (-i) input Path to job input directory. Must be a SequenceFile of VectorWritable --clusters (-c) clusters The input centroids, as Vectors. Must be a SequenceFile of Writable, Cluster/Canopy. If k is also specified, then a random set of vectors will be selected and written out to this path first --output (-o) output The directory pathname for output. --distanceMeasure (-dm) distanceMeasure The classname of the DistanceMeasure. Default is SquaredEuclidean --convergenceDelta (-cd) convergenceDelta The convergence delta value. Default is 0.5 --maxIter (-x) maxIter The maximum number of iterations. --maxRed (-r) maxRed The number of reduce tasks. Defaults to 2 --k (-k) k The k in k-Means. If specified, then a random selection of k Vectors will be chosen as the Centroid and written to the clusters input path. --overwrite (-ow) If present, overwrite the output directory before running job --help (-h) Print out help --clustering (-cl) If present, run clustering after the iterations have taken place