kMeans commandline introduction

This quick start page describes how to run the kMeans clustering algorithm on a Hadoop cluster.

Steps

Mahout’s k-Means clustering can be launched from the same command line invocation whether you are running on a single machine in stand-alone mode or on a larger Hadoop cluster. The difference is determined by the $HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to an operating Hadoop cluster on the target machine then the invocation will run k-Means on that cluster. If either of the environment variables are missing then the stand-alone Hadoop configuration will be invoked instead.

./bin/mahout kmeans <OPTIONS>

In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job will be generated in $MAHOUT_HOME/core/target/ and it’s name will contain the Mahout version number. For example, when using Mahout 0.3 release, the job will be mahout-core-0.3.job

Testing it on one single machine w/o cluster

  • Put the data: cp testdata
  • Run the Job:

    ./bin/mahout kmeans -i testdata -o output -c clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k 25

Running it on the cluster

  • (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
  • Put the data: $HADOOP_HOME/bin/hadoop fs -put testdata
  • Run the Job:

    export HADOOP_HOME= export HADOOP_CONF_DIR=$HADOOP_HOME/conf ./bin/mahout kmeans -i testdata -o output -c clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k 25

  • Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output to view all outputs.

Command line options

  --input (-i) input			       Path to job input directory. 
					       Must be a SequenceFile of    
					       VectorWritable		    
  --clusters (-c) clusters		       The input centroids, as Vectors. 
					       Must be a SequenceFile of    
					       Writable, Cluster/Canopy. If k  
					       is also specified, then a random 
					       set of vectors will be selected  
					       and written out to this path 
					       first			    
  --output (-o) output			       The directory pathname for   
					       output.			    
  --distanceMeasure (-dm) distanceMeasure      The classname of the	    
					       DistanceMeasure. Default is  
					       SquaredEuclidean 	    
  --convergenceDelta (-cd) convergenceDelta    The convergence delta value. 
					       Default is 0.5		    
  --maxIter (-x) maxIter		       The maximum number of	    
					       iterations.		    
  --maxRed (-r) maxRed			       The number of reduce tasks.  
					       Defaults to 2		    
  --k (-k) k				       The k in k-Means.  If specified, 
					       then a random selection of k 
					       Vectors will be chosen as the    
					       Centroid and written to the  
					       clusters input path.	    
  --overwrite (-ow)			       If present, overwrite the output 
					       directory before running job 
  --help (-h)				       Print out help		    
  --clustering (-cl)			       If present, run clustering after 
					       the iterations have taken place