Running Fuzzy k-Means Clustering from the Command Line

Mahout’s Fuzzy k-Means clustering can be launched from the same command line invocation whether you are running on a single machine in stand-alone mode or on a larger Hadoop cluster. The difference is determined by the $HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to an operating Hadoop cluster on the target machine then the invocation will run FuzzyK on that cluster. If either of the environment variables are missing then the stand-alone Hadoop configuration will be invoked instead.

./bin/mahout fkmeans <OPTIONS>
  • In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job will be generated in $MAHOUT_HOME/core/target/ and it’s name will contain the Mahout version number. For example, when using Mahout 0.3 release, the job will be mahout-core-0.3.job

Testing it on one single machine w/o cluster

  • Put the data: cp testdata
  • Run the Job:

    ./bin/mahout fkmeans -i testdata

Running it on the cluster

  • (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
  • Put the data: $HADOOP_HOME/bin/hadoop fs -put testdata
  • Run the Job:

    export HADOOP_HOME= export HADOOP_CONF_DIR=$HADOOP_HOME/conf ./bin/mahout fkmeans -i testdata

  • Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output to view all outputs.

Command line options

  --input (-i) input			       Path to job input directory. 
					       Must be a SequenceFile of    
					       VectorWritable		    
  --clusters (-c) clusters		       The input centroids, as Vectors. 
					       Must be a SequenceFile of    
					       Writable, Cluster/Canopy. If k  
					       is also specified, then a random 
					       set of vectors will be selected  
					       and written out to this path 
					       first			    
  --output (-o) output			       The directory pathname for   
					       output.			    
  --distanceMeasure (-dm) distanceMeasure      The classname of the	    
					       DistanceMeasure. Default is  
					       SquaredEuclidean 	    
  --convergenceDelta (-cd) convergenceDelta    The convergence delta value. 
					       Default is 0.5		    
  --maxIter (-x) maxIter		       The maximum number of	    
					       iterations.		    
  --k (-k) k				       The k in k-Means.  If specified, 
					       then a random selection of k 
					       Vectors will be chosen as the
    					       Centroid and written to the  
					       clusters input path.	    
  --m (-m) m				       coefficient normalization    
					       factor, must be greater than 1   
  --overwrite (-ow)			       If present, overwrite the output 
					       directory before running job 
  --help (-h)				       Print out help		    
  --numMap (-u) numMap			       The number of map tasks.     
					       Defaults to 10		    
  --maxRed (-r) maxRed			       The number of reduce tasks.  
					       Defaults to 2		    
  --emitMostLikely (-e) emitMostLikely	       True if clustering should emit   
					       the most likely point only,  
					       false for threshold clustering.  
					       Default is true		    
  --threshold (-t) threshold		       The pdf threshold used for   
					       cluster determination. Default   
					       is 0 
  --clustering (-cl)			       If present, run clustering after 
					       the iterations have taken place