Running Mean Shift Canopy Clustering from the Command Line

Mahout's Mean Shift clustering can be launched from the same command line invocation whether you are running on a single machine in stand-alone mode or on a larger Hadoop cluster. The difference is determined by the $HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to an operating Hadoop cluster on the target machine then the invocation will run Mean Shift on that cluster. If either of the environment variables are missing then the stand-alone Hadoop configuration will be invoked instead.

./bin/mahout meanshift <OPTIONS>

In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job will be generated in $MAHOUT_HOME/core/target/ and it's name will contain the Mahout version number. For example, when using Mahout 0.3 release, the job will be mahout-core-0.3.job

Testing it on one single machine w/o cluster

Put the data: cp <PATH TO DATA> testdata

Run the Job:

./bin/mahout meanshift -i testdata <OTHER OPTIONS>

Running it on the cluster

(As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata

Run the Job:

export HADOOP_HOME=<Hadoop Home Directory>
export HADOOP_CONF_DIR=$HADOOP_HOME/conf
./bin/mahout meanshift -i testdata <OTHER OPTIONS>

Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output to view all outputs.

Command line options

  --input (-i) input                           Path to job input directory.     
                                               Must be a SequenceFile of        
                                               VectorWritable                   
  --output (-o) output                         The directory pathname for       
                                               output.                          
  --overwrite (-ow)                            If present, overwrite the output 
                                               directory before running job     
  --distanceMeasure (-dm) distanceMeasure      The classname of the             
                                               DistanceMeasure. Default is      
                                               SquaredEuclidean                 
  --help (-h)                                  Print out help                   
  --convergenceDelta (-cd) convergenceDelta    The convergence delta value.     
                                               Default is 0.5                   
  --t1 (-t1) t1                                T1 threshold value               
  --t2 (-t2) t2                                T2 threshold value               
  --clustering (-cl)                           If present, run clustering after 
                                               the iterations have taken place  
  --maxIter (-x) maxIter                       The maximum number of            
                                               iterations.                      
  --inputIsCanopies (-ic) inputIsCanopies      If present, the input directory  
                                               already contains                 
                                               MeanShiftCanopies