Running Canopy Clustering from the Command Line
Mahout’s Canopy clustering can be launched from the same command line
invocation whether you are running on a single machine in stand-alone mode
or on a larger Hadoop cluster. The difference is determined by the
$HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to
an operating Hadoop cluster on the target machine then the invocation will
run Canopy on that cluster. If either of the environment variables are
missing then the stand-alone Hadoop configuration will be invoked instead.
./bin/mahout canopy <OPTIONS>
- In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
will be generated in $MAHOUT_HOME/core/target/ and it’s name will contain
the Mahout version number. For example, when using Mahout 0.3 release, the
job will be mahout-core-0.3.job
Testing it on one single machine w/o cluster
Running it on the cluster
Command line options
--input (-i) input Path to job input directory.Must
be a SequenceFile of
VectorWritable
--output (-o) output The directory pathname for output.
--overwrite (-ow) If present, overwrite the output
directory before running job
--distanceMeasure (-dm) distanceMeasure The classname of the
DistanceMeasure. Default is
SquaredEuclidean
--t1 (-t1) t1 T1 threshold value
--t2 (-t2) t2 T2 threshold value
--clustering (-cl) If present, run clustering after
the iterations have taken place
--help (-h) Print out help