kMeans commandline introduction
This quick start page describes how to run the kMeans clustering algorithm
on a Hadoop cluster.
Steps
Mahout’s k-Means clustering can be launched from the same command line
invocation whether you are running on a single machine in stand-alone mode
or on a larger Hadoop cluster. The difference is determined by the
$HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to
an operating Hadoop cluster on the target machine then the invocation will
run k-Means on that cluster. If either of the environment variables are
missing then the stand-alone Hadoop configuration will be invoked instead.
./bin/mahout kmeans <OPTIONS>
In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
will be generated in $MAHOUT_HOME/core/target/ and it’s name will contain
the Mahout version number. For example, when using Mahout 0.3 release, the
job will be mahout-core-0.3.job
Testing it on one single machine w/o cluster
Running it on the cluster
Command line options
--input (-i) input Path to job input directory.
Must be a SequenceFile of
VectorWritable
--clusters (-c) clusters The input centroids, as Vectors.
Must be a SequenceFile of
Writable, Cluster/Canopy. If k
is also specified, then a random
set of vectors will be selected
and written out to this path
first
--output (-o) output The directory pathname for
output.
--distanceMeasure (-dm) distanceMeasure The classname of the
DistanceMeasure. Default is
SquaredEuclidean
--convergenceDelta (-cd) convergenceDelta The convergence delta value.
Default is 0.5
--maxIter (-x) maxIter The maximum number of
iterations.
--maxRed (-r) maxRed The number of reduce tasks.
Defaults to 2
--k (-k) k The k in k-Means. If specified,
then a random selection of k
Vectors will be chosen as the
Centroid and written to the
clusters input path.
--overwrite (-ow) If present, overwrite the output
directory before running job
--help (-h) Print out help
--clustering (-cl) If present, run clustering after
the iterations have taken place