Title: k-means-commandline
# Introduction
This quick start page describes how to run the kMeans clustering algorithm
on a Hadoop cluster.
# Steps
Mahout's k-Means clustering can be launched from the same command line
invocation whether you are running on a single machine in stand-alone mode
or on a larger Hadoop cluster. The difference is determined by the
$HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to
an operating Hadoop cluster on the target machine then the invocation will
run k-Means on that cluster. If either of the environment variables are
missing then the stand-alone Hadoop configuration will be invoked instead.
./bin/mahout kmeans
* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
the Mahout version number. For example, when using Mahout 0.3 release, the
job will be mahout-core-0.3.job
## Testing it on one single machine w/o cluster
* Put the data: cp testdata
* Run the Job:
./bin/mahout kmeans -i testdata -o output -c clusters -dm
org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k
25
## Running it on the cluster
* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
* Put the data: $HADOOP_HOME/bin/hadoop fs -put testdata
* Run the Job:
export HADOOP_HOME=
export HADOOP_CONF_DIR=$HADOOP_HOME/conf
./bin/mahout kmeans -i testdata -o output -c clusters -dm
org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k
25
* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
to view all outputs.
# Command line options
--input (-i) input Path to job input directory.
Must be a SequenceFile of
VectorWritable
--clusters (-c) clusters The input centroids, as
Vectors.
Must be a SequenceFile of
Writable, Cluster/Canopy.
If k
is also specified, then a
random
set of vectors will be
selected
and written out to this path
first
--output (-o) output The directory pathname for
output.
--distanceMeasure (-dm) distanceMeasure The classname of the
DistanceMeasure. Default is
SquaredEuclidean
--convergenceDelta (-cd) convergenceDelta The convergence delta value.
Default is 0.5
--maxIter (-x) maxIter The maximum number of
iterations.
--maxRed (-r) maxRed The number of reduce tasks.
Defaults to 2
--k (-k) k The k in k-Means. If
specified,
then a random selection of k
Vectors will be chosen as
the
Centroid and written to the
clusters input path.
--overwrite (-ow) If present, overwrite the
output
directory before running job
--help (-h) Print out help
--clustering (-cl) If present, run clustering
after
the iterations have taken
place