Running Fuzzy k-Means Clustering from the Command Line
Mahout’s Fuzzy k-Means clustering can be launched from the same command
line invocation whether you are running on a single machine in stand-alone
mode or on a larger Hadoop cluster. The difference is determined by the
$HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to
an operating Hadoop cluster on the target machine then the invocation will
run FuzzyK on that cluster. If either of the environment variables are
missing then the stand-alone Hadoop configuration will be invoked instead.
./bin/mahout fkmeans <OPTIONS>
- In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
will be generated in $MAHOUT_HOME/core/target/ and it’s name will contain
the Mahout version number. For example, when using Mahout 0.3 release, the
job will be mahout-core-0.3.job
Testing it on one single machine w/o cluster
Running it on the cluster
Command line options
--input (-i) input Path to job input directory.
Must be a SequenceFile of
VectorWritable
--clusters (-c) clusters The input centroids, as Vectors.
Must be a SequenceFile of
Writable, Cluster/Canopy. If k
is also specified, then a random
set of vectors will be selected
and written out to this path
first
--output (-o) output The directory pathname for
output.
--distanceMeasure (-dm) distanceMeasure The classname of the
DistanceMeasure. Default is
SquaredEuclidean
--convergenceDelta (-cd) convergenceDelta The convergence delta value.
Default is 0.5
--maxIter (-x) maxIter The maximum number of
iterations.
--k (-k) k The k in k-Means. If specified,
then a random selection of k
Vectors will be chosen as the
Centroid and written to the
clusters input path.
--m (-m) m coefficient normalization
factor, must be greater than 1
--overwrite (-ow) If present, overwrite the output
directory before running job
--help (-h) Print out help
--numMap (-u) numMap The number of map tasks.
Defaults to 10
--maxRed (-r) maxRed The number of reduce tasks.
Defaults to 2
--emitMostLikely (-e) emitMostLikely True if clustering should emit
the most likely point only,
false for threshold clustering.
Default is true
--threshold (-t) threshold The pdf threshold used for
cluster determination. Default
is 0
--clustering (-cl) If present, run clustering after
the iterations have taken place