Title: Fuzzy K-Means
# Fuzzy K-Means
Fuzzy K-Means (also called Fuzzy C-Means) is an extension of [K-Means](http://mahout.apache.org/users/clustering/k-means-clustering.html)
, the popular simple clustering technique. While K-Means discovers hard
clusters (a point belong to only one cluster), Fuzzy K-Means is a more
statistically formalized method and discovers soft clusters where a
particular point can belong to more than one cluster with certain
probability.
#### Algorithm
Like K-Means, Fuzzy K-Means works on those objects which can be represented
in n-dimensional vector space and a distance measure is defined.
The algorithm is similar to k-means.
* Initialize k clusters
* Until converged
* Compute the probability of a point belong to a cluster for every pair
* Recompute the cluster centers using above probability membership values of points to clusters
#### Design Implementation
The design is similar to K-Means present in Mahout. It accepts an input
file containing vector points. User can either provide the cluster centers
as input or can allow canopy algorithm to run and create initial clusters.
Similar to K-Means, the program doesn't modify the input directories. And
for every iteration, the cluster output is stored in a directory cluster-N.
The code has set number of reduce tasks equal to number of map tasks. So,
those many part-0
Files are created in clusterN directory. The code uses
driver/mapper/combiner/reducer as follows:
FuzzyKMeansDriver - This is similar to KMeansDriver. It iterates over
input points and cluster points for specified number of iterations or until
it is converged.During every iteration i, a new cluster-i directory is
created which contains the modified cluster centers obtained during
FuzzyKMeans iteration. This will be feeded as input clusters in the next
iteration. Once Fuzzy KMeans is run for specified number of
iterations or until it is converged, a map task is run to output "the point
and the cluster membership to each cluster" pair as final output to a
directory named "points".
FuzzyKMeansMapper - reads the input cluster during its configure() method,
then computes cluster membership probability of a point to each
cluster.Cluster membership is inversely propotional to the distance.
Distance is computed using user supplied distance measure. Output key
is encoded clusterId. Output values are ClusterObservations containing
observation statistics.
FuzzyKMeansCombiner - receives all key:value pairs from the mapper and
produces partial sums of the cluster membership probability times input
vectors for each cluster. Output key is: encoded cluster identifier. Output
values are ClusterObservations containing observation statistics.
FuzzyKMeansReducer - Multiple reducers receives certain keys and all values
associated with those keys. The reducer sums the values to produce a new
centroid for the cluster which is output. Output key is: encoded cluster
identifier (e.g. "C14". Output value is: formatted cluster identifier (e.g.
"C14"). The reducer encodes unconverged clusters with a 'Cn' cluster Id and
converged clusters with 'Vn' clusterId.
## Running Fuzzy k-Means Clustering
The Fuzzy k-Means clustering algorithm may be run using a command-line
invocation on FuzzyKMeansDriver.main or by making a Java call to
FuzzyKMeansDriver.run().
Invocation using the command line takes the form:
bin/mahout fkmeans \
-i \
-c \
-o