Apache Mahout > Mahout Wiki > Algorithms > Fuzzy K-Means |
Fuzzy K-Means (also called Fuzzy C-Means) is an extension of K-Means, the popular simple clustering technique. While K-Means discovers hard clusters (a point belong to only one cluster), Fuzzy K-Means is a more statistically formalized method and discovers soft clusters where a particular point can belong to more than one cluster with certain probability.
Like K-Means, Fuzzy K-Means works on those objects which can be represented in n-dimensional vector space and a distance measure is defined.
The algorithm is similar to k-means.
The design is similar to K-Means present in Mahout. It accepts an input file containing vector points. User can either provide the cluster centers as input or can allow canopy algorithm to run and create initial clusters.
Similar to K-Means, the program doesn't modify the input directories. And for every iteration, the cluster output is stored in a directory cluster-N. The code has set number of reduce tasks equal to number of map tasks. So, those many part-0
* files are created in clusterN directory. The code uses driver/mapper/combiner/reducer as follows:
FuzzyKMeansDriver - This is similar to KMeansDriver. It iterates over input points and cluster points for specified number of iterations or until it is converged.During every iteration i, a new cluster-i directory is created which contains the modified cluster centers obtained during FuzzyKMeans iteration. This will be feeded as input clusters in the next iteration. Once Fuzzy KMeans is run for specified number of iterations or until it is converged, a map task is run to output "the point and the cluster membership to each cluster" pair as final output to a directory named "points".
FuzzyKMeansMapper - reads the input cluster during its configure() method, then computes cluster membership probability of a point to each cluster.Cluster membership is inversely propotional to the distance. Distance is computed using user supplied distance measure. Output key is encoded clusterId. Output values are ClusterObservations containing observation statistics.
FuzzyKMeansCombiner - receives all key:value pairs from the mapper and produces partial sums of the cluster membership probability times input vectors for each cluster. Output key is: encoded cluster identifier. Output values are ClusterObservations containing observation statistics.
FuzzyKMeansReducer - Multiple reducers receives certain keys and all values associated with those keys. The reducer sums the values to produce a new centroid for the cluster which is output. Output key is: encoded cluster identifier (e.g. "C14". Output value is: formatted cluster identifier (e.g. "C14"). The reducer encodes unconverged clusters with a 'Cn' cluster Id and converged clusters with 'Vn' clusterId.
The Fuzzy k-Means clustering algorithm may be run using a command-line invocation on FuzzyKMeansDriver.main or by making a Java call to FuzzyKMeansDriver.run().
Invocation using the command line takes the form:
bin/mahout fkmeans \ -i <input vectors directory> \ -c <input clusters directory> \ -o <output working directory> \ -dm <DistanceMeasure> \ -m <fuzziness argument >1> \ -x <maximum number of iterations> \ -k <optional number of initial clusters to sample from input vectors> \ -cd <optional convergence delta. Default is 0.5> \ -ow <overwrite output directory if present> -cl <run input vector clustering after computing Clusters> -e <emit vectors to most likely cluster during clustering> -t <threshold to use for clustering if -e is false> -xm <execution method: sequential or mapreduce>
Note: if the -k argument is supplied, any clusters in the -c directory will be overwritten and -k random points will be sampled from the input vectors to become the initial cluster centers.
Invocation using Java involves supplying the following arguments:
After running the algorithm, the output directory will contain:
The following images illustrate Fuzzy k-Means clustering applied to a set of randomly-generated 2-d data points. The points are generated using a normal distribution centered at a mean location and with a constant standard deviation. See the README file in the /examples/src/main/java/org/apache/mahout/clustering/display/README.txt for details on running similar examples.
The points are generated as follows:
In the first image, the points are plotted and the 3-sigma boundaries of their generator are superimposed.
In the second image, the resulting clusters (k=3) are shown superimposed upon the sample data. As Fuzzy k-Means is an iterative algorithm, the centers of the clusters in each recent iteration are shown using different colors. Bold red is the final clustering and previous iterations are shown in [orange, yellow, green, blue, violet and gray]. Although it misses a lot of the points and cannot capture the original, superimposed cluster centers, it does a decent job of clustering this data.
The third image shows the results of running Fuzzy k-Means on a different data set (see Dirichlet Process Clustering for details) which is generated using asymmetrical standard deviations. Fuzzy k-Means does a fair job handling this data set as well.