Title: Visualizing Sample Clusters
# Introduction
Mahout provides examples to visualize sample clusters that gets created by
various clustering algorithms like
* Canopy Clustering
* Dirichlet Process
* KMeans
* Fuzzy KMeans
* MeanShift Canopy
* Spectral KMeans
* MinHash
##### Note
These are Swing programs. You have to be in a window system on the same
machine you run these, or logged in via a "remote desktop" or VNC program.
# Pre - Prep
For visualizing the clusters, you would just have to execute the Java
classes under org.apache.mahout.clustering.display package in
mahout-examples module. If you are using eclipse, setup mahout-examples as
a project as specified in [Working with Maven in Eclipse](buildingmahout#mahout_maven_eclipse.html)
.
# Visualizing clusters
The following classes in org.apache.mahout.clustering.display can be run
without parameters to generate a sample data set and run the reference
clustering implementations over them:
1. DisplayClustering - generates 1000 samples from three, symmetric
distributions. This is the same data set that is used by the following
clustering programs. It displays the points on a screen and superimposes
the model parameters that were used to generate the points. You can edit
the generateSamples() method to change the sample points used by these
programs.
1. DisplayClustering - displays initial areas of generated points
1. DisplayDirichlet - uses Dirichlet Process clustering
1. DisplayCanopy - uses Canopy clustering
1. DisplayKMeans - uses k-Means clustering
1. DisplayFuzzyKMeans - uses Fuzzy k-Means clustering
1. DisplayMeanShift - uses MeanShift clustering
1. DisplaySpectralKMeans - uses Spectral KMeans via map-reduce algorithm
If you are using Eclipse and have set it up as specified in Pre-Prep, just
right-click on each of the classes mentioned above and choose "Run As -
Java Application". To run these directly from the command line:
cd $MAHOUT_HOME/examples
mvn -q exec:java
-Dexec.mainClass=org.apache.mahout.clustering.display.DisplayClustering
# substitute other names above for DisplayClustering
# Note: the DisplaySpectralKMeans program does a Hadoop job that takes 3
minutes on a laptop. Set this MVN_OPTS=300m to give the program enough
memory. You may find that some of the other programs also need more memory.
Note:
* Some of these programs display the sample points and then superimpose all
of the clusters from each iteration. The last iteration's clusters are in
bold red and the previous several are colored (orange, yellow, green, blue,
magenta) in order after which all earlier clusters are in light grey. This
helps to visualize how the clusters converge upon a solution over multiple
iterations.
* By changing the parameter values (k, ALPHA_0, numIterations) and the
display SIGNIFICANCE you can obtain different results.
# Screen Capture Animation
See [Sample Clusters Animation](sample-clusters-animation.html)
for a screen caps of all the above programs, and an animated gif.