Title: Canopy Clustering
# Canopy Clustering
[Canopy Clustering](http://www.kamalnigam.com/papers/canopy-kdd00.pdf)
is a very simple, fast and surprisingly accurate method for grouping
objects into clusters. All objects are represented as a point in a
multidimensional feature space. The algorithm uses a fast approximate
distance metric and two distance thresholds T1 > T2 for processing. The
basic algorithm is to begin with a set of points and remove one at random.
Create a Canopy containing this point and iterate through the remainder of
the point set. At each point, if its distance from the first point is < T1,
then add the point to the cluster. If, in addition, the distance is < T2,
then remove the point from the set. This way points that are very close to
the original will avoid all further processing. The algorithm loops until
the initial set is empty, accumulating a set of Canopies, each containing
one or more points. A given point may occur in more than one Canopy.
Canopy Clustering is often used as an initial step in more rigorous
clustering techniques, such as [K-Means Clustering](k-means-clustering.html)
. By starting with an initial clustering the number of more expensive
distance measurements can be significantly reduced by ignoring points
outside of the initial canopies.
## Strategy for parallelization
Looking at the sample Hadoop implementation in [http://code.google.com/p/canopy-clustering/](http://code.google.com/p/canopy-clustering/)
the processing is done in 3 M/R steps:
1. The data is massaged into suitable input format
1. Each mapper performs canopy clustering on the points in its input set and
outputs its canopies' centers
1. The reducer clusters the canopy centers to produce the final canopy
centers
1. The points are then clustered into these final canopies
Some ideas can be found in [Cluster computing and MapReduce](http://code.google.com/edu/content/submissions/mapreduce-minilecture/listing.html)
lecture video series \[by Google(r)\]; Canopy Clustering is discussed in [lecture #4|http://www.youtube.com/watch?v=1ZDybXl212Q]
. Slides can be found [here|https://code.google.com/edu/submissions/mapreduce-minilecture/lec4-clustering.ppt]
. Finally here is the [Wikipedia page|http://en.wikipedia.org/wiki/Canopy_clustering_algorithm]
.
## Design of implementation
The implementation accepts as input Hadoop SequenceFiles containing
multidimensional points (VectorWritable). Points may be expressed either as
dense or sparse Vectors and processing is done in two phases: Canopy
generation and, optionally, Clustering.
### Canopy generation phase
During the map step, each mapper processes a subset of the total points and
applies the chosen distance measure and thresholds to generate canopies. In
the mapper, each point which is found to be within an existing canopy will
be added to an internal list of Canopies. After observing all its input
vectors, the mapper updates all of its Canopies and normalizes their totals
to produce canopy centroids which are output, using a constant key
("centroid") to a single reducer. The reducer receives all of the initial
centroids and again applies the canopy measure and thresholds to produce a
final set of canopy centroids which is output (i.e. clustering the cluster
centroids). The reducer output format is: SequenceFile(Text, Canopy) with
the _key_ encoding the canopy identifier.
### Clustering phase
During the clustering phase, each mapper reads the Canopies produced by the
first phase. Since all mappers have the same canopy definitions, their
outputs will be combined during the shuffle so that each reducer (many are
allowed here) will see all of the points assigned to one or more canopies.
The output format will then be: SequenceFile(IntWritable,
WeightedVectorWritable) with the _key_ encoding the canopyId. The
WeightedVectorWritable has two fields: a double weight and a VectorWritable
vector. Together they encode the probability that each vector is a member
of the given canopy.
## Running Canopy Clustering
The canopy clustering algorithm may be run using a command-line invocation
on CanopyDriver.main or by making a Java call to CanopyDriver.run(...).
Both require several arguments:
Invocation using the command line takes the form:
bin/mahout canopy \
-i \
-o