Title: Dirichlet Process Clustering
<a name="DirichletProcessClustering-Overview"></a>
# Overview

The Dirichlet Process Clustering algorithm performs Bayesian mixture
modeling.

The idea is that we use a probabilistic mixture of a number of models that
we use to explain some observed data. Each observed data point is assumed
to have come from one of the models in the mixture, but we don't know
which.	The way we deal with that is to use a so-called latent parameter
which specifies which model each data point came from.

In addition, since this is a Bayesian clustering algorithm, we don't want
to actually commit to any single explanation, but rather to sample from the
distribution of models and latent assignments of data points to models
given the observed data and the prior distributions of model parameters.
This sampling process is initialized by taking models at random from the
prior distribution for models.

Then, we iteratively assign points to the different models using the
mixture probabilities and the degree of fit between the point and each
model expressed as a probability that the point was generated by that
model. After points are assigned, new parameters for each model are sampled
from the posterior distribution for the model parameters considering all of
the observed data points that were assigned to the model.  Models without
any data points are also sampled, but since they have no points assigned,
the new samples are effectively taken from the prior distribution for model
parameters.

The result is a number of samples that represent mixing probabilities,
models and assignment of points to models. If the total number of possible
models is substantially larger than the number that ever have points
assigned to them, then this algorithm provides a (nearly) non-parametric
clustering algorithm. These samples can give us interesting information
that is lacking from a normal clustering that consists of a single
assignment of points to clusters.  Firstly, by examining the number of
models in each sample that actually has any points assigned to it, we can
get information about how many models (clusters) that the data support.
Morevoer, by examining how often two points are assigned to the same model,
we can get an approximate measure of how likely these points are to be
explained by the same model.  Such soft membership information is difficult
to come by with conventional clustering methods.

Finally, we can get an idea of the stability of how the data can be
described.  Typically, aspects of the data with lots of data available wind
up with stable descriptions while at the edges, there are aspects that are
phenomena that we can't really commit to a solid description, but it is
still clear that the well supported explanations are insufficient to
explain these additional aspects. One thing that can be difficult about
these samples is that we can't always assign a correlation between the
models in the different samples.  Probably the best way to do this is to
look for overlap in the assignments of data observations to the different
models.

<a name="DirichletProcessClustering-DesignofImplementation"></a>
## Design of Implementation

The implementation accepts one input directory containing the data points
to be clustered. The data directory contains multiple input files of
SequenceFile(key, VectorWritable). The input data points are not modified
by the implementation, allowing experimentation with initial clustering and
convergence values.

The program iterates over the input points, outputting a new directory
"clusters-N" containing SequenceFile(Text, DirichletCluster) files for each
iteration N. This process uses a mapper/reducer/driver as follows:

DirichletMapper - reads the input clusters during its configure() method,
then assigns and outputs each input point to a probable cluster as defined
by the model's pdf() function. Output _key_ is: clusterId. Output _value_
is: input point.
DirichletReducer - reads the input clusters during its configure() method,
then each reducer receives clusterId:VectorWritable pairs from all mappers
and accumulates them to produce a new posterior model for each cluster
which is output. Output _key_ is: clusterId. Output value is:
DirichletCluster. Reducer outputs are used as the input clusters for the
next iteration.
DirichletDriver - iterates over the points and clusters until the given
number of iterations has been reached. During iterations, a new clusters
directory "clusters-N" is produced with the output clusters from the
previous iteration used for input to the next. A final optional pass over
the data using the DirichletClusterMapper clusters all points to an output
directory "clusteredPoints" and has no combiner or reducer steps.

<a name="DirichletProcessClustering-RunningDirichletProcessClustering"></a>
## Running Dirichlet Process Clustering

The Dirichlet clustering algorithm may be run using a command-line
invocation on DirichletDriver.main or by making a Java call to
DirichletDriver.runJob(). 

Invocation using the command line takes the form:


    bin/mahout dirichlet \
        -i <input vectors directory> \
        -o <output working directory> \
        -a0 <the alpha_0 parameter to the Dirichlet Distribution>
        -x <maximum number of iterations> \
        -k <number of models to create from prior> \
        -md <the ModelDistribution class name. Default NormalModelDistribution>
\
        -mp <the ModelPrototype class name. Default
SequentialAccessSparseVector> \
        -dm <optional DistanceMeasure class name for some ModelDistribution>
        -ow <overwrite output directory if present>
        -cl <run input vector clustering after computing Clusters>
        -e <emit vectors to most likely cluster during clustering>
        -t <threshold to use for clustering if -e is false>
        -xm <execution method: sequential or mapreduce>


Invocation using Java involves supplying the following arguments:

1. input: a file path string to a directory containing the input data set a
SequenceFile(WritableComparable, VectorWritable). The sequence file _key_
is not used.
1. output: a file path string to an empty directory which is used for all
output from the algorithm.
1. modelFactory: an instance of ModelDistribution which will be used for the
clustering.
1. numClusters: the number of models to be used for the clustering. This
should be larger than the number of clusters which is expected in the data
set.
1. maxIterations: the number of iterations to run for the clustering.
1. alpha_0: a double value (default is 1.0) used for creating the
DirichletDistribution. Influences the likelihood that new, empty clusters
will be selected for assignment in the first iteration.
1. runClustering: a boolean indicating, if true, that the clustering step is
to be executed after clusters have been determined.
1. emitMostLikely: a boolean indicating, if true, that the clustering step
should only emit the most likely cluster for each clustered point.
1. threshold: a double indicating, if emitMostLikely is false, the cluster
probability threshold used for emitting multiple clusters for each point. A
value of 0 will emit all clusters with their associated probabilities for
each vector.
1. runSequential: a boolean indicating, if true, that the clustering is to
be run using the sequential reference implementation in memory.

After running the algorithm, the output directory will contain:
1. clusters-N: directories containing SequenceFiles(Text, DirichletCluster)
produced by the algorithm for each iteration. The Text _key_ is a cluster
identifier string.
1. clusteredPoints: (if runClustering enabled) a directory containing
SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable _key_ is
the clusterId. The WeightedVectorWritable _value_ is a bean containing a
double _weight_ and a VectorWritable _vector_ where the weight indicates
the probability that the vector is a member of the cluster (the Probability
Density Function (pdf) of the cluster model evaluated at the vector). 


<a name="DirichletProcessClustering-Examples"></a>
# Examples

The following images illustrate three different prior models applied to a
set of randomly-generated 2-d data points. The points are generated using a
normal distribution centered at a mean location and with a constant
standard deviation. See the README file in the [/examples/src/main/java/org/apache/mahout/clustering/display/README.txt](http://svn.apache.org/repos/asf/mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/README.txt)
 for details on running similar examples.

The points are generated as follows:

* 500 samples m=\[1.0, 1.0\](1.0,-1.0\.html)
 sd=3.0
* 300 samples m=\[1.0, 0.0\](1.0,-0.0\.html)
 sd=0.5
* 300 samples m=\[0.0, 2.0\](0.0,-2.0\.html)
 sd=0.1

In the first image, the points are plotted and the 3-sigma boundaries of
their generator are superimposed. It is, of course, impossible to tell
which model actually generated each point as there is some probability -
perhaps small - that any of the models could have generated every point.

![SampleData](attachments/101992/23527463.png)

In the next image, the Dirichlet Process Clusterer is run against the sample points using a NormalModelDistribution with m=\[0.0, 0.0\](0.0,-0.0\.html)
 sd=1.0. This distribution represents the least amount of prior
information, as its sampled models all have constant parameters. The
resulting significant models (representing > 5% of the population) are
superimposed upon the sample data. Since all prior models are identical and
their pdfs are the same, the first iteration's assignment of points to
models is completely governed by the initial mixture values. Since these
are also identical, it means the first iteration assigns points to models
at random. During subsequent iterations, the models diverge from the origin
but there is some over-fitting in the result.

As Dirichlet clustering is an iterative process, the following illustrations include the cluster information from all iterations. The final cluster values are in bold red and earlier iterations are shown in \[orange, yellow, green, blue, violet and the rest are all gray\](orange,-yellow,-green,-blue,-violet-and-the-rest-are-all-gray\.html)
. These illustrate the cluster convergence process over the last several
iterations and can be helpful in tuning the algorithm.

![DirichletN](attachments/101992/9382.png)

The next image improves upon this situation by using a
SampledNormalDistribution. In this distribution, the prior models have
means that are sampled from a normal distribution and all have a constant
sd=1. This distribution creates initial models that are centered at
different coordinates. During the first iteration, each model thus has a
different pdf for each point and the iteration assigns points to the
more-likely models given this value. The result is a decent capture of the
sample data parameters but there is still some over-fitting.

![DirichletSN](attachments/101992/9383.png)

The above image was run through 20 iterations and the cluster assignments
are clearly moving indicating the clustering is not yet converged. The next
image runs the same model for 40 iterations, producing an accurate model of
the input data.

![DirichletSN40](attachments/101992/23527485.png)

The next image uses an AsymmetricSampledNormalDistribution in which the
model's standard deviation is also represented as a 2-d vector. This causes
the clusters to assume elliptical shapes in the resulting clustering. This
represents an incorrect prior assumption but it is interesting that it fits
the actual sample data quite well. Had we suspected the sample points were
generated in a similar manner then this distribution would have been the
most logical model.

![DirichletASN](attachments/101992/9384.png)

In order to explore an asymmetrical sample data distribution, the following
image shows a number of points generated according to the following
parameters. Again, the generator's 3-sigma ellipses are superimposed:

* 500 samples m=\[1.0, 1.0\](1.0,-1.0\.html)
 sd=\[3.0, 1.0\]
* 300 samples m=\[1.0, 0.0\](1.0,-0.0\.html)
 sd=\[0.5, 1.0\]
* 300 samples m=\[0.0, 2.0\](0.0,-2.0\.html)
 sd=\[0.1, 0.5\]

![AsymmetricSampleData](attachments/101992/23527462.png)

The following image shows the results of applying the symmetrical
SampledNormalDistribution to the asymmetrically-generated sample data. It
does a valiant effort but does not capture a very good set of models
because the circular model assumption does not fit the data.

![2dDirichletSN](attachments/101992/9386.png)

Finally, the AsymmetricSampledNormalDistribution is run against the
asymmetrical sample data. Though there is some over-fitting, it does a
credible job of capturing the underlying models. Different arguments
(numClusters, alpha0, numIterations) and display thresholds will yield
slightly different results. Compare the first run of numClusters=20 models
for 20 iterations with another run of numClusters=40 models for 40
iterations.

![2dDirichletASN](attachments/101992/9387.png)
![2dDirichletASN4040](attachments/101992/23527487.png)

<a name="DirichletProcessClustering-References"></a>
# References

McCullagh and Yang:
http://ba.stat.cmu.edu/journal/2008/vol03/issue01/yang.pdf

There is also a more approachable example in [Chris Bishop's book on Machine Learning](http://research.microsoft.com/en-us/um/people/cmbishop/prml/index.htm.html)
. I think that chapter 9 is where the example of clustering using a mixture
model is found.

The Neal and Blei references from the McCullagh and Yang paper are also
good. Zoubin Gharamani has some very [nice tutorials out which describe why non-parametric Bayesian approaches to problems are very cool](http://learning.eng.cam.ac.uk/zoubin/talks/uai05tutorial-b.pdf)
, there are video versions about as well.