Title: Dirichlet Process Clustering
# Overview
The Dirichlet Process Clustering algorithm performs Bayesian mixture
modeling.
The idea is that we use a probabilistic mixture of a number of models that
we use to explain some observed data. Each observed data point is assumed
to have come from one of the models in the mixture, but we don't know
which. The way we deal with that is to use a so-called latent parameter
which specifies which model each data point came from.
In addition, since this is a Bayesian clustering algorithm, we don't want
to actually commit to any single explanation, but rather to sample from the
distribution of models and latent assignments of data points to models
given the observed data and the prior distributions of model parameters.
This sampling process is initialized by taking models at random from the
prior distribution for models.
Then, we iteratively assign points to the different models using the
mixture probabilities and the degree of fit between the point and each
model expressed as a probability that the point was generated by that
model. After points are assigned, new parameters for each model are sampled
from the posterior distribution for the model parameters considering all of
the observed data points that were assigned to the model. Models without
any data points are also sampled, but since they have no points assigned,
the new samples are effectively taken from the prior distribution for model
parameters.
The result is a number of samples that represent mixing probabilities,
models and assignment of points to models. If the total number of possible
models is substantially larger than the number that ever have points
assigned to them, then this algorithm provides a (nearly) non-parametric
clustering algorithm. These samples can give us interesting information
that is lacking from a normal clustering that consists of a single
assignment of points to clusters. Firstly, by examining the number of
models in each sample that actually has any points assigned to it, we can
get information about how many models (clusters) that the data support.
Morevoer, by examining how often two points are assigned to the same model,
we can get an approximate measure of how likely these points are to be
explained by the same model. Such soft membership information is difficult
to come by with conventional clustering methods.
Finally, we can get an idea of the stability of how the data can be
described. Typically, aspects of the data with lots of data available wind
up with stable descriptions while at the edges, there are aspects that are
phenomena that we can't really commit to a solid description, but it is
still clear that the well supported explanations are insufficient to
explain these additional aspects. One thing that can be difficult about
these samples is that we can't always assign a correlation between the
models in the different samples. Probably the best way to do this is to
look for overlap in the assignments of data observations to the different
models.
## Design of Implementation
The implementation accepts one input directory containing the data points
to be clustered. The data directory contains multiple input files of
SequenceFile(key, VectorWritable). The input data points are not modified
by the implementation, allowing experimentation with initial clustering and
convergence values.
The program iterates over the input points, outputting a new directory
"clusters-N" containing SequenceFile(Text, DirichletCluster) files for each
iteration N. This process uses a mapper/reducer/driver as follows:
DirichletMapper - reads the input clusters during its configure() method,
then assigns and outputs each input point to a probable cluster as defined
by the model's pdf() function. Output _key_ is: clusterId. Output _value_
is: input point.
DirichletReducer - reads the input clusters during its configure() method,
then each reducer receives clusterId:VectorWritable pairs from all mappers
and accumulates them to produce a new posterior model for each cluster
which is output. Output _key_ is: clusterId. Output value is:
DirichletCluster. Reducer outputs are used as the input clusters for the
next iteration.
DirichletDriver - iterates over the points and clusters until the given
number of iterations has been reached. During iterations, a new clusters
directory "clusters-N" is produced with the output clusters from the
previous iteration used for input to the next. A final optional pass over
the data using the DirichletClusterMapper clusters all points to an output
directory "clusteredPoints" and has no combiner or reducer steps.
## Running Dirichlet Process Clustering
The Dirichlet clustering algorithm may be run using a command-line
invocation on DirichletDriver.main or by making a Java call to
DirichletDriver.runJob().
Invocation using the command line takes the form:
bin/mahout dirichlet \
-i \
-o