Apache Mahout > Mahout Wiki > Algorithms > Expectation Maximization

Expectation Maximization

The principle of EM can be applied to several learning settings, but is most commonly associated with clustering. The main principle of the algorithm is comparable to k-Means. Yet in contrast to hard cluster assignments, each object is given some probability to belong to a cluster. Accordingly cluster centers are recomputed based on the average of all objects weighted by their probability of belonging to the cluster at hand.

Canopy-modified EM

One can also use the canopies idea to speed up prototypebased clustering methods like K-means and Expectation-Maximization (EM). In general, neither K-means nor EMspecify how many clusters to use. The canopies technique does not help this choice.

Prototypes (our estimates of the cluster centroids) are associated with the canopies that contain them, and the prototypes are only influenced by data that are inside their associated canopies. After creating the canopies, we decide how many prototypes will be created for each canopy. This could be done, for example, using the number of data points in a canopy and AIC or BIC where points that occur in more than one canopy are counted fractionally. Then we place prototypesinto each canopy. This initial placement can be random, as long as it is within the canopy in question, as determined by the inexpensive distance metric.

Then, instead of calculating the distance from each prototype to every point (as is traditional, a O(nk) operation), theE-step instead calculates the distance from each prototype to a much smaller number of points. For each prototype, we find the canopies that contain it (using the cheap distance metric), and only calculate distances (using the expensive distance metric) from that prototype to points within those canopies.

Note that by this procedure prototypes may move across canopy boundaries when canopies overlap. Prototypes may move to cover the data in the overlapping region, and then move entirely into another canopy in order to cover data there.

The canopy-modified EM algorithm behaves very similarly to traditional EM, with the slight difference that points outside the canopy have no influence on points in the canopy, rather than a minute influence. If the canopy property holds, and points in the same cluster fall in the same canopy, then the canopy-modified EM will almost always converge to the same maximum in likelihood as the traditional EM. In fact, the difference in each iterative step (apart from the enormous computational savings of computing fewer terms) will be negligible since points outside the canopy will have exponentially small influence.

Strategy for Parallelization

Map/Reduce Implementation