Apache Mahout: Scalable machine learning and data mining

NOTE: This implementation is a Work-In-Progress, at least till September,

The JIRA issue is here .

Boltzmann Machines

Boltzmann Machines are a type of stochastic neural networks that closely resemble physical processes. They define a network of units with an overall energy that is evolved over a period of time, until it reaches thermal equilibrium.

However, the convergence speed of Boltzmann machines that have unconstrained connectivity is low.

Restricted Boltzmann Machines

Restricted Boltzmann Machines are a variant, that are ‘restricted’ in the sense that connections between hidden units of a single layer are not allowed. In addition, stacking multiple RBM’s is also feasible, with the activities of the hidden units forming the base for a higher-level RBM. The combination of these two features renders RBM’s highly usable for parallelization.

In the Netflix Prize, RBM’s offered distinctly orthogonal predictions to SVD and k-NN approaches, and contributed immensely to the final solution.

RBM’s in Apache Mahout

An implementation of Restricted Boltzmann Machines is being developed for Apache Mahout as a Google Summer of Code 2010 project. A recommender interface will also be provided. The key aims of the implementation are:

Accurate - should replicate known results, including those of the Netflix Prize
Fast - The implementation uses Map-Reduce, hence, it should be fast
Scale - Should scale to large datasets, with a design whose critical parts don’t need a dependency between the amount of memory on your cluster systems and the size of your dataset

You can view the patch as it develops here .

Twitter

Apache Software Foundation

Related Projects

Boltzmann Machines

Restricted Boltzmann Machines

RBM’s in Apache Mahout