Logistic Regression (SGD)

Logistic regression is a model used for prediction of the probability of occurrence of an event. It makes use of several predictor variables that may be either numerical or categories.

Logistic regression is the standard industry workhorse that underlies many production fraud detection and advertising quality and targeting products. The Mahout implementation uses Stochastic Gradient Descent (SGD) to all large training sets to be used.

For a more detailed analysis of the approach, have a look at the thesis of Paul Komarek:

http://www.autonlab.org/autonweb/14709/version/4/part/5/data/komarek:lr_thesis.pdf?branch=main&language=en

See MAHOUT-228 for the main JIRA issue for SGD.

Parallelization strategy

The bad news is that SGD is an inherently sequential algorithm. The good news is that it is blazingly fast and thus it is not a problem for Mahout's implementation to handle training sets of tens of millions of examples. With the down-sampling typical in many data-sets, this is equivalent to a dataset with billions of raw training examples.

The SGD system in Mahout is an online learning algorithm which means that you can learn models in an incremental fashion and that you can do performance testing as your system runs. Often this means that you can stop training when a model reaches a target level of performance. The SGD framework includes classes to do on-line evaluation using cross validation (the CrossFoldLearner) and an evolutionary system to do learning hyper-parameter optimization on the fly (the AdaptiveLogisticRegression). The AdaptiveLogisticRegression system makes heavy use of threads to increase machine utilization. The way it works is that it runs 20 CrossFoldLearners in separate threads, each with slightly different learning parameters. As better settings are found, these new settings are propagating to the other learners.

Design of packages

There are three packages that are used in Mahout's SGD system. These include

The vector encoding package (found in org.apache.mahout.vectorizer.encoders)

The SGD learning package (found in org.apache.mahout.classifier.sgd)

The evolutionary optimization system (found in org.apache.mahout.ep)

Feature vector encoding

Because the SGD algorithms need to have fixed length feature vectors and because it is a pain to build a dictionary ahead of time, most SGD applications use the hashed feature vector encoding system that is rooted at FeatureVectorEncoder.

The basic idea is that you create a vector, typically a RandomAccessSparseVector, and then you use various feature encoders to progressively add features to that vector. The size of the vector should be large enough to avoid feature collisions as features are hashed.

There are specialized encoders for a variety of data types. You can normally encode either a string representation of the value you want to encode or you can encode a byte level representation to avoid string conversion. In the case of ContinuousValueEncoder and ConstantValueEncoder, it is also possible to encode a null value and pass the real value in as a weight. This avoids numerical parsing entirely in case you are getting your training data from a system like Avro.

Here is a class diagram for the encoders package:

SGD Learning

For the simplest applications, you can construct an OnlineLogisticRegression and be off and running. Typically, though, it is nice to have running estimates of performance on held out data. To do that, you should use a CrossFoldLearner which keeps a stable of five (by default) OnlineLogisticRegression objects. Each time you pass a training example to a CrossFoldLearner, it passes this example to all but one of its children as training and passes the example to the last child to evaluate current performance. The children are used for evaluation in a round-robin fashion so, if you are using the default 5 way split, all of the children get 80% of the training data for training and get 20% of the data for evaluation.

To avoid the pesky need to configure learning rates, regularization parameters and annealing schedules, you can use the AdaptiveLogisticRegression. This class maintains a pool of CrossFoldLearners and adapts learning rates and regularization on the fly so that you don't have to.

Here is a class diagram for the classifiers.sgd package. As you can see, the number of twiddlable knobs is pretty large. For some examples, see the TrainNewsGroups example code.