About

The OrinaryLeastSquares regressor in Mahout implements a closed-form solution to Ordinary Least Squares. This is in stark contrast to many “big data machine learning” frameworks which implement a stochastic approach. From the users perspecive this difference can be reduced to:

  • Stochastic- A series of guesses at a line line of best fit.
  • Closed Form- A mathimatical approach has been explored, the properties of the parameters are well understood, and problems which arise (and the remedial measures), exist. This is usually the preferred choice of mathematicians/statisticians, but computational limititaions have forced us to resort to SGD.

Parameters

Parameter Description Default Value
'calcCommonStatistics Calculate commons statistics such as Coeefficient of Determination and Mean Square Error true
'calcStandardErrors Calculate the standard errors (and subsequent "t-scores" and "p-values") of the \(\boldsymbol{\beta}\) estimates true
'addIntercept Add an intercept to \(\mathbf{X}\) true

Example

In this example we disable the “calculate common statistics” parameters, so our summary will NOT contain the coefficient of determination (R-squared) or Mean Square Error

import org.apache.mahout.math.algorithms.regression.OrdinaryLeastSquares

val drmData = drmParallelize(dense(
      (2, 2, 10.5, 10, 29.509541),  // Apple Cinnamon Cheerios
      (1, 2, 12,   12, 18.042851),  // Cap'n'Crunch
      (1, 1, 12,   13, 22.736446),  // Cocoa Puffs
      (2, 1, 11,   13, 32.207582),  // Froot Loops
      (1, 2, 12,   11, 21.871292),  // Honey Graham Ohs
      (2, 1, 16,   8,  36.187559),  // Wheaties Honey Gold
      (6, 2, 17,   1,  50.764999),  // Cheerios
      (3, 2, 13,   7,  40.400208),  // Clusters
      (3, 3, 13,   4,  45.811716)), numPartitions = 2)


val drmX = drmData(::, 0 until 4)
val drmY = drmData(::, 4 until 5)

val model = new OrdinaryLeastSquares[Int]().fit(drmX, drmY, 'calcCommonStatistics → false)
println(model.summary)