You’ve probably already noticed Mahout has a lot of things going on at different levels, and it can be hard to know where to start. Let’s provide an overview to help you see how the pieces fit together. In general the stack is something like this:
You have an JAVA/Scala applicatoin (skip this if you’re working from an interactive shell or Apache Zeppelin)
def main(args: Array[String]) {
println("Welcome to My Mahout App")
if (args.isEmpty) {
This may seem like a trivial part to call out, but the point is important- Mahout runs inline with your regular application code. E.g. if this is an Apache Spark app, then you do all your Spark things, including ETL and data prep in the same application, and then invoke Mahout’s mathematically expressive Scala DSL when you’re ready to math on it.
So when you get to a point in your code where you’re ready to math it up (in this example Spark) you can elegently express yourself mathematically.
implicit val sdc: org.apache.mahout.sparkbindings.SparkDistributedContext = sc2sdc(sc)
val A = drmWrap(rddA)
val B = drmWrap(rddB)
val C = A.t %*% A + A %*% B.t
We’ve defined a MahoutDistributedContext
(which is a wrapper on the Spark Context), and two Disitributed Row Matrices (DRMs)
which are wrappers around RDDs (in Spark).
At this point there is a bit of optimization that happens. For example, consider the
A.t %*% A
Which is
Transposing a large matrix is a very expensive thing to do, and in this case we don’t actually need to do it. There is a
more efficient way to calculate
(Image showing this)
Mahout converts this code into something that looks like:
OpAtA(A) + OpABt(A, B) // illustrative pseudocode with real functions called
There’s a little more magic that happens at this level, but the punchline is Mahout translates the pretty scala into a a series of operators, which at the next level are turned implemented at the engine.
When one creates new engine bindings, one is in essence defining
MahoutVector
s, so in Spark RDD[(index, MahoutVector)]
. This will be important when we get to the native solvers.AtA
on an RDD. See the sparkbindings on githubNow your mathematically expresive Samsara Scala code has been translated into optimized engine specific functions.
Recall how I said the rows of the DRMs are org.apache.mahout.math.Vector
. Here is where this becomes important. I’m going
to explain this in the context of Spark, but the principals apply to all distributed backends.
If you are familiar with how mapping and reducing in Spark, then envision this RDD of MahoutVector
s, each partition,
and indexed collection of vectors is a block of the distributed matrix, however this block is totally incore, and therefor
is treated like an in core matrix.
Now Mahout defines its own incore BLAS packs and refers to them as Native Solvers. The default native solver is just plain old JVM, which is painfully slow, but works just about anywhere.
When the data gets to the node and an operation on the matrix block is called. In the same way Mahout converts abstract operators on the DRM that are implemented on various distributed engines, it calls abstract operators on the incore matrix and vectors which are implemented on various native solvers.
The default “native solver” is the JVM, which isn’t native at all- and if no actual native solvers are present operations will fall back to this. However, IF a native solver is present (the jar was added to the notebook), then the magic will happen.
Imagine still we have our Spark executor- it has this block of a matrix sitting in its core. Now let’s suppose the ViennaCl-OMP
native solver is in use. When Spark calls an operation on this incore matrix, the matrix dumps out of the JVM and the
calculation is carried out on all available CPUs.
In a similar way, the ViennaCL
native solver dumps the matrix out of the JVM and looks for a GPU to execute the operations on.
Once the operations are complete, the result is loaded back up into the JVM, and Spark (or whatever distributed engine) and shipped back to the driver.
The native solver operatoins are only defined on org.apache.mahout.math.Vector
and org.apache.mahout.math.Matrix
, which is
why it is critical that the underlying structure composed row-wise of Vector
or Matrices
.