Title:    Scrunch
Subtitle: A Scala Wrapper for the Apache Crunch Java API
Notice:   Licensed to the Apache Software Foundation (ASF) under one
          or more contributor license agreements.  See the NOTICE file
          distributed with this work for additional information
          regarding copyright ownership.  The ASF licenses this file
          to you under the Apache License, Version 2.0 (the
          "License"); you may not use this file except in compliance
          with the License.  You may obtain a copy of the License at
          .
            http://www.apache.org/licenses/LICENSE-2.0
          .
          Unless required by applicable law or agreed to in writing,
          software distributed under the License is distributed on an
          "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
          KIND, either express or implied.  See the License for the
          specific language governing permissions and limitations
          under the License.

## Introduction

Scrunch is an experimental Scala wrapper for the Apache Crunch Java API, based on the same ideas as the
[Cascade](http://days2011.scala-lang.org/node/138/282) project at Google, which created a Scala wrapper for
FlumeJava.

## Why Scala?

In many ways, Scala is the perfect language for writing MapReduce pipelines. Scala supports
a mixture of functional and object-oriented programming styles and has powerful type-inference
capabilities, allowing us to create complex pipelines using very few keystrokes. Here is
an implementation of the classic WordCount problem using the Scrunch API:

	import org.apache.crunch.io.{From => from}
	import org.apache.crunch.scrunch._
	import org.apache.crunch.scrunch.Conversions_  # For implicit type conversions

	class WordCountExample {
	  val pipeline = new Pipeline[WordCountExample]

	  def wordCount(fileName: String) = {
	    pipeline.read(from.textFile(fileName))
	      .flatMap(_.toLowerCase.split("\\W+"))
	      .filter(!_.isEmpty())
	      .count
	  }
	}

The Scala compiler can infer the return type of the flatMap function as an Array[String], and
the Scrunch wrapper code uses the type inference mechanism to figure out how to serialize the
data between the Map and Reduce stages. Here's a slightly more complex example, in which we
get the word counts for two different files and compute the deltas of how often different
words occur, and then only returns the words where the first file had more occurrences then
the second:

	class WordCountExample {
	  def wordGt(firstFile: String, secondFile: String) = {
	    wordCount(firstFile).cogroup(wordCount(secondFile))
	      .map((k, v) => (k, (v._1.sum - v._2.sum)))
	      .filter((k, v) => v > 0).map((k, v) => k)
	  }
	}

## Materializing Job Outputs

The Scrunch API also incorporates the Java library's `materialize` functionality, which allows us to easily read
the output of a MapReduce pipeline into the client:

	class WordCountExample {
	  def hasHamlet = wordGt("shakespeare.txt", "maugham.txt").materialize.exists(_ == "hamlet")
	}

## Notes and Thanks

Scrunch emerged out of conversations with [Dmitriy Ryaboy](http://twitter.com/#!/squarecog),
[Oscar Boykin](http://twitter.com/#!/posco), and [Avi Bryant](http://twitter.com/#!/avibryant) from Twitter.
Many thanks to them for their feedback, guidance, and encouragement. We are also grateful to
[Matei Zaharia](http://twitter.com/#!/matei_zaharia), whose [Spark Project](http://www.spark-project.org/)
inspired much of the original Scrunch API implementation.