Title: Scrunch Subtitle: A Scala Wrapper for the Apache Crunch Java API Notice: Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at . http://www.apache.org/licenses/LICENSE-2.0 . Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. ## Introduction Scrunch is an experimental Scala wrapper for the Apache Crunch Java API, based on the same ideas as the [Cascade](http://days2011.scala-lang.org/node/138/282) project at Google, which created a Scala wrapper for FlumeJava. ## Why Scala? In many ways, Scala is the perfect language for writing MapReduce pipelines. Scala supports a mixture of functional and object-oriented programming styles and has powerful type-inference capabilities, allowing us to create complex pipelines using very few keystrokes. Here is an implementation of the classic WordCount problem using the Scrunch API: import org.apache.crunch.io.{From => from} import org.apache.crunch.scrunch._ import org.apache.crunch.scrunch.Conversions_ # For implicit type conversions class WordCountExample { val pipeline = new Pipeline[WordCountExample] def wordCount(fileName: String) = { pipeline.read(from.textFile(fileName)) .flatMap(_.toLowerCase.split("\\W+")) .filter(!_.isEmpty()) .count } } The Scala compiler can infer the return type of the flatMap function as an Array[String], and the Scrunch wrapper code uses the type inference mechanism to figure out how to serialize the data between the Map and Reduce stages. Here's a slightly more complex example, in which we get the word counts for two different files and compute the deltas of how often different words occur, and then only returns the words where the first file had more occurrences then the second: class WordCountExample { def wordGt(firstFile: String, secondFile: String) = { wordCount(firstFile).cogroup(wordCount(secondFile)) .map((k, v) => (k, (v._1.sum - v._2.sum))) .filter((k, v) => v > 0).map((k, v) => k) } } ## Materializing Job Outputs The Scrunch API also incorporates the Java library's `materialize` functionality, which allows us to easily read the output of a MapReduce pipeline into the client: class WordCountExample { def hasHamlet = wordGt("shakespeare.txt", "maugham.txt").materialize.exists(_ == "hamlet") } ## Notes and Thanks Scrunch emerged out of conversations with [Dmitriy Ryaboy](http://twitter.com/#!/squarecog), [Oscar Boykin](http://twitter.com/#!/posco), and [Avi Bryant](http://twitter.com/#!/avibryant) from Twitter. Many thanks to them for their feedback, guidance, and encouragement. We are also grateful to [Matei Zaharia](http://twitter.com/#!/matei_zaharia), whose [Spark Project](http://www.spark-project.org/) inspired much of the original Scrunch API implementation.