Apache Accumulo MapReduce Example
This example uses mapreduce and accumulo to compute word counts for a set of documents. This is accomplished using a map-only mapreduce job and a accumulo table with aggregators.
To run this example you will need a directory in HDFS containing text files. The accumulo readme will be used to show how to run this example.
$ hadoop fs -copyFromLocal $ACCUMULO_HOME/README /user/username/wc/Accumulo.README $ hadoop fs -ls /user/username/wc Found 1 items -rw-r--r-- 2 username supergroup 9359 2009-07-15 17:54 /user/username/wc/Accumulo.README
The first part of running this example is to create a table with aggregation for the column family count.
$ ./bin/accumulo shell -u username -p password Shell - Apache Accumulo Interactive Shell - version: 1.3.x-incubating - instance name: instance - instance id: 00000000-0000-0000-0000-000000000000 - - type 'help' for a list of available commands - username@instance> createtable wordCount -a count=org.apache.accumulo.core.iterators.aggregation.StringSummation username@instance wordCount> quit
After creating the table, run the word count map reduce job.
[user1@instance accumulo]$ bin/tool.sh lib/accumulo-examples-*[^c].jar org.apache.accumulo.examples.mapreduce.WordCount instance zookeepers /user/user1/wc wordCount -u username -p password 11/02/07 18:20:11 INFO input.FileInputFormat: Total input paths to process : 1 11/02/07 18:20:12 INFO mapred.JobClient: Running job: job_201102071740_0003 11/02/07 18:20:13 INFO mapred.JobClient: map 0% reduce 0% 11/02/07 18:20:20 INFO mapred.JobClient: map 100% reduce 0% 11/02/07 18:20:22 INFO mapred.JobClient: Job complete: job_201102071740_0003 11/02/07 18:20:22 INFO mapred.JobClient: Counters: 6 11/02/07 18:20:22 INFO mapred.JobClient: Job Counters 11/02/07 18:20:22 INFO mapred.JobClient: Launched map tasks=1 11/02/07 18:20:22 INFO mapred.JobClient: Data-local map tasks=1 11/02/07 18:20:22 INFO mapred.JobClient: FileSystemCounters 11/02/07 18:20:22 INFO mapred.JobClient: HDFS_BYTES_READ=10487 11/02/07 18:20:22 INFO mapred.JobClient: Map-Reduce Framework 11/02/07 18:20:22 INFO mapred.JobClient: Map input records=255 11/02/07 18:20:22 INFO mapred.JobClient: Spilled Records=0 11/02/07 18:20:22 INFO mapred.JobClient: Map output records=1452
After the map reduce job completes, query the accumulo table to see word counts.
$ ./bin/accumulo shell -u username -p password username@instance> table wordCount username@instance wordCount> scan -b the the count:20080906 [] 75 their count:20080906 [] 2 them count:20080906 [] 1 then count:20080906 [] 1 there count:20080906 [] 1 these count:20080906 [] 3 this count:20080906 [] 6 through count:20080906 [] 1 time count:20080906 [] 3 time. count:20080906 [] 1 to count:20080906 [] 27 total count:20080906 [] 1 tserver, count:20080906 [] 1 tserver.compaction.major.concurrent.max count:20080906 [] 1 ...