Title: Apache Accumulo MapReduce Example Notice: Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at . http://www.apache.org/licenses/LICENSE-2.0 . Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. This example uses mapreduce and accumulo to compute word counts for a set of documents. This is accomplished using a map-only mapreduce job and a accumulo table with aggregators. To run this example you will need a directory in HDFS containing text files. The accumulo readme will be used to show how to run this example. $ hadoop fs -copyFromLocal $ACCUMULO_HOME/README /user/username/wc/Accumulo.README $ hadoop fs -ls /user/username/wc Found 1 items -rw-r--r-- 2 username supergroup 9359 2009-07-15 17:54 /user/username/wc/Accumulo.README The first part of running this example is to create a table with aggregation for the column family count. $ ./bin/accumulo shell -u username -p password Shell - Apache Accumulo Interactive Shell - version: 1.3.x - instance name: instance - instance id: 00000000-0000-0000-0000-000000000000 - - type 'help' for a list of available commands - username@instance> createtable wordCount -a count=org.apache.accumulo.core.iterators.aggregation.StringSummation username@instance wordCount> quit After creating the table, run the word count map reduce job. [user1@instance accumulo]$ bin/tool.sh lib/accumulo-examples-*[^c].jar org.apache.accumulo.examples.mapreduce.WordCount instance zookeepers /user/user1/wc wordCount -u username -p password 11/02/07 18:20:11 INFO input.FileInputFormat: Total input paths to process : 1 11/02/07 18:20:12 INFO mapred.JobClient: Running job: job_201102071740_0003 11/02/07 18:20:13 INFO mapred.JobClient: map 0% reduce 0% 11/02/07 18:20:20 INFO mapred.JobClient: map 100% reduce 0% 11/02/07 18:20:22 INFO mapred.JobClient: Job complete: job_201102071740_0003 11/02/07 18:20:22 INFO mapred.JobClient: Counters: 6 11/02/07 18:20:22 INFO mapred.JobClient: Job Counters 11/02/07 18:20:22 INFO mapred.JobClient: Launched map tasks=1 11/02/07 18:20:22 INFO mapred.JobClient: Data-local map tasks=1 11/02/07 18:20:22 INFO mapred.JobClient: FileSystemCounters 11/02/07 18:20:22 INFO mapred.JobClient: HDFS_BYTES_READ=10487 11/02/07 18:20:22 INFO mapred.JobClient: Map-Reduce Framework 11/02/07 18:20:22 INFO mapred.JobClient: Map input records=255 11/02/07 18:20:22 INFO mapred.JobClient: Spilled Records=0 11/02/07 18:20:22 INFO mapred.JobClient: Map output records=1452 After the map reduce job completes, query the accumulo table to see word counts. $ ./bin/accumulo shell -u username -p password username@instance> table wordCount username@instance wordCount> scan -b the the count:20080906 [] 75 their count:20080906 [] 2 them count:20080906 [] 1 then count:20080906 [] 1 there count:20080906 [] 1 these count:20080906 [] 3 this count:20080906 [] 6 through count:20080906 [] 1 time count:20080906 [] 3 time. count:20080906 [] 1 to count:20080906 [] 27 total count:20080906 [] 1 tserver, count:20080906 [] 1 tserver.compaction.major.concurrent.max count:20080906 [] 1 ...