Accumulo supports the ability to import map files produced by an external process into an online table. Often, it is much faster to churn through large amounts of data using map/reduce to produce the map files. The new map files can be incorporated into Accumulo using bulk ingest.
An important caveat is that the map/reduce job must use a range partitioner instead of the default hash partitioner. The range partitioner uses the current split points of the Accumulo table you want to ingest data into. To bulk insert data using map/reduce, the following high level steps should be taken.
org.apache.accumulo.core.client.Connector
instanceconnector.tableOperations().getSplits()
connector.tableOperations().importDirectory()
passing the output directory of the MapReduce jobA complete example is available in README.bulkIngest
The reason hash partition is not recommended is that it could potentially place a lot of load on the system. Accumulo will look at each map file and determine the tablets to which it should be assigned. When hash partitioning, every map file could get assigned to every tablet. If a tablet has too many map files it will not allow them to be opened for a query (opening too many map files can kill a Hadoop Data Node). So queries would be disabled until major compactions reduced the number of map files on the tablet. However, when range partitioning using a tables splits each tablet should only get one map file.
Any set of cut points for range partitioning can be used in a map
reduce job, but using Accumulo's current splits is probably the most
optimal thing to do. However in some case there may be too many
splits. For example if there are 2000 splits, you would need to run
2001 reducers. To overcome this problem use the
connector.tableOperations.getSplits(<table name>,<max
splits>)
method. This method will not return more than
<max splits>
splits, but the splits it returns
will optimally partition the data for Accumulo.
Remember that Accumulo never splits rows across tablets. Therefore the range partitioner only considers rows when partitioning.
An alternative to bulk ingest is to have a map/reduce job use
AccumuloOutputFormat
, which can support billions of inserts per
hour, depending on the size of your cluster. This is sufficient for
most users, but bulk ingest remains the fastest way to incorporate
data into Accumulo. In addition, bulk ingest has one advantage over
AccumuloOutputFormat: there is no duplicate data insertion. When one uses
map/reduce to output data to accumulo, restarted jobs may re-enter
data from previous failed attempts. Generally, this only matters when
there are aggregators. With bulk ingest, reducers are writing to new
map files, so it does not matter. If a reduce fails, you create a new
map file. When all reducers finish, you bulk ingest the map files
into Accumulo. The disadvantage to bulk ingest over
AccumuloOutputFormat
, is that it is tightly coupled to the
Accumulo internals. Therefore a bulk ingest user may need to make
more changes to their code to switch to a new Accumulo version.