Title: Apache Accumulo Bloom Filter Example Notice: Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at . http://www.apache.org/licenses/LICENSE-2.0 . Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. This example shows how to create a table with bloom filters enabled. It also shows how bloom filters increase query performance when looking for values that do not exist in a table. Below table named bloom_test is created and bloom filters are enabled. $ ./accumulo shell -u username -p password Shell - Apache Accumulo Interactive Shell - version: 1.3.x - instance name: instance - instance id: 00000000-0000-0000-0000-000000000000 - - type 'help' for a list of available commands - username@instance> setauths -u username -s exampleVis username@instance> createtable bloom_test username@instance bloom_test> config -t bloom_test -s table.bloom.enabled=true username@instance bloom_test> exit Below 1 million random values are inserted into accumulo. The randomly generated rows range between 0 and 1 billion. The random number generator is initialized with the seed 7. $ ./bin/accumulo org.apache.accumulo.examples.client.RandomBatchWriter -s 7 instance zookeepers username password bloom_test 1000000 0 1000000000 50 2000000 60000 3 exampleVis Below the table is flushed, look at the monitor page and wait for the flush to complete. $ ./bin/accumulo shell -u username -p password username@instance> flush -t bloom_test Flush of table bloom_test initiated... username@instance> exit The flush will be finished when there are no entries in memory and the number of minor compactions goes to zero. Refresh the page to see changes to the table. After the flush completes, 500 random queries are done against the table. The same seed is used to generate the queries, therefore everything is found in the table. $ ./bin/accumulo org.apache.accumulo.examples.client.RandomBatchScanner -s 7 instance zookeepers username password bloom_test 500 0 1000000000 50 20 exampleVis Generating 500 random queries...finished 96.19 lookups/sec 5.20 secs num results : 500 Generating 500 random queries...finished 102.35 lookups/sec 4.89 secs num results : 500 Below another 500 queries are performed, using a different seed which results in nothing being found. In this case the lookups are much faster because of the bloom filters. $ ../bin/accumulo org.apache.accumulo.examples.client.RandomBatchScanner -s 8 instance zookeepers username password bloom_test 500 0 1000000000 50 20 exampleVis Generating 500 random queries...finished 2212.39 lookups/sec 0.23 secs num results : 0 Did not find 500 rows Generating 500 random queries...finished 4464.29 lookups/sec 0.11 secs num results : 0 Did not find 500 rows ******************************************************************************** Bloom filters can also speed up lookups for entries that exist. In accumulo data is divided into tablets and each tablet has multiple map files. Every lookup in accumulo goes to a specific tablet where a lookup is done on each map file in the tablet. So if a tablet has three map files, lookup performance can be three times slower than a tablet with one map file. However if the map files contain unique sets of data, then bloom filters can help eliminate map files that do not contain the row being looked up. To illustrate this two identical tables were created using the following process. One table had bloom filters, the other did not. Also the major compaction ratio was increased to prevent the files from being compacted into one file. * Insert 1 million entries using RandomBatchWriter with a seed of 7 * Flush the table using the shell * Insert 1 million entries using RandomBatchWriter with a seed of 8 * Flush the table using the shell * Insert 1 million entries using RandomBatchWriter with a seed of 9 * Flush the table using the shell After following the above steps, each table will have a tablet with three map files. Each map file will contain 1 million entries generated with a different seed. Below 500 lookups are done against the table without bloom filters using random NG seed 7. Even though only one map file will likely contain entries for this seed, all map files will be interrogated. $ ./bin/accumulo org.apache.accumulo.examples.client.RandomBatchScanner -s 7 instance zookeepers username password bloom_test1 500 0 1000000000 50 20 exampleVis Generating 500 random queries...finished 35.09 lookups/sec 14.25 secs num results : 500 Generating 500 random queries...finished 35.33 lookups/sec 14.15 secs num results : 500 Below the same lookups are done against the table with bloom filters. The lookups were 2.86 times faster because only one map file was used, even though three map files existed. $ ./bin/accumulo org.apache.accumulo.examples.client.RandomBatchScanner -s 7 instance zookeepers username password bloom_test2 500 0 1000000000 50 20 exampleVis Generating 500 random queries...finished 99.03 lookups/sec 5.05 secs num results : 500 Generating 500 random queries...finished 101.15 lookups/sec 4.94 secs num results : 500