Title: Apache Accumulo Bloom Filter Example
Notice:    Licensed to the Apache Software Foundation (ASF) under one
           or more contributor license agreements.  See the NOTICE file
           distributed with this work for additional information
           regarding copyright ownership.  The ASF licenses this file
           to you under the Apache License, Version 2.0 (the
           "License"); you may not use this file except in compliance
           with the License.  You may obtain a copy of the License at
           .
             http://www.apache.org/licenses/LICENSE-2.0
           .
           Unless required by applicable law or agreed to in writing,
           software distributed under the License is distributed on an
           "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
           KIND, either express or implied.  See the License for the
           specific language governing permissions and limitations
           under the License.

This example shows how to create a table with bloom filters enabled.  It also
shows how bloom filters increase query performance when looking for values that
do not exist in a table.

Below table named bloom_test is created and bloom filters are enabled.

    $ ./accumulo shell -u username -p password
    Shell - Apache Accumulo Interactive Shell
    - version: 1.3.x
    - instance name: instance
    - instance id: 00000000-0000-0000-0000-000000000000
    - 
    - type 'help' for a list of available commands
    - 
    username@instance> setauths -u username -s exampleVis
    username@instance> createtable bloom_test
    username@instance bloom_test> config -t bloom_test -s table.bloom.enabled=true
    username@instance bloom_test> exit

Below 1 million random values are inserted into accumulo.  The randomly
generated rows range between 0 and 1 billion.  The random number generator is
initialized with the seed 7.

    $ ./bin/accumulo org.apache.accumulo.examples.client.RandomBatchWriter -s 7 instance zookeepers username password bloom_test 1000000 0 1000000000 50 2000000 60000 3 exampleVis

Below the table is flushed, look at the monitor page and wait for the flush to
complete.  

    $ ./bin/accumulo shell -u username -p password
    username@instance> flush -t bloom_test
    Flush of table bloom_test initiated...
    username@instance> exit

The flush will be finished when there are no entries in memory and the 
number of minor compactions goes to zero. Refresh the page to see changes to the table.

After the flush completes, 500 random queries are done against the table.  The
same seed is used to generate the queries, therefore everything is found in the
table.

    $ ./bin/accumulo org.apache.accumulo.examples.client.RandomBatchScanner -s 7 instance zookeepers username password bloom_test 500 0 1000000000 50 20 exampleVis
    Generating 500 random queries...finished
    96.19 lookups/sec   5.20 secs
    num results : 500
    Generating 500 random queries...finished
    102.35 lookups/sec   4.89 secs
    num results : 500

Below another 500 queries are performed, using a different seed which results
in nothing being found.  In this case the lookups are much faster because of
the bloom filters.

    $ ../bin/accumulo org.apache.accumulo.examples.client.RandomBatchScanner -s 8 instance zookeepers username password bloom_test 500 0 1000000000 50 20 exampleVis
    Generating 500 random queries...finished
    2212.39 lookups/sec   0.23 secs
    num results : 0
    Did not find 500 rows
    Generating 500 random queries...finished
    4464.29 lookups/sec   0.11 secs
    num results : 0
    Did not find 500 rows

********************************************************************************

Bloom filters can also speed up lookups for entries that exist.  In accumulo
data is divided into tablets and each tablet has multiple map files. Every
lookup in accumulo goes to a specific tablet where a lookup is done on each
map file in the tablet.  So if a tablet has three map files, lookup performance
can be three times slower than a tablet with one map file.  However if the map
files contain unique sets of data, then bloom filters can help eliminate map
files that do not contain the row being looked up.  To illustrate this two
identical tables were created using the following process.  One table had bloom
filters, the other did not.  Also the major compaction ratio was increased to
prevent the files from being compacted into one file.

 * Insert 1 million entries using  RandomBatchWriter with a seed of 7
 * Flush the table using the shell
 * Insert 1 million entries using  RandomBatchWriter with a seed of 8
 * Flush the table using the shell
 * Insert 1 million entries using  RandomBatchWriter with a seed of 9
 * Flush the table using the shell

After following the above steps, each table will have a tablet with three map
files.  Each map file will contain 1 million entries generated with a different
seed. 

Below 500 lookups are done against the table without bloom filters using random
NG seed 7.  Even though only one map file will likely contain entries for this
seed, all map files will be interrogated.

    $ ./bin/accumulo org.apache.accumulo.examples.client.RandomBatchScanner -s 7 instance zookeepers username password bloom_test1 500 0 1000000000 50 20 exampleVis
    Generating 500 random queries...finished
    35.09 lookups/sec  14.25 secs
    num results : 500
    Generating 500 random queries...finished
    35.33 lookups/sec  14.15 secs
    num results : 500

Below the same lookups are done against the table with bloom filters.  The
lookups were 2.86 times faster because only one map file was used, even though three
map files existed.

    $ ./bin/accumulo org.apache.accumulo.examples.client.RandomBatchScanner -s 7 instance zookeepers username password bloom_test2 500 0 1000000000 50 20 exampleVis
    Generating 500 random queries...finished
    99.03 lookups/sec   5.05 secs
    num results : 500
    Generating 500 random queries...finished
    101.15 lookups/sec   4.94 secs
    num results : 500