1 /* 2 * 3 * Licensed to the Apache Software Foundation (ASF) under one 4 * or more contributor license agreements. See the NOTICE file 5 * distributed with this work for additional information 6 * regarding copyright ownership. The ASF licenses this file 7 * to you under the Apache License, Version 2.0 (the 8 * "License"); you may not use this file except in compliance 9 * with the License. You may obtain a copy of the License at 10 * 11 * http://www.apache.org/licenses/LICENSE-2.0 12 * 13 * Unless required by applicable law or agreed to in writing, software 14 * distributed under the License is distributed on an "AS IS" BASIS, 15 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 * See the License for the specific language governing permissions and 17 * limitations under the License. 18 */ 19 package org.apache.hadoop.hbase.util; 20 21 22 import org.apache.hadoop.hbase.Cell; 23 import org.apache.hadoop.hbase.classification.InterfaceAudience; 24 import org.apache.hadoop.hbase.nio.ByteBuff; 25 26 /** 27 * 28 * Implements a <i>Bloom filter</i>, as defined by Bloom in 1970. 29 * <p> 30 * The Bloom filter is a data structure that was introduced in 1970 and that has 31 * been adopted by the networking research community in the past decade thanks 32 * to the bandwidth efficiencies that it offers for the transmission of set 33 * membership information between networked hosts. A sender encodes the 34 * information into a bit vector, the Bloom filter, that is more compact than a 35 * conventional representation. Computation and space costs for construction are 36 * linear in the number of elements. The receiver uses the filter to test 37 * whether various elements are members of the set. Though the filter will 38 * occasionally return a false positive, it will never return a false negative. 39 * When creating the filter, the sender can choose its desired point in a 40 * trade-off between the false positive rate and the size. 41 * 42 * <p> 43 * Originally inspired by <a href="http://www.one-lab.org/">European Commission 44 * One-Lab Project 034819</a>. 45 * 46 * Bloom filters are very sensitive to the number of elements inserted into 47 * them. For HBase, the number of entries depends on the size of the data stored 48 * in the column. Currently the default region size is 256MB, so entry count ~= 49 * 256MB / (average value size for column). Despite this rule of thumb, there is 50 * no efficient way to calculate the entry count after compactions. Therefore, 51 * it is often easier to use a dynamic bloom filter that will add extra space 52 * instead of allowing the error rate to grow. 53 * 54 * ( http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/BloomFilterSurvey 55 * .pdf ) 56 * 57 * m denotes the number of bits in the Bloom filter (bitSize) n denotes the 58 * number of elements inserted into the Bloom filter (maxKeys) k represents the 59 * number of hash functions used (nbHash) e represents the desired false 60 * positive rate for the bloom (err) 61 * 62 * If we fix the error rate (e) and know the number of entries, then the optimal 63 * bloom size m = -(n * ln(err) / (ln(2)^2) ~= n * ln(err) / ln(0.6185) 64 * 65 * The probability of false positives is minimized when k = m/n ln(2). 66 * 67 * @see BloomFilter The general behavior of a filter 68 * 69 * @see <a 70 * href="http://portal.acm.org/citation.cfm?id=362692&dl=ACM&coll=portal"> 71 * Space/Time Trade-Offs in Hash Coding with Allowable Errors</a> 72 * 73 * @see BloomFilterWriter for the ability to add elements to a Bloom filter 74 */ 75 @InterfaceAudience.Private 76 public interface BloomFilter extends BloomFilterBase { 77 78 /** 79 * Check if the specified key is contained in the bloom filter. 80 * Used in ROW_COL blooms where the blooms are serialized as KeyValues 81 * @param keyCell the key to check for the existence of 82 * @param bloom bloom filter data to search. This can be null if auto-loading 83 * is supported. 84 * @return true if matched by bloom, false if not 85 */ 86 boolean contains(Cell keyCell, ByteBuff bloom); 87 88 /** 89 * Check if the specified key is contained in the bloom filter. 90 * Used in ROW bloom where the blooms are just plain byte[] 91 * @param buf data to check for existence of 92 * @param offset offset into the data 93 * @param length length of the data 94 * @param bloom bloom filter data to search. This can be null if auto-loading 95 * is supported. 96 * @return true if matched by bloom, false if not 97 */ 98 boolean contains(byte[] buf, int offset, int length, ByteBuff bloom); 99 100 /** 101 * @return true if this Bloom filter can automatically load its data 102 * and thus allows a null byte buffer to be passed to contains() 103 */ 104 boolean supportsAutoLoading(); 105 }