If HBase is used as an input source for a MapReduce job, for
example, make sure that the input Scan
instance to the MapReduce job has setCaching
set to something greater
than the default (which is 1). Using the default value means that the
map-task will make call back to the region-server for every record
processed. Setting this value to 500, for example, will transfer 500
rows at a time to the client to be processed. There is a cost/benefit to
have the cache value be large because it costs more in memory for both
client and RegionServer, so bigger isn't always better.
Scan settings in MapReduce jobs deserve special attention. Timeouts can result (e.g., UnknownScannerException) in Map tasks if it takes longer to process a batch of records before the client goes back to the RegionServer for the next set of data. This problem can occur because there is non-trivial processing occuring per row. If you process rows quickly, set caching higher. If you process rows more slowly (e.g., lots of transformations per row, writes), then set caching lower.
Timeouts can also happen in a non-MapReduce use case (i.e., single threaded HBase client doing a Scan), but the processing that is often performed in MapReduce jobs tends to exacerbate this issue.
Whenever a Scan is used to process large numbers of rows (and especially when used
as a MapReduce source), be aware of which attributes are selected. If scan.addFamily
is called
then all of the attributes in the specified ColumnFamily will be returned to the client.
If only a small number of the available attributes are to be processed, then only those attributes should be specified
in the input scan because attribute over-selection is a non-trivial performance penalty over large datasets.
When columns are selected explicitly with scan.addColumn
, HBase will schedule seek operations to seek between the
selected columns. When rows have few columns and each column has only a few versions this can be inefficient. A seek operation is generally
slower if does not seek at least past 5-10 columns/versions or 512-1024 bytes.
In order to opportunistically look ahead a few columns/versions to see if the next column/version can be found that
way before a seek operation is scheduled, a new attribute Scan.HINT_LOOKAHEAD
can be set the on Scan object. The following code instructs the
RegionServer to attempt two iterations of next before a seek is scheduled:
Scan scan = new Scan(); scan.addColumn(...); scan.setAttribute(Scan.HINT_LOOKAHEAD, Bytes.toBytes(2)); table.getScanner(scan);
For MapReduce jobs that use HBase tables as a source, if there a pattern where the "slow" map tasks seem to have the same Input Split (i.e., the RegionServer serving the data), see the Troubleshooting Case Study in Section 13.3.1, “Case Study #1 (Performance Issue On A Single Node)”.
This isn't so much about improving performance but rather avoiding performance problems. If you forget to close ResultScanners you can cause problems on the RegionServers. Always have ResultScanner processing enclosed in try/catch blocks...
Scan scan = new Scan(); // set attrs... ResultScanner rs = htable.getScanner(scan); try { for (Result r = rs.next(); r != null; r = rs.next()) { // process result... } finally { rs.close(); // always close the ResultScanner! } htable.close();
Scan
instances can be set to use the block cache in the RegionServer via the
setCacheBlocks
method. For input Scans to MapReduce jobs, this should be
false
. For frequently accessed rows, it is advisable to use the block
cache.
When performing a table scan
where only the row keys are needed (no families, qualifiers, values or timestamps), add a FilterList with a
MUST_PASS_ALL
operator to the scanner using setFilter
. The filter list
should include both a FirstKeyOnlyFilter
and a KeyOnlyFilter.
Using this filter combination will result in a worst case scenario of a RegionServer reading a single value from disk
and minimal network traffic to the client for a single row.
When performing a high number of concurrent reads, monitor the data spread of the target tables. If the target table(s) have too few regions then the reads could likely be served from too few nodes.
See Section 11.7.2, “ Table Creation: Pre-Creating Regions ”, as well as Section 11.4, “HBase Configurations”
Enabling Bloom Filters can save your having to go to disk and can help improve read latencys.
Bloom filters were developed over in HBase-1200 Add bloomfilters.[25][26]
See also Section 11.6.4, “Bloom Filters”.
Bloom filters add an entry to the StoreFile
general FileInfo
data structure and then two
extra entries to the StoreFile
metadata
section.
FileInfo
has a
BLOOM_FILTER_TYPE
entry which is set to
NONE
, ROW
or
ROWCOL.
BLOOM_FILTER_META
holds Bloom Size, Hash
Function used, etc. Its small in size and is cached on
StoreFile.Reader
load
BLOOM_FILTER_DATA
is the actual bloomfilter
data. Obtained on-demand. Stored in the LRU cache, if it is enabled
(Its enabled by default).
io.hfile.bloom.enabled
in
Configuration
serves as the kill switch in case
something goes wrong. Default = true
.
io.hfile.bloom.error.rate
= average false
positive rate. Default = 1%. Decrease rate by ½ (e.g. to .5%) == +1
bit per bloom entry.
io.hfile.bloom.max.fold
= guaranteed minimum
fold rate. Most people should leave this alone. Default = 7, or can
collapse to at least 1/128th of original size. See the
Development Process section of the document BloomFilters
in HBase for more on what this option means.
[25] For description of the development process -- why static blooms rather than dynamic -- and for an overview of the unique properties that pertain to blooms in HBase, as well as possible future directions, see the Development Process section of the document BloomFilters in HBase attached to HBase-1200.
[26] The bloom filters described here are actually version two of blooms in HBase. In versions up to 0.19.x, HBase had a dynamic bloom option based on work done by the European Commission One-Lab Project 034819. The core of the HBase bloom work was later pulled up into Hadoop to implement org.apache.hadoop.io.BloomMapFile. Version 1 of HBase blooms never worked that well. Version 2 is a rewrite from scratch though again it starts with the one-lab work.