HBase

HBase is a scalable, distributed database built on Hadoop Core.

Requirements

Java 1.6.x, preferably from Sun. Use the latest version available.
This version of HBase will only run on Hadoop 0.20.x.
ssh must be installed and sshd must be running to use Hadoop's scripts to manage remote Hadoop daemons.
HBase depends on ZooKeeper as of release 0.20.0. Clients and Servers now must know where their ZooKeeper Quorum locations before they can do anything else. In basic standalone and pseudo-distributed modes, HBase manages a ZooKeeper instance for you but it is required that you run a ZooKeeper Quorum when running HBase fully distributed (More on this below). The Zookeeper addition changes how some core HBase configuration is done.
Hosts must be able to resolve the fully-qualified domain name of the master.
HBase currently is a file handle hog. The usual default of 1024 on *nix systems is insufficient if you are loading any significant amount of data into regionservers. See the FAQ: Why do I see "java.io.IOException...(Too many open files)" in my logs? for how to up the limit. Also, as of 0.18.x hadoop, datanodes have an upper-bound on the number of threads they will support (dfs.datanode.max.xcievers). Default is 256. If loading lots of data into hbase, up this limit on your hadoop cluster.
The clocks on cluster members should be in basic alignments. Some skew is tolerable but wild skew can generate odd behaviors. Run NTP on your cluster, or an equivalent.
This is a list of patches we recommend you apply to your running Hadoop cluster:
- HADOOP-4681 "DFSClient block read failures cause open DFSInputStream to become unusable". This patch will help with the ever-popular, "No live nodes contain current block". The hadoop version bundled with hbase has this patch applied. Its an HDFS client fix so this should do for usual usage but if your cluster is missing the patch, and in particular if calling hbase from a mapreduce job, you may run into this issue.

Windows

If you are running HBase on Windows, you must install Cygwin. Additionally, it is strongly recommended that you add or append to the following environment variables. If you install Cygwin in a location that is not C:\cygwin you should modify the following appropriately.

HOME=c:\cygwin\home\jim
ANT_HOME=(wherever you installed ant)
JAVA_HOME=(wherever you installed java) 
PATH=C:\cygwin\bin;%JAVA_HOME%\bin;%ANT_HOME%\bin; other windows stuff 
SHELL=/bin/bash

For additional information, see the Hadoop Quick Start Guide

Getting Started

What follows presumes you have obtained a copy of HBase, see Releases, and are installing for the first time. If upgrading your HBase instance, see Upgrading.

Three modes are described: standalone, pseudo-distributed (where all servers are run on a single host), and distributed. If new to hbase start by following the standalone instruction.

Whatever your mode, define ${HBASE_HOME} to be the location of the root of your HBase installation, e.g. /user/local/hbase. Edit ${HBASE_HOME}/conf/hbase-env.sh. In this file you can set the heapsize for HBase, etc. At a minimum, set JAVA_HOME to point at the root of your Java installation.

Standalone Mode

If you are running a standalone operation, there should be nothing further to configure; proceed to Running and Confirming Your Installation. If you are running a distributed operation, continue reading.

Distributed Operation: Pseudo- and Fully-Distributed Modes

Distributed mode requires an instance of the Hadoop Distributed File System (DFS). See the Hadoop requirements and instructions for how to set up a DFS.

Pseudo-Distributed Operation

A pseudo-distributed operation is simply a distributed operation run on a single host. Once you have confirmed your DFS setup, configuring HBase for use on one host requires modification of ${HBASE_HOME}/conf/hbase-site.xml, which needs to be pointed at the running Hadoop DFS instance. Use hbase-site.xml to override the properties defined in ${HBASE_HOME}/conf/hbase-default.xml (hbase-default.xml itself should never be modified). At a minimum the hbase.rootdir property should be redefined in hbase-site.xml to point HBase at the Hadoop filesystem to use. For example, adding the property below to your hbase-site.xml says that HBase should use the /hbase directory in the HDFS whose namenode is at port 9000 on your local machine:

<configuration>
  ...
  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://localhost:9000/hbase</value>
    <description>The directory shared by region servers.
    </description>
  </property>
  ...
</configuration>

Note: Let hbase create the directory. If you don't, you'll get warning saying hbase needs a migration run because the directory is missing files expected by hbase (it'll create them if you let it).

Fully-Distributed Operation

For running a fully-distributed operation on more than one host, the following configurations must be made in addition to those described in the pseudo-distributed operation section above. In this mode, a ZooKeeper cluster is required.

In hbase-site.xml, set hbase.cluster.distributed to 'true'.

<configuration>
  ...
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
    <description>The mode the cluster will be in. Possible values are
      false: standalone and pseudo-distributed setups with managed Zookeeper
      true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)
    </description>
  </property>
  ...
</configuration>

In fully-distributed operation, you probably want to change your hbase.rootdir from localhost to the name of the node running the HDFS namenode. In addition to hbase-site.xml changes, a fully-distributed operation requires that you modify ${HBASE_HOME}/conf/regionservers. The regionserver file lists all hosts running HRegionServers, one host per line (This file in HBase is like the hadoop slaves file at ${HADOOP_HOME}/conf/slaves).

A distributed HBase depends on a running ZooKeeper cluster. The ZooKeeper configuration file for HBase is stored at ${HBASE_HOME}/conf/zoo.cfg. See the ZooKeeper Getting Started Guide for information about the format and options of that file. Specifically, look at the Running Replicated ZooKeeper section. After configuring zoo.cfg, in ${HBASE_HOME}/conf/hbase-env.sh, set the following to tell HBase to STOP managing its instance of ZooKeeper.

  ...
# Tell HBase whether it should manage it's own instance of Zookeeper or not.
export HBASE_MANAGES_ZK=false

Though not recommended, it can be convenient having HBase continue to manage ZooKeeper even when in distributed mode (It can be good when testing or taking hbase for a testdrive). Change ${HBASE_HOME}/conf/zoo.cfg and set the server.0 property to the IP of the node that will be running ZooKeeper (Leaving the default value of "localhost" will make it impossible to start HBase).

  ...
server.0=example.org:2888:3888

Then on the example.org server do the following before running HBase.

${HBASE_HOME}/bin/hbase-daemon.sh start zookeeper

To stop ZooKeeper, after you've shut down hbase, do:

${HBASE_HOME}/bin/hbase-daemon.sh stop zookeeper

Be aware that this option is only recommanded for testing purposes as a failure on that node would render HBase unusable.

Of note, if you have made HDFS client configuration on your hadoop cluster, HBase will not see this configuration unless you do one of the following:

Add a pointer to your HADOOP_CONF_DIR to CLASSPATH in hbase-env.sh
Add a copy of hadoop-site.xml to ${HBASE_HOME}/conf, or
If only a small set of HDFS client configurations, add them to hbase-site.xml

An example of such an HDFS client configuration is dfs.replication. If for example, you want to run with a replication factor of 5, hbase will create files with the default of 3 unless you do the above to make the configuration available to HBase.

Running and Confirming Your Installation

If you are running in standalone, non-distributed mode, HBase by default uses the local filesystem.

If you are running a distributed cluster you will need to start the Hadoop DFS daemons and ZooKeeper Quorum before starting HBase and stop the daemons after HBase has shut down.

Start and stop the Hadoop DFS daemons by running ${HADOOP_HOME}/bin/start-dfs.sh. You can ensure it started properly by testing the put and get of files into the Hadoop filesystem. HBase does not normally use the mapreduce daemons. These do not need to be started.

Start up your ZooKeeper cluster.

Start HBase with the following command:

${HBASE_HOME}/bin/start-hbase.sh

Once HBase has started, enter ${HBASE_HOME}/bin/hbase shell to obtain a shell against HBase from which you can execute commands. Test your installation by creating, viewing, and dropping To stop HBase, exit the HBase shell and enter:

${HBASE_HOME}/bin/stop-hbase.sh

If you are running a distributed operation, be sure to wait until HBase has shut down completely before stopping the Hadoop daemons.

The default location for logs is ${HBASE_HOME}/logs.

HBase also puts up a UI listing vital attributes. By default its deployed on the master host at port 60010 (HBase regionservers listen on port 60020 by default and put up an informational http server at 60030).

Upgrading

After installing a new HBase on top of data written by a previous HBase version, before starting your cluster, run the ${HBASE_DIR}/bin/hbase migrate migration script. It will make any adjustments to the filesystem data under hbase.rootdir necessary to run the HBase version. It does not change your install unless you explicitly ask it to.

Example API Usage

For sample Java code, see org.apache.hadoop.hbase.client documentation.

If your client is NOT Java, consider the Thrift or REST libraries.