HBase

What follows presumes you have obtained a copy of HBase and are installing for the first time. If upgrading your HBase instance, see Upgrading.

Define ${HBASE_HOME} to be the location of the root of your HBase installation, e.g. /user/local/hbase. Edit ${HBASE_HOME}/conf/hbase-env.sh. In this file you can set the heapsize for HBase, etc. At a minimum, set JAVA_HOME to point at the root of your Java installation.

If you are running a standalone operation, there should be nothing further to configure; proceed to Running and Confirming Your Installation. If you are running a distributed operation, continue reading.

Distributed Operation

Distributed mode requires an instance of the Hadoop Distributed File System (DFS) and a ZooKeeper cluster. See the Hadoop requirements and instructions for how to set up a DFS. See the ZooKeeeper Getting Started Guide for information about the ZooKeeper distributed coordination service. If you do not configure a ZooKeeper cluster, HBase will manage a single instance ZooKeeper service for you running on the master node. This is intended for development and local testing only. It SHOULD NOT be used in a fully-distributed production operation.

Pseudo-Distributed Operation

A pseudo-distributed operation is simply a distributed operation run on a single host. Once you have confirmed your DFS setup, configuring HBase for use on one host requires modification of ${HBASE_HOME}/conf/hbase-site.xml, which needs to be pointed at the running Hadoop DFS instance. Use hbase-site.xml to override the properties defined in ${HBASE_HOME}/conf/hbase-default.xml (hbase-default.xml itself should never be modified). At a minimum the hbase.rootdir property should be redefined in hbase-site.xml to point HBase at the Hadoop filesystem to use. For example, adding the property below to your hbase-site.xml says that HBase should use the /hbase directory in the HDFS whose namenode is at port 9000 on your local machine:

Note: Let hbase create the directory. If you don't, you'll get warning saying hbase needs a migration run because the directory is missing files expected by hbase (it'll create them if you let it).

Fully-Distributed Operation

Keep in mind that for a fully-distributed operation, you may not want your hbase.rootdir to point to localhost (maybe, as in the configuration above, you will want to use example.org). In addition to hbase-site.xml, a fully-distributed operation requires that you also modify ${HBASE_HOME}/conf/regionservers. regionserver lists all the hosts running HRegionServers, one host per line (This file in HBase is like the hadoop slaves file at ${HADOOP_HOME}/conf/slaves).

Furthermore, you should configure a distributed ZooKeeper cluster. The ZooKeeper configuration file is stored at ${HBASE_HOME}/conf/zoo.cfg. See the ZooKeeper Getting Started Guide for information about the format and options of that file. Specifically, look at the Running Replicated ZooKeeper section. In ${HBASE_HOME}/conf/hbase-env.sh, set HBASE_MANAGES_ZK=false to tell HBase not to manage its own single instance ZooKeeper service.

Of note, if you have made HDFS client configuration on your hadoop cluster, hbase will not see this configuration unless you do one of the following:

Running and Confirming Your Installation

If you are running in standalone, non-distributed mode, HBase by default uses the local filesystem.

If you are running a distributed cluster you will need to start the Hadoop DFS daemons before starting HBase and stop the daemons after HBase has shut down. Start and stop the Hadoop DFS daemons by running ${HADOOP_HOME}/bin/start-dfs.sh. You can ensure it started properly by testing the put and get of files into the Hadoop filesystem. HBase does not normally use the mapreduce daemons. These do not need to be started.

Once HBase has started, enter ${HBASE_HOME}/bin/hbase shell to obtain a shell against HBase from which you can execute commands. Test your installation by creating, viewing, and dropping To stop HBase, exit the HBase shell and enter:

If you are running a distributed operation, be sure to wait until HBase has shut down completely before stopping the Hadoop daemons.

HBase also puts up a UI listing vital attributes. By default its deployed on the master host at port 60010 (HBase regionservers listen on port 60020 by default and put up an informational http server at 60030).

Upgrading

After installing a new HBase on top of data written by a previous HBase version, before starting your cluster, run the ${HBASE_DIR}/bin/hbase migrate migration script. It will make any adjustments to the filesystem data under hbase.rootdir necessary to run the HBase version. It does not change your install unless you explicitly ask it to.

Example API Usage

Once you have a running HBase, you probably want a way to hook your application up to it. If your application is in Java, then you should use the Java API. Here's an example of what a simple client might look like. This example assumes that you've created a table called "myTable" with a column family called "myColumnFamily".

import java.io.IOException;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Scanner;
import org.apache.hadoop.hbase.io.BatchUpdate;
import org.apache.hadoop.hbase.io.Cell;
import org.apache.hadoop.hbase.io.RowResult;
import org.apache.hadoop.hbase.util.Bytes;

public class MyClient {

  public static void main(String args[]) throws IOException {
    // You need a configuration object to tell the client where to connect.
    // But don't worry, the defaults are pulled from the local config file.
    HBaseConfiguration config = new HBaseConfiguration();

    // This instantiates an HTable object that connects you to the "myTable"
    // table. 
    HTable table = new HTable(config, "myTable");

    // To do any sort of update on a row, you use an instance of the BatchUpdate
    // class. A BatchUpdate takes a row and optionally a timestamp which your
    // updates will affect.  If no timestamp, the server applies current time
    // to the edits.
    BatchUpdate batchUpdate = new BatchUpdate("myRow");

    // The BatchUpdate#put method takes a byte [] (or String) that designates
    // what cell you want to put a value into, and a byte array that is the
    // value you want to store. Note that if you want to store Strings, you
    // have to getBytes() from the String for HBase to store it since HBase is
    // all about byte arrays. The same goes for primitives like ints and longs
    // and user-defined classes - you must find a way to reduce it to bytes.
    // The Bytes class from the hbase util package has utility for going from
    // String to utf-8 bytes and back again and help for other base types.
    batchUpdate.put("myColumnFamily:columnQualifier1", 
      Bytes.toBytes("columnQualifier1 value!"));

    // Deletes are batch operations in HBase as well. 
    batchUpdate.delete("myColumnFamily:cellIWantDeleted");

    // Once you've done all the puts you want, you need to commit the results.
    // The HTable#commit method takes the BatchUpdate instance you've been 
    // building and pushes the batch of changes you made into HBase.
    table.commit(batchUpdate);

    // Now, to retrieve the data we just wrote. The values that come back are
    // Cell instances. A Cell is a combination of the value as a byte array and
    // the timestamp the value was stored with. If you happen to know that the 
    // value contained is a string and want an actual string, then you must 
    // convert it yourself.
    Cell cell = table.get("myRow", "myColumnFamily:columnQualifier1");
    // This could throw a NullPointerException if there was no value at the cell
    // location.
    String valueStr = Bytes.toString(cell.getValue());
    
    // Sometimes, you won't know the row you're looking for. In this case, you
    // use a Scanner. This will give you cursor-like interface to the contents
    // of the table.
    Scanner scanner = 
      // we want to get back only "myColumnFamily:columnQualifier1" when we iterate
      table.getScanner(new String[]{"myColumnFamily:columnQualifier1"});
    
    
    // Scanners return RowResult instances. A RowResult is like the
    // row key and the columns all wrapped up in a single Object. 
    // RowResult#getRow gives you the row key. RowResult also implements 
    // Map, so you can get to your column results easily. 
    
    // Now, for the actual iteration. One way is to use a while loop like so:
    RowResult rowResult = scanner.next();
    
    while (rowResult != null) {
      // print out the row we found and the columns we were looking for
      System.out.println("Found row: " + Bytes.toString(rowResult.getRow()) +
        " with value: " + rowResult.get(Bytes.toBytes("myColumnFamily:columnQualifier1")));
      rowResult = scanner.next();
    }
    
    // The other approach is to use a foreach loop. Scanners are iterable!
    for (RowResult result : scanner) {
      // print out the row we found and the columns we were looking for
      System.out.println("Found row: " + Bytes.toString(rowResult.getRow()) +
        " with value: " + rowResult.get(Bytes.toBytes("myColumnFamily:columnQualifier1")));
    }
    
    // Make sure you close your scanners when you are done!
    // Its probably best to put the iteration into a try/finally with the below
    // inside the finally clause.
    scanner.close();
  }
}

There are many other methods for putting data into and getting data out of HBase, but these examples should get you started. See the HTable javadoc for more methods. Additionally, there are methods for managing tables in the HBaseAdmin class.

If your client is NOT Java, then you should consider the Thrift or REST libraries.

Requirements