HBase is a scalable, distributed database built on Hadoop Core.

Requirements

Getting Started

What follows presumes you have obtained a copy of HBase and are installing for the first time. If upgrading your HBase instance, see Upgrading.

Define ${HBASE_HOME} to be the location of the root of your HBase installation, e.g. /user/local/hbase. Edit ${HBASE_HOME}/conf/hbase-env.sh. In this file you can set the heapsize for HBase, etc. At a minimum, set JAVA_HOME to point at the root of your Java installation.

If you are running a standalone operation, there should be nothing further to configure; proceed to Running and Confirming Your Installation. If you are running a distributed operation, continue reading.

Distributed Operation

Distributed mode requires an instance of the Hadoop Distributed File System (DFS). See the Hadoop requirements and instructions for how to set up a DFS.

Pseudo-Distributed Operation

A pseudo-distributed operation is simply a distributed operation run on a single host. Once you have confirmed your DFS setup, configuring HBase for use on one host requires modification of ${HBASE_HOME}/conf/hbase-site.xml, which needs to be pointed at the running Hadoop DFS instance. Use hbase-site.xml to override the properties defined in ${HBASE_HOME}/conf/hbase-default.xml (hbase-default.xml itself should never be modified). At a minimum the hbase.rootdir property should be redefined in hbase-site.xml to point HBase at the Hadoop filesystem to use. For example, adding the property below to your hbase-site.xml says that HBase should use the /hbase directory in the HDFS whose namenode is at port 9000 on your local machine:

<configuration>
  ...
  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://localhost:9000/hbase</value>
    <description>The directory shared by region servers.
    </description>
  </property>
  ...
</configuration>

Fully-Distributed Operation

For running a fully-distributed operation on more than one host, the following configurations must be made in addition to those described in the pseudo-distributed operation section above. In hbase-site.xml, you must also configure hbase.master to the host:port pair on which the HMaster runs (read about the HBase master, regionservers, etc). For example, adding the below to your hbase-site.xml says the master is up on port 60000 on the host example.org:

<configuration>
  ...
  <property>
    <name>hbase.master</name>
    <value>example.org:60000</value>
    <description>The host and port that the HBase master runs at.
    </description>
  </property>
  ...
</configuration>

Keep in mind that for a fully-distributed operation, you may not want your hbase.rootdir to point to localhost (maybe, as in the configuration above, you will want to use example.org). In addition to hbase-site.xml, a fully-distributed operation requires that you also modify ${HBASE_HOME}/conf/regionservers. regionserver lists all the hosts running HRegionServers, one host per line (This file in HBase is like the hadoop slaves file at ${HADOOP_HOME}/conf/slaves).

Of note, if you have made HDFS client configuration on your hadoop cluster, hbase will not see this configuration unless you do one of the following:

An example of such an HDFS client configuration is dfs.replication. If for example, you want to run with a replication factor of 5, hbase will create files with the default of 3 unless you do the above to make the configuration available to hbase.

Running and Confirming Your Installation

If you are running in standalone, non-distributed mode, HBase by default uses the local filesystem.

If you are running a distributed cluster you will need to start the Hadoop DFS daemons before starting HBase and stop the daemons after HBase has shut down. Start and stop the Hadoop DFS daemons by running ${HADOOP_HOME}/bin/start-dfs.sh. You can ensure it started properly by testing the put and get of files into the Hadoop filesystem. HBase does not normally use the mapreduce daemons. These do not need to be started.

Start HBase with the following command:

${HBASE_HOME}/bin/start-hbase.sh

Once HBase has started, enter ${HBASE_HOME}/bin/hbase shell to obtain a shell against HBase from which you can execute commands. Test your installation by creating, viewing, and dropping To stop HBase, exit the HBase shell and enter:

${HBASE_HOME}/bin/stop-hbase.sh

If you are running a distributed operation, be sure to wait until HBase has shut down completely before stopping the Hadoop daemons.

The default location for logs is ${HBASE_HOME}/logs.

HBase also puts up a UI listing vital attributes. By default its deployed on the master host at port 60010 (HBase regionservers listen on port 60020 by default and put up an informational http server at 60030).

Upgrading

After installing a new HBase on top of data written by a previous HBase version, before starting your cluster, run the ${HBASE_DIR}/bin/hbase migrate migration script. It will make any adjustments to the filesystem data under hbase.rootdir necessary to run the HBase version. It does not change your install unless you explicitly ask it to.

Example API Usage

Once you have a running HBase, you probably want a way to hook your application up to it. If your application is in Java, then you should use the Java API. Here's an example of what a simple client might look like. This example assumes that you've created a table called "myTable" with a column family called "myColumnFamily".

import java.io.IOException;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Scanner;
import org.apache.hadoop.hbase.io.BatchUpdate;
import org.apache.hadoop.hbase.io.Cell;
import org.apache.hadoop.hbase.io.RowResult;

public class MyClient {

  public static void main(String args[]) throws IOException {
    // You need a configuration object to tell the client where to connect.
    // But don't worry, the defaults are pulled from the local config file.
    HBaseConfiguration config = new HBaseConfiguration();

    // This instantiates an HTable object that connects you to the "myTable"
    // table. 
    HTable table = new HTable(config, "myTable");

    // To do any sort of update on a row, you use an instance of the BatchUpdate
    // class. A BatchUpdate takes a row and optionally a timestamp which your
    // updates will affect. 
    BatchUpdate batchUpdate = new BatchUpdate("myRow");

    // The BatchUpdate#put method takes a Text that describes what cell you want
    // to put a value into, and a byte array that is the value you want to 
    // store. Note that if you want to store strings, you have to getBytes() 
    // from the string for HBase to understand how to store it. (The same goes
    // for primitives like ints and longs and user-defined classes - you must 
    // find a way to reduce it to bytes.)
    batchUpdate.put("myColumnFamily:columnQualifier1", 
      "columnQualifier1 value!".getBytes());

    // Deletes are batch operations in HBase as well. 
    batchUpdate.delete("myColumnFamily:cellIWantDeleted");

    // Once you've done all the puts you want, you need to commit the results.
    // The HTable#commit method takes the BatchUpdate instance you've been 
    // building and pushes the batch of changes you made into HBase.
    table.commit(batchUpdate);

    // Now, to retrieve the data we just wrote. The values that come back are
    // Cell instances. A Cell is a combination of the value as a byte array and
    // the timestamp the value was stored with. If you happen to know that the 
    // value contained is a string and want an actual string, then you must 
    // convert it yourself.
    Cell cell = table.get("myRow", "myColumnFamily:columnQualifier1");
    String valueStr = new String(cell.getValue());
    
    // Sometimes, you won't know the row you're looking for. In this case, you
    // use a Scanner. This will give you cursor-like interface to the contents
    // of the table.
    Scanner scanner = 
      // we want to get back only "myColumnFamily:columnQualifier1" when we iterate
      table.getScanner(new String[]{"myColumnFamily:columnQualifier1"});
    
    
    // Scanners in HBase 0.2 return RowResult instances. A RowResult is like the
    // row key and the columns all wrapped up in a single interface. 
    // RowResult#getRow gives you the row key. RowResult also implements 
    // Map, so you can get to your column results easily. 
    
    // Now, for the actual iteration. One way is to use a while loop like so:
    RowResult rowResult = scanner.next();
    
    while(rowResult != null) {
      // print out the row we found and the columns we were looking for
      System.out.println("Found row: " + new String(rowResult.getRow()) + " with value: " +
       rowResult.get("myColumnFamily:columnQualifier1".getBytes()));
      
      rowResult = scanner.next();
    }
    
    // The other approach is to use a foreach loop. Scanners are iterable!
    for (RowResult result : scanner) {
      // print out the row we found and the columns we were looking for
      System.out.println("Found row: " + new String(result.getRow()) + " with value: " +
       result.get("myColumnFamily:columnQualifier1".getBytes()));
    }
    
    // Make sure you close your scanners when you are done!
    scanner.close();
  }
}

There are many other methods for putting data into and getting data out of HBase, but these examples should get you started. See the HTable javadoc for more methods. Additionally, there are methods for managing tables in the HBaseAdmin class.

If your client is NOT Java, then you should consider the Thrift or REST libraries.

Related Documentation