HBase has two run modes: Section 2.2.1, “Standalone HBase” and Section 2.2.2, “Distributed”. Out of the box, HBase runs in
standalone mode. To set up a distributed deploy, you will need to
configure HBase by editing files in the HBase conf
directory.
Whatever your mode, you will need to edit
conf/hbase-env.sh
to tell HBase which
java to use. In this file you set HBase environment
variables such as the heapsize and other options for the
JVM, the preferred location for log files,
etc. Set JAVA_HOME
to point at the root of your
java install.
This is the default mode. Standalone mode is what is described in the Section 1.2, “Quick Start” section. In standalone mode, HBase does not use HDFS -- it uses the local filesystem instead -- and it runs all HBase daemons and a local ZooKeeper all up in the same JVM. Zookeeper binds to a well known port so clients may talk to HBase.
Distributed mode can be subdivided into distributed but all daemons run on a single node -- a.k.a pseudo-distributed-- and fully-distributed where the daemons are spread across all nodes in the cluster [9].
Distributed modes require an instance of the Hadoop Distributed File System (HDFS). See the Hadoop requirements and instructions for how to set up a HDFS. Before proceeding, ensure you have an appropriate, working HDFS.
Below we describe the different distributed setups. Starting, verification and exploration of your install, whether a pseudo-distributed or fully-distributed configuration is described in a section that follows, Section 2.2.3, “Running and Confirming Your Installation”. The same verification script applies to both deploy types.
A pseudo-distributed mode is simply a distributed mode run on a single host. Use this configuration testing and prototyping on HBase. Do not use this configuration for production nor for evaluating HBase performance.
First, setup your HDFS in pseudo-distributed mode.
Next, configure HBase. Below is an example conf/hbase-site.xml
.
This is the file into
which you add local customizations and overrides for
??? and Section 2.2.2.2.3, “HDFS Client Configuration”.
Note that the hbase.rootdir
property points to the
local HDFS instance.
Now skip to Section 2.2.3, “Running and Confirming Your Installation” for how to start and verify your pseudo-distributed install. [10]
Let HBase create the hbase.rootdir
directory. If you don't, you'll get warning saying HBase needs a
migration run because the directory is missing files expected by
HBase (it'll create them if you let it).
Below is a sample pseudo-distributed file for the node h-24-30.example.com
.
hbase-site.xml
<configuration> ... <property> <name>hbase.rootdir</name> <value>hdfs://h-24-30.sfo.stumble.net:8020/hbase</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> <property> <name>hbase.zookeeper.quorum</name> <value>h-24-30.sfo.stumble.net</value> </property> ... </configuration>
To start up the initial HBase cluster...
% bin/start-hbase.sh
To start up an extra backup master(s) on the same server run...
% bin/local-master-backup.sh start 1
... the '1' means use ports 60001 & 60011, and this backup master's logfile will be at logs/hbase-${USER}-1-master-${HOSTNAME}.log
.
To startup multiple backup masters run...
% bin/local-master-backup.sh start 2 3
You can start up to 9 backup masters (10 total).
To start up more regionservers...
% bin/local-regionservers.sh start 1
where '1' means use ports 60201 & 60301 and its logfile will be at logs/hbase-${USER}-1-regionserver-${HOSTNAME}.log
.
To add 4 more regionservers in addition to the one you just started by running...
% bin/local-regionservers.sh start 2 3 4 5
This supports up to 99 extra regionservers (100 total).
For running a fully-distributed operation on more than one
host, make the following configurations. In
hbase-site.xml
, add the property
hbase.cluster.distributed
and set it to
true
and point the HBase
hbase.rootdir
at the appropriate HDFS NameNode
and location in HDFS where you would like HBase to write data. For
example, if you namenode were running at namenode.example.org on
port 8020 and you wanted to home your HBase in HDFS at
/hbase
, make the following
configuration.
<configuration> ... <property> <name>hbase.rootdir</name> <value>hdfs://namenode.example.org:8020/hbase</value> <description>The directory shared by RegionServers. </description> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> <description>The mode the cluster will be in. Possible values are false: standalone and pseudo-distributed setups with managed Zookeeper true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh) </description> </property> ... </configuration>
In addition, a fully-distributed mode requires that you
modify conf/regionservers
. The
Section 2.4.1.2, “regionservers
” file
lists all hosts that you would have running
HRegionServers, one host per line (This
file in HBase is like the Hadoop slaves
file). All servers listed in this file will be started and stopped
when HBase cluster start or stop is run.
See section Chapter 16, ZooKeeper for ZooKeeper setup for HBase.
Of note, if you have made HDFS client configuration on your Hadoop cluster -- i.e. configuration you want HDFS clients to use as opposed to server-side configurations -- HBase will not see this configuration unless you do one of the following:
Add a pointer to your HADOOP_CONF_DIR
to the HBASE_CLASSPATH
environment variable
in hbase-env.sh
.
Add a copy of hdfs-site.xml
(or
hadoop-site.xml
) or, better, symlinks,
under ${HBASE_HOME}/conf
, or
if only a small set of HDFS client configurations, add
them to hbase-site.xml
.
An example of such an HDFS client configuration is
dfs.replication
. If for example, you want to
run with a replication factor of 5, hbase will create files with
the default of 3 unless you do the above to make the configuration
available to HBase.
Make sure HDFS is running first. Start and stop the Hadoop HDFS
daemons by running bin/start-hdfs.sh
over in the
HADOOP_HOME
directory. You can ensure it started
properly by testing the put and
get of files into the Hadoop filesystem. HBase does
not normally use the mapreduce daemons. These do not need to be
started.
If you are managing your own ZooKeeper, start it and confirm its running else, HBase will start up ZooKeeper for you as part of its start process.
Start HBase with the following command:
bin/start-hbase.shRun the above from the
HBASE_HOME
directory.
You should now have a running HBase instance. HBase logs can be
found in the logs
subdirectory. Check them out
especially if HBase had trouble starting.
HBase also puts up a UI listing vital attributes. By default its
deployed on the Master host at port 60010 (HBase RegionServers listen
on port 60020 by default and put up an informational http server at
60030). If the Master were running on a host named
master.example.org
on the default port, to see the
Master's homepage you'd point your browser at
http://master.example.org:60010
.
Once HBase has started, see the Section 1.2.3, “Shell Exercises” for how to create tables, add data, scan your insertions, and finally disable and drop your tables.
To stop HBase after exiting the HBase shell enter
$ ./bin/stop-hbase.sh stopping hbase...............
Shutdown can take a moment to complete. It can take longer if your cluster is comprised of many machines. If you are running a distributed operation, be sure to wait until HBase has shut down completely before stopping the Hadoop daemons.
[9] The pseudo-distributed vs fully-distributed nomenclature comes from Hadoop.
[10] See Section 2.2.2.1.2, “Pseudo-distributed Extras” for notes on how to start extra Masters and RegionServers when running pseudo-distributed.