For the most part, accumulo is ready to go out of the box. To start it, first you must distribute and install
the accumulo software to each machine in the cloud that you wish to run on. The software should be installed
in the same directory on each machine and configured identically (or at least similarly... see the configuration
sections for more details). Select one machine to be your boostrap machine, the one that you will start accumulo
with. Note that you must have passphraseless ssh access to each machine from your bootstrap machine. On this machine,
create a conf/masters and conf/slaves file. In the masters file, type the hostname of the machine you wish to run the master on (probably localhost).
In the slaves file, type the hostnames, separated by newlines of each machine you wish to participate in accumulo as a tablet server. If you neglect
to create these files, the startup scripts will assume you are trying to run on localhost only, and will instantiate a single-node instance only.
It is probably a good idea to back up these files, or distribute them to the other nodes as well, so that you can easily boot up accumulo
from another machine, if necessary. You can also make create a conf/accumulo-env.sh
file if you want to configure any custom environment variables.
Once properly configured, you can initialize or prepare an instance of accumulo by running: bin/accumulo init
Follow the prompts and you are ready to go. This step only prepares accumulo to run, it does not start up accumulo.
Once you have configured accumulo to your liking, and distributed the appropriate configuration to each machine, you can start accumulo with bin/start-all.sh. If at any time, you wish to bring accumulo servers online after one or more have been shutdown, you can run bin/start-all.sh again. This step will only start services that are not already running. Be aware that if you run this command on more than one machine, you may unintentionally start an extra copy of the garbage collector service and the monitoring service, since each of these will run on the server on which you run this script.
Similar to the start-all.sh script, we provide a bin/stop-all.sh script to shut down accumulo. This will prompt for the root password so that it can ask the master to shut down the tablet servers gracefully. If the tablet servers do not respond, or the master takes too long, you can force a shutdown by hitting Ctrl-C at the password prompt, and waiting 15 seconds for the script to force a shutdown. Normally, once the shutdown happens gracefully, unresponsive tablet servers are forcibly shut down after 5 seconds.
Accumulo configuration information is stored in a xml file and ZooKeeper. System wide configuration information is stored in accumulo-site.xml. In order for accumulo to find this file its directory must be on the classpath. Accumulo will log a warning if it can not find it, and will use built-in default values. The accumulo scripts try to put the config directory on the classpath.
Starting with version 1.0, per-table configuration was introduced. This information is stored in ZooKeeper. This information can be manipulated using the config command in the accumulo shell. ZooKeeper will notify all tablet servers when config properties are modified. This makes it possible to change major compaction settings, for example, for a table while accumulo is running.
Per-table configuration settings override system settings.
See the possible configuration options and their default values here
It is very important how disk and memory usage are allocated across the cluster and how servers processes are allocated across the cluster.
There are a few settings that determine how much memory accumulo tablet servers use. In accumulo-env.sh there is a setting called ACCUMULO_TSERVER_OPTS. By default this is set to something like "-Xmx512m -Xms512m". These are Java jvm options asking Java to use 512 megabytes of memory. By default accumulo stores data written to it outside of the Java memory space inorder to avoid pauses caused by the Java garbage collector. The amount of memory it uses for this data is determined by the accumulo setting "tserver.memory.maps.max". Since this memory is outside of the Java managed memory, the process can grow larger than the -Xmx setting. So if -Xmx is set to 512M and tserver.memory.maps.max is set to 1G, a tablet server process can be expected to use 1.5G. If tserver.memory.maps.native.enabled is set to false, then accumulo will only use memory managed by Java and the process will not use more than what -Xmx is set to. In this case the tserver.memory.maps.max setting should be 75% of the -Xmx setting.