Apache Slider Troubleshooting

Slider can be tricky to start using, because it combines the need to set up a YARN application, with the need to have an HBase configuration that works

Common problems

Not all the containers start -but whenever you kill one, another one comes up.

This is often caused by YARN not having enough capacity in the cluster to start up the requested set of containers. The AM has submitted a list of container requests to YARN, but only when an existing container is released or killed is one of the outstanding requests granted.

Fix #1: Ask for smaller containers

edit the yarn.memory option for roles to be smaller: set it 64 for a smaller YARN allocation. This does not affect the actual heap size of the application component deployed

Fix #2: Tell YARN to be less strict about memory consumption

Here are the properties in yarn-site.xml which we set to allow YARN to schedule more role instances than it nominally has room for.

<property>
  <name>yarn.scheduler.minimum-allocation-mb</name>
  <value>128</value>
</property>
<property>
  <description>Whether physical memory limits will be enforced for
    containers.
  </description>
  <name>yarn.nodemanager.pmem-check-enabled</name>
  <value>false</value>
</property>
<!-- we really don't want checking here-->
<property>
  <name>yarn.nodemanager.vmem-check-enabled</name>
  <value>false</value>
</property>

Important In a real cluster, the minimum size of a an allocation should be larger, such as 256, to stop the RM being overloaded. When the PMEM and VMEM sizes are enforced

The complete instance never comes up -some containers are outstanding

This means that there isn't enough space in the cluster

Slider instances not being able to create registry paths on secure clusters

This feature requires the YARN Resource Manager to do the setup securely of the user's path in the registry

  1. The RM must have the specific patch applied to do this. It is not in Apache Hadoop 2.6.0; it is in HDP-2.2.
  2. The registry must be enabled

Application Instantiation fails, "TriggerClusterTeardownException: Unstable Cluster"

Slider gives up if it cannot keep enough instances of a role running -or more precisely, if they keep failing.

If this happens on cluster startup, it means that the application is not working

 org.apache.slider.core.exceptions.TriggerClusterTeardownException: Unstable Cluster: 
 - failed with role worker failing 4 times (4 in startup); threshold is 2
 - last failure: Failure container_1386872971874_0001_01_000006 on host 192.168.1.86,
   see http://hor12n22.gq1.ygridcore.net:19888/jobhistory/logs/192.168.1.86:45454/container_1386872971874_0001_01_000006/ctx/yarn

This message warns that a role -here worker- is failing to start and it has failed more than the configured failure threshold is. What it doesn't do is say why it failed, because that is not something the AM knows -that is a fact hidden in the logs on the container that failed.

The final bit of the exception message can help you track down the problem, as it points you to the logs.

In the example above the failure was in container_1386872971874_0001_01_000006 on the host 192.168.1.86. If you go to then node manager on that machine (the YARN RM web page will let you do this), and look for that container, you may be able to grab the logs from it.

A quicker way is to browse to the URL on the next line. Note: the URL depends on yarn.log.server.url being properly configured.

It is from those logs that the cause of the problem -because they are the actual output of the actual application which Slider is trying to deploy.

Configuring YARN for better debugging

One configuration to aid debugging is tell the nodemanagers to keep data for a short period after containers finish

<!-- 10 minutes after a failure to see what is left in the directory-->
<property>
  <name>yarn.nodemanager.delete.debug-delay-sec</name>
  <value>600</value>
</property>

You can then retrieve logs by either the web UI, or by connecting to the server (usually by ssh) and retrieve the logs from the log directory

We also recommend making sure that YARN kills processes

<!--time before the process gets a -9 -->
<property>
  <name>yarn.nodemanager.sleep-delay-before-sigkill.ms</name>
  <value>30000</value>
</property>

Running HBase Shell wrapper

We have provided an HBase Shell wrapper, hbase-slider, to facilitate running shell commands without retrieving hbase-site.xml manually.

You can unpack the following scripts from hbase app package:

hbase-slider
hbase-slider.py

Use 'chmod +x' to give hbase-slider execution permission. The syntax for using the wrapper is:

./hbase-slider hbasesliderapp

where hbasesliderapp is the name of Slider HBase instance. The script would retrieve hbase-site.xml and run HBase shell command.

You can issue the following command to see supported options:

./hbase-slider