High Availability:

Setup Resource Manager HA, Name Node HA, Work Preserving Resource Manager restart and Work Preserving Node Manager restart

Setting up High Availability ensures uninterrupted service provided by long running applications installed by Slider in the event of any or all of the following YARN component failures - Resource Manager, Name Node and Node Manager. This document provides details on how to configure YARN properties to achieve the corresponding high availability setup.

Resource Manager HA (automatic)

      <property>
          <name>yarn.resourcemanager.ha.enabled</name>
          <value>true</value>
      </property>

      <property>
          <name>yarn.resourcemanager.ha.rm-ids</name>
          <value>rm1,rm2</value>
      </property>

      <property>
          <name>yarn.resourcemanager.hostname.rm1</name>
          <value>192.168.1.9</value>
      </property>

      <property>
          <name>yarn.resourcemanager.hostname.rm2</name>
          <value>192.168.1.10</value>
      </property>

      <property>
          <name>yarn.resourcemanager.recovery.enabled</name>
          <value>true</value>
      </property>

      <property>
          <name>yarn.resourcemanager.store.class</name>
          <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
      </property>

      <property>
          <name>yarn.resourcemanager.ha.automatic-failover.enabled</name>
          <value>true</value>
      </property>

      <property>
          <name>yarn.resourcemanager.ha.automatic-failover.embedded</name>
          <value>true</value>
      </property>

      <property>
          <name>yarn.client.failover-proxy-provider</name>
          <value>org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider</value>
      </property>

      <property>
          <name>yarn.resourcemanager.zk-address</name>
          <value>192.168.1.9:2181,192.168.1.10:2181</value>
          <description>For multiple zk services, separate them with comma</description>
      </property>

      <property>
          <name>yarn.resourcemanager.cluster-id</name>
          <value>yarn-cluster</value>
      </property>

      <property>
          <name>yarn.resourcemanager.ha.automatic-failover.zk-base-path</name>
          <value>/yarn-leader-election</value>
      </property>

      <property>
          <name>yarn.resourcemanager.am.max-attempts</name>
          <value>20</value>
      </property>

Name Node HA

Setting up NN HA is outside the scope of this document. Please refer to this HDFS HA document (r2.6.0) for setup details.

Work Preserving RM Restart

      <property>
          <description>Enable RM to recover state after starting. If true, then
                       yarn.resourcemanager.store.class must be specified
          </description>
          <name>yarn.resourcemanager.recovery.enabled</name>
          <value>true</value>
      </property>

      <property>
          <description>Enable RM work preserving recovery. This configuration is
                       private to YARN for experimenting the feature.  NOTE: this config
                       has to be set on both RM and ALL NMs
          </description>
          <name>yarn.resourcemanager.work-preserving-recovery.enabled</name>
          <value>true</value>
      </property>

      <property>
          <description>The class to use as the persistent store.</description>
          <name>yarn.resourcemanager.store.class</name>
          <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
      </property>

      <property>
          <description>Host:Port of the ZooKeeper server where RM state will
                       be stored. This must be supplied when using 
                       org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore
                       as the value for yarn.resourcemanager.store.class
          </description>
          <name>yarn.resourcemanager.zk.address</name>
          <value>127.0.0.1:2181</value>
      </property>

Work Preserving NM Restart

      <property>
          <description>Enable the node manager to recover after starting</description>
          <name>yarn.nodemanager.recovery.enabled</name>
          <value>false</value>
      </property>

      <property>
          <description>The local filesystem directory in which the node manager
                       will store state when recovery is enabled.
          </description>
          <name>yarn.nodemanager.recovery.dir</name>
          <value>${hadoop.tmp.dir}/yarn-nm-recovery</value>
      </property>