High Availability:¶
Setup Resource Manager HA, Name Node HA, Work Preserving Resource Manager restart and Work Preserving Node Manager restart¶
Setting up High Availability ensures uninterrupted service provided by long running applications installed by Slider in the event of any or all of the following YARN component failures - Resource Manager, Name Node and Node Manager. This document provides details on how to configure YARN properties to achieve the corresponding high availability setup.
Resource Manager HA (automatic)¶
<property> <name>yarn.resourcemanager.ha.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.ha.rm-ids</name> <value>rm1,rm2</value> </property> <property> <name>yarn.resourcemanager.hostname.rm1</name> <value>192.168.1.9</value> </property> <property> <name>yarn.resourcemanager.hostname.rm2</name> <value>192.168.1.10</value> </property> <property> <name>yarn.resourcemanager.recovery.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.store.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value> </property> <property> <name>yarn.resourcemanager.ha.automatic-failover.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.ha.automatic-failover.embedded</name> <value>true</value> </property> <property> <name>yarn.client.failover-proxy-provider</name> <value>org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider</value> </property> <property> <name>yarn.resourcemanager.zk-address</name> <value>192.168.1.9:2181,192.168.1.10:2181</value> <description>For multiple zk services, separate them with comma</description> </property> <property> <name>yarn.resourcemanager.cluster-id</name> <value>yarn-cluster</value> </property> <property> <name>yarn.resourcemanager.ha.automatic-failover.zk-base-path</name> <value>/yarn-leader-election</value> </property> <property> <name>yarn.resourcemanager.am.max-attempts</name> <value>20</value> </property>
Name Node HA¶
Setting up NN HA is outside the scope of this document. Please refer to this HDFS HA document (r2.6.0) for setup details.
Work Preserving RM Restart¶
<property> <description>Enable RM to recover state after starting. If true, then yarn.resourcemanager.store.class must be specified </description> <name>yarn.resourcemanager.recovery.enabled</name> <value>true</value> </property> <property> <description>Enable RM work preserving recovery. This configuration is private to YARN for experimenting the feature. NOTE: this config has to be set on both RM and ALL NMs </description> <name>yarn.resourcemanager.work-preserving-recovery.enabled</name> <value>true</value> </property> <property> <description>The class to use as the persistent store.</description> <name>yarn.resourcemanager.store.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value> </property> <property> <description>Host:Port of the ZooKeeper server where RM state will be stored. This must be supplied when using org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore as the value for yarn.resourcemanager.store.class </description> <name>yarn.resourcemanager.zk.address</name> <value>127.0.0.1:2181</value> </property>
Work Preserving NM Restart¶
<property> <description>Enable the node manager to recover after starting</description> <name>yarn.nodemanager.recovery.enabled</name> <value>false</value> </property> <property> <description>The local filesystem directory in which the node manager will store state when recovery is enabled. </description> <name>yarn.nodemanager.recovery.dir</name> <value>${hadoop.tmp.dir}/yarn-nm-recovery</value> </property>