Apache Slider Resource Specification

The Resource specification file, resources.json, defines the Yarn resource needs for each component type that belong to the application. This includes: container CPU and memory requirements component placement policy, including YARN labels to explictly request nodes on. failure policy: what to do if components keep failing. placement escalation policy * where logs generated by applications will be saved, information which is passed to YARN to enable these logs to be copied to HDFS and remotely retrieved, even while the application is running

As such, it is the core file used by Slider to configure and manage the application.

Core Properties

An example resource requirement for an application that has two components "master" and "worker" is as follows. Slider will automatically add the requirements for the AppMaster for the application. This component is named "slider-appmaster".

Some parameters that can be specified for a component instance include:

yarn.component.instances Number of instances of this component type
yarn.memory Amount of memory in MB required for the component instance.
yarn.vcores Number of "virtual cores" requested
yarn.role.priority Unique priority for this component

Component instance count: yarn.component.instances

The property yarn.component.instances is one of the most foundational one in slider —it declares how many instances of a component to instantiate on the cluster.

If the value is set to "0", no instances of a component will be created. If set to a larger number, more instances will be requested. Thus the property sets the size of the application, component-by-component.

The number of instances of each component is application-specific; there are no recommended values.

Container resource requirements: yarn.memory and yarn.vcores

These two properties define how much memory and CPU capacity each YARN container of this component requires. YARN will queue container requests until enough capacity exists within the cluster to satisfy them. When there is capacity, a container is allocated to Slider, which then deploys an instance of the component.

The larger these numbers, the more capacity the application gets.

If more memory or CPU is requested than needed then that containers will take longer to be allocated than necessary, and other work may not be scheduled: the cluster will be under-utilized.

yarn.memory declares the amount of memory to ask for in YARN containers; it should be defined for each component based on the expected memory consumption. It is measured in MB.

If the cluster has hard memory limits enabled, then if the processes in a container use more physical or virtual memory than was granted —YARN will kill the container. Slider will attempt to recreate the component instance by requesting a new container, though if the number of failures of a component is too great then it will eventually give up and fail the application.

A YARN cluster is usually configured with a minimum container allocation, set in yarn-site.xml by the configuration parameter yarn.scheduler.minimum-allocation-mb; the default value is 1024 MB. It will also have a maximum size set in yarn.scheduler.maximum-allocation-mb; the default is 8192, that is, 8GB. Asking for more than this will result in the request being rejected.

yarn.vcores declares the number of "virtual cores" to request. These are a site-configured fraction of a physical CPU core; if the ratio of virtual to physical is 1:1 then a physical core is allocated to each one (this may include a Hyperthreaded Core if enabled in the BIOS). If the ratio is lower, such as 2:1, then each vcore allocates half a physical one.

This notion of a virtual core is intended to partially isolate applications from differences in cluster performance: a process which needs 2 vcores on one cluster should ideally still ask for 2 vcores on a different cluster —even if the latter has newer CPU parts. In practise, it's not so consistent. Ask for more vcores if your process needs more CPU time.

YARN clusters may be configured to throttle CPU usage: if a process tries to use more than has been granted to the container, it will simply be scheduled with less CPU time. The penalty for using more CPU than requested is therefore less destructive than attempting to use more memory than requested/granted.

Relationship between yarn.memory and JVM Heap Size

Java applications deployed by Slider usually have a JVM heap size property which needs to be defined as part of the application configuration.

The value of yarn.memory MUST be bigger than the heap size allocated to any JVM, as a JVM uses a lot more memory than simply the heap alone. We have found that asking for at least 50% more appears to work, though some experimentation will be needed.

Slider does not attempt to derive a heap size for any component from the YARN allocation.

Component instance count: yarn.role.priority

The property yarn.role.priority has two purposes within Slider 1. It provides a unique index of individual component types. That is, it is not the name of a component which Slider uses to index components, it is it's priority value. 1. It defines the priority within an application for YARN to use when allocating components. Components with higher priority get allocated first.

Generally the latter use, YARN allocation priority, is less important for Slider-deployed applications than for analytics applications designed to scale to as many nodes that can be instantiated. A static slider cluster has a predefined number of of each components to request (defined by yarn.component.instances), with memory and CPU requirements of each component's container defined by yarn.memory and yarn.vcores. It will request the specified number of components —and keep those requests outstanding until they are satisfied.

Example

{
  "schema" : "http://example.org/specification/v2.0.0",
  "metadata" : {
  },
  "global" : {

  },
  "components" : {
    "HBASE_MASTER" : {
      "yarn.role.priority" : "1",
      "yarn.component.instances" : "1"
      "yarn.memory" : "768",
      "yarn.vcores" : "1"
    },
    "slider-appmaster" : {
      "yarn.memory" : "1024",
      "yarn.vcores" : "1"
    },
    "HBASE_REGIONSERVER" : {
      "yarn.role.priority" : "2",
      "yarn.component.instances" : "1"
    }
  }
}

The slider-appmaster component

The examples here all have a component slider-appmaster. This defines the settings of the application master itself: the memory and CPU it requires, optionally a label (see "Labels"). The yarn.role.priority value is ignored: the priority is always "0"; and the instance count, yarn.component.instances is implicitly set to "1".

The entry exists primarily to allow applications to configure the amount of RAM the AM should request.

Container Failure Policy

YARN containers hosting component instances may fail. This can happen because of

  1. A problem in the configuration of the instance.
  2. A problem in the app package
  3. A problem (hardware, software or networking) in the server hosting the container
  4. Conflict for a resource (usually a network port) between the component instance and another running program.
  5. The server or the network connection to it failing.
  6. The server being taken down for maintenance.

Slider reacts to a failed container by requesting a new container from YARN, preferably on a host that has already hosted an instance of that role. Once the container is allocated, slider will redeploy an instance of the component. As it may take time for YARN to have the resources to allocate the container, replacements are not immediately instantiated.

Slider tracks failures in an attempt to differentiate problems in the application package or its configuration from those of the underlying servers. If a a component fails too many times then slider considers the application itself as failing, and halts.

This leads to the question: what is too many times?

The limits are defined in resources.json: 1. The duration of a failure window, a time period in which failures are counted. This duration can span days. 1. The maximum number of failures of any component in this time period.

Failure threshold for a component

The number of times a component may fail within a failure window is defined by the property yarn.container.failure.threshold

If set to "0" there are no limits on the number of times containers may fail.

The failure thresholds for individual components can be set independently

Failure window

The failure window can be set by minutes, days and hours. These must be set in the global options, as they apply to slider only.

yarn.container.failure.window.days
yarn.container.failure.window.hours
yarn.container.failure.window.minutes

These properties define the duration of the window; they are all combined so the window is, in minutes:

minutes + 60 * (hours + days * 24)

The initial window is measured from the start of the application master —once the duration of that window is exceeded, all failure counts are reset, and the window begins again.

If the AM itself fails, the failure counts are reset and and the window is restarted.

The default value is yarn.container.failure.window.hours=6; when changing the window size, the hour value must be explicitly set, even if to zero, to change this.

We recommend having a duration of a few hours for the window, and a large failure limit proportional to the the number of instances of that component

Why?

This will cover the loss of a large portion of the hardware of the cluster by trying to reinstantiate all the components. Yet, if a component does fail repeatedly, eventually slider will conclude that there is a problem and fail with the exit code 73, EXIT_DEPLOYMENT_FAILED.

Example

Here is a resource.json file for an HBase cluster:

{
  "schema" : "http://example.org/specification/v2.0.0",
  "metadata" : { },
  "global" : {
    "yarn.container.failure.threshold" : "4",
    "yarn.container.failure.window.days" : "0',
    "yarn.container.failure.window.hours" : "1',
    "yarn.container.failure.window.minutes" : "0'
  },
  "components" : {
    "slider-appmaster" : {
      "yarn.memory" : "256",
      "yarn.vcores" : "1",
      "yarn.component.instances" : "1"
    },
    "HBASE_MASTER" : {
      "yarn.role.priority" : "1",
      "yarn.memory" : "256",
      "yarn.vcores" : "1",
      "yarn.component.instances" : "2"
    },
    "HBASE_REGIONSERVER" : {
      "yarn.role.priority" : "2",
      "yarn.memory" : "512",
      "yarn.container.failure.threshold" : "15",
      "yarn.vcores" : "2",
      "yarn.component.instances" : "10"
    }
  }
}

The window size is set to one hour: after that the counters are reset.

There is a global failure threshold of 4 components.

There are ten worker components requested; the failure threshold for these components is overridden to be fifteen. Given that there are more region servers than masters, a higher failure rate of worker nodes is to be expected if the cause of the failure is due to the underlying hardware

Choosing a higher value for the region servers ensures that the application is resilient to harware problems. If there were some configuration problem in the region server deployments, resulting in them all failing rapidly, this threshold would soon be breached which would cause the application to fail. Thus, configuration problems would be detected.

These failure thresholds are all heuristics. When initially configuring an application instance, low thresholds reduce the disruption caused by components which are frequently failing due to configuration problems.

In a production application, large failure thresholds and/or shorter windows ensures that the application is resilient to transient failures of the underlying YARN cluster and hardware.

Placement Policies and escalation

Slider can be configured with different options for placement —the policies by which it chooses where to ask YARN for nodes.

Placement Policy

The "placement policy" of a component is the set of rules by which Slider makes a decision on where to request instances of that component from YARN.

0 Default: placement is spread across the cluster on re-starts, with escalation if requests are unmet. Unreliable nodes are avoided.
1 strict: a component is requested on every node used, irrespective of faiure history. No escalation takes place.
2 Anywhere. Place requests anywhere and ignore the history.
4 Anti affinity required. This option is not currently supported.

The placement policy is a binary "or" of all the values, and can be set in the property "yarn.component.placement.policy".

Example:

"HBASE_REST": {
  "yarn.role.priority": "3",
  "yarn.component.instances": "1",
  "yarn.component.placement.policy": "1",
  "yarn.memory": "556"
},

This defines an HBASE_REST component with a placement policy of "1"; strict.

On application restarts Slider will re-request the same node.

If the component were configured to request an explicit port for its REST endpoint, then the same URL would reach it whenever this application were deployed —provided the host was available and the port not already in use.

Notes

  1. There's no support for anti-affinity —i.e. to mandate that component instances must never be deployed on the same hosts. Once YARN adds support for this, Slider will support it.

  2. Slider never explicitly black-lists nodes. It does track which nodes have been unreliable "recently", and avoids explicitly requesting them. If YARN does actually allocate a container there, Slider will attempt to deploy the component there.

  3. Apart from an (optional) label, placement policies for the application master itself cannot be specified. The Application Master is deployed wherever YARN sees fit.

Node Failure Threshold, yarn.node.failure.threshold

The configuration property yarn.node.failure.threshold defines how "unreliable" a node must be before it is skipped for placement requests.

  1. This is per-component.
  2. It is ignored for "strict" or "anywhere" placements.
  3. It is reset at the same time as the container failure counters; that is, at the interval defined by the yarn.container.failure.window properties

Escalation: yarn.placement.escalate.seconds

For any component whose placement policy is not "any", Slider saves to HDFS a record the nodes on which instances were running. When starting a cluster, it uses this history to identify hosts on which to request instances.

  1. Slider initially asks for nodes on those specific hosts —provided their recent failure history is considered acceptable.
  2. It tracks which 'placed' requests are outstanding.
  3. If, after the specified escalation time, YARN containers have not been allocated on those nodes, slider will "escalate" the placement of those requests that are outstanding.
  4. It currently does this by cancelling each request and re-requesting a container on that node, this time with the relaxLocality flag set.
  5. This tells YARN to seek an alternative location in the cluster if it cannot allocate one on the target host.
  6. If there is enough capacity in the cluster, the new node will then be allocated.

The higher the cost of migrating a component instance from one host to another, the longer we would recommend for an escalation timeout.

Example:

{
  "schema": "http://example.org/specification/v2.0.0",
  "metadata": {
  },
  "global": {
  },
  "components": {
    "HBASE_MASTER": {
      "yarn.role.priority": "1",
      "yarn.component.instances": "1",
      "yarn.placement.escalate.seconds": "10"
    },
    "HBASE_REGIONSERVER": {
      "yarn.role.priority": "2",
      "yarn.component.instances": "10",
      "yarn.placement.escalate.seconds": "600"
    },
    "slider-appmaster": {
    }
  }
}

This declares that the HBASE_MASTER placement should be escalated after ten seconds, but that that HBASE_REGIONSERVER instances should have an escalation timeout of 600 seconds —ten minutes. These values were chosen as an HBase Master can be allocated anywhere in the cluster, but a region server is significantly faster if restarted on the same node on which it previously saved all its data. Even though HDFS will have replicated all data elsewhere, it will have been scattered across the cluster —resulting in remote access for most of the data, at least until a full compaction has taken place.

Notes

  1. Escalation goes directly from "specific node" to "anywhere in cluster"; it does not have any intermediate "same-rack" policy.

  2. If components were assigned to specific labels, then even when placement is "escalated", Slider will always ask for containers on the specified labels. That is —it will never relax the constraint of "deploy on the labels specified". If there are not enough labelled nodes for the application, either the cluster administrators need to add more labelled nodes, or the application must be reconfigured with a different label policy.

  3. Escalated components may be allocated containers on nodes which already have a running instance of the same component.

  4. If the placement policy is "strict", there is no escalation. If the node is not available or lacks capacity, the request will remain unsatisfied.

  5. There is no placement escalation option for the application master.

  6. For more details, see: Role History

Using Labels

The resources.json file includes specifications of the labels to be used when allocating containers for the components. The details of the YARN Label feature can be found at YARN-796.

In summary:

  • Nodes can be assigned one or more labels
  • Capacity Queues can be defined with access to one or more labels
  • Ensure application components are associated with appropriate label expressions
  • Create the application using specific queue

This way, you can guarantee that a certain set of nodes are reserved for an application or for a component within an application.

Label expression is specified through property yarn.label.expression. When no label expression is specified then it is assumed that only non-labeled nodes are used when allocating containers for component instances.

If a label expression is specified for the slider-appmaster component then it also becomes the default label expression for all component.

Example

Here is a resource.json file for an HBase cluster which uses labels.

The label for the application master is hbase1 and the label expression for the HBASE_MASTER components is hbase1_master. HBASE_REGIONSERVER instances will automatically use label hbase1.

{
  "schema": "http://example.org/specification/v2.0.0",
  "metadata": {
  },
  "global": {
  },
"components": {
    "HBASE_MASTER": {
      "yarn.role.priority": "1",
      "yarn.component.instances": "1",
      "yarn.label.expression":"hbase1_master"
    },
    "HBASE_REGIONSERVER": {
      "yarn.role.priority": "2",
      "yarn.component.instances": "10",
    },
    "slider-appmaster": {
      "yarn.label.expression":"hbase1"
    }
  }
}

To deploy this application in a YARN cluster, the following steps must be followed.

  1. Create two labels, hbase1 and hbase1_master (use yarn rmadmin commands)
  2. Assign the labels to nodes (use yarn rmadmin commands)
  3. Perform refresh queue (yarn -refreshqueue)
  4. Create a queue by defining it in the capacity scheduler configuragion.
  5. Allow the queue to access to the labels and ensure that appropriate min/max capacity is assigned
  6. Perform refresh queue (yarn -refreshqueue)
  7. Create the Slider application against the above queue using parameter --queue while creating the application

Notes

  1. If a label is defined in the global section, it will also apply to all components which do not explicitly identify a label. If such a label is expression is set there and another is defined for the slider-appmaster, the app master's label is only used for its placement.

  2. To explicitly request that components are not requested on a label, irrespective of any global- or appmaster- spettings, set the yarn.label.expression to an empty string:

    "HBASE_REGIONSERVER": {
      "yarn.role.priority": "2",
      "yarn.component.instances": "10",
      "yarn.label.expression":""
    }
    
  3. If there is not enough capacity within a set of labelled nodes for the requested containers, the application instance will not reach its requested size.

Log Aggregation

Log aggregation at regular intervals for long running services (LRS) needs to be enabled at the YARN level before any application can exploit this functionality. To enable set yarn.log-aggregation-enable to true and the interval property to a positive value of 3600 (in secs) or more. If set to a positive value less than 3600 (1 hour) this property defaults to 3600. To perform log aggregation only after the application finishes, set the interval property to -1 (default value).

  <property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds</name>
    <value>3600</value>
  </property>

Subsequently every application owner has the flexibility to set the include and exclude patterns of file names that they intend to aggregate. In Slider the resources.json file can be used to specify the include and exclude patterns of files that need to be backed up under the default log directory of the application. The patterns are Java regular expressions. These properties need to be set at the global level as shown below -

{
  "schema": "http://example.org/specification/v2.0.0",
  "metadata": {
  },
  "global": {
    "yarn.log.include.patterns": "hbase.*",
    "yarn.log.exclude.patterns": "hbase.*out" 
  },
  "components": {
    "HBASE_MASTER": {
      "yarn.role.priority": "1",
      "yarn.component.instances": "1",
    },
    "HBASE_REGIONSERVER": {
      "yarn.role.priority": "2",
      "yarn.component.instances": "10",
    },
    "slider-appmaster": {
    }
  }
}

The details of the YARN Log Aggregation feature can be found at YARN-2468.

Putting it all together

Here is an example of a definition of an HBase cluster.

{
  "schema": "http://example.org/specification/v2.0.0",
  "metadata": {
  },
  "global": {
    "yarn.log.include.patterns": "hbase.*",
    "yarn.log.exclude.patterns": "hbase.*out",
    "yarn.container.failure.window.hours": "0",
    "yarn.container.failure.window.minutes": "30",
    "yarn.label.expression":"development"
  },
  "components": {
    "slider-appmaster": {
      "yarn.memory": "1024",
      "yarn.vcores": "1"
      "yarn.label.expression":""
    },
    "HBASE_MASTER": {
      "yarn.role.priority": "1",
      "yarn.component.instances": "1",
      "yarn.placement.escalate.seconds": "10",
      "yarn.vcores": "1",
      "yarn.memory": "1500"
    },
    "HBASE_REGIONSERVER": {
      "yarn.role.priority": "2",
      "yarn.component.instances": "1",
      "yarn.vcores": "1",
      "yarn.memory": "1500",
      "yarn.container.failure.threshold": "15",
      "yarn.placement.escalate.seconds": "60"
    },
    "HBASE_REST": {
      "yarn.role.priority": "3",
      "yarn.component.instances": "1",
      "yarn.component.placement.policy": "1",
      "yarn.container.failure.threshold": "3",
      "yarn.vcores": "1",
      "yarn.memory": "556"
    },
    "HBASE_THRIFT": {
      "yarn.role.priority": "4",
      "yarn.component.instances": "0",
      "yarn.component.placement.policy": "1",
      "yarn.vcores": "1",
      "yarn.memory": "556"
      "yarn.label.expression":"stable"
    },
    "HBASE_THRIFT2": {
      "yarn.role.priority": "5",
      "yarn.component.instances": "1",
      "yarn.component.placement.policy": "1",
      "yarn.vcores": "1",
      "yarn.memory": "556"
      "yarn.label.expression":"stable"
    }
  }
}

There are ten region servers, with a 60-second timeout for placement escalation; 15 containers can fail in the "recent" time window before the application is considered to have failed.

The time window to reset failures is set to 30 minutes.

The Thrift, Thrift2 and REST servers all have strict placement. The REST server also has a container failure threshold of 3: if it can not come up three times, the entire application deployment is considered a failure.

The default label for nodes is "development". For the application master itself it is "", meaning anywhere. Both thrift services are requested on the labels "stable"