Apache Slider Placement: how Slider brings back nodes in the same location

Last updated 2016-01-06

This document covers how Slider works with YARN to place work around the cluster. For details on the implementation of this work, consult Role History.

Sections

  1. Introduction
  2. Assumptions
  3. Placement policies
  4. Slider and "Outstanding Requests"
  5. History
  6. Changes

Introduction

A Slider Application Instance consists of the Application Master —Slider's code— and the components requested in the resources.json file. For each component, Slider requests a new YARN Container, which is then allocated in the Apache Hadoop cluster by YARN. The Slider application starts the component within the container, and monitors its lifecycle.

The choice of where a container is created is something in which YARN and Slider can participate. It is up to YARN to allocate the capacity for the container (CPU, memory, etc), possibly killing other containers ("pre-emption") to satisfy the requests submitted by Slider.

Slider can make different kinds of container requests of YARN

  1. Anywhere. Here Slider asks YARN for a container anywhere in the cluster. These requests are most likely to be satisfied, as if there is space anywhere in the cluster, YARN will allocate it.

  2. Labelled. Slider provides a label expression to YARN; the request can only be satisfied on a node in the cluster whose label matches the expression. If there is no capacity on nodes with the label, the request will be unsatisfied —even if there is space elsewhere.

  3. Named Nodes/Racks. Slider lists the explicit hosts and racks upon which a component can be allocated. If there is no capacity in these locations, the request will be unsatisfied —even if there is space elsewhere.

  4. Named Nodes/Racks with relaxation. Slider lists the explicit hosts and racks upon which a component can be allocated. If there is no capacity on these locations, then, after a few seconds, YARN will "relax" the request and try to find anywhere on the cluster with the capacity.

Classic data-analysis code running across a YARN cluster needs locality for the significantly better IO performance when reading data. For short-lived processes, non-local access, while still a downgrade in performance, is preferable to having the code executon delayed for more than a few seconds. Accordingly, the named+relaxing requests is a viable request policy offered by YARN schedulers and used by most applications.

For Slider-deployed applications, the threshold at which relaxing placement should be attempted can change from seconds to minutes —something YARN does not itself support. Slider itself must implement its own mechanism, initially making non-relaxed requests, then, after the configured timeout, issuing a new request with relaxed placement.

Another recurrent need in Slider applications is Anti-Affinity, that is: explicitly running instances of a component on different machines.

Anti-affinity can be required for a number of reasons.

  • Availability: ensures that if a single node fails, the other instances of the component must remain running. Examples: HBase Master nodes, Kafka instances.

  • Performance: If a process is deployed across separate machines, work can be scheduled which spans the hosts, running closer to the data. This is essentially a long-lifespan version of the classic YARN scheduling policy. Servers running across the pool of machines can field requests related to data on their machine.

  • Addressing resource conflict. If every component instance uses a resource which cannot be shared (example: hard coded network ports and file paths), then anti-affine placement ensures that there is no conflict between instances. It does not prevent any conflict across component types, or between Slider/YARN applications.

Another need is history-based placement: re-instantiating component instances on the machine(s) on which they last ran. This can deliver performance and startup advantages, or, for some applications, essential to recover data persisted on the server. To achieve this, Slider persists recent placement configurations to HDFS, performing a best-effort reload of this data when starting up.

Assumptions

Here are some assumptions in Slider

  1. Different components are independent: it is not an issue if a component of one type (example, an Accumulo Monitor and an Accumulo Tablet Server) are on the same host. This assumption allows Slider to only worry about affinity issues within a specific component, rather than across all.

  2. After an Application Instance has been started, the rate of change of the application is low: both node failures and flexing happen at the rate of every few hours, rather than every few seconds. This allows Slider to avoid needing data structures and layout persistence code designed for regular and repeated changes.

  3. Instances of a specific component should preferably be deployed onto different servers. This enables Slider to only remember the set of server nodes onto which instances were created, rather than more complex facts such as "two Region Servers were previously running on Node 17". On restart Slider can simply request one instance of a Region Server on a specific node, leaving the other instance to be arbitrarily deployed by YARN. This strategy should help reduce the affinity in the component deployment, increasing their resilience to failure.

  4. There is no need to make sophisticated choices on which nodes to request re-assignment —such as recording the amount of data persisted by a previous instance and prioritizing nodes based on such data. More succinctly 'the only priority needed when asking for nodes is ask for the most recently used'.

  5. If a component instance fails on a specific node, asking for a new container on that same node is a valid first attempt at a recovery strategy.

Placement policies

The policy for placing a component can be set in the resources.json file, via the yarn.component.placement.policy field.

Here are the currently supported placement policies

Note that

  1. These are orthogonal to labels: when "anywhere" is used, it means "anywhere that the label expression permits".
  2. The numbers are (currently) part of a bit-mask. Other combinations may be chosen, in which case the outcome is "undefined".

"Normal" (0)

Slider remembers the hosts on which containers were requested, and makes relaxed-placement requests for new instances on those hosts on startup/flex up. Component instances may be co-allocated on the same hosts.

When restarting/expanding a cluster, placement of any request in which the hostname is provided will be relaxed after the timeout specified for a component in "yarn.placement.escalate.seconds"; the default is 30 seconds.

"Anywhere" (2)

Slider asks for component instances anywhere in the cluster (constrained by any label expression).

Any history of where containers were placed before is not used; there's no attempt to spread the placement of containers across nodes.

That is: no other policy will be better in terms of allocation of containers or speed of allocation than this one.

"Strict" (1)

Once a component has been deployed on a node, one component request will be made against that node, even if it is considered unreliable. No relaxation of the request will ever take place. If a host upon which an instance was executed has failed, then the request will never be satisifed: it will remain outstanding.

New instances (i.e. ones for which there is no historical placement) will be requested anywhere.

Note: As of Jan 2016, there is no anti-affinity placement when expanding strict placements SLIDER-980. It's possible to work around this by requesting the initial set of instances using anti-affinity, then editing resources.json to switch to strict placements for future requests.

"Anti-affinity" (4)

Anti-affinity placement means that no two instances of the same component type are created on the same server.

Anti-affinity placement is implemented in Slider by asking for each instance one-by-one, requesting a container on any host with the desired label that is not running an instance of the component type.

This means that Slider cannot build a large application instance by simultaneously asking for many containers of a specific component type —for any type where anti-affinity is required, the requests must take place one-by-one. This makes placement significantly slower when trying to start a large application.

Furthermore, the guarantee that Slider makes: no two instances of a component on the same host, means that if there is no capacity in for a container on any of the hosts not yet running a component instance —Slider will not relax the request.

That is: in exchange for a guarantee of anti-affinity, you many sacrifice the ability to run as many instances as you request —even if there is space in the cluster.

Slider and "Outstanding Requests"

One recurrent support problem people raise with Slider is that it appears to hang with one or more outstanding requests. When this situation arises, it is not a sign that there is anything wrong with Slider. It means there is nowhere for YARN to allocate the desired containers.

Here are the common causes

  • In a small cluster: you are asking for more containers than the cluster actually has.
  • There isn't enough space given everything else running.
  • There may be space, but the priority of the queue to which you are submitting work is lower than that of other work waiting to run.
  • With strict placement: the previously used host has failed; the request will remain unsatisifed until it comes back up.
  • With anti-affinity: Either there aren't enough hosts left for all the desired instances, or the remaining YARN nodes lack the capacity.
  • The amount of memory or CPU you are asking for is too much for any node to satisfy.

When working with labels, the size of the cluster becomes effectively that of the number of nodes with a given label: if you have eight nodes with the label "gpu", then components with "yarn.label.expression":"gpu" will only be able to work with four nodes.

Another tactic is to ask for less resources: CPU or memory. Don't lie about your needs; CPU performance can be throttled, while memory over-runs can result in the container being killed. It's best to try and work out your component's needs (sorry: there's no easy way to do that) and use that in future requests.

Placement History

The history of where instances were deployed is persisted in instance-specific role histories, in the history subdirectory of the cluster data directory. Multiple history files are saved; they are timestamped and appear in sorted order (the hex value of the System.currentTimeMillis() is used for this). The history files are text files storing a list of Apache Avro records

Here are the contents of a test cluster, testagentecho, stored in the file ~/.slider/clusters/testagentecho/history/rolehistory-00000151b4b44461.json:

{"entry":{"org.apache.slider.server.avro.RoleHistoryHeader":{"version":1,"saved":1450435691617,"savedx":"151b4b44461","savedate":"18 Dec 2015 10:48:11 GMT","roles":2}}}
{"entry":{"org.apache.slider.server.avro.RoleHistoryMapping":{"rolemap":{"echo":1,"slider-appmaster":0}}}}
{"entry":{"org.apache.slider.server.avro.NodeEntryRecord":{"host":"192.168.56.1","role":1,"active":true,"last_used":0}}}
{"entry":{"org.apache.slider.server.avro.RoleHistoryFooter":{"count":1}}}

The RoleHistoryMapping record lists the current map of component types/roles to YARN priorities, as defined in the "yarn.role.priority" values. Each NodeEntryRecord record lists the cluster nodes where instances are running —or are known to have hosted an instance of a component in the past. In the case of the latter, the timestamp of the last time a node was used is recorded.

When reloading the history, the most recent history which can be successfully parsed is used as the source of historical placement information.

If something appears to be wrong with cluster recovery, delete the history records. Slider will lose all placement history, and so start from scratch.

Changes

Slider 0.90.2-incubating

Slider now supports Anti-affinity SLIDER-82

This a feature regularly requested: you can now tell Slider to place all instances of a component on different hosts. It will attempt to do this (iteratively),

Slider 0.80-incubating

A major rework of placement has taken place, Über-JIRA : placement phase 2

  1. Slider manages the process of relaxing a request from a specific host to "anywhere in the cluster".
  2. Each role/component type may have a configurable timeout for escalation to begin.
  3. Slider periodically checks (every 30s by default) to see if there are outstanding requests that have reached their escalation timeout and yet have not been satisfied.
  4. Such requests are cancelled and "relaxed" requests re-issued.
  5. Labels are always respected; even relaxed requests use any labels specified in resources.json