Slider Chaos Monkey

Slider includes a built in Chaos Monkey. This is a service which runs inside the Slider Application Master, and randomly kills containers without any warning, or even the application master itself.

This Chaos Monkey is intended for testing. Netflix's original design runs against all their production applications hosted in Amazon's cloud. The slider Chaos Monkey also may be used in production, if desired, though it is not something we would recommend, especially for any application where the cost of reacting to or recovering from failures is tangible.

If used in production, note that YARN needs to be configured to tolerate the failure rate, and the Slider failure threshold (yarn.container.failure.threshold) and window are configured to tolerate the increased failure rate.

The Chaos Monkey works as follows

  1. The monkey wakes up at a configured interval "comes out to play".
  2. For each chaos action, the monkey then generates a random number
  3. If the probability of the chaos action is greater than this random number, the action is performed.

As an example, if the probability of killing the AM is set to 50%, and the monkey interval is set to one hour, then one would expect over a 48 hour period for the AM to have failed approximately 24 times. As the check is random, it is unlikey to be exactly this value, nor will the interval between failures be exactly two hours.

A monkey interval of 30 minutes and an AM kill probability of 25% would result in the aggregate failure rate being approximately the same, but the interval between failures would be different.

Configuring the Chaos Monkey

The Chaos Monkey is configured on a per-application basis, by setting options in the global section of the internal resources file, internal.json

Any option which takes a probability uses unit of hundreths of a percentage, that is 10000 units are equivalent to a probability of 1: a operation will always take place. A value of 100 is translated to 1%, a probability of 0.01.

This unit is used to allow very small percentages to be expressed without resorting to floating point numbers.

the monkey can be enabled, after which the interval between checks must be set. Available actions can then have their individidual probability set.

Enabling the Monkey

The option internal.chaos.monkey.enabled enables or disables the monkey; it must equal "true" for the monkey to be enabled and other options read.

Interval

The interval (aggregated to produce a total interval) between checks to see if any chaos action is to be triggered.

  • internal.chaos.monkey.interval.days
  • internal.chaos.monkey.interval.hours
  • internal.chaos.monkey.interval.minutes
  • internal.chaos.monkey.interval.seconds

If not interval is set the chaos monkey does not start

Startup delay

The chaos monkey can be given a startup delay before it begins actions. The value defaults to the interval value (see above).

  • internal.chaos.monkey.delay.days
  • internal.chaos.monkey.delay.hours
  • internal.chaos.monkey.delay.minutes
  • internal.chaos.monkey.delay.seconds

Application Master Kill

The probability of the AM being killed on a monkey check is:

internal.chaos.monkey.probability.amfailure

When the monkey triggers this action, the AM kills itself. YARN is expected to detect this and react by creating a new application master —while leaving the running application itself to continue uninterrupted.

For the AM to recover from failures, YARN must be configured to support application retries.

As a restarted AM resets all its internal state, the Chaos Monkey itself will be restarted with a new interval which begins from the moment the AM is restarted.

Container Kill

The probability of a container being killed in a single monkey "play" is:

internal.chaos.monkey.probability.containerfailure

When the monkey triggers this action, the current list of active YARN containers being used by the application is enumerated, then one of the containers is selected at random to be killed.

The Slider Application Master is expected to notice this event and respond by requesting and re-instantiating a replacement failure.

The Slider Application should be configured in its resources.json file to tolerate a failure rate.

If there are no containers hosting application components at the time the chaos monkey performs its actions, then no container will be killed.

AM Launch Failure

The launch failure is a special check at startup, if the monkey is enabled: it defines the probability that the launch itself will fail.

It is tested precisely once per application attempt, at launch startup

internal.chaos.monkey.probability.amlaunchfailure

Note that the chaos monkey must be enabled for this check, where that means 1. the monkey must be enabled with internal.chaos.monkey.enabled=true 1. A non-zero monkey interval must be set via the internal.chaos.monkey.interval properties

Example

A disabled Chaos Monkey

{
  "internal.chaos.monkey.enabled":" false"
}

As this is the default, it does not need to be declared.

{
  "internal.chaos.monkey.enabled": "true",
  "internal.chaos.monkey.interval.hours": "1",
  "internal.chaos.monkey.interval.minutes": "30",
  "internal.chaos.monkey.probability.containerfailure": "1000",
  "internal.chaos.monkey.probability.amfailure": "5"
}

This configuration

  1. Enables the Chaos Monkey
  2. Set the interval to 1h 30m; 90 minutes.
  3. Sets the probability of a container failure to 10%
  4. Sets the probability of an application master failure to 0.05%

With these values, over an 24 hour period, the probability of a container being killed is 16 * 1000 / 10000: 1.6,

That is, at least one container is likely to have been killed over the day.

The probability of the AM failing is significantly lower

16 * 0.0005 = 0.008 = 0.8%

If any probability is set to zero, such as:

"internal.chaos.monkey.probability.amfailure": "0"

Then that check is never made —here the AM will never be killed by the Chaos Monkey.