~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~   http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.

  ---
  Hadoop Map Reduce Next Generation-${project.version} - Fair Scheduler
  ---
  ---
  ${maven.build.timestamp}

Hadoop MapReduce Next Generation - Fair Scheduler

  \[ {{{./index.html}Go Back}} \]

%{toc|section=1|fromDepth=0}

* {Purpose} 

  This document describes the <<<FairScheduler>>>, a pluggable scheduler for Hadoop 
  which provides a way to share large clusters. <<NOTE:>> The Fair Scheduler 
  implementation is currently under development and should be considered experimental.

* {Introduction}

  Fair scheduling is a method of assigning resources to applications such that 
  all apps get, on average, an equal share of resources over time. 
  Hadoop NextGen is capable of scheduling multiple resource types, such as 
  Memory and CPU. Currently only memory is supported, so a "cluster share" is 
  a proportion of aggregate memory in the cluster. When there is a single app 
  running, that app uses the entire cluster. When other apps are submitted, 
  resources that free up are assigned to the new apps, so that each app gets 
  roughly the same amount of resources. Unlike the default Hadoop scheduler, 
  which forms a queue of apps, this lets short apps finish in reasonable time
  while not starving long-lived apps. It is also a reasonable way to share a 
  cluster between a number of users. Finally, fair sharing can also work with 
  app priorities - the priorities are used as weights to determine the 
  fraction of total resources that each app should get.

  The scheduler organizes apps further into "queues", and shares resources 
  fairly between these queues. By default, all users share a single queue, 
  called “default”. If an app specifically lists a queue in a container 
  resource request, the request is submitted to that queue. It is also 
  possible to assign queues based on the user name included with the request 
  through configuration. Within each queue, fair sharing is used to share 
  capacity between the running apps. queues can also be given weights to share 
  the cluster non-proportionally in the config file.

  The fair scheduler supports hierarchical queues. All queues descend from a
  queue named "root". Available resources are distributed among the children
  of the root queue in the typical fair scheduling fashion. Then, the children
  distribute the resources assigned to them to their children in the same
  fashion.  Applications may only be scheduled on leaf queues. Queues can be
  specified as children of other queues by placing them as sub-elements of 
  their parents in the fair scheduler configuration file.
  
  A queue's name starts with the names of its parents, with periods as
  separators.  So a queue named "queue1" under the root named, would be 
  referred to as "root.queue1", and a queue named "queue2" under a queue
  named "parent1" would be referred to as "root.parent1.queue2". When
  referring to queues, the root part of the name is optional, so queue1 could
  be referred to as just "queue1", and a queue2 could be referred to as just 
  "parent1.queue2".

  In addition to providing fair sharing, the Fair Scheduler allows assigning 
  guaranteed minimum shares to queues, which is useful for ensuring that 
  certain users, groups or production applications always get sufficient 
  resources. When a queue contains apps, it gets at least its minimum share, 
  but when the queue does not need its full guaranteed share, the excess is 
  split between other running apps. This lets the scheduler guarantee capacity 
  for queues while utilizing resources efficiently when these queues don't
  contain applications.

  The Fair Scheduler lets all apps run by default, but it is also possible to 
  limit the number of running apps per user and per queue through the config 
  file. This can be useful when a user must submit hundreds of apps at once, 
  or in general to improve performance if running too many apps at once would 
  cause too much intermediate data to be created or too much context-switching.
  Limiting the apps does not cause any subsequently submitted apps to fail, 
  only to wait in the scheduler's queue until some of the user's earlier apps 
  finish. apps to run from each user/queue are chosen in order of priority and 
  then submit time, as in the default FIFO scheduler in Hadoop.

  Certain add-ons are not yet supported which existed in the original (MR1) 
  Fair Scheduler. Among them, is the use of a custom policies governing 
  priority “boosting” over  certain apps. 

* {Installation}

  To use the Fair Scheduler first assign the appropriate scheduler class in 
  yarn-site.xml:

------
<property>
  <name>yarn.resourcemanager.scheduler.class</name>
  <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
</property>
------

* {Configuration}

  Customizing the Fair Scheduler typically involves altering two files. First, 
  scheduler-wide options can be set by adding configuration properties in the 
  fair-scheduler.xml file in your existing configuration directory. Second, in 
  most cases users will want to create a manifest file listing which queues 
  exist and their respective weights and capacities. The location of this file 
  is flexible - but it must be declared in fair-scheduler.xml. 

 * <<<yarn.scheduler.fair.allocation.file>>>

   * Path to allocation file. An allocation file is an XML manifest describing
     queues and their properties, in addition to certain policy defaults. This file
     must be in XML format as described in the next section.

 * <<<yarn.scheduler.fair.minimum-allocation-mb>>>

    * The smallest container size the scheduler can allocate, in MB of memory.

 * <<<yarn.scheduler.fair.minimum-allocation-mb>>>

    * The largest container the scheduler can allocate, in MB of memory.

 * <<<yarn.scheduler.fair.user-as-default-queue>>>

    * Whether to use the username associated with the allocation as the default 
      queue name, in the event that a queue name is not specified. If this is set 
      to "false" or unset, all jobs have a shared default queue, called "default".
      Defaults to true.

 * <<<yarn.scheduler.fair.preemption>>>

    * Whether to use preemption. Note that preemption is experimental in the current
      version. Defaults to false.

 * <<<yarn.scheduler.fair.sizebasedweight>>>
  
    * Whether to assign shares to individual apps based on their size, rather than
      providing an equal share to all apps regardless of size. Defaults to false.

 * <<<yarn.scheduler.fair.assignmultiple>>>

    * Whether to allow multiple container assignments in one heartbeat. Defaults
      to false.

 * <<<yarn.scheduler.fair.max.assign>>>

    * If assignmultiple is true, the maximum amount of containers that can be
      assigned in one heartbeat. Defaults to -1, which sets no limit.

 * <<<locality.threshold.node>>>

    * For applications that request containers on particular nodes, the number of
      scheduling opportunities since the last container assignment to wait before
      accepting a placement on another node. Expressed as a float between 0 and 1,
      which, as a fraction of the cluster size, is the number of scheduling
      opportunities to pass up. The default value of -1.0 means don't pass up any
      scheduling opportunities.

 * <<<locality.threshold.rack>>>

    * For applications that request containers on particular racks, the number of
      scheduling opportunities since the last container assignment to wait before
      accepting a placement on another rack. Expressed as a float between 0 and 1,
      which, as a fraction of the cluster size, is the number of scheduling
      opportunities to pass up. The default value of -1.0 means don't pass up any
      scheduling opportunities.

Allocation file format

  The allocation file must be in XML format. The format contains three types of
  elements:

 * <<Queue elements>>, which represent queues. Each may contain the following
     properties:

   * minResources: minimum amount of aggregate memory

   * maxResources: maximum amount of aggregate memory

   * maxRunningApps: limit the number of apps from the queue to run at once

   * weight: to share the cluster non-proportionally with other queues

   * schedulingMode: either "fifo" or "fair" depending on the in-queue scheduling
     policy desired

   * aclSubmitApps: a list of users that can submit apps to the queue. A (default)
     value of "*" means that any users can submit apps. A queue inherits the ACL of
     its parent, so if a queue2 descends from queue1, and user1 is in queue1's ACL,
     and user2 is in queue2's ACL, then both users may submit to queue2.

   * minSharePreemptionTimeout: number of seconds the queue is under its minimum share
     before it will try to preempt containers to take resources from other queues.

 * <<User elements>>, which represent settings governing the behavior of individual 
     users. They can contain a single property: maxRunningApps, a limit on the 
     number of running apps for a particular user.

 * <<A userMaxAppsDefault element>>, which sets the default running app limit 
   for any users whose limit is not otherwise specified.

 * <<A fairSharePreemptionTimeout element>>, number of seconds a queue is under
   its fair share before it will try to preempt containers to take resources from
   other queues.

  An example allocation file is given here:

---
<?xml version="1.0"?>
<allocations>
  <queue name="sample_queue">
    <minResources>10000</minResources>
    <maxResources>90000</maxResources>
    <maxRunningApps>50</maxRunningApps>
    <weight>2.0</weight>
    <schedulingMode>fair</schedulingMode>
    <queue name="sample_sub_queue">
      <minResources>5000</minResources>
    </queue>
  </queue>
  <user name="sample_user">
    <maxRunningApps>30</maxRunningApps>
  </user>
  <userMaxAppsDefault>5</userMaxAppsDefault>
</allocations>
---

  Note that for backwards compatibility with the original FairScheduler, "queue" elements can instead be named as "pool" elements.