# Copyright 2008 The Apache Software Foundation Licensed under the # Apache License, Version 2.0 (the "License"); you may not use this # file except in compliance with the License. You may obtain a copy # of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless # required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or # implied. See the License for the specific language governing # permissions and limitations under the License. This package implements a scheduler for Map-Reduce jobs, called Capacity Task Scheduler (or just Capacity Scheduler), which provides a way to share large clusters. The scheduler provides the following features (which are described in detail in HADOOP-3421): * Support for queues, where a job is submitted to a queue. * Queues are guaranteed a fraction of the capacity of the grid (their 'guaranteed capacity') in the sense that a certain capacity of resources will be at their disposal. All jobs submitted to the queues of an Org will have access to the capacity guaranteed to the Org. * Free resources can be allocated to any queue beyond its guaranteed capacity. These excess allocated resources can be reclaimed and made available to another queue in order to meet its capacity guarantee. * The scheduler guarantees that excess resources taken from a queue will be restored to it within N minutes of its need for them. * Queues optionally support job priorities (disabled by default). * Within a queue, jobs with higher priority will have access to the queue's resources before jobs with lower priority. However, once a job is running, it will not be preempted for a higher priority job. * In order to prevent one or more users from monopolizing its resources, each queue enforces a limit on the percentage of resources allocated to a user at any given time, if there is competition for them. * Support for memory-intensive jobs, wherein a job can optionally specify higher memory-requirements than the default, and the tasks of the job will only be run on TaskTrackers that have enough memory to spare. Whenever a TaskTracker is free, the Capacity Scheduler first picks a queue that needs to reclaim any resources the earliest. If no such queue is found, it then picks a queue which has most free space (whose ratio of # of running slots to guaranteed capacity is the lowest). -------------------------------------------------------------------------------- BUILDING: In HADOOP_HOME, run ant package to build Hadoop and its contrib packages. -------------------------------------------------------------------------------- INSTALLING: To run the capacity scheduler in your Hadoop installation, you need to put it on the CLASSPATH. The easiest way is to copy the hadoop-*-capacity-scheduler.jar from HADOOP_HOME/build/contrib/capacity-scheduler to HADOOP_HOME/lib. Alternatively you can modify HADOOP_CLASSPATH to include this jar, in conf/hadoop-env.sh. You will also need to set the following property in the Hadoop config file (conf/hadoop-site.xml) to have Hadoop use the capacity scheduler: mapred.jobtracker.taskScheduler org.apache.hadoop.mapred.CapacityTaskScheduler -------------------------------------------------------------------------------- CONFIGURATION: The following properties can be set in hadoop-site.xml to configure the scheduler: mapred.capacity-scheduler.reclaimCapacity.interval: The capacity scheduler checks, every 'interval' seconds, whether any capacity needs to be reclaimed. The default value is 5 seconds. The scheduling information for queues is maintained in a configuration file called 'capacity-scheduler.xml'. Note that the queue names are set in hadoop-site.xml. capacity-scheduler.xml sets the scheduling properties for each queue. See that file for configuration details, but the following are the configuration options for each queue: mapred.capacity-scheduler.queue..guaranteed-capacity Percentage of the number of slots in the cluster that are guaranteed to be available for jobs in this queue. The sum of guaranteed capacities for all queues should be less than or equal 100. mapred.capacity-scheduler.queue..reclaim-time-limit The amount of time, in seconds, before which resources distributed to other queues will be reclaimed. mapred.capacity-scheduler.queue..supports-priority If true, priorities of jobs will be taken into account in scheduling decisions. mapred.capacity-scheduler.queue..minimum-user-limit-percent Each queue enforces a limit on the percentage of resources allocated to a user at any given time, if there is competition for them. This user limit can vary between a minimum and maximum value. The former depends on the number of users who have submitted jobs, and the latter is set to this property value. For example, suppose the value of this property is 25. If two users have submitted jobs to a queue, no single user can use more than 50% of the queue resources. If a third user submits a job, no single user can use more than 33% of the queue resources. With 4 or more users, no user can use more than 25% of the queue's resources. A value of 100 implies no user limits are imposed. -------------------------------------------------------------------------------- IMPLEMENTATION: When a TaskTracker is free, the capacity scheduler does the following (note that many of these steps can be, and will be, enhanced over time to provide better algorithms): 1. Decide whether to giev it a Map or Reduce task, depending on how many tasks the TT is already running of that type, with respect to the maximum taks it can run. 2. The scheduler then picks a queue. Queues that need to reclaim capacity sooner, come before queues that don't. For queues that don't, they're ordered by a ratio of (# of running tasks)/Guaranteed capacity, which indicates how much 'free space' the queue has, or how much it is over capacity. 3. A job is picked in the queue based on its state (running jobs are picked first), its priority (if the queue supports priorities) or its submission time, and whether the job's user is under or over limit. 4. A task is picked from the job in the same way it always has. Periodically, a thread checks each queue to see if it needs to reclaim any capacity. Queues that are running below capacity and that have tasks waiting, need to reclaim capacity within a certain perdiod of time. If a queue hasn't received enough tasks in a certain amount of time, tasks will be killed from queues that are running over capacity. --------------------------------------------------------------------------------