Hadoop Scheduling

Summary -

In this topic, we described about the below sections -

Scheduling

Hadoop is an open source framework, distributed, scalable, batch processing and fault-tolerance system that can store and process the huge amount of data (Bigdata). Hadoop enables high-performance processing of data over a set of distributed nodes. But Hadoop is a multi-tasking system which can process multiple data sets for multiple jobs triggered by multiple users at the same time.

Up until 2008, Hadoop supported a single scheduler that was perfect for the traditional batch jobs of Hadoop. A bug report was submitted for an implementation of a scheduler that was independent of the Job Tracker.

The new scheduler is pluggable and allows use of new scheduling algorithms to help optimize jobs that have specific characteristics. The new scheduler changes increased readability of the scheduler and the potential for a growing list of schedulers.

With the above change, Hadoop is now a multi-user data warehouse that supports a variety types of processing jobs with a pluggable scheduler framework providing greater control. The below three types of schedulers to schedule the jobs --

FIFO Scheduler
Fair Scheduler
Capacity Scheduler

FIFO Scheduler -

The original scheduling algorithm integrated within the Job Tracker was called FIFO. Each job uses the whole cluster, so the jobs wait for their turn. In FIFO scheduling, a Job Tracker pulled jobs from a work queue, oldest job first.

The schedule had no concept of the priority or size of the job. Pre-exemption is not supported. The FIFO approach was simple to implement and efficient. FIFO scheduler is not suitable for shared production-level cluster.

Fair Scheduler -

The fair scheduler was developed by Facebook. The fair share scheduler used to assign resources to jobs on average completion time and each job gets an equal share of the available resources. As a result, the jobs require less time can access the CPU along with the jobs that require more time to execute.

The Hadoop implementation creates a set of pools where jobs are placed for selection by the scheduler. Each pool can be assigned a set of shares to balance resources across jobs in pools. By default, all pools have equal shares and the custom configuration is possible to provide more or fewer shares depending upon the job type.

Each user is assigned to one pool to ensure fairness. Every user gets a fair share of the cluster capacity overtime. A User gets his assigned share of resources only regardless of the number of jobs submitted.If the jobs are not submitted by particular user, the resources will not share to others and go unused.

The scheduler implementation keeps track of the compute time for each job in the system. Multiple users can run jobs on the cluster at the same time.

Capacity Scheduler -

The capacity scheduler was developed by Yahoo. Capacity Scheduler is like Fair Scheduler with some differences. Distribute jobs fairly among users. Capacity scheduling was defined for large clusters.

In capacity scheduling, several queues are created and each queue with a configurable number of map and reduce slots. Each queue is assigned a guaranteed capacity.

All queues are monitored for the allocated capacity. If a queue is not consuming its allocated capacity, the excess capacity can be temporarily allocated to other queues. In capacity scheduling, the jobs with a higher priority have access to resources sooner than lower-priority jobs.

In capacity scheduling, the presence of strict access controls on queues. These access controls are defined on a per-queue basis and restrict the ability to submit jobs to queues, the ability to view and modify jobs in queues.