Summary -
In this topic, we described about the below sections -
MapReduce is a programming model designed to process large amount of data in parallel by dividing the job into several independent local tasks. Running the independent tasks locally reduces the network usage drastically. To run the tasks locally, the data needs move to the data nodes for data processing.
The below tasks occur when the user submits a MapReduce job to Hadoop -
- The local Job Client prepares the job for submission and hands it off to the Job Tracker.
- The Job Tracker schedules the job and distributes the map work among the Task Trackers for parallel processing.
- Each Task Tracker issues a Map Task.
- The Job Tracker receives progress information from the Task Trackers.
- Once the mapping phase results available, the Job Tracker distributes the reduce work among the Task Trackers for parallel processing.
- Each Task Tracker issues a Reduce Task to perform the work.
- The Job Tracker receives progress information from the Task Trackers.
- Once the Reduce task completed, Cleanup task will be performed.
Below list of tasks plays the key role in MapReduce lifecycle –
- Job Client
- Job Tracker
- Task Tracker
- Map Task
- Reduce Task
Job Client: -
The Job Client prepares a job for execution. The local Job Client performs below When a MapReduce job submitted to Hadoop –
- Validates the job configuration.
- Generates the input splits.
- Copies the job resources to a shared location (HDFS directory) which is accessible to the Job Tracker and Task Trackers.
- Submits the job to the Job Tracker.
Job Tracker: -
The Job Tracker is responsible for the below tasks -
- scheduling jobs
- dividing a job into map and reduce tasks
- distributing map and reduce tasks among worker nodes
- task failure recovery
- tracking the job status
The Job Tracker performs below when preparing to run a job -
- Fetches input splits from the shared location where the Job Client placed the information.
- Creates a map task for each split.
- Assigns each map task to a Task Tracker.
The Job Tracker monitors the health of the Task Trackers and the progress of the job. Once the mapping phase results available, the Job Tracker performs the below steps -
- Creates reduce tasks up to the maximum enabled by the job configuration.
- Assigns each map result partition to a reduce task.
- Assigns each reduce task to a Task Tracker.
A job is marked as complete when all map and reduce tasks completed successfully, or, when all map tasks completed successfully if there is no reduce step.
Task Tracker -
A Task Tracker manages the tasks assigned and reports status to the Job Tracker. The Task Tracker runs on the associated node. The associated node may not require to be on the same host.
Task tracker perform below when the Job Tracker assigns a map or reduce task to a Task Tracker -
- Fetches job resources locally.
- Issues a child JVM on the node to execute the map or reduce task.
- Reports status to the Job Tracker.
The task issued by the Task Tracker runs the job's map or reduce functions.
Map Task -
The Hadoop MapReduce framework creates a map task to process each InputSplit. The map task -
- Create input key-value pairs using the InputFormat to fetch the input data locally.
- Applies the job-supplied map function to each key-value pair.
- Performs local sorting and aggregation of the results.
- Runs the Combiner for further aggregation if the job includes a Combiner.
- Stores the results locally in memory and on the local file system.
- Communicates with the Task Tracker about progress and status.
- Notifies the Task Tracker for the job completion.
Map task results processed through a local sort by key to prepare the data for reduce tasks. Combiner runs in the map task, if a Combiner is configured for the job. Combiner consolidates and reduces the amount of data that must be transferred to reduce tasks.
When a map task notifies the Task Tracker about the job completion, the Task Tracker notifies it to the Job Tracker. Then Job Tracker makes the results available to reduce tasks.
Reduce Task -
The reduce phase aggregates the results from the map phase into final results. Normally, the result set is smaller than the input set and application dependent. The reduction is carried out by parallel reduce tasks.
The reduce input keys and values need not have the same type as the output keys and values. The reduce phase is optional and a job can be configured to stop after map phase completes. Reduce is carried out in three phases - copy, sort and merge.
A reduce task -
- Fetches job resources locally.
- Performs copy phase to fetch local copies of all performed map results from the map worker nodes.
- Once the copy phase completed, performs sort phase to merge the copied results into a single sorted set of (key, value-list) pairs.
- Once the sort phase completes, executes the reduce phase by invoking the job-supplied reduce function on each (key, value-list) pair.
- Saves the end results to the output destination (HDFS).
The input to a reduce function is key-value pairs where the value is a list of values sharing the same key. When a map task notifies the Task Tracker about the job completion, the Task Tracker notifies it to the Job Tracker. Then Job Tracker saves the end results at the output destination (HDFS).