Mappers run on individual machines and prepare intermediate results. They accept splits as inputs. Reducer accepts partition of data as inputs. Preparing the partitions from the intermediate mapper results is the responsibility of this sort & spill phase, which is the 5th circle given below.
During this phase, we find the keys from all the mappers, sort and shuffle them before sending them to one machine, where the reducer will run.
- The task tracker 1 initiates a reducer task.
- TT1 updates the job tracker about the completion status.
- Similarly TT2 or any other task trackers also updates the Job Tracker about completion status.
- The reducer task goes to TT1 where the mapping task is finished to collect the interim results.
- TT2 read the mapper output streams it to the reducer
- Task 4 and 5 is repeated for other task trackers also who are all involved in mapping tasks.
- Once the reducer received all the mapping results, it performs a sorting and spilling.
We shall talk about 3rd circle today, as we talk about Job submission and Job initialilzation already.
Scheduling the jobs is an interesting concept. I’m really excited to see the communication between Scheduler, Job tracker and Task tracker.
- The task tracker keeps on sending heartbeats to Job Tracker about the status of the job. So, it says to Job Tracker that job is completed and it wants more jobs.
- Job Tracker updates the task status and make a note of Task Tracker’s message.
- Job Tracker goes to Scheduler asking for tasks.
- Scheduler updates the tasks scheduler record. Based on job scheduling policy, either it makes the job client to wait or process the job. It is based on execution policy, priority etc.
- Job tracker gets the task.
- It submits the task to the task tracker.
Dear fellow Hadoopers,
After a quick introduction to HDFS, my instructor started Map/Reduce concepts today. I could realize that he touched upon many concepts in a short sessions. Here is what I scribbled down.
M/R – As the name implies, it has two parts.
- These are java programs written using M/R algorithm
- Mapping programs runs on each block of big data file
- Transformation of data, picking the URL from a web server access log file
- These are also java programs writting using M/R algorithm
- We do aggregations with reduce
- The output of mapping becomes the input of Reducing.
- We do many type of aggregations to arrive at the right results required by business logic
- A good example is how many requests received for a particular URL of a web server.
M/R Process flow
The following gives an example of the MR process flow.
M/R 1 process flow example
We have many words in a block of a big data file. This is our input.
During the split phase, Hadoop reads the input sequentially. K1 is the key and V1 is the value. Each value denotes each record in the input.
Mapping takes the output of split and parse the content. It makes a K2 and V2 key value pair, denotes the word and number of appearance.
The process is moved to shuffling phase, where we give the K2, V2 to prepare the word and its appearance.
This output of shuffling is given to reduce process where aggregation is performed and final result is k3, v3 list is prepared.
M/R 1 Task Execution
- We launch the job from client to Job tracker
- Job tracker is running on a server class machine
- Job tracker can submit the job to each task tracker on data nodes.
- It can monitor the jobs on all nodes.
- If it is preferred, it can kill the tasks.
We devide Hadoop process flow into two major classifications-
Storage and Jobs
Storage – Building blocks
We have HDFS Cluster. In this cluster we have different data nodes like node 1, node 2 …..node n. All these nodes are slave nodes in which data is stored as small chunks. These nodes are administered by Admin node using name node.
When huge data is coming in, admin node has the control over – how many nodes available, where it can be stored etc. Name node has the catalog of the data saved already in the slave nodes.
Jobs – Building blocks
We start with MapReduce Engine. It provides different set of advanced APIs to retrieve the data. MapReduce jobs are scheduled by Job Tracker, which is the admin of Jobs. It rely on Task trackers running on individual data nodes to complete a job. Generally data node has the data and task tracker.
Job tracker tracks the job completion. If the job fails on a node, it finds the alternate node to get the task done.
I depicted it pictorially below.
Hadoop storage and jobs – Javashine