We are in 6th circle today, which is the reducer function. A job is submitted by the user, which has been initiated in 2nd circle for which the setup is completed in 3rd circle.
Map Task was executed in 4th circle and sort & shuffle was completed in 5th circle.
The reducer will collect the output from all the mappers to apply the user defined reduce function.
- Task tracker launches the reduce task
- Reduce task (not reduce function) read the jar and xml of the job.
- It execute the shuffle. Because the time the reducer task started, all the mappers may not have completed the job. So it goes to individual mapper machines to collect the output and shuffles them.
- Once all the mapping activity is finished it invokes the user reducer function (one more reducers).
- Each reducers will complete their jobwrite the output records to HDFS.
- Those output would be stored in temporary output file first.
- Once all the reducers have completed their job, final output would be written to the reducer partition file.
Mappers run on individual machines and prepare intermediate results. They accept splits as inputs. Reducer accepts partition of data as inputs. Preparing the partitions from the intermediate mapper results is the responsibility of this sort & spill phase, which is the 5th circle given below.
During this phase, we find the keys from all the mappers, sort and shuffle them before sending them to one machine, where the reducer will run.
- The task tracker 1 initiates a reducer task.
- TT1 updates the job tracker about the completion status.
- Similarly TT2 or any other task trackers also updates the Job Tracker about completion status.
- The reducer task goes to TT1 where the mapping task is finished to collect the interim results.
- TT2 read the mapper output streams it to the reducer
- Task 4 and 5 is repeated for other task trackers also who are all involved in mapping tasks.
- Once the reducer received all the mapping results, it performs a sorting and spilling.
The user had submitted his job. He has permissions. We have slots in the cluster. Job setup is completed. We look at 4th circle given below – The Map Task Execution
The below given diagram depicts the Map Task Execution.
- The task tracker launches the Map Task
- The Map task read the jar file given the user. This is what we write in Eclipse. In the entire frameworks, this is what our contribution 🙂
The Map task also reads the job config (input path, output path etc). It gets everything from HDFS, as all these are already uploaded to HDFS initially.
- The Map task reads the input splits from HDFS
- From the input splits, Map task creates the record.
- The Map task invokes the user Mapper with the record
- The mapper writes intermediate output
- The task sort them based on key and flush them to disk.
- Map task informs Task Tracker about the completion of the job.
We shall talk about 3rd circle today, as we talk about Job submission and Job initialilzation already.
Scheduling the jobs is an interesting concept. I’m really excited to see the communication between Scheduler, Job tracker and Task tracker.
- The task tracker keeps on sending heartbeats to Job Tracker about the status of the job. So, it says to Job Tracker that job is completed and it wants more jobs.
- Job Tracker updates the task status and make a note of Task Tracker’s message.
- Job Tracker goes to Scheduler asking for tasks.
- Scheduler updates the tasks scheduler record. Based on job scheduling policy, either it makes the job client to wait or process the job. It is based on execution policy, priority etc.
- Job tracker gets the task.
- It submits the task to the task tracker.
Dear fellow Hadoopers,
After a quick introduction to HDFS, my instructor started Map/Reduce concepts today. I could realize that he touched upon many concepts in a short sessions. Here is what I scribbled down.
M/R – As the name implies, it has two parts.
- These are java programs written using M/R algorithm
- Mapping programs runs on each block of big data file
- Transformation of data, picking the URL from a web server access log file
- These are also java programs writting using M/R algorithm
- We do aggregations with reduce
- The output of mapping becomes the input of Reducing.
- We do many type of aggregations to arrive at the right results required by business logic
- A good example is how many requests received for a particular URL of a web server.
M/R Process flow
The following gives an example of the MR process flow.
M/R 1 process flow example
We have many words in a block of a big data file. This is our input.
During the split phase, Hadoop reads the input sequentially. K1 is the key and V1 is the value. Each value denotes each record in the input.
Mapping takes the output of split and parse the content. It makes a K2 and V2 key value pair, denotes the word and number of appearance.
The process is moved to shuffling phase, where we give the K2, V2 to prepare the word and its appearance.
This output of shuffling is given to reduce process where aggregation is performed and final result is k3, v3 list is prepared.
M/R 1 Task Execution
- We launch the job from client to Job tracker
- Job tracker is running on a server class machine
- Job tracker can submit the job to each task tracker on data nodes.
- It can monitor the jobs on all nodes.
- If it is preferred, it can kill the tasks.