Dear fellow Hadoopers,
After a quick introduction to HDFS, my instructor started Map/Reduce concepts today. I could realize that he touched upon many concepts in a short sessions. Here is what I scribbled down.
M/R – As the name implies, it has two parts.
- These are java programs written using M/R algorithm
- Mapping programs runs on each block of big data file
- Transformation of data, picking the URL from a web server access log file
- These are also java programs writting using M/R algorithm
- We do aggregations with reduce
- The output of mapping becomes the input of Reducing.
- We do many type of aggregations to arrive at the right results required by business logic
- A good example is how many requests received for a particular URL of a web server.
M/R Process flow
The following gives an example of the MR process flow.
We have many words in a block of a big data file. This is our input.
During the split phase, Hadoop reads the input sequentially. K1 is the key and V1 is the value. Each value denotes each record in the input.
Mapping takes the output of split and parse the content. It makes a K2 and V2 key value pair, denotes the word and number of appearance.
The process is moved to shuffling phase, where we give the K2, V2 to prepare the word and its appearance.
This output of shuffling is given to reduce process where aggregation is performed and final result is k3, v3 list is prepared.
- We launch the job from client to Job tracker
- Job tracker is running on a server class machine
- Job tracker can submit the job to each task tracker on data nodes.
- It can monitor the jobs on all nodes.
- If it is preferred, it can kill the tasks.