YARN process flow


Yesterday I wrote my note on MR v1. After perceiving the problems, YARN, which is MR v2 is released.


It has lot of advantages as given below.

  1. MR v1 has two major daemons – Job Tracker and Task Tracker. As all the tasks are aggregated under these daemons, it may hang when we process large set of data
  2. It does not provide HA.

Yarn – MR2 – Process flow

  1. JOB SUMISSION: User executes his YARN code. You are going to launch a job. A new JVM is launched.
  2. JOB SUMISSION: The job is submitted to App Manager of Resource Manager.
  3. JOB SUMISSION: In turn, Resource manager gives you back an job identifier. Job resources are copied to the HDFS. Name node has the meta data of the resources. (It is not the file or content. It is the job detail)
  4. RESOURCE MGMT: Resource Manager talks to one of the machines where the node manager is running – Machine A. It will ask the node manager to start an Application Master in that machine.
  5. RESOURCE MGMT: Node manager accepts the call and spawns Application Master.
  6. RESOURCE MGMT: Application master goes to Name node to get the meta data of the job details. It calculates the resource usage, split information etc to execute the task.
  7. RESOURCE MGMT: It requests the RM for the application master instance ID.
  8. RESOURCE MGMT: The Application Manager identify machines and calculates the resource information of them. It informs the Resource Manager about the identification of the Node Manager to execute the task.
  9. RESOURCE MGMT: The Resource Manager gives an acknowledgement to Application manager to allocate container from the identified Node Manager.
  10. RESOURCE MGMT: Application Master talks to Node Manager of Machine B to inform the approval of containers to Node Manager on its box where the data lies.
  11. TASK EXECUTION: It creates the container. The task will be executed inside the same.
  12. TASK EXECUTION: The Task Container will ge the job resources from HDFS, we copied in Step #3.
  13. TASK EXECUTION: Task starts.
  14. TASK EXECUTION: It is the Application Master’s duty to monitor the container. Hence the container sends the MR status to the same. (Task related information)
  15. TASK EXECUTION: Node manager of Machine B will update the Resource Manager about the resource consumption (CPU/MEM..) Based on this data, Resource manager decides the Machines for another job execution.
  16. TASK COMPLETION: Container informs the Application Master about the job completion. Resource Manager is updated. Job queue is updated. Application master de-registers the container and the container is terminated.

Happy Weekend, Hadoopers!

Virtualization Terminologies – 1

VMware ESX server:

This is a popular virtualization solution from VMware.
Installs on a ‘bare metal’, allows multiple OSs and their applications to run in VMs.

Physical server/Virtual Host:

The Operating System running on a bare metal

Guests/Virtual Machines

Individual Operating Systems hosted by ESX or Virtual Host


An OS that runs directly on the ESX server host.

Resource Manager:

Partitions the physical resources of the underlying server
Uses proportional share mechanism to allocate CPU, memory and disk resources to VMs those are powered on
manages the reservations, limits of each VM

VMKernal Hardware Interface:

Hides the hardware differences from ESX server and guest users.
Enables hardware specific service delivery
Includes device drivers and Virtual Machine File System (VMFS)

Virtual Machine Monitor (VMM):

Responsible for virtualizing CPUs
When the VM is getting started, the control is transferred to VMM
Sets the system state