We devide Hadoop process flow into two major classifications-
Storage and Jobs
Storage – Building blocks
We have HDFS Cluster. In this cluster we have different data nodes like node 1, node 2 …..node n. All these nodes are slave nodes in which data is stored as small chunks. These nodes are administered by Admin node using name node.
When huge data is coming in, admin node has the control over – how many nodes available, where it can be stored etc. Name node has the catalog of the data saved already in the slave nodes.
Jobs – Building blocks
We start with MapReduce Engine. It provides different set of advanced APIs to retrieve the data. MapReduce jobs are scheduled by Job Tracker, which is the admin of Jobs. It rely on Task trackers running on individual data nodes to complete a job. Generally data node has the data and task tracker.
Job tracker tracks the job completion. If the job fails on a node, it finds the alternate node to get the task done.
I depicted it pictorially below.