Hadoop process flow

We devide Hadoop process flow into two major classifications-

Storage and Jobs

Storage – Building blocks

We have HDFS Cluster. In this cluster we have different data nodes like node 1, node 2 …..node n. All these nodes are slave nodes in which data is stored as small chunks. These nodes are administered by Admin node using name node.

When huge data is coming in, admin node has the control over – how many nodes available, where it can be stored etc. Name node has the catalog of the data saved already in the slave nodes.

Jobs – Building blocks

We start with MapReduce Engine. It provides different set of advanced APIs to retrieve the data. MapReduce jobs are scheduled by Job Tracker, which is the admin of Jobs. It rely on Task trackers running on individual data nodes to complete a job. Generally data node has the data and task tracker.

Job tracker tracks the job completion. If the job fails on a node, it finds the alternate node to get the task done.
I depicted it pictorially below.

Hadoop storage and jobs - Javashine

Hadoop storage and jobs – Javashine


One thought on “Hadoop process flow

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s