Anotomy of Read & Write in HDFS


I have drafted some basic commands to read the content from HDFS in my earlier post Basic HDFS commands – Demo


How does it happen internally? How does my CLI HDFS client knows where the block is? Here is the explanation given by my instructor.

hadoop023 - HDFS Read Anatomy

  1. HDFS client asks DFS to open a file
  2. DFS goes to name node to get the block meta data
  3. We need to read 2 blocks – one is from data node 1 and another one is from data node 3.
  4. FSDataInputStream is opened to read block at data node 1
  5. In parallel, FSDataInputStream is opened to read block at data node 3
  6. Once the blocks are read and merged, stream is closed

I have written some contents to HDFS as explained in my earlier post Store files in two disks in Hadoop – Storage Reliability. Let’s see how the write operations work inside HDFS.

hadoop024 - HDFS Write Anatomy

  1. The HDFS contacts the filesystem to say, ‘thala, I want to write some content’
  2. HDFS seeks the location to write from Admin name node. It has the auditing of occupied and free space in HDFS.
  3. Once the client gets the location, it started writing the content.
  4. Content is written to Data Node 1 first. Data node 1 replicates to DN2. DN2 replicates to DN3. Default replication factor is 3 for production clusters.
  5. If 3 replications are successful, acknowledgement packet is sent to the client.
  6. After acknowledgement, the client closes the write stream.
  7. Completed signal is sent to Name node.

Here I have an important question. What will happen if any one of the flow is violated or fails?!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s