I have drafted some basic commands to read the content from HDFS in my earlier post Basic HDFS commands – Demo
How does it happen internally? How does my CLI HDFS client knows where the block is? Here is the explanation given by my instructor.
- HDFS client asks DFS to open a file
- DFS goes to name node to get the block meta data
- We need to read 2 blocks – one is from data node 1 and another one is from data node 3.
- FSDataInputStream is opened to read block at data node 1
- In parallel, FSDataInputStream is opened to read block at data node 3
- Once the blocks are read and merged, stream is closed
I have written some contents to HDFS as explained in my earlier post Store files in two disks in Hadoop – Storage Reliability. Let’s see how the write operations work inside HDFS.
- The HDFS contacts the filesystem to say, ‘thala, I want to write some content’
- HDFS seeks the location to write from Admin name node. It has the auditing of occupied and free space in HDFS.
- Once the client gets the location, it started writing the content.
- Content is written to Data Node 1 first. Data node 1 replicates to DN2. DN2 replicates to DN3. Default replication factor is 3 for production clusters.
- If 3 replications are successful, acknowledgement packet is sent to the client.
- After acknowledgement, the client closes the write stream.
- Completed signal is sent to Name node.
Here I have an important question. What will happen if any one of the flow is violated or fails?!