After setting up the Hadoop eco system, I have been directed to spend some time (quickly) on HDFS architecture. Here are some highlights.
HDFS does –
- Start sequentially and process the jobs in parallel
- Distributes data across multiple machines to form the HDFS cluster
- Logically divide the files into equal sized blocks
- Spread blocks across multiple machines
- Create replicas of the blocks
- Maintain 3 copies (by default) to ensure the HA
- Maintain data integrity by computing the block checksum
- Work in Master-Slave approach
- Fault tolerence
- Provide high scalability to store high volume of data
- Run on top of unix FS (Data storage)
I have a question at this point – How to size the Hadoop cluster? What’s the best practice?
Hadoop Operation Modes
- Standalone – No daemons would be running. HDFS is also not present
- Pseudo-distributed – All daemons would be running. HDFS is present. This is to work in one machine. It may be suitable for SIT or dev
- Fully distributed – This is real cluster where admin nodes, name nodes, secondary name nodes are running
Data storage – Blocks & Splits
Files are stored into different blocks across machines physically. Each block can be 128MB by default.
Splits are logical entities used at the time of read.
Reading from blocks are reads during Mapping is decided by the hadoop programmer.
HDFS performs checksum to validate the block integrity. When checksum fails, that block is marked for deletion. This faulty block is recreated in some other data node.
Data and Meta Data
Data is stored in blocks. The information about the block is stored in meta data. When the namenode starts up, it has all the meta data in memory (huge amount of RAM required??)
I have another question here – is there any way to monitor the corrupted blocks and recreation?
As per our installation, hadoop conf files are located in /opt/hadoop/etc/hadoop. Let’s see what’s inside hdfs-site.xml
dfs.name.dir is where the namenode stores the metadata about the data. It is located in /var/lib/hadoop/cache/hadoop/dfs/name.
dfs.data.dir is where the actual data is stored. It is located in /var/lib/hadoop/cache/hadoop/dfs/data as per our configuration.
Start dfs service and access the name node portal and access http://gandhari:50070/
Here is the data node detail
data name namesecondary
Let’s browse through meta data and actual blocks.
When some of the blocks are corrupted or deleted, they will be recreated. I’m not going to test at the moment.
I’ll write some of the extended features in my next blog post.