HDFS – A birds eye view

After setting up the Hadoop eco system, I have been directed to spend some time (quickly) on HDFS architecture. Here are some highlights.

Responsibilities

HDFS does –

  1. Start sequentially and process the jobs in parallel
  2. Distributes data across multiple machines to form the HDFS cluster
  3. Logically divide the files into equal sized blocks
  4. Spread blocks across multiple machines
  5. Create replicas of the blocks
  6. Maintain 3 copies (by default) to ensure the HA
  7. Maintain data integrity by computing the block checksum
  8. Work in Master-Slave approach
  9. Fault tolerence
  10. Provide high scalability to store high volume of data
  11. Run on top of unix FS (Data storage)

I have a question at this point – How to size the Hadoop cluster? What’s the  best practice?

Hadoop Operation Modes

  1. Standalone – No daemons would be running. HDFS is also not present
  2. Pseudo-distributed – All daemons would be running. HDFS is present. This is to work in one machine. It may be suitable for SIT or dev
  3. Fully distributed – This is real cluster where admin nodes, name nodes, secondary name nodes are running

Data storage – Blocks & Splits

Files are stored into different blocks across machines physically. Each block can be 128MB by default.

Splits are logical entities used at the time of read.

hadoop009-HDFS blocks and splits

Reading from blocks are reads during Mapping is decided by the hadoop programmer.

hadoop010 - Mapper program with split and blocks

Blocks verification

HDFS performs checksum to validate the block integrity. When checksum fails, that block is marked for deletion. This faulty block is recreated in some other data node.

hadoop011 - HDFS checksum

Data and Meta Data

Data is stored in blocks. The information about the block is stored in meta data. When the namenode starts up, it has all the meta data in memory (huge amount of RAM required??)

I have another question here – is there any way to monitor the corrupted blocks and recreation?

Inside NameNode

As per our installation, hadoop conf files are located in /opt/hadoop/etc/hadoop. Let’s see what’s inside hdfs-site.xml

dfs.name.dir is where the namenode stores the metadata about the data. It is located in /var/lib/hadoop/cache/hadoop/dfs/name.

dfs.data.dir is where the actual data is stored. It is located in /var/lib/hadoop/cache/hadoop/dfs/data as per our configuration.

Start dfs service and access the name node portal and access http://gandhari:50070/

hadoop012 - Name Node

Here is the data node detail

hadoop013 - Data Node Details

hadoop013 - Data Node Browse Directoryhadoop@gandhari:/var/lib/hadoop/cache/hadoop/dfs$ ls
data  name  namesecondaryhadoop014 - Data Node Details

Let’s browse through meta data and actual blocks.

hadoop015 - DFS MetaData

Meta Data

hadoop016 - DFS blocks

Blocks

hadoop017 - DFS read blocks

Blocks are readable

When some of the blocks are corrupted or deleted, they will be recreated. I’m not going to test at the moment.

I’ll write some of the extended features in my next blog post.

CaEykmeUcAArWT9.jpg large

 

Advertisements

One thought on “HDFS – A birds eye view

  1. Pingback: HDFS in Detail | JavaShine

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s