Filesystem closed

Hi hadoopers,

Here is the exception that screwed up me on Saturday night and failed my Mapper task.

  • Mapper is reading the lines one by one and tokenize it.
  • The last token contains a path of a file in HDFS.
  • I need to open the file and read the contents.

For the above task, following is the flow I followed in the Mapper.


Worse, my mapper failed with the following exception.

org.apache.hadoop.mapred.MapTask: Ignoring exception during close for org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader@1cb3ec38 Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(
at org.apache.hadoop.hdfs.DFSInputStream.close(

Filesystem object is suppose to be global. When I close the filesystem, the Mapper input is also closed which breaks the complete flow. So I closed only the filestream, but I didn’t close the file system explicitly which resolved the problem.



MapReduce Job Execution Process – Job Cleanup

Hi Hadoopers,

So we are looking at the 7th circle today – which is the job clean up.



MR job writes many intermediate results and junk files during the operation. Once the job is completed, these junks would occupy space on HDFS which is of no benefit any more. Hence the clean up task is launched.


  1. Job tracker informs all the task trackers to perform the cleanup.
  2. Individual task tracker cleans up the work folders
  3. They clean up the temporary directory
  4. Once the cleanup task is successful, Task Tracker ends the job by writing _SUCCESS file


MapReduce Job Execution Process – Reduce Function

Hi Hadoopers,

We are in 6th circle today, which is the reducer function. A job is submitted by the user, which has been initiated in 2nd circle for which the setup is completed in 3rd circle.

Map Task was executed in 4th circle and sort & shuffle was completed in 5th circle.


The reducer will collect the output from all the mappers to apply the user defined reduce function.


  1. Task tracker launches the reduce task
  2. Reduce task (not reduce function) read the jar and xml of the job.
  3. It execute the shuffle. Because the time the reducer task started, all the mappers may not have completed the job. So it goes to individual mapper machines to collect the output and shuffles them.
  4. Once all the mapping activity is finished it invokes the user reducer function (one more reducers).
  5. Each reducers will complete their jobwrite the output records to HDFS.
  6. Those output would be stored in temporary output file first.
  7. Once all the reducers have completed their job, final output would be written to the reducer partition file.

MapReduce Job Execution Process – Map Task Execution

Hi Hadoopers,

The user had submitted his job. He has permissions. We have slots in the cluster. Job setup is completed. We look at 4th circle given below – The Map Task Execution



The below given diagram depicts the Map Task Execution.


  1. The task tracker launches the Map  Task
  2. The Map task read the jar file given the user. This is what we write in Eclipse. In the entire frameworks, this is what our contribution 🙂
    The Map task also reads the job config (input path, output path etc). It gets everything from HDFS, as all these are already uploaded to HDFS initially.
  3. The Map task reads the input splits from HDFS
  4. From the input splits, Map task creates the record.
  5. The Map task invokes the user Mapper with the record
  6. The mapper writes intermediate output
  7. The task sort them based on key and flush them to disk.
  8. Map task informs Task Tracker about the completion of the job.

MapReduce Job Execution Process – Job Submission

Hi Hadoopers,

After publishing many posts about MapReduce code, we’ll see the MR internals like, how the MR job is submitted and executed.


This post talks about first circle – Job Submission.

We compiled the MR code and jar is ready. We execute the job with hadoop jar xxxxxx. First the job is submitted to hadoop. There are schedulers which runs the job, based on cluster capacity and availability.

I want to scribble down quick notes on Job Submission using the below given gantt diagram.


  1. The user submits the job to Job Client.
  2. Job client talks to Job Tracker to get the job id
  3. The job client creates a staging directory in HDFS. This is where all the files related to the job would get uploaded.
  4. The MR code and configurations with their 10 replicas of the blocks would be uploaded to Staging directory. Jar file of the job, job splits, split meta data and job.xml which has the job description would be uploaded.
  5. Splits are computed automatically and input is read.
  6. Meta data of split is uploaded to HDFS
  7. Job is submitted and it is ready to execute.

Lab12 – Writing local file to HDFS

Hi Hadoopers,

You might have seen my earliest post on how to read from HDFS using APIs. Here is the post that can tell you how to write to HDFS.

Input file in HDD – /opt/hadoop/feed/output/2016-09-18
HDFS file – /user/hadoop/feed/2016-09-18


The following code needs two inputs. One is the local file and another one is HDFS file.

package org.grassfield.hadoop.input;


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

 * upload local file to HDFS
 * @author pandian
public class LoadItemsHdfs {

     * @param args    localFile remoteFile
     * @throws IOException
    public static void main(String[] args) throws IOException {
        Path path = new Path("hdfs://gandhari:9000"+args[1]);
        FileSystem fs = FileSystem.get(new Configuration());
        BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(fs.create(path, true)));
        BufferedReader br = new BufferedReader(new FileReader(args[0]));
        String line = null;


Let’s execute it.

hadoop@gandhari:~/jars$ hadoop jar FeedCategoryCount-9.jar org.grassfield.hadoop.input.LoadItemsHdfs ../feed/output/2016-09-18 /user/hadoop/feed/2016-09-18
hadoop@gandhari:~/jars$ hadoop fs -ls /user/hadoop/feed
Found 1 items
-rw-r--r--   3 hadoop supergroup     120817 2016-09-18 06:51 /user/hadoop/feed/2016-09-18
hadoop@gandhari:~/jars$ hadoop fs -cat /user/hadoop/feed/2016-09-18

application/rss+xml     Today Online - Hot news null null    Rosberg in pole position to claim victory on Sunday
 Today       Sat Sep 17 23:44:59 MYT 2016    []
 application/rss+xml     Today Online - Hot news null null    No slowing Tang down despite qualifying setback Today  Sat Sep 17 22:09:20 MYT 2016    []

HDFS Permissions

The Hadoop Distributed File System (HDFS) implements a permissions model for files and directories that shares much of the POSIX model. Each file and directory is associated with an owner and a group. The file or directory has separate permissions for the user that is the owner, for other users that are members of the group, and for all other users. For files, the r permission is required to read the file, and the w permission is required to write or append to the file. For directories, the r permission is required to list the contents of the directory, the w permission is required to create or delete files or directories, and the x permission is required to access a child of the directory.

This assignment will create a new user, assign a folder in HDFS for him to demonstrate the permission capabilities.


Add a Unix user

hadoop@gandhari:~$ sudo groupadd feeder
hadoop@gandhari:~$ sudo useradd -g feeder -m feeder
hadoop@gandhari:~$ sudo passwd feeder

Create a folder in HDFS and assign permissions

hadoop@gandhari:~$ hadoop fs -mkdir /feeder
hadoop@gandhari:~$ hadoop fs -chown -R feeder:feeder /feeder
hadoop@gandhari:~$ hadoop fs -ls /
Found 6 items
-rw-r--r--   1 hadoop supergroup       1749 2016-08-24 06:01 /data
drwxr-xr-x   - feeder feeder              0 2016-09-05 15:34 /feeder
drwxr-xr-x   - hadoop supergroup          0 2016-09-05 15:15 /hbase
drwxr-xr-x   - hadoop supergroup          0 2016-08-24 13:53 /pigdata
drwxrwx---   - hadoop supergroup          0 2016-08-24 16:14 /tmp
drwxr-xr-x   - hadoop supergroup          0 2016-08-24 13:56 /user

We need to enable the permissions in hdfs-site.xml

hadoop@gandhari:~$ vi etc/hadoop/hdfs-site.xml

After this change, we need to restart dfs daemon.


Let’s test the permissions using another user kannan who does not have write permission to /data/feeder

kannan@gandhari:~$ /opt/hadoop/bin/hadoop fs -put javashine.xml /data/feeder
put: Permission denied: user=kannan, access=EXECUTE, inode="/data":hadoop:supergroup:-rw-r--r--

See you in another interesting post!

Anotomy of Read & Write in HDFS


I have drafted some basic commands to read the content from HDFS in my earlier post Basic HDFS commands – Demo


How does it happen internally? How does my CLI HDFS client knows where the block is? Here is the explanation given by my instructor.

hadoop023 - HDFS Read Anatomy

  1. HDFS client asks DFS to open a file
  2. DFS goes to name node to get the block meta data
  3. We need to read 2 blocks – one is from data node 1 and another one is from data node 3.
  4. FSDataInputStream is opened to read block at data node 1
  5. In parallel, FSDataInputStream is opened to read block at data node 3
  6. Once the blocks are read and merged, stream is closed

I have written some contents to HDFS as explained in my earlier post Store files in two disks in Hadoop – Storage Reliability. Let’s see how the write operations work inside HDFS.

hadoop024 - HDFS Write Anatomy

  1. The HDFS contacts the filesystem to say, ‘thala, I want to write some content’
  2. HDFS seeks the location to write from Admin name node. It has the auditing of occupied and free space in HDFS.
  3. Once the client gets the location, it started writing the content.
  4. Content is written to Data Node 1 first. Data node 1 replicates to DN2. DN2 replicates to DN3. Default replication factor is 3 for production clusters.
  5. If 3 replications are successful, acknowledgement packet is sent to the client.
  6. After acknowledgement, the client closes the write stream.
  7. Completed signal is sent to Name node.

Here I have an important question. What will happen if any one of the flow is violated or fails?!

Store files in two disks in Hadoop – Storage Reliability

Hi BigDs,

As I’m using single node cluster, I’m trying to store the copy in two different disks in my server, as directed by my instructor. Let’s see how.

Initially following is my data directory configured in hdfs-site.xml.


I’ll change it as below.


hadoop@gandhari:/var/lib/hadoop/cache/hadoop/dfs$ sudo mkdir data2

hadoop@gandhari:/var/lib/hadoop/cache/hadoop/dfs$ sudo chown hadoop:hadoop data2/

hadoop@gandhari:/var/lib/hadoop/cache/hadoop/dfs$ ls -alt
total 24
drwx------ 3 hadoop hadoop 4096 Sep  2 15:20 data
drwxr-xr-x 3 hadoop hadoop 4096 Sep  2 15:20 name
drwxr-xr-x 6 root   root   4096 Sep  2 15:19 .
drwxr-xr-x 2 hadoop hadoop 4096 Sep  2 15:19 data2
drwxr-xr-x 3 hadoop hadoop 4096 Aug 26 20:16 namesecondary
drwxr-xr-x 4 root   root   4096 Aug 20 09:52 ..

Stop and start the dfs. We can see the data2 folder is getting replicated.

hadoop@gandhari:/var/lib/hadoop/cache/hadoop/dfs$ ls /var/lib/hadoop/cache/hadoop/dfs/data2
current  in_use.lock

hadoop@gandhari:/var/lib/hadoop/cache/hadoop/dfs$ ls /var/lib/hadoop/cache/hadoop/dfs/data2/current/
BP-419586781-  VERSION

data node log shows it is scanning in two folders now.

2016-09-02 15:29:48,346 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-419586781- on volume /var/lib/hadoop/cache/hadoop/dfs/data/current…

2016-09-02 15:29:48,407 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-419586781- on volume /var/lib/hadoop/cache/hadoop/dfs/data2/current..


HDFS Replication – a quick demo

Hi BigD,

I want to share with you a quick demo of replication. Let’s open the DFS health page first.

hadoop021 - replication

3 is the default replication factor for production clusters

So the content will be replicated 3 times. This can be overridden in /opt/hadoop/etc/hadoop/hdfs-site.xml as given below.


Let’s make it as 1, using CLI.

hadoop@gandhari:~$ hadoop dfs -setrep -w 1 /data
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Replication 1 set: /data
Waiting for /data ... done
hadoop022 - replication

After changing the replication factor