Lab 20: Sequential File Creation

Hi hadoopers,

I have been told that Sequential files are created from many small junks of files placed in HDFS. I have lot of such files in Feed analytics project. I hope this would help me to free up considerable space in HDFS blocked by small html files.


So, we accept a directory as input in this program. All files inside the directory would be put inside a sequential file.

This folder is my input.

$ hadoop fs -ls /user/hadoop/lab20/input
Found 2 items
-rw-r--r--   3 hadoop supergroup         79 2016-10-08 19:32 /user/hadoop/lab20/input/employee.csv
-rw-r--r--   3 hadoop supergroup         36 2016-10-08 19:32 /user/hadoop/lab20/input/salary.csv

Here is the mapper

package org.grassfield.nandu.etl;


import org.apache.hadoop.mapreduce.Mapper;

public class SeqFileMapper
        extends Mapper<LongWritable, Text, Text, Text> {

    protected void map(LongWritable key, Text value,
            Mapper<LongWritable, Text, Text, Text>.Context context)
            throws IOException, InterruptedException {
        context.write(new Text(key.toString()), value);

Here is the reducer

package org.grassfield.nandu.etl;


import org.apache.hadoop.mapreduce.Reducer;

public class SeqFileReducer
        extends Reducer<Text, Text, Text, Text> {

    protected void reduce(Text key, Iterable<Text> values,
            Reducer<Text, Text, Text, Text>.Context context)
            throws IOException, InterruptedException {
        for (Text value:values){
            context.write(key, value);


Here is the Driver

package org.grassfield.nandu.etl;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class SeqFileJob extends Configured implements Tool {

    public int run(String[] args) throws Exception {
        Job job = new Job(getConf());
        Configuration conf = job.getConfiguration();
        job.setJobName("Sequential File Job");
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        return 0;

    public static void main(String[] args) throws Exception { Configuration(), new SeqFileJob(), args);



Let’s execute it.

$ hadoop jar FeedCategoryCount-20.jar org.grassfield.nandu.etl.SeqFileJob /user/hadoop/lab20/input /user/hadoop/lab20/02

$ hadoop fs -ls /user/hadoop/lab20/02
Found 2 items
-rw-r--r--   3 hadoop supergroup          0 2016-10-08 19:36 /user/hadoop/lab20/02/_SUCCESS
-rw-r--r--   3 hadoop supergroup        256 2016-10-08 19:36 /user/hadoop/lab20/02/part-r-00000
And here is the output!

$ hadoop fs -cat /user/hadoop/lab20/02/part-r-00000㜄▒Ӛ▒▒▒▒#▒▒▒▒#

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s