Lab 20: Sequential File Creation

Hi hadoopers,

I have been told that Sequential files are created from many small junks of files placed in HDFS. I have lot of such files in Feed analytics project. I hope this would help me to free up considerable space in HDFS blocked by small html files.

logo-mapreduce

So, we accept a directory as input in this program. All files inside the directory would be put inside a sequential file.

This folder is my input.

$ hadoop fs -ls /user/hadoop/lab20/input
Found 2 items
-rw-r--r--   3 hadoop supergroup         79 2016-10-08 19:32 /user/hadoop/lab20/input/employee.csv
-rw-r--r--   3 hadoop supergroup         36 2016-10-08 19:32 /user/hadoop/lab20/input/salary.csv

Here is the mapper

package org.grassfield.nandu.etl;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class SeqFileMapper
        extends Mapper<LongWritable, Text, Text, Text> {

    @Override
    protected void map(LongWritable key, Text value,
            Mapper<LongWritable, Text, Text, Text>.Context context)
            throws IOException, InterruptedException {
        context.write(new Text(key.toString()), value);
    }
}

Here is the reducer

package org.grassfield.nandu.etl;

import java.io.IOException;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class SeqFileReducer
        extends Reducer<Text, Text, Text, Text> {

    @Override
    protected void reduce(Text key, Iterable<Text> values,
            Reducer<Text, Text, Text, Text>.Context context)
            throws IOException, InterruptedException {
        for (Text value:values){
            context.write(key, value);
        }
    }

}

Here is the Driver

package org.grassfield.nandu.etl;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;


public class SeqFileJob extends Configured implements Tool {

    @Override
    public int run(String[] args) throws Exception {
        Job job = new Job(getConf());
        Configuration conf = job.getConfiguration();
        job.setJarByClass(this.getClass());
        job.setJobName("Sequential File Job");
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        job.setMapperClass(SeqFileMapper.class);
        job.setReducerClass(SeqFileReducer.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(SequenceFileOutputFormat.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        job.setNumReduceTasks(1);
        job.waitForCompletion(true);
        return 0;
    }

    public static void main(String[] args) throws Exception {
        ToolRunner.run(new Configuration(), new SeqFileJob(), args);

    }

}

Let’s execute it.

$ hadoop jar FeedCategoryCount-20.jar org.grassfield.nandu.etl.SeqFileJob /user/hadoop/lab20/input /user/hadoop/lab20/02

$ hadoop fs -ls /user/hadoop/lab20/02
Found 2 items
-rw-r--r--   3 hadoop supergroup          0 2016-10-08 19:36 /user/hadoop/lab20/02/_SUCCESS
-rw-r--r--   3 hadoop supergroup        256 2016-10-08 19:36 /user/hadoop/lab20/02/part-r-00000
And here is the output!

$ hadoop fs -cat /user/hadoop/lab20/02/part-r-00000
SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text㜄▒Ӛ▒▒▒▒#▒▒▒▒#
                                                                   101,200020/101,Duryodhana,Dhritarashtra,Gandhari,Bhanumati
1101,4000
2102,3000"48102,Bheema,Pandu,Kunti,Hidimbi
                                          102,1500
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s