Lab 09 – Identity Mapper & Reducer

Identity Mapper is the default mapper which will run if we don’t specify any Mappers. This will just copy the contents to the output along with the offset value. This exercise shows how we can write an Identity mapper.

logo-mapreduce

Mapper

We read the input and write the output.

package org.grassfield.hadoop;

import java.io.IOException;

import org.apache.hadoop.mapreduce.Mapper;

/**
 * Identity Mapper and Reducer just like the concept of Identity function in
 * mathematics i.e. do not transform the input and return it as it is in output
 * form. Identity Mapper takes the input key/value pair and splits it out
 * without any processing.
 * 
 * @author pandian
 *
 */
public class IdentityMapper extends Mapper<Object, Object, Object, Object> {

    @Override
    protected void map(Object key, Object value, Mapper<Object, Object, Object, Object>.Context context)
            throws IOException, InterruptedException {
        context.write(key, value);
    }

}

Worker

The driver class would be like this.

package org.grassfield.hadoop;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import com.sun.jersey.core.impl.provider.entity.XMLJAXBElementProvider.Text;

/**
 * Driver for Identity Mapper
 * @author pandian
 *
 */
public class IdentityMapperDriver extends Configured implements Tool {

    @Override
    public int run(String[] args) throws Exception {
        Configuration conf = super.getConf();
        GenericOptionsParser gop = new GenericOptionsParser(conf, args);
        args=gop.getRemainingArgs();
        
        Path input = new Path(args[0]);
        Path output = new Path(args[1]);
        
        Job job = new Job (conf, "IdentityMapper");
        job.setJarByClass(this.getClass());
        
        job.setOutputKeyClass(LongWritable.class);
        job.setOutputValueClass(Text.class);
        
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        
        job.setNumReduceTasks(0);
        
        FileInputFormat.setInputPaths(job, input);
        FileOutputFormat.setOutputPath(job, output);
        
        job.setMapperClass(IdentityMapper.class);
        job.waitForCompletion(true);
        return 0;
    }
    
    public static void main(String [] args) throws Exception{
        System.exit(ToolRunner.run(new Configuration(), new IdentityMapperDriver(), args));
    }

}

 

Execution

Let’s jar this and execute it now.

I’ll use /user/hadoop/lab07/2016-09-12 as input file and output file would be written to /user/hadoop/lab09

Execute it.

hadoop@gandhari:/opt/hadoop-2.6.4/jars$ hadoop jar IdentityMR-0.0.1-SNAPSHOT.jar org.grassfield.hadoop.IdentityMapperDriver /user/hadoop/lab07/2016-09-12 /user/hadoop/lab09/

Lab09 is populated with mapper output.

hadoop@gandhari:/opt/hadoop-2.6.4/jars$ hadoop fs -ls /user/hadoop/lab09
 Found 2 items
 -rw-r--r--   3 hadoop supergroup          0 2016-09-16 05:29 /user/hadoop/lab09/_SUCCESS
 -rw-r--r--   3 hadoop supergroup     546863 2016-09-16 05:29 /user/hadoop/lab09/part-m-00000

And, here is the content. Pls check the token 1 – which is the offset of the input lines.

hadoop@gandhari:/opt/hadoop-2.6.4/jars$ hadoop fs -cat /user/hadoop/lab09/part-m-00000
 11      application/rss+xml     Singapore       null    http://www.todayonline.com/taxonomy/term/3/all  null    SBF leads first business mission to Myanmar under new government        Today   http://www.todayonline.com/business/sbf-leads-first-business-mission-myanmar-under-new-government       Mon Sep 12 07:00:00 MYT 2016    []
 299     application/rss+xml     Singapore       null    http://www.todayonline.com/taxonomy/term/3/all  null    TODAY's morning briefing for Sept 12    Today   http://www.todayonline.com/singapore/todays-morning-briefing-sept-12    Mon Sep 12 01:13:53 MYT 2016    []
 530     application/rss+xml     Singapore       null    http://www.todayonline.com/taxonomy/term/3/all  null    Society ‘far from ready’ to do away with race markers        Today   http://www.todayonline.com/singapore/society-far-ready-do-away-race-markers     Mon Sep 12 00:35:04 MYT 2016    []

 Using Reducer

Identity reducer doesn’t do anything special except writing whatever it gets as input. It sorts the input and writes to the output. Let’s see how to modify the above code to use reducer.

Note: if you don’t want to use reducer, you need to set job.setNumReduceTasks(0); as I have shown in the above program. Avoiding this step, doesn’t mean reducer won’t run. IdentityReducer will run by default if you don’t set a reducer.

Mapper

We don’t need a change to the mapper program.

Reducer

Let’s write a reducer, which accepts the input and writes the same to output without any transformation.

package org.grassfield.hadoop;

import java.io.IOException;
import org.apache.hadoop.mapreduce.Mapper;

/**
 * Identity Mapper and Reducer just like the concept of Identity function in
 * mathematics i.e. do not transform the input and return it as it is in output
 * form. Identity Mapper takes the input key/value pair and splits it out
 * without any processing.
 * 
 * @author pandian
 *
 */
public class IdentityMapper extends Mapper<Object, Object, Object, Object> {

    @Override
    protected void map(Object key, Object value, Mapper<Object, Object, Object, Object>.Context context)
            throws IOException, InterruptedException {
        context.write(key, value);
    }

}

Driver

Coming back to driver to make the changes to include the reducer. Pls check the highlighted changes.

package org.grassfield.hadoop;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

/**
 * Driver for Identity Mapper
 * @author pandian
 *
 */
public class IdentityMapperDriver extends Configured implements Tool {

    @Override
    public int run(String[] args) throws Exception {
        Configuration conf = super.getConf();
        GenericOptionsParser gop = new GenericOptionsParser(conf, args);
        args=gop.getRemainingArgs();
        
        Path input = new Path(args[0]);
        Path output = new Path(args[1]);
        
        Job job = new Job (conf, "IdentityMapper");
        job.setJarByClass(this.getClass());
        
        job.setOutputKeyClass(LongWritable.class);
        job.setOutputValueClass(Text.class);
        
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        
        job.setNumReduceTasks(1);
        
        FileInputFormat.setInputPaths(job, input);
        FileOutputFormat.setOutputPath(job, output);
        
        job.setMapperClass(IdentityMapper.class);
        job.setReducerClass(IdentityReducer.class);
        job.waitForCompletion(true);
        return 0;
    }
    
    public static void main(String [] args) throws Exception{
        System.exit(ToolRunner.run(new Configuration(), new IdentityMapperDriver(), args));
    }

}

Execution

Let’s execute the program now.

hadoop@gandhari:~/jars$ hadoop jar IdentityMR-2.jar org.grassfield.hadoop.IdentityMapperDriver /user/hadoop/lab07/2016-09-12 /user/hadoop/lab09/19

16/09/17 07:35:48 INFO mapreduce.Job: Counters: 38
        File System Counters
                FILE: Number of bytes read=1126302
                FILE: Number of bytes written=2194526
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=1070278
                HDFS: Number of bytes written=546863
                HDFS: Number of read operations=13
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=4
        Map-Reduce Framework
                Map input records=1726
                Map output records=1726
                Map output bytes=551840
                Map output materialized bytes=558208
                Input split bytes=114
                Combine input records=0
                Combine output records=0
                Reduce input groups=1726
                Reduce shuffle bytes=558208
                Reduce input records=1726
                Reduce output records=1726
                Spilled Records=3452
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=0
                CPU time spent (ms)=0
                Physical memory (bytes) snapshot=0
                Virtual memory (bytes) snapshot=0
                Total committed heap usage (bytes)=942669824
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=535139
        File Output Format Counters
                Bytes Written=546863


hadoop@gandhari:~/jars$ hadoop fs -ls /user/hadoop/lab09/19
Found 2 items
-rw-r--r--   3 hadoop supergroup          0 2016-09-17 07:35 /user/hadoop/lab09/19/_SUCCESS
-rw-r--r--   3 hadoop supergroup     546863 2016-09-17 07:35 /user/hadoop/lab09/19/part-r-00000

hadoop@gandhari:~/jars$ hadoop fs -cat /user/hadoop/lab09/19/part-r-00000

0       2016-09-12
11      application/rss+xml     Singapore       null    http://www.todayonline.com/taxonomy/term/3/all  null    SBF leads first business mission to Myanmar under new go
vernment        Today   http://www.todayonline.com/business/sbf-leads-first-business-mission-myanmar-under-new-government       Mon Sep 12 07:00:00 MYT 2016    []
299     application/rss+xml     Singapore       null    http://www.todayonline.com/taxonomy/term/3/all  null    TODAY's morning briefing for Sept 12    Today   http://w
ww.todayonline.com/singapore/todays-morning-briefing-sept-12    Mon Sep 12 01:13:53 MYT 2016    []
530     application/rss+xml     Singapore       null    http://www.todayonline.com/taxonomy/term/3/all  null    Society ‘far from ready’ to do away with race markers
Today   http://www.todayonline.com/singapore/society-far-ready-do-away-race-markers     Mon Sep 12 00:35:04 MYT 2016    []

The line starts with 0 appeared is not orderly manner during our mapping phase. It has been sorted and brought to first in the reducer output.

Let’s modify our reducer to write only the values (removing the offset numbers). This is to be done in our reducer as given below.

@Override
    protected void reduce(Object key, Iterable<Object> values, Reducer<Object, Object, Object, Object>.Context context)
            throws IOException, InterruptedException {
        for (Object value:values){
            context.write(value, NullWritable.get());
        }
    }

Let’s execute it now.

hadoop@gandhari:~/jars$ hadoop jar IdentityMR-3.jar org.grassfield.hadoop.IdentityMapperDriver /user/hadoop/lab07/2016-09-12 /user/hadoop/lab09/20

hadoop@gandhari:~/jars$ hadoop fs -ls /user/hadoop/lab09/20
Found 2 items
-rw-r--r--   3 hadoop supergroup          0 2016-09-17 07:58 /user/hadoop/lab09/20/_SUCCESS
-rw-r--r--   3 hadoop supergroup     535139 2016-09-17 07:58 /user/hadoop/lab09/20/part-r-00000

hadoop@gandhari:~/jars$ hadoop fs -cat /user/hadoop/lab09/20/part-r-00000|more

2016-09-12
application/rss+xml     Singapore       null    http://www.todayonline.com/taxonomy/term/3/all  null    SBF leads first business mission to Myanmar under new government
        Today   http://www.todayonline.com/business/sbf-leads-first-business-mission-myanmar-under-new-government       Mon Sep 12 07:00:00 MYT 2016    []
application/rss+xml     Singapore       null    http://www.todayonline.com/taxonomy/term/3/all  null    TODAY's morning briefing for Sept 12    Today   http://www.today
online.com/singapore/todays-morning-briefing-sept-12    Mon Sep 12 01:13:53 MYT 2016    []

Have a good week, hadoopers!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s