Lab 05 – A hadoop combiner demo

Dear Hadoopers,

Let’s fine tune our exercises by adding a Combiner. Looking at the previous exercise is recommended to understand this post.

logo-mapreduce

Combiner

I’m adding a combiner class FeedCategoryCombiner by extending the Reducer class. The content is same as that of the reducer shown in Lab 04.

package org.grassfield.hadoop;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Reducer.Context;

/**
 * This is the new Combiner class introduced for lab exercise #05
 * @author pandian
 *
 */
public class FeedCategoryCombiner extends Reducer<Text, IntWritable, Text, IntWritable> {
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values,
            Context context) throws IOException, InterruptedException {
        int sum=0;
        for (IntWritable value:values){
            sum+=value.get();
        }
        context.write(key, new IntWritable(sum));
    }
}

Changes to Driver program

We need to include the combiner in our program. The changes are highlighted.

@Override
    public int run(String[] args) throws Exception {
        Configuration conf = getConf();
        GenericOptionsParser parser = new GenericOptionsParser(conf, args);
        args = parser.getRemainingArgs();
        Path input = new Path(args[0]);
        Path output = new Path(args[1]);
        Job job = new Job(conf, "Feed Category Count");
        job.setJarByClass(getClass());
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        job.setCombinerClass(FeedCategoryCombiner.class);
        job.setReducerClass(FeedCategoryReducer.class);
        job.setNumReduceTasks(1);
        FileInputFormat.setInputPaths(job, input);
        FileOutputFormat.setOutputPath(job, output);
        job.setMapperClass(FeedCategoryCountMapper.class);
        job.waitForCompletion(true);
        return 0;
    }

Execution

hadoop_yarn

I run my project as Maven Build with clean install goals and SCPed my jar file to Hadoop server.

Okay, let’s execute it.

hadoop@gandhari:~/jars$ hadoop jar FeedCategoryCount-2.0.jar org.grassfield.hadoop.FeedCategoryCountDriver /user/hadoop/lab03/feed /user/hadoop/output/lab05

Map-Reduce Framework
                Map input records=793
                Map output records=45
                Map output bytes=647
                Map output materialized bytes=515
                Input split bytes=108
                Combine input records=45
                Combine output records=28
                Reduce input groups=28
                Reduce shuffle bytes=515
                Reduce input records=28
                Reduce output records=28
                Spilled Records=56
                Shuffled Maps =3
                Failed Shuffles=0
                Merged Map outputs=3
                GC time elapsed (ms)=0
                CPU time spent (ms)=0
                Physical memory (bytes) snapshot=0
                Virtual memory (bytes) snapshot=0
                Total committed heap usage (bytes)=1189085184

What about my output folder?

hadoop@gandhari:~/jars$ hadoop fs -ls /user/hadoop/output/lab05
Found 2 items
-rw-r–r–   3 hadoop supergroup          0 2016-09-10 17:34 /user/hadoop/output/lab05/_SUCCESS
-rw-r–r–   3 hadoop supergroup        386 2016-09-10 17:34 /user/hadoop/output/lab05/part-r-00000

Here is the aggregated output.

hadoop@gandhari:~/jars$ hadoop fs -cat /user/hadoop/output/lab05/part-r-00000
Application Master      1
BDFS split      1
Big Data        10
Flume   1
HBase   1
HDFS    6
HDFS block      2
HDFS commands   2
HDFS permissions        1
HDFS replication        1
Hadoop  2
Hive    1
Hue     1
Job Tracker     1
Map Reduce      1
MapReduce       1
Oozie   1
Pig     1
Resource Manager        1
Task Container  1
Task Tracker    1
YARN    1
ZooKeeper       1
ZooKeeperFailoverController     1
hadoop federation       1
hadoop high availability        1
hadoop rack-aware       1
sqoop   1

See you in another interesting post.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s