Lab 05 – A hadoop combiner demo

Dear Hadoopers,

Let’s fine tune our exercises by adding a Combiner. Looking at the previous exercise is recommended to understand this post.



I’m adding a combiner class FeedCategoryCombiner by extending the Reducer class. The content is same as that of the reducer shown in Lab 04.

package org.grassfield.hadoop;


import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Reducer.Context;

 * This is the new Combiner class introduced for lab exercise #05
 * @author pandian
public class FeedCategoryCombiner extends Reducer<Text, IntWritable, Text, IntWritable> {
    protected void reduce(Text key, Iterable<IntWritable> values,
            Context context) throws IOException, InterruptedException {
        int sum=0;
        for (IntWritable value:values){
        context.write(key, new IntWritable(sum));

Changes to Driver program

We need to include the combiner in our program. The changes are highlighted.

    public int run(String[] args) throws Exception {
        Configuration conf = getConf();
        GenericOptionsParser parser = new GenericOptionsParser(conf, args);
        args = parser.getRemainingArgs();
        Path input = new Path(args[0]);
        Path output = new Path(args[1]);
        Job job = new Job(conf, "Feed Category Count");
        FileInputFormat.setInputPaths(job, input);
        FileOutputFormat.setOutputPath(job, output);
        return 0;



I run my project as Maven Build with clean install goals and SCPed my jar file to Hadoop server.

Okay, let’s execute it.

hadoop@gandhari:~/jars$ hadoop jar FeedCategoryCount-2.0.jar org.grassfield.hadoop.FeedCategoryCountDriver /user/hadoop/lab03/feed /user/hadoop/output/lab05

Map-Reduce Framework
                Map input records=793
                Map output records=45
                Map output bytes=647
                Map output materialized bytes=515
                Input split bytes=108
                Combine input records=45
                Combine output records=28
                Reduce input groups=28
                Reduce shuffle bytes=515
                Reduce input records=28
                Reduce output records=28
                Spilled Records=56
                Shuffled Maps =3
                Failed Shuffles=0
                Merged Map outputs=3
                GC time elapsed (ms)=0
                CPU time spent (ms)=0
                Physical memory (bytes) snapshot=0
                Virtual memory (bytes) snapshot=0
                Total committed heap usage (bytes)=1189085184

What about my output folder?

hadoop@gandhari:~/jars$ hadoop fs -ls /user/hadoop/output/lab05
Found 2 items
-rw-r–r–   3 hadoop supergroup          0 2016-09-10 17:34 /user/hadoop/output/lab05/_SUCCESS
-rw-r–r–   3 hadoop supergroup        386 2016-09-10 17:34 /user/hadoop/output/lab05/part-r-00000

Here is the aggregated output.

hadoop@gandhari:~/jars$ hadoop fs -cat /user/hadoop/output/lab05/part-r-00000
Application Master      1
BDFS split      1
Big Data        10
Flume   1
HBase   1
HDFS    6
HDFS block      2
HDFS commands   2
HDFS permissions        1
HDFS replication        1
Hadoop  2
Hive    1
Hue     1
Job Tracker     1
Map Reduce      1
MapReduce       1
Oozie   1
Pig     1
Resource Manager        1
Task Container  1
Task Tracker    1
YARN    1
ZooKeeper       1
ZooKeeperFailoverController     1
hadoop federation       1
hadoop high availability        1
hadoop rack-aware       1
sqoop   1

See you in another interesting post.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s