Lab 08 – MapReduce using custom class as Key

Hi,

logo-mapreduceI can say lab 07 is a milestone in this BigData blog post series as it has a complete project structure. It contains pre-process and staging, transforming and producing output. I used the same exercise to learn about using custom class for MR process. Here is the updated code.

Instead of Text as the input, I have used a custom class EntryCategory.

Mapper

Look at the highlighted text. Those are  changes I did to use the custom class as the key.

package org.grassfield.hadoop;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Mapper;
import org.grassfield.hadoop.entity.EntryCategory;

/**
 * A Mapper Program to count the categories in RSS XML file This may not be the
 * right approach to parse the XML. Only for demo purpose
 * 
 * @author pandian
 *
 */
public class FeedCategoryCountMapper extends Mapper {
    private IntWritable one = new IntWritable(1);
    

    @Override
    protected void map(Object key, Object value, Context context) throws IOException, InterruptedException {
        /*
         * 1. feed generator 2. feed title 3. feed_author 4. feed_url 5.
         * feed_time 6. item-subject 7. item-author 8. item url 9. item date 10.
         * category
         */
        String line = value.toString();
        StringTokenizer st = new StringTokenizer(line, "\t");
        int countTokens = st.countTokens();
        if (countTokens != 10) {
            System.err.println("Incorrect record " + line);
            return;
        }
        st.nextToken();
        st.nextToken();
        st.nextToken();
        st.nextToken();
        st.nextToken();
        st.nextToken();
        st.nextToken();
        st.nextToken();
        st.nextToken();
        String catCsv = st.nextToken();
        catCsv = catCsv.substring(1);
        catCsv = catCsv.substring(0, catCsv.length() - 1);
        st = new StringTokenizer(catCsv, ",");
        while (st.hasMoreTokens()) {
            EntryCategory category = new EntryCategory();
            category.setCategory(st.nextToken().trim());
            context.write(category, one);
        }
    }
}

Previously it was Text. It has been changed to org.grassfield.hadoop.EntryCategory, which is our custom key.

Reducer

Here is the updated Reducer.

/**
 * 
 */
package org.grassfield.hadoop;

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Reducer;
import org.grassfield.hadoop.entity.EntryCategory;

/**
 * Reducer for Feed Category Mapper
 * @author pandian
 *
 */
public class FeedCategoryReducer extends 
    Reducer<EntryCategory, IntWritable, EntryCategory, IntWritable> {

    @Override
    protected void reduce(EntryCategory key, Iterable<IntWritable> values,
            Context context) throws IOException, InterruptedException {
        int sum=0;
        for (IntWritable value:values){
            sum+=value.get();
        }
        context.write(key, new IntWritable(sum));
    }
}

I need to make the similar changes to Combiner and Partitioner as well

Combiner

package org.grassfield.hadoop;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Reducer;
import org.grassfield.hadoop.entity.EntryCategory;

/**
 * This is the new Combiner class introduced for lab exercise #05
 * @author pandian
 *
 */
public class FeedCategoryCombiner extends Reducer<EntryCategory, IntWritable, EntryCategory, IntWritable> {
    @Override
    protected void reduce(EntryCategory key, Iterable<IntWritable> values,
            Context context) throws IOException, InterruptedException {
        int sum=0;
        for (IntWritable value:values){
            sum+=value.get();
        }
        context.write(key, new IntWritable(sum));
    }
}

Partitioner

package org.grassfield.hadoop;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Partitioner;
import org.grassfield.hadoop.entity.EntryCategory;

/**
 * Let's decide the way to manage the load of the reducers.
 * Reducer 1 - Vowels
 * Reducer 2 - Keys starting with ijklm
 * Reducer 3 - Remaining all
 * @author pandian
 *
 */
public class FeedCategoryPartitioner extends Partitioner<EntryCategory, IntWritable> {

    @Override
    public int getPartition(EntryCategory word, IntWritable count, int numReducer) {
        String s = word.toString();
        if (s.length()==0)
            return 0;
        char b = (char) s.getBytes()[0];
        if (b>='A' && b<='Z')
            return 0;
        
        if (b>='a' && b<='z')
            return 1;
        
        return 2;
    }

}

Execution

Let me export my project as jar and execute it as shown below.

hadoop@gandhari:~/jars$ hadoop jar FeedCategoryCount-8.0.jar org.grassfield.hadoop.FeedCategoryCountDriver /user/hadoop/lab08/2016-09-15 /user/hadoop/lab08/03

File System Counters
                FILE: Number of bytes read=428250
                FILE: Number of bytes written=1690331
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=3666720
                HDFS: Number of bytes written=144496
                HDFS: Number of read operations=38
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=16
        Map-Reduce Framework
                Map input records=2458
                Map output records=7038
                Map output bytes=175688
                Map output materialized bytes=108705
                Input split bytes=114
                Combine input records=7038
                Combine output records=3917
                Reduce input groups=3917
                Reduce shuffle bytes=108705
                Reduce input records=3917
                Reduce output records=3917
                Spilled Records=7834
                Shuffled Maps =3
                Failed Shuffles=0
                Merged Map outputs=3
                GC time elapsed (ms)=0
                CPU time spent (ms)=0
                Physical memory (bytes) snapshot=0
                Virtual memory (bytes) snapshot=0
                Total committed heap usage (bytes)=1220542464
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=916680
        File Output Format Counters
                Bytes Written=89171

Let’s see the output.

hadoop@gandhari:~/jars$ hadoop fs -ls /user/hadoop/lab08/03
Found 4 items
-rw-r--r--   3 hadoop supergroup          0 2016-09-15 05:39 /user/hadoop/lab08/03/_SUCCESS
-rw-r--r--   3 hadoop supergroup      22346 2016-09-15 05:39 /user/hadoop/lab08/03/part-r-00000
-rw-r--r--   3 hadoop supergroup      10633 2016-09-15 05:39 /user/hadoop/lab08/03/part-r-00001
-rw-r--r--   3 hadoop supergroup      56192 2016-09-15 05:39 /user/hadoop/lab08/03/part-r-00002

Let’s cat it.

hadoop@gandhari:~/jars$ hadoop fs -cat /user/hadoop/lab08/03/part-r-00000|more
A Man Who Escaped       1
A Quiet Tidy Man        1
A. R. Rahman    3
A.P. Government 1
A320    3
hadoop@gandhari:~/jars$ hadoop fs -cat /user/hadoop/lab08/03/part-r-00001|more
accelerated depreciation        1
acyclic graph   1
acyclovir       1
ad blocking     1
advertising     1
hadoop@gandhari:~/jars$ hadoop fs -cat /user/hadoop/lab08/03/part-r-00002|more
90      1
@Team_Optimum   1
அகரமுதல்வன்       1
அகிரா குரோசவா   1
அகிலா   3
Advertisements

4 thoughts on “Lab 08 – MapReduce using custom class as Key

  1. Pingback: Lab 10 – Make use of MapReduce Counter | JavaShine

  2. Pingback: Lab11 – Reading from HDFS & charting | JavaShine

  3. Pingback: Lab 13 – Working with Hadoop serialization – MapReduce with custom key and custom value | JavaShine

  4. Pingback: Lab 14: Sending MapReduce output to JDBC | JavaShine

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s