Lab 07 – A mini MR project with mapper, reducer and partitioner

Hi Hadoopers,

I prefer to make a milestone check at this point, hence more and more practicals are scheduled in the course. I had a long weekend for Hari Raya Haji as well. So, I did a mini project with the below given scope.

  1. Collect the RSS feed of multiple sites
  2. XML parsing and flat file preparation
  3. Use the MR to find the categories and their occurrences.

Let me give you  in graphical format.

hadoop030-lab-07-mini-project-with-mapper-reducer-combiner

Download the feeds

We have list of variety of different feeds from IT, Corporates, personal blogs etc. To start the project I have used 80+ different feeds. Each feed is downloaded and copied to the hard disk

XML parsing and flat file preparation

I already maintain a simple RSS parser in SourceForge by name jAtomRSS.

I checked out this and made a few changes to work for this project. jAtomRSS will read all those XML feeds and create a flat file. It will create one record per blog with 10 different tokens separated by tab as below.

  1. feed generator
  2. feed title
  3. feed_author
  4. feed_url
  5. feed_time
  6. item-subject
  7. item-author
  8. item url
  9. item date
  10. category (all the category text would be given as comma separated values. It would be covered by [].

MapReduce

logo-mapreduce

Our mapper FeedCategoryCountMapper will read the output of jAtomRss. It will take the 10th token and process it to get category string and its appearance.

Input to the Mapper would be like this

Blogger    அனு    அனுசுயா    http://vanusuya.blogspot.com/    Sat Mar 26 19:40:29 MYT 2016    ஜான்சிபார் -3    அனுசுயா    http://vanusuya.blogspot.com/2013/02/3.html    Thu Feb 07 15:46:41 MYT 2013    [ஆப்பிரிக்கா, சுயபுராணம், பயணகட்டுரை]

The flatfile is copied to HDFS under directory /user/hadoop/lab07

hadoop@gandhari:/opt/hadoop-2.6.4/jars$ hadoop fs -mkdir /user/hadoop/lab07
hadoop@gandhari:/opt/hadoop-2.6.4/jars$ hadoop fs -put ../feed/output/2016-09-12 /user/hadoop/lab07

Let’s see how is it parsed.

        String line = value.toString();
        StringTokenizer st = new StringTokenizer(line, "\t");
        int countTokens = st.countTokens();
        if (countTokens != 10) {
            System.err.println("Incorrect record " + line);
            return;
        }
        st.nextToken();
        st.nextToken();
        st.nextToken();
        st.nextToken();
        st.nextToken();
        st.nextToken();
        st.nextToken();
        st.nextToken();
        st.nextToken();
        String catCsv = st.nextToken();
        catCsv = catCsv.substring(1);
        catCsv = catCsv.substring(0, catCsv.length() - 1);
        st = new StringTokenizer(catCsv, ",");
        while (st.hasMoreTokens()) {
            category.set(st.nextToken().trim());
            context.write(category, one);
        }

FeedCategoryPartitioner is our partitioner. It will send all the text starting with Capital letters to Reducer 1, small letters to Reducer 2 and non english characters to Reducer 3.

        String s = word.toString();
        if (s.length()==0)
            return 0;
        char b = (char) s.getBytes()[0];
        if (b>='A' && b<='Z')
            return 0;
        
        if (b>='a' && b<='z')
            return 1;
        
        return 2;

FeedCategoryReducer is the reducer class. Three reducers are in action to complete this action. Here is the implementation of reduce method.

        int sum=0;
        for (IntWritable value:values){
            sum+=value.get();
        }
        context.write(key, new IntWritable(sum));

Code is available at https://sourceforge.net/p/feedanalylitcs/code/HEAD/tree/FeedCategoryCount/src/org/grassfield/hadoop/

Our code is ready. Let’s execute it.

hadoop_yarn

hadoop@gandhari:/opt/hadoop-2.6.4/jars$ hadoop jar FeedCategoryCount-7.0.jar org.grassfield.hadoop.FeedCategoryCountDriver /user/hadoop/lab07/2016-09-12 /user/hadoop/lab07/05

It has created 3 part files

hadoop@gandhari:/opt/hadoop-2.6.4/jars$ hadoop fs -ls /user/hadoop/lab07/05
Found 4 items
-rw-r--r--   3 hadoop supergroup          0 2016-09-12 14:04 /user/hadoop/lab07/05/_SUCCESS
-rw-r--r--   3 hadoop supergroup       8534 2016-09-12 14:04 /user/hadoop/lab07/05/part-r-00000
-rw-r--r--   3 hadoop supergroup       2842 2016-09-12 14:04 /user/hadoop/lab07/05/part-r-00001
-rw-r--r--   3 hadoop supergroup      15683 2016-09-12 14:04 /user/hadoop/lab07/05/part-r-00002

Let’s cat the files.

hadoop@gandhari:/opt/hadoop-2.6.4/jars$ hadoop fs -cat /user/hadoop/lab07/05/part-r-00000|more
A320    1
A320neo 3
A321neo 1
A380    2
ADFS    1
APM     6
hadoop@gandhari:/opt/hadoop-2.6.4/jars$ hadoop fs -cat /user/hadoop/lab07/05/part-r-00001|more
acyclovir       1
ad blocking     1
advertising     1
airport.        1
ambari  1
analysis        1
hadoop@gandhari:/opt/hadoop-2.6.4/jars$ hadoop fs -cat /user/hadoop/lab07/05/part-r-00002|more
#CitrixPartnerLove      1
#Summertrip2015 7
0-days  1
1928    1
1997    1
2008    1
63 nayanmars    2
737     1
737MAX  1
747     1
747-8F  1
787     2
ஃபிரான்ஸ் காஃப்கா  2
அகிரா குரோசவா   1
அக்ஷயா.  1
அஞ்சலை   1

You might have observed that the output of reducer 1 starts with capital letter, reducer 2 starts with small case and number  and non-english text is being processed by reducer 3.

Good! See you in another interesting blog post.

You can follow my sourceforge project RSS Atom Feed Analytics With MapReduce for further improvements.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s