Automation & Analysis of RSS & ATOM newsfeeds using Hadoop Ecosystem

This project has been carried out to extract, analyse and display the RSS and ATOM news feeds.

The final goal of this project would be as given below.
1. Providing an automated workflow for the feeds extraction and analysis
2. Providing a browser based user interface for the analysis and reporting

Along with the above main goals, it has been considered to provide a scalable framework to opt-in many other feeds mechanism (social media streaming) and machine learning analytics.

The report shall be downloaded from the below given location

automation-analysis-of-rss-atom-newsfeeds-ramaiah-murugapandian

 

Lab 30: Hive Queries

Hi hadoopers,

I have a program that will extract RSS feeds from different sources in tab limited text file. I used to Hive to do some mining today. Let’s see the results.

pighive3

The file has 12 fields separated by tab. Here is the table description.


CREATE external TABLE IF NOT EXISTS feed_article (feedgenerator STRING,feedtitle STRING,feed_author STRING,feed_url STRING,feed_time STRING,item_subject STRING,item_author STRING,itemurl STRING,itemdate STRING,category STRING,DescriptionFile STRING,uniqueId bigint) ROW FORMAT DELIMITED     FIELDS TERMINATED BY '\t'      STORED AS TEXTFILE     LOCATION '/user/hadoop/lab27';

hive> describe feed_article;
OK
feedgenerator           string
feedtitle               string
feed_author             string
feed_url                string
feed_time               string
item_subject            string
item_author             string
itemurl                 string
itemdate                string
category                string
descriptionfile         string
uniqueid                bigint
Time taken: 0.062 seconds, Fetched: 12 row(s)

Count how many articles published today.


hive> select count(*) from feed_article;
OK
3699
Time taken: 1.624 seconds, Fetched: 1 row(s)

List of distinct authors today.


hive> select distinct item_author from feed_article;
,Alok Deshpande
-தி.இன்பராஜ்-
-பா.ராஜா
A.D.Balasubramaniyan
A.T.S Pandian
AFP
AP
Aekaanthan
Aishwarya Parikh
Akanksha Jain
Alex Barile

Lets see which site has lot of articles


hive> select feedtitle, count(*) from feed_article group by feedtitle;
NULL    139
A Wandering Mind        1
APMdigest Hot Topics: APM       2
Application Performance Monitoring Blog | AppDynamics   1
BSNLTeleServices | BSNL Broadband Plans, Bill Payment Selfcare Portal   3
Bangalore Aviation      1
Blog Feed       1
Cloudera Engineering Blog       1
DailyThanthi.com        20

Who wrote many articles today?

hive> select item_author, count (*) from feed_article group by item_author order by item_author desc limit 5;
OK
ஹாவேரி, 1
ஹரி கிருஷ்ணன்     14
ஹரன் பிரசன்னா     2
ஸ்கிரீனன்  4
ஷங்கர்    2
Time taken: 2.476 seconds, Fetched: 5 row(s)

Author of which website wrote many article today?

hive> hive> select item_author, feedtitle, count (*) from feed_article group by item_author, feedtitle order by item_author desc limit 10;
ஹாவேரி, Dinamani - பெங்களூரு - http://www.dinamani.com/all-editions/edition-bangalore/ 1
ஹரி கிருஷ்ணன்     Dinamani - தினந்தோறும் திருப்புகழ் - http://www.dinamani.com/specials/dinanthorum-thirupugal/     14
ஹரன் பிரசன்னா     ஹரன் பிரசன்னா     2
ஸ்கிரீனன்  தி இந்து - முகப்பு        1
ஸ்கிரீனன்  தி இந்து - தமிழ் சினிமா   1
ஸ்கிரீனன்  தி இந்து - சினிமா        2
ஷங்கர்    தி இந்து - சினிமா        1
ஷங்கர்    தி இந்து - முகப்பு        1
வெங்கடேசன். ஆர்    Dinamani - வேலைவாய்ப்பு - http://www.dinamani.com/employment/  32
வெங்கடேசன். ஆர்    Dinamani - விவசாயம் - http://www.dinamani.com/agriculture/    2
Time taken: 2.493 seconds, Fetched: 10 row(s)

Using which feed software the articles were published.


hive> select feedgenerator, count (*) from feed_article group by feedgenerator order by feedgenerator desc limit 10;
https://wordpress.org/?v=4.6.1  5
https://wordpress.org/?v=4.5.4  80
https://wordpress.org/?v=4.5.2  1
http://wordpress.org/?v=4.2.10  2
http://wordpress.org/?v=4.1.4   7
http://wordpress.org/?v=3.5.1   10
http://wordpress.org/?v=3.0     1
http://wordpress.com/   13
application/rss+xml     3434
Jive Engage 8.0.2.0  (http://jivesoftware.com/products/)        1
Time taken: 2.473 seconds, Fetched: 10 row(s)

Lab 07 – A mini MR project with mapper, reducer and partitioner

Hi Hadoopers,

I prefer to make a milestone check at this point, hence more and more practicals are scheduled in the course. I had a long weekend for Hari Raya Haji as well. So, I did a mini project with the below given scope.

  1. Collect the RSS feed of multiple sites
  2. XML parsing and flat file preparation
  3. Use the MR to find the categories and their occurrences.

Let me give you  in graphical format.

hadoop030-lab-07-mini-project-with-mapper-reducer-combiner

Download the feeds

We have list of variety of different feeds from IT, Corporates, personal blogs etc. To start the project I have used 80+ different feeds. Each feed is downloaded and copied to the hard disk

XML parsing and flat file preparation

I already maintain a simple RSS parser in SourceForge by name jAtomRSS.

I checked out this and made a few changes to work for this project. jAtomRSS will read all those XML feeds and create a flat file. It will create one record per blog with 10 different tokens separated by tab as below.

  1. feed generator
  2. feed title
  3. feed_author
  4. feed_url
  5. feed_time
  6. item-subject
  7. item-author
  8. item url
  9. item date
  10. category (all the category text would be given as comma separated values. It would be covered by [].

MapReduce

logo-mapreduce

Our mapper FeedCategoryCountMapper will read the output of jAtomRss. It will take the 10th token and process it to get category string and its appearance.

Input to the Mapper would be like this

Blogger    அனு    அனுசுயா    http://vanusuya.blogspot.com/    Sat Mar 26 19:40:29 MYT 2016    ஜான்சிபார் -3    அனுசுயா    http://vanusuya.blogspot.com/2013/02/3.html    Thu Feb 07 15:46:41 MYT 2013    [ஆப்பிரிக்கா, சுயபுராணம், பயணகட்டுரை]

The flatfile is copied to HDFS under directory /user/hadoop/lab07

hadoop@gandhari:/opt/hadoop-2.6.4/jars$ hadoop fs -mkdir /user/hadoop/lab07
hadoop@gandhari:/opt/hadoop-2.6.4/jars$ hadoop fs -put ../feed/output/2016-09-12 /user/hadoop/lab07

Let’s see how is it parsed.

        String line = value.toString();
        StringTokenizer st = new StringTokenizer(line, "\t");
        int countTokens = st.countTokens();
        if (countTokens != 10) {
            System.err.println("Incorrect record " + line);
            return;
        }
        st.nextToken();
        st.nextToken();
        st.nextToken();
        st.nextToken();
        st.nextToken();
        st.nextToken();
        st.nextToken();
        st.nextToken();
        st.nextToken();
        String catCsv = st.nextToken();
        catCsv = catCsv.substring(1);
        catCsv = catCsv.substring(0, catCsv.length() - 1);
        st = new StringTokenizer(catCsv, ",");
        while (st.hasMoreTokens()) {
            category.set(st.nextToken().trim());
            context.write(category, one);
        }

FeedCategoryPartitioner is our partitioner. It will send all the text starting with Capital letters to Reducer 1, small letters to Reducer 2 and non english characters to Reducer 3.

        String s = word.toString();
        if (s.length()==0)
            return 0;
        char b = (char) s.getBytes()[0];
        if (b>='A' && b<='Z')
            return 0;
        
        if (b>='a' && b<='z')
            return 1;
        
        return 2;

FeedCategoryReducer is the reducer class. Three reducers are in action to complete this action. Here is the implementation of reduce method.

        int sum=0;
        for (IntWritable value:values){
            sum+=value.get();
        }
        context.write(key, new IntWritable(sum));

Code is available at https://sourceforge.net/p/feedanalylitcs/code/HEAD/tree/FeedCategoryCount/src/org/grassfield/hadoop/

Our code is ready. Let’s execute it.

hadoop_yarn

hadoop@gandhari:/opt/hadoop-2.6.4/jars$ hadoop jar FeedCategoryCount-7.0.jar org.grassfield.hadoop.FeedCategoryCountDriver /user/hadoop/lab07/2016-09-12 /user/hadoop/lab07/05

It has created 3 part files

hadoop@gandhari:/opt/hadoop-2.6.4/jars$ hadoop fs -ls /user/hadoop/lab07/05
Found 4 items
-rw-r--r--   3 hadoop supergroup          0 2016-09-12 14:04 /user/hadoop/lab07/05/_SUCCESS
-rw-r--r--   3 hadoop supergroup       8534 2016-09-12 14:04 /user/hadoop/lab07/05/part-r-00000
-rw-r--r--   3 hadoop supergroup       2842 2016-09-12 14:04 /user/hadoop/lab07/05/part-r-00001
-rw-r--r--   3 hadoop supergroup      15683 2016-09-12 14:04 /user/hadoop/lab07/05/part-r-00002

Let’s cat the files.

hadoop@gandhari:/opt/hadoop-2.6.4/jars$ hadoop fs -cat /user/hadoop/lab07/05/part-r-00000|more
A320    1
A320neo 3
A321neo 1
A380    2
ADFS    1
APM     6
hadoop@gandhari:/opt/hadoop-2.6.4/jars$ hadoop fs -cat /user/hadoop/lab07/05/part-r-00001|more
acyclovir       1
ad blocking     1
advertising     1
airport.        1
ambari  1
analysis        1
hadoop@gandhari:/opt/hadoop-2.6.4/jars$ hadoop fs -cat /user/hadoop/lab07/05/part-r-00002|more
#CitrixPartnerLove      1
#Summertrip2015 7
0-days  1
1928    1
1997    1
2008    1
63 nayanmars    2
737     1
737MAX  1
747     1
747-8F  1
787     2
ஃபிரான்ஸ் காஃப்கா  2
அகிரா குரோசவா   1
அக்ஷயா.  1
அஞ்சலை   1

You might have observed that the output of reducer 1 starts with capital letter, reducer 2 starts with small case and number  and non-english text is being processed by reducer 3.

Good! See you in another interesting blog post.

You can follow my sourceforge project RSS Atom Feed Analytics With MapReduce for further improvements.

A simple RSS Parsing Android application

Adding further more changes to my code given in my earlier post, I am adding the source code as well as the APK file for a simple RSS SAX parser.

This will parse the feed http://foxbrush.wordpress.com/feed and show the content on the screen. More finetuning required, will be added later.

Source file (eclipse project archive) – http://ubuntuone.com/4Id0xEpemi3JwEhIRvwQKd

APK file – http://ubuntuone.com/4SjwRCdQwh6QsoWBd46TAY

Android version required 2.3.1; Tested with my HTC

Feedparser (c) – http://www.ibm.com/developerworks/opensource/library/x-android/