Lab 30: Hive Queries

Hi hadoopers,

I have a program that will extract RSS feeds from different sources in tab limited text file. I used to Hive to do some mining today. Let’s see the results.

pighive3

The file has 12 fields separated by tab. Here is the table description.


CREATE external TABLE IF NOT EXISTS feed_article (feedgenerator STRING,feedtitle STRING,feed_author STRING,feed_url STRING,feed_time STRING,item_subject STRING,item_author STRING,itemurl STRING,itemdate STRING,category STRING,DescriptionFile STRING,uniqueId bigint) ROW FORMAT DELIMITED     FIELDS TERMINATED BY '\t'      STORED AS TEXTFILE     LOCATION '/user/hadoop/lab27';

hive> describe feed_article;
OK
feedgenerator           string
feedtitle               string
feed_author             string
feed_url                string
feed_time               string
item_subject            string
item_author             string
itemurl                 string
itemdate                string
category                string
descriptionfile         string
uniqueid                bigint
Time taken: 0.062 seconds, Fetched: 12 row(s)

Count how many articles published today.


hive> select count(*) from feed_article;
OK
3699
Time taken: 1.624 seconds, Fetched: 1 row(s)

List of distinct authors today.


hive> select distinct item_author from feed_article;
,Alok Deshpande
-தி.இன்பராஜ்-
-பா.ராஜா
A.D.Balasubramaniyan
A.T.S Pandian
AFP
AP
Aekaanthan
Aishwarya Parikh
Akanksha Jain
Alex Barile

Lets see which site has lot of articles


hive> select feedtitle, count(*) from feed_article group by feedtitle;
NULL    139
A Wandering Mind        1
APMdigest Hot Topics: APM       2
Application Performance Monitoring Blog | AppDynamics   1
BSNLTeleServices | BSNL Broadband Plans, Bill Payment Selfcare Portal   3
Bangalore Aviation      1
Blog Feed       1
Cloudera Engineering Blog       1
DailyThanthi.com        20

Who wrote many articles today?

hive> select item_author, count (*) from feed_article group by item_author order by item_author desc limit 5;
OK
ஹாவேரி, 1
ஹரி கிருஷ்ணன்     14
ஹரன் பிரசன்னா     2
ஸ்கிரீனன்  4
ஷங்கர்    2
Time taken: 2.476 seconds, Fetched: 5 row(s)

Author of which website wrote many article today?

hive> hive> select item_author, feedtitle, count (*) from feed_article group by item_author, feedtitle order by item_author desc limit 10;
ஹாவேரி, Dinamani - பெங்களூரு - http://www.dinamani.com/all-editions/edition-bangalore/ 1
ஹரி கிருஷ்ணன்     Dinamani - தினந்தோறும் திருப்புகழ் - http://www.dinamani.com/specials/dinanthorum-thirupugal/     14
ஹரன் பிரசன்னா     ஹரன் பிரசன்னா     2
ஸ்கிரீனன்  தி இந்து - முகப்பு        1
ஸ்கிரீனன்  தி இந்து - தமிழ் சினிமா   1
ஸ்கிரீனன்  தி இந்து - சினிமா        2
ஷங்கர்    தி இந்து - சினிமா        1
ஷங்கர்    தி இந்து - முகப்பு        1
வெங்கடேசன். ஆர்    Dinamani - வேலைவாய்ப்பு - http://www.dinamani.com/employment/  32
வெங்கடேசன். ஆர்    Dinamani - விவசாயம் - http://www.dinamani.com/agriculture/    2
Time taken: 2.493 seconds, Fetched: 10 row(s)

Using which feed software the articles were published.


hive> select feedgenerator, count (*) from feed_article group by feedgenerator order by feedgenerator desc limit 10;
https://wordpress.org/?v=4.6.1  5
https://wordpress.org/?v=4.5.4  80
https://wordpress.org/?v=4.5.2  1
http://wordpress.org/?v=4.2.10  2
http://wordpress.org/?v=4.1.4   7
http://wordpress.org/?v=3.5.1   10
http://wordpress.org/?v=3.0     1
http://wordpress.com/   13
application/rss+xml     3434
Jive Engage 8.0.2.0  (http://jivesoftware.com/products/)        1
Time taken: 2.473 seconds, Fetched: 10 row(s)

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s