Lab 30: Hive Queries

Hi hadoopers,

I have a program that will extract RSS feeds from different sources in tab limited text file. I used to Hive to do some mining today. Let’s see the results.


The file has 12 fields separated by tab. Here is the table description.

CREATE external TABLE IF NOT EXISTS feed_article (feedgenerator STRING,feedtitle STRING,feed_author STRING,feed_url STRING,feed_time STRING,item_subject STRING,item_author STRING,itemurl STRING,itemdate STRING,category STRING,DescriptionFile STRING,uniqueId bigint) ROW FORMAT DELIMITED     FIELDS TERMINATED BY '\t'      STORED AS TEXTFILE     LOCATION '/user/hadoop/lab27';

hive> describe feed_article;
feedgenerator           string
feedtitle               string
feed_author             string
feed_url                string
feed_time               string
item_subject            string
item_author             string
itemurl                 string
itemdate                string
category                string
descriptionfile         string
uniqueid                bigint
Time taken: 0.062 seconds, Fetched: 12 row(s)

Count how many articles published today.

hive> select count(*) from feed_article;
Time taken: 1.624 seconds, Fetched: 1 row(s)

List of distinct authors today.

hive> select distinct item_author from feed_article;
,Alok Deshpande
A.T.S Pandian
Aishwarya Parikh
Akanksha Jain
Alex Barile

Lets see which site has lot of articles

hive> select feedtitle, count(*) from feed_article group by feedtitle;
NULL    139
A Wandering Mind        1
APMdigest Hot Topics: APM       2
Application Performance Monitoring Blog | AppDynamics   1
BSNLTeleServices | BSNL Broadband Plans, Bill Payment Selfcare Portal   3
Bangalore Aviation      1
Blog Feed       1
Cloudera Engineering Blog       1        20

Who wrote many articles today?

hive> select item_author, count (*) from feed_article group by item_author order by item_author desc limit 5;
ஹாவேரி, 1
ஹரி கிருஷ்ணன்     14
ஹரன் பிரசன்னா     2
ஸ்கிரீனன்  4
ஷங்கர்    2
Time taken: 2.476 seconds, Fetched: 5 row(s)

Author of which website wrote many article today?

hive> hive> select item_author, feedtitle, count (*) from feed_article group by item_author, feedtitle order by item_author desc limit 10;
ஹாவேரி, Dinamani - பெங்களூரு - 1
ஹரி கிருஷ்ணன்     Dinamani - தினந்தோறும் திருப்புகழ் -     14
ஹரன் பிரசன்னா     ஹரன் பிரசன்னா     2
ஸ்கிரீனன்  தி இந்து - முகப்பு        1
ஸ்கிரீனன்  தி இந்து - தமிழ் சினிமா   1
ஸ்கிரீனன்  தி இந்து - சினிமா        2
ஷங்கர்    தி இந்து - சினிமா        1
ஷங்கர்    தி இந்து - முகப்பு        1
வெங்கடேசன். ஆர்    Dinamani - வேலைவாய்ப்பு -  32
வெங்கடேசன். ஆர்    Dinamani - விவசாயம் -    2
Time taken: 2.493 seconds, Fetched: 10 row(s)

Using which feed software the articles were published.

hive> select feedgenerator, count (*) from feed_article group by feedgenerator order by feedgenerator desc limit 10;  5  80  1  2   7   10     1   13
application/rss+xml     3434
Jive Engage  (        1
Time taken: 2.473 seconds, Fetched: 10 row(s)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s