Apache Pig – Operators and Functions with Examples

Here you go! Click the below given page for the examples

Advertisements

Lab 25: Combiners using Pig Union

Hi Friends,

I archive the blog entries and news entries on daily basis using one of my projects jatomrss. It saves the news articles as tab limited text

$ hadoop fs -ls /user/hadoop/feed/*.jatomrss.log
-rw-r--r--   3 hadoop supergroup    5933545 2016-10-01 21:44 /user/hadoop/feed/2016-10-01.jatomrss.log
-rw-r--r--   3 hadoop supergroup    6313692 2016-10-02 17:03 /user/hadoop/feed/2016-10-02.jatomrss.log
-rw-r--r--   3 hadoop supergroup     596174 2016-10-03 06:37 /user/hadoop/feed/2016-10-03.jatomrss.log
-rw-r--r--   3 hadoop supergroup     827974 2016-10-04 06:53 /user/hadoop/feed/2016-10-04.jatomrss.log
-rw-r--r--   3 hadoop supergroup    1367507 2016-10-05 07:41 /user/hadoop/feed/2016-10-05.jatomrss.log
-rw-r--r--   3 hadoop supergroup      10927 2016-10-06 07:29 /user/hadoop/feed/2016-10-06.jatomrss.log
-rw-r--r--   3 hadoop supergroup    1536870 2016-10-07 06:24 /user/hadoop/feed/2016-10-07.jatomrss.log
-rw-r--r--   3 hadoop supergroup    1719126 2016-10-08 07:13 /user/hadoop/feed/2016-10-08.jatomrss.log
-rw-r--r--   3 hadoop supergroup    1870073 2016-10-09 09:36 /user/hadoop/feed/2016-10-09.jatomrss.log
-rw-r--r--   3 hadoop supergroup    1376982 2016-10-11 05:11 /user/hadoop/feed/2016-10-11.jatomrss.log

Lets use Union operators to combine the feeds on Oct 1st and 2nd.

pig-on-elephant

The relation for October 1st is given as below.


feedEntryRecord20161001 = load '/user/hadoop/feed/2016-10-01.jatomrss.log' using PigStorage ('\t') as (generator:chararray, feedTitle:chararray, feedAuthor:chararray, feedUrl:chararray, feed_time:chararray, entrySubject:chararray, entryAuthor:chararray, entryUrl:chararray, entryDate:chararray, categorySet:chararray, descriptionFile:chararray, uniqueId:chararray);

The relation for October 2nd is given as below.


feedEntryRecord20161002 = load '/user/hadoop/feed/2016-10-02.jatomrss.log' using PigStorage ('\t') as (generator:chararray, feedTitle:chararray, feedAuthor:chararray, feedUrl:chararray, feed_time:chararray, entrySubject:chararray, entryAuthor:chararray, entryUrl:chararray, entryDate:chararray, categorySet:chararray, descriptionFile:chararray, uniqueId:chararray);

Let’s describe them.


grunt> <span style="color: #ff0000;">describe feedEntryRecord20161001;</span>
feedEntryRecord20161001: {generator: chararray,feedTitle: chararray,feedAuthor: chararray,feedUrl: chararray,feed_time: chararray,entrySubject: chararray,entryAuthor: chararray,entryUrl: chararray,entryDate: chararray,categorySet: chararray,descriptionFile: chararray,uniqueId: chararray}
grunt> <span style="color: #ff0000;">describe feedEntryRecord20161002</span>
feedEntryRecord20161002: {generator: chararray,feedTitle: chararray,feedAuthor: chararray,feedUrl: chararray,feed_time: chararray,entrySubject: chararray,entryAuthor: chararray,entryUrl: chararray,entryDate: chararray,categorySet: chararray,descriptionFile: chararray,uniqueId: chararray}

Here is the combiner! To make the combiner (union) to work, we need to identical columns in both data. Yes. we do have.


union_data = union feedEntryRecord20161001, feedEntryRecord20161002;

grunt> describe union_data;
union_data: {generator: chararray,feedTitle: chararray,feedAuthor: chararray,feedUrl: chararray,feed_time: chararray,entrySubject: chararray,entryAuthor: chararray,entryUrl: chararray,entryDate: chararray,categorySet: chararray,descriptionFile: chararray,uniqueId: chararray}

Let's store it to check if the combiner works as expected.

store union_data into '/user/hadoop/lab25/union_data' using PigStorage ('\t');

Input(s):
Successfully read 10612 records from: "/user/hadoop/feed/2016-10-02.jatomrss.log"
Successfully read 10295 records from: "/user/hadoop/feed/2016-10-01.jatomrss.log"

Output(s):
Successfully stored 20907 records (206484670 bytes) in: "/user/hadoop/lab25/union_data"

Counters:
Total records written : 20907
Total bytes written : 206484670
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_local1564505665_0014