Lab 23: Apache Pig Basics

Hi ETL  enthusiast,

This post will talk about important concepts in Pig.

pig-on-elephant

PIG Conventions
Convention Description Example
( ) tuple data type (John,18,4.0F)
{ } bag data type (1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(8,{(8,3,4)})
[ ] map data type [name#John,phone#5551212]

Relations, Bags, Tuples, Fields

Field is a piece of data. Eg, Jessica

Tuple is an ordered set of fields (Jessica, F, 35, NY)

Bag is a collection of Tuples {(Jessica, F, 35, NY),(Nathan, M, 35, NJ)}

Data Types

Data Types
Simple and Complex
Simple Types Description Example
int Signed 32-bit integer 10
long Signed 64-bit integer Data:     10L or 10l
Display: 10L
float 32-bit floating point Data:     10.5F or 10.5f or 10.5e2f or 10.5E2F
Display: 10.5F or 1050.0F
double 64-bit floating point Data:     10.5 or 10.5e2 or 10.5E2
Display: 10.5 or 1050.0
chararray Character array (string) in Unicode UTF-8 format hello world
bytearray Byte array (blob)  
boolean boolean true/false (case insensitive)
datetime datetime 1970-01-01T00:00:00.000+00:00
Complex Types    
tuple An ordered set of fields. (19,2)
bag An collection of tuples. {(19,2), (18,1)}
map A set of key value pairs. [open#apache]

Nulls, Operators, and Functions

Operator Interaction
Comparison operators:

==, !=

>, <

>=, <=

If either subexpression is null, the result is null.
Comparison operator:

matches

If either the string being matched against or the string defining the match is null, the result is null.
Arithmetic operators:

+ , -, *, /

% modulo

? : bincond

If either subexpression is null, the resulting expression is null.
Null operator:

is null

If the tested value is null, returns true; otherwise, returns false (see Null Operators).
Null operator:

is not null

If the tested value is not null, returns true; otherwise, returns false (see Null Operators).
Dereference operators:

tuple (.) or map (#)

If the de-referenced tuple or map is null, returns null.
Operators:

COGROUP, GROUP, JOIN

These operators handle nulls differently (see examples below).
Function:

COUNT_STAR

This function counts all values, including nulls.
Cast operator Casting a null from one type to another type results in a null.
Functions:

AVG, MIN, MAX, SUM, COUNT

These functions ignore nulls.
Function:

CONCAT

If either subexpression is null, the resulting expression is null.
Function:

SIZE

If the tested object is null, returns null.

Operators

Following are the operators available in Pig. The items marked in red colour are used in this post.

Operator Description
Loading and Storing
LOAD To Load the data from the file system (local/HDFS) into a relation.
STORE To save a relation to the file system (local/HDFS).
Filtering
FILTER To remove unwanted rows from a relation.
DISTINCT To remove duplicate rows from a relation.
FOREACH, GENERATE To generate data transformations based on columns of data.
STREAM To transform a relation using an external program.
Grouping and Joining
JOIN To join two or more relations.
COGROUP To group the data in two or more relations.
GROUP To group the data in a single relation.
CROSS To create the cross product of two or more relations.
Sorting
ORDER To arrange a relation in a sorted order based on one or more fields (ascending or descending).
LIMIT To get a limited number of tuples from a relation.
Combining and Splitting
UNION To combine two or more relations into a single relation.
SPLIT To split a single relation into two or more relations.
Diagnostic Operators
DUMP To print the contents of a relation on the console.
DESCRIBE To describe the schema of a relation.
EXPLAIN To view the logical, physical, or MapReduce execution plans to compute a relation.
ILLUSTRATE To view the step-by-step execution of a series of statements.

Pig Latin statements

Let’s use the following as a sample table

emp id name age Desig
1 Dharma 45 Sr Manager
2 Bheema 43 Cook
3 Arjuna 41 Instructor
4 Nakula 35 Jr Instructor
5 Sahadeva 33 Jr Instructor

grunt> fs -cat lab23/employee.csv
1, Dharma, 45, Sr Manager
2, Bheema, 43, Cook
3, Arjuna, 41, Instructor
4, Nakula, 35, Jr Instructor
5, Sahadeva, 33, Jr Instructor

pig-hadoop

Load, Describe, illustrate, dump

grunt> employee = load ‘lab23/employee.csv’ using PigStorage(‘,’) as (emp_id:int,emp_name:chararray,emp_age:int,emp_desig:chararray);

Relation name: employee
Input file path: lab23/employee.csv (I have used relative path)
Storage function: We have used the PigStorage() function. It loads and stores data as structured text files. Default delimiter is \t. We use comma here.

grunt> describe employee;
employee: {emp_id: int,emp_name: chararray,emp_age: int,emp_desig: chararray}

grunt> illustrate employee
————————————————————————————————–
| employee     | emp_id:int    | emp_name:chararray    | emp_age:int    | emp_desig:chararray    |
————————————————————————————————–
|              | 5             |  Sahadeva             |  33            |  Jr Instructor         |
————————————————————————————————–

grunt> dump employee;
(1, Dharma,45, Sr Manager)
(2, Bheema,43, Cook)
(3, Arjuna,41, Instructor)
(4, Nakula,35, Jr Instructor)
(5, Sahadeva,33, Jr Instructor)

This will execute a MapReduce job to read data from HDFS and print the content on the screen.

grunt> explain employee;

#————————————————–
# Map Reduce Plan
#————————————————–
MapReduce node scope-119
Map Plan
employee: Store(fakefile:org.apache.pig.builtin.PigStorage) – scope-118
|
|—employee: New For Each(false,false,false,false)[bag] – scope-117
|   |
|   Cast[int] – scope-106
|   |
|   |—Project[bytearray][0] – scope-105
|   |
|   Cast[chararray] – scope-109
|   |
|   |—Project[bytearray][1] – scope-108
|   |
|   Cast[int] – scope-112
|   |
|   |—Project[bytearray][2] – scope-111
|   |
|   Cast[chararray] – scope-115
|   |
|   |—Project[bytearray][3] – scope-114
|
|—employee: Load(hdfs://gandhari:9000/user/hadoop/lab23/employee.csv:PigStorage(‘,’)) – scope-104——–
Global sort: false
—————-

Group

grunt> edesig = group employee by emp_desig;

grunt> dump edesig;
( Cook,{(2, Bheema,43, Cook)})
( Instructor,{(3, Arjuna,41, Instructor)})
( Sr Manager,{(1, Dharma,45, Sr Manager)})
( Jr Instructor,{(5, Sahadeva,33, Jr Instructor),(4, Nakula,35, Jr Instructor)})

I’m just thinking in terms of map reduce. By this time, I need to write a mapper extending Mapper, a reducer extending Reducer and a driver 🙂

Let’s do one more grouping by age

grunt> eage = group employee by emp_age;

grunt> dump eage;
(33,{(5, Sahadeva,33, Jr Instructor)})
(35,{(4, Nakula,35, Jr Instructor)})
(41,{(3, Arjuna,41, Instructor)})
(43,{(2, Bheema,43, Cook)})
(45,{(1, Dharma,45, Sr Manager)})

Let’s do a grouping based on both the columns.

grunt> e_age_desig = group employee by (emp_age, emp_desig);

grunt> dump e_age_desig;
((33, Jr Instructor),{(5, Sahadeva,33, Jr Instructor)})
((35, Jr Instructor),{(4, Nakula,35, Jr Instructor)})
((41, Instructor),{(3, Arjuna,41, Instructor)})
((43, Cook),{(2, Bheema,43, Cook)})
((45, Sr Manager),{(1, Dharma,45, Sr Manager)})

Yes. Writing MapReduce for simple tasks are time consuming. Rather we need to deploy the right tools to get the job done.

Co-group

In addition to employee, we have one other table, student as given below

grunt> fs -cat lab23/student.csv;
1, Duryodhana, 15, 11
2, Dushasana, 14, 10
3, Dushala, 13, 9
4, Dronacharya,45, 12

grunt> student = load ‘lab23/student.csv’ using PigStorage(‘,’) as (stud_id:int,stud_name:chararray,stud_age:int,stud_class:chararray);

grunt> illustrate student;
————————————————————————————————-
| student     | stud_id:int   | stud_name:chararray   | stud_age:int   | stud_class:chararray   |
————————————————————————————————-
|             | 3             |  Dushala              |  13            |  9                     |
————————————————————————————————-

grunt> cogroupdata = COGROUP student by stud_age, employee by emp_age;

(13,{(3, Dushala,13, 9)},{})
(14,{(2, Dushasana,14, 10)},{})
(15,{(1, Duryodhana,15, 11)},{})
(33,{},{(5, Sahadeva,33, Jr Instructor)})
(35,{},{(4, Nakula,35, Jr Instructor)})
(41,{},{(3, Arjuna,41, Instructor)})
(43,{},{(2, Bheema,43, Cook)})
(45,{(4, Dronacharya,45, 12)},{(1, Dharma,45, Sr Manager)})

Co group is similar to group operator. But it happens across two different relations.

 

Ref: http://pig.apache.org

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s