# Lab 23: Apache Pig Basics

Hi ETL  enthusiast,

This post will talk about important concepts in Pig.

 PIG Conventions Convention Description Example ( ) tuple data type (John,18,4.0F) { } bag data type (1,{(1,2,3)}) (4,{(4,2,1),(4,3,3)}) (8,{(8,3,4)}) [ ] map data type [name#John,phone#5551212]

### Relations, Bags, Tuples, Fields

Field is a piece of data. Eg, Jessica

Tuple is an ordered set of fields (Jessica, F, 35, NY)

Bag is a collection of Tuples {(Jessica, F, 35, NY),(Nathan, M, 35, NJ)}

### Data Types

 Data Types Simple and Complex Simple Types Description Example int Signed 32-bit integer 10 long Signed 64-bit integer Data:     10L or 10l Display: 10L float 32-bit floating point Data:     10.5F or 10.5f or 10.5e2f or 10.5E2F Display: 10.5F or 1050.0F double 64-bit floating point Data:     10.5 or 10.5e2 or 10.5E2 Display: 10.5 or 1050.0 chararray Character array (string) in Unicode UTF-8 format hello world bytearray Byte array (blob) boolean boolean true/false (case insensitive) datetime datetime 1970-01-01T00:00:00.000+00:00 Complex Types tuple An ordered set of fields. (19,2) bag An collection of tuples. {(19,2), (18,1)} map A set of key value pairs. [open#apache]

### Nulls, Operators, and Functions

 Operator Interaction Comparison operators: ==, != >, < >=, <= If either subexpression is null, the result is null. Comparison operator: matches If either the string being matched against or the string defining the match is null, the result is null. Arithmetic operators: + , -, *, / % modulo ? : bincond If either subexpression is null, the resulting expression is null. Null operator: is null If the tested value is null, returns true; otherwise, returns false (see Null Operators). Null operator: is not null If the tested value is not null, returns true; otherwise, returns false (see Null Operators). Dereference operators: tuple (.) or map (#) If the de-referenced tuple or map is null, returns null. Operators: COGROUP, GROUP, JOIN These operators handle nulls differently (see examples below). Function: COUNT_STAR This function counts all values, including nulls. Cast operator Casting a null from one type to another type results in a null. Functions: AVG, MIN, MAX, SUM, COUNT These functions ignore nulls. Function: CONCAT If either subexpression is null, the resulting expression is null. Function: SIZE If the tested object is null, returns null.

### Operators

Following are the operators available in Pig. The items marked in red colour are used in this post.

Operator Description
LOAD To Load the data from the file system (local/HDFS) into a relation.
STORE To save a relation to the file system (local/HDFS).
Filtering
FILTER To remove unwanted rows from a relation.
DISTINCT To remove duplicate rows from a relation.
FOREACH, GENERATE To generate data transformations based on columns of data.
STREAM To transform a relation using an external program.
Grouping and Joining
JOIN To join two or more relations.
COGROUP To group the data in two or more relations.
GROUP To group the data in a single relation.
CROSS To create the cross product of two or more relations.
Sorting
ORDER To arrange a relation in a sorted order based on one or more fields (ascending or descending).
LIMIT To get a limited number of tuples from a relation.
Combining and Splitting
UNION To combine two or more relations into a single relation.
SPLIT To split a single relation into two or more relations.
Diagnostic Operators
DUMP To print the contents of a relation on the console.
DESCRIBE To describe the schema of a relation.
EXPLAIN To view the logical, physical, or MapReduce execution plans to compute a relation.
ILLUSTRATE To view the step-by-step execution of a series of statements.

### Pig Latin statements

Let’s use the following as a sample table

 emp id name age Desig 1 Dharma 45 Sr Manager 2 Bheema 43 Cook 3 Arjuna 41 Instructor 4 Nakula 35 Jr Instructor 5 Sahadeva 33 Jr Instructor

grunt> fs -cat lab23/employee.csv
1, Dharma, 45, Sr Manager
2, Bheema, 43, Cook
3, Arjuna, 41, Instructor
4, Nakula, 35, Jr Instructor

grunt> employee = load ‘lab23/employee.csv’ using PigStorage(‘,’) as (emp_id:int,emp_name:chararray,emp_age:int,emp_desig:chararray);

Relation name: employee
Input file path: lab23/employee.csv (I have used relative path)
Storage function: We have used the PigStorage() function. It loads and stores data as structured text files. Default delimiter is \t. We use comma here.

grunt> describe employee;
employee: {emp_id: int,emp_name: chararray,emp_age: int,emp_desig: chararray}

grunt> illustrate employee
————————————————————————————————–
| employee     | emp_id:int    | emp_name:chararray    | emp_age:int    | emp_desig:chararray    |
————————————————————————————————–
|              | 5             |  Sahadeva             |  33            |  Jr Instructor         |
————————————————————————————————–

grunt> dump employee;
(1, Dharma,45, Sr Manager)
(2, Bheema,43, Cook)
(3, Arjuna,41, Instructor)
(4, Nakula,35, Jr Instructor)

This will execute a MapReduce job to read data from HDFS and print the content on the screen.

grunt> explain employee;

#————————————————–
# Map Reduce Plan
#————————————————–
MapReduce node scope-119
Map Plan
employee: Store(fakefile:org.apache.pig.builtin.PigStorage) – scope-118
|
|—employee: New For Each(false,false,false,false)[bag] – scope-117
|   |
|   Cast[int] – scope-106
|   |
|   |—Project[bytearray][0] – scope-105
|   |
|   Cast[chararray] – scope-109
|   |
|   |—Project[bytearray][1] – scope-108
|   |
|   Cast[int] – scope-112
|   |
|   |—Project[bytearray][2] – scope-111
|   |
|   Cast[chararray] – scope-115
|   |
|   |—Project[bytearray][3] – scope-114
|
Global sort: false
—————-

### Group

grunt> edesig = group employee by emp_desig;

grunt> dump edesig;
( Cook,{(2, Bheema,43, Cook)})
( Instructor,{(3, Arjuna,41, Instructor)})
( Sr Manager,{(1, Dharma,45, Sr Manager)})
( Jr Instructor,{(5, Sahadeva,33, Jr Instructor),(4, Nakula,35, Jr Instructor)})

I’m just thinking in terms of map reduce. By this time, I need to write a mapper extending Mapper, a reducer extending Reducer and a driver 🙂

Let’s do one more grouping by age

grunt> eage = group employee by emp_age;

grunt> dump eage;
(35,{(4, Nakula,35, Jr Instructor)})
(41,{(3, Arjuna,41, Instructor)})
(43,{(2, Bheema,43, Cook)})
(45,{(1, Dharma,45, Sr Manager)})

Let’s do a grouping based on both the columns.

grunt> e_age_desig = group employee by (emp_age, emp_desig);

grunt> dump e_age_desig;
((33, Jr Instructor),{(5, Sahadeva,33, Jr Instructor)})
((35, Jr Instructor),{(4, Nakula,35, Jr Instructor)})
((41, Instructor),{(3, Arjuna,41, Instructor)})
((43, Cook),{(2, Bheema,43, Cook)})
((45, Sr Manager),{(1, Dharma,45, Sr Manager)})

Yes. Writing MapReduce for simple tasks are time consuming. Rather we need to deploy the right tools to get the job done.

### Co-group

In addition to employee, we have one other table, student as given below

grunt> fs -cat lab23/student.csv;
1, Duryodhana, 15, 11
2, Dushasana, 14, 10
3, Dushala, 13, 9
4, Dronacharya,45, 12

grunt> student = load ‘lab23/student.csv’ using PigStorage(‘,’) as (stud_id:int,stud_name:chararray,stud_age:int,stud_class:chararray);

grunt> illustrate student;
————————————————————————————————-
| student     | stud_id:int   | stud_name:chararray   | stud_age:int   | stud_class:chararray   |
————————————————————————————————-
|             | 3             |  Dushala              |  13            |  9                     |
————————————————————————————————-

grunt> cogroupdata = COGROUP student by stud_age, employee by emp_age;

(13,{(3, Dushala,13, 9)},{})
(14,{(2, Dushasana,14, 10)},{})
(15,{(1, Duryodhana,15, 11)},{})