Lab 24: Joins using Pig

Let’s make some joins today, dear hadoopers.

pig-on-elephant

Join

We have a table for employee.

grunt> dump employee;
 (1, Dharma,45, Sr Manager)
 (2, Bheema,43, Cook)
 (3, Arjuna,41, Instructor)
 (4, Nakula,35, Jr Instructor)
 (5, Sahadeva,33, Jr Instructor)

We have another table salary as given below.

grunt> dump salary;

(1,2000)
 (2,1500)
 (3,1000)
 (4,500)
 (5,500)

Let’s join salary table with employee table on emp_id field.

grunt> salary_table = JOIN employee by emp_id, salary by emp_id;

grunt> dump salary_table;
 Input(s):
 Successfully read 5 records from: "hdfs://gandhari:9000/user/hadoop/lab23/salary.csv"
 Successfully read 5 records from: "hdfs://gandhari:9000/user/hadoop/lab23/employee.csv"

Output(s):
 Successfully stored 5 records (16134806 bytes) in: "hdfs://gandhari:9000/tmp/temp-2101978857/tmp596113376"

Counters:
 Total records written : 5
 Total bytes written : 16134806
 Spillable Memory Manager spill count : 0
 Total bags proactively spilled: 0
 Total records proactively spilled: 0

(1, Dharma,45, Sr Manager,1,2000)
 (2, Bheema,43, Cook,2,1500)
 (3, Arjuna,41, Instructor,3,1000)
 (4, Nakula,35, Jr Instructor,4,500)
 (5, Sahadeva,33, Jr Instructor,5,500)

Inner Join

The employee table looks like this.

(1, dharma,45, Sr Manager)
 (2, Bheema,43, Cook)
 (3, Arjuna,41, Instructor)
 (4, Nakula,35, Jr Instructor)
 (5, Sahadeva,33, Jr Instructor)

grunt> employee = load 'employee.csv' using PigStorage (',') as (emp_id:int,emp_name:chararray,emp_age:int,emp_desig:chararray);

grunt> describe employee;
 employee: {emp_id: int,emp_name: chararray,emp_age: int,emp_desig: chararray}

There is a change in salary table. No salary for Arjuna this month 🙂 Id 3 is missing in salary table.

grunt> dump salary;
 (1,2000)
 (2,1500)
 (4,500)
 (5,500)

grunt> salary = load 'salary.csv' using PigStorage (',') as (emp_id:int,emp_salary:int);

grunt> describe salary;
 salary: {emp_id: int,emp_salary: int}

Lets do an inner join now.

grunt> employee_salary = join employee by emp_id, salary by emp_id;

grunt> describe employee_salary;
 employee_salary: {employee::emp_id: int,employee::emp_name: chararray,employee::emp_age: int,employee::emp_desig: chararray,salary::emp_id: int,salary::emp_salary: int}

grunt> illustrate employee_salary;

--------------------------------------------------------------------------------------------------
 | employee     | emp_id:int    | emp_name:chararray    | emp_age:int    | emp_desig:chararray    |
 --------------------------------------------------------------------------------------------------
 |              | 5             |  Sahadeva             | 33             |  Jr Instructor         |
 |              | 5             |  Sahadeva             | 33             |  Jr Instructor         |
 --------------------------------------------------------------------------------------------------
 --------------------------------------------------
 | salary     | emp_id:int    | emp_salary:int    |
 --------------------------------------------------
 |            | 5             | 500               |
 |            | 5             | 500               |
 --------------------------------------------------
 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 | employee_salary     | employee::emp_id:int    | employee::emp_name:chararray    | employee::emp_age:int    | employee::emp_desig:chararray    | salary::emp_id:int    | salary::emp_salary:int    |
 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 |                     | 5                       |  Sahadeva                       | 33                       |  Jr Instructor                   | 5                     | 500                       |
 |                     | 5                       |  Sahadeva                       | 33                       |  Jr Instructor                   | 5                     | 500                       |
 |                     | 5                       |  Sahadeva                       | 33                       |  Jr Instructor                   | 5                     | 500                       |
 |                     | 5                       |  Sahadeva                       | 33                       |  Jr Instructor                   | 5                     | 500                       |
 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

grunt> dump employee_salary;

(1, dharma,45, Sr Manager,1,2000)
 (2, Bheema,43, Cook,2,1500)
 (4, Nakula,35, Jr Instructor,4,500)
 (5, Sahadeva,33, Jr Instructor,5,500)

Left Outer Join

grunt> left_outer = join employee by emp_id left outer, salary by emp_id;

grunt> describe left_outer;
 left_outer: {employee::emp_id: int,employee::emp_name: chararray,employee::emp_age: int,employee::emp_desig: chararray,salary::emp_id: int,salary::emp_salary: int}

grunt> dump left_outer;

(1, dharma,45, Sr Manager,1,2000)
 (2, Bheema,43, Cook,2,1500)
 (3, Arjuna,41, Instructor,,)
 (4, Nakula,35, Jr Instructor,4,500)
 (5, Sahadeva,33, Jr Instructor,5,500)

We get all the records from employee table with null value for those value is not given in salary table.

Right Outer Join

Lets do a right outer join now

grunt> right_outer = join salary by emp_id right, employee by emp_id;

grunt> describe right_outer;
 right_outer: {salary::emp_id: int,salary::emp_salary: int,employee::emp_id: int,employee::emp_name: chararray,employee::emp_age: int,employee::emp_desig: chararray}

grunt> dump right_outer;

(1,2000,1, dharma,45, Sr Manager)
 (2,1500,2, Bheema,43, Cook)
 (,,3, Arjuna,41, Instructor)
 (4,500,4, Nakula,35, Jr Instructor)
 (5,500,5, Sahadeva,33, Jr Instructor)

We could see all the records from right table even if there is no match in the left table.

Full outer join

Last topic of this blog post – Lets do a full outer join, to record from each table when there is at least a single match.

grunt> full_outer = join employee by emp_id full outer, salary by emp_id;
 grunt> describe full_outer;
 full_outer: {employee::emp_id: int,employee::emp_name: chararray,employee::emp_age: int,employee::emp_desig: chararray,salary::emp_id: int,salary::emp_salary: int}

grunt> dump full_outer;

(1, dharma,45, Sr Manager,1,2000)
 (2, Bheema,43, Cook,2,1500)
 (3, Arjuna,41, Instructor,,)
 (4, Nakula,35, Jr Instructor,4,500)
 (5, Sahadeva,33, Jr Instructor,5,500)

Can we join by two keys? yes. join employee by (emp_id, emp_name)!

Cross

Lets do a cross to bring the X relationship between tables. To make it simpler, I have taken less data in our employee table.

grunt> employee = load 'employee.csv' using PigStorage (',') as (emp_id:int,emp_name:chararray,emp_age:int,emp_desig:chararray);

grunt> dump employee;

(1, Dharma,45, Sr Manager)
(2, Bheema,43, Cook)

grunt> leave = load 'leave.csv' using PigStorage (',') as (emp_id:int,emp_leave_date:chararray);

grunt> dump leave;

(1,2016-10-11)
(2,2016-10-12)

grunt> cross_data = cross employee, leave;

grunt> describe cross_data;
cross_data: {employee::emp_id: int,employee::emp_name: chararray,employee::emp_age: int,employee::emp_desig: chararray,leave::emp_id: int,leave::emp_leave_date: chararray}

grunt> dump cross_data;

(2, Bheema,43, Cook,2,2016-10-12)
(2, Bheema,43, Cook,1,2016-10-11)
(1, Dharma,45, Sr Manager,2,2016-10-12)
(1, Dharma,45, Sr Manager,1,2016-10-11)

See you in another interesting post.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s