Apache Pig – Operators and Functions with Examples

Here you go! Click the below given page for the examples

Advertisements

Lab 22: Getting started with Apache Pig

Hi Hadoopers,

I’m happy to start another series of blog post in Big Data – Apache Pig!

Pig – Eats everything!

Here are the basic commands

If you are looking for installation, you can find it here.

pig-on-elephant.png

Version

$ pig -i
Apache Pig version 0.16.0 (r1746530)
compiled Jun 01 2016, 23:10:49

Launch Pig

The following command will load it in local mode. This is to work with local file system.

$ pig -l /tmp -x local

... ... ...

2016-10-09 03:02:04,064 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///

... ... ...

grunt> ls
file:/opt/hadoop-2.6.4/pigfarm/pig_1475951593891.log       2615

The following command will load it in mapreduce mode, where it will connect to hdfs.

$ pig -l /tmp -x mapreduce

2016-10-09 03:03:24,768 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://gandhari:9000

Some of the basic commands

grunt> fs -ls 

Found 20 items 
drwxr-xr-x   - hadoop supergroup          0 2016-10-06 06:07 dc 
drwxr-xr-x   - hadoop supergroup          0 2016-10-08 07:13 feed 
drwxr-xr-x   - hadoop supergroup          0 2016-09-10 10:51 lab01 
drwxr-xr-x   - hadoop supergroup          0 2016-09-10 15:20 lab03 
drwxr-xr-x   - hadoop supergroup          0 2016-09-12 14:04 lab07 
drwxr-xr-x   - hadoop supergroup          0 2016-09-15 05:39 lab08 
drwxr-xr-x   - hadoop supergroup          0 2016-09-17 07:58 lab09 
drwxr-xr-x   - hadoop supergroup          0 2016-09-17 15:47 lab10 
drwxr-xr-x   - hadoop supergroup          0 2016-09-22 21:07 lab13 
drwxr-xr-x   - hadoop supergroup          0 2016-09-25 00:40 lab15 
drwxr-xr-x   - hadoop supergroup          0 2016-10-02 18:17 lab16 
drwxr-xr-x   - hadoop supergroup          0 2016-10-08 09:25 lab17 
drwxr-xr-x   - hadoop supergroup          0 2016-10-08 11:27 lab18 
drwxr-xr-x   - hadoop supergroup          0 2016-10-08 19:35 lab19 
drwxr-xr-x   - hadoop supergroup          0 2016-10-08 19:36 lab20 
drwxr-xr-x   - hadoop supergroup          0 2016-10-09 00:54 lab21 
drwxr-xr-x   - hadoop supergroup          0 2016-09-11 09:41 output 
drwxr-xr-x   - hadoop supergroup          0 2016-08-27 08:40 share 
drwxr-xr-x   - hadoop supergroup          0 2016-09-04 15:41 trial1 
drwxr-xr-x   - hadoop supergroup          0 2016-09-04 16:00 trial2 
grunt> pwd hdfs://gandhari:9000/user/hadoop 
grunt> fs -cp /user/hadoop/lab20/input/*.csv lab22 
grunt> fs -cat /user/hadoop/lab20/input/employee.csv 
101,Duryodhana,Dhritarashtra,Gandhari,Bhanumati 
102,Bheema,Pandu,Kunti,Hidimbi

Store

Creating a relation for employee.csv table

grunt> a = load '/user/hadoop/lab20/input/employee.csv' using PigStorage(',') as (empid:int,emp_name:chararray,fathers_name:chararray,mothers_name:chararray,wifes_name:chararray);
2016-10-09 03:32:44,930 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

Describe

It describes employee table.

grunt> describe a;
a: {empid: int,emp_name: chararray,fathers_name: chararray,mothers_name: chararray,wifes_name: chararray}

Dump

It dumps the data on the screen

grunt> dump a;
(101,Duryodhana,Dhritarashtra,Gandhari,Bhanumati)
(102,Bheema,Pandu,Kunti,Hidimbi)

Explain

This is the execution plan.

grunt> explain a;
2016-10-09 03:41:37,719 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2016-10-09 03:41:37,720 [main] WARN  org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2016-10-09 03:41:37,721 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
#-----------------------------------------------
# New Logical Plan:
#-----------------------------------------------
a: (Name: LOStore Schema: empid#31:int,emp_name#32:chararray,fathers_name#33:chararray,mothers_name#34:chararray,wifes_name#35:chararray)
|
|---a: (Name: LOForEach Schema: empid#31:int,emp_name#32:chararray,fathers_name#33:chararray,mothers_name#34:chararray,wifes_name#35:chararray)
    |   |
    |   (Name: LOGenerate[false,false,false,false,false] Schema: empid#31:int,emp_name#32:chararray,fathers_name#33:chararray,mothers_name#34:chararray,wifes_name#35:chararray)ColumnPrune:OutputUids=[32, 33, 34, 35, 31]ColumnPrune:InputUids=[32, 33, 34, 35, 31]
    |   |   |
    |   |   (Name: Cast Type: int Uid: 31)
    |   |   |
    |   |   |---empid:(Name: Project Type: bytearray Uid: 31 Input: 0 Column: (*))
    |   |   |
    |   |   (Name: Cast Type: chararray Uid: 32)
    |   |   |
    |   |   |---emp_name:(Name: Project Type: bytearray Uid: 32 Input: 1 Column: (*))
    |   |   |
    |   |   (Name: Cast Type: chararray Uid: 33)
    |   |   |
    |   |   |---fathers_name:(Name: Project Type: bytearray Uid: 33 Input: 2 Column: (*))
    |   |   |
    |   |   (Name: Cast Type: chararray Uid: 34)
    |   |   |
    |   |   |---mothers_name:(Name: Project Type: bytearray Uid: 34 Input: 3 Column: (*))
    |   |   |
    |   |   (Name: Cast Type: chararray Uid: 35)
    |   |   |
    |   |   |---wifes_name:(Name: Project Type: bytearray Uid: 35 Input: 4 Column: (*))
    |   |
    |   |---(Name: LOInnerLoad[0] Schema: empid#31:bytearray)
    |   |
    |   |---(Name: LOInnerLoad[1] Schema: emp_name#32:bytearray)
    |   |
    |   |---(Name: LOInnerLoad[2] Schema: fathers_name#33:bytearray)
    |   |
    |   |---(Name: LOInnerLoad[3] Schema: mothers_name#34:bytearray)
    |   |
    |   |---(Name: LOInnerLoad[4] Schema: wifes_name#35:bytearray)
    |
    |---a: (Name: LOLoad Schema: empid#31:bytearray,emp_name#32:bytearray,fathers_name#33:bytearray,mothers_name#34:bytearray,wifes_name#35:bytearray)RequiredFields:null
#-----------------------------------------------
# Physical Plan:
#-----------------------------------------------
a: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-36
|
|---a: New For Each(false,false,false,false,false)[bag] - scope-35
    |   |
    |   Cast[int] - scope-21
    |   |
    |   |---Project[bytearray][0] - scope-20
    |   |
    |   Cast[chararray] - scope-24
    |   |
    |   |---Project[bytearray][1] - scope-23
    |   |
    |   Cast[chararray] - scope-27
    |   |
    |   |---Project[bytearray][2] - scope-26
    |   |
    |   Cast[chararray] - scope-30
    |   |
    |   |---Project[bytearray][3] - scope-29
    |   |
    |   Cast[chararray] - scope-33
    |   |
    |   |---Project[bytearray][4] - scope-32
    |
    |---a: Load(/user/hadoop/lab20/input/employee.csv:PigStorage(',')) - scope-19

2016-10-09 03:41:37,731 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2016-10-09 03:41:37,733 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2016-10-09 03:41:37,733 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node scope-37
Map Plan
a: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-36
|
|---a: New For Each(false,false,false,false,false)[bag] - scope-35
    |   |
    |   Cast[int] - scope-21
    |   |
    |   |---Project[bytearray][0] - scope-20
    |   |
    |   Cast[chararray] - scope-24
    |   |
    |   |---Project[bytearray][1] - scope-23
    |   |
    |   Cast[chararray] - scope-27
    |   |
    |   |---Project[bytearray][2] - scope-26
    |   |
    |   Cast[chararray] - scope-30
    |   |
    |   |---Project[bytearray][3] - scope-29
    |   |
    |   Cast[chararray] - scope-33
    |   |
    |   |---Project[bytearray][4] - scope-32
    |
    |---a: Load(/user/hadoop/lab20/input/employee.csv:PigStorage(',')) - scope-19--------
Global sort: false
----------------

Store

grunt> store a into ‘/user/hadoop/lab22/01’ using PigStorage(‘,’);

Input(s):
Successfully read 2 records (10756671 bytes) from: “/user/hadoop/lab20/input/employee.csv”

Output(s):
Successfully stored 2 records (10756592 bytes) in: “/user/hadoop/lab22/01”

Counters:
Total records written : 2
Total bytes written : 10756592
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_local1725228402_0002

grunt> fs -ls /user/hadoop/lab22/01
Found 2 items
-rw-r–r–   3 hadoop supergroup          0 2016-10-09 08:00 /user/hadoop/lab22/01/_SUCCESS
-rw-r–r–   3 hadoop supergroup         79 2016-10-09 08:00 /user/hadoop/lab22/01/part-m-00000

grunt> fs -cat /user/hadoop/lab22/01/part-m-00000
101,Duryodhana,Dhritarashtra,Gandhari,Bhanumati
102,Bheema,Pandu,Kunti,Hidimbi

Filter by

grunt> b = filter a by empid<102;

grunt> dump b;

(101,Duryodhana,Dhritarashtra,Gandhari,Bhanumati)

Order by

grunt> describe a;
a: {empid: int,emp_name: chararray,fathers_name: chararray,mothers_name: chararray,wifes_name: chararray}
grunt> c = order a by emp_name;

grunt> dump c;

(102,Bheema,Pandu,Kunti,Hidimbi)
(101,Duryodhana,Dhritarashtra,Gandhari,Bhanumati)

Group by

grunt> d = group a by fathers_name;

grunt>dump d;

(Pandu,{(102,Bheema,Pandu,Kunti,Hidimbi)})
(Dhritarashtra,{(101,Duryodhana,Dhritarashtra,Gandhari,Bhanumati)})

Wow. Pig saves my time like anything!

See you in another interesting post.