So far I have documented some of the key functions in Apache Pig. Today, Let’s write a simple Pig Latin script to parse and analyse Apache’s access log. Here you go.
$ cat accessLogETL.pig -- demo script javashine.wordpress.com -- extract user IPs access = load '/user/cloudera/pig/access.log' using PigStorage (' ') as (hostIp:chararray, clientId:chararray, userId:chararray, reqTime:chararray, reqTimeZone:chararray, reqMethod:chararray, reqLine:chararray, reqProt:chararray, statusCode:chararray, respLength:int, referrer:chararray, userAgentMozilla:chararray, userAgentPf1:chararray, userAgentPf2:chararray, userAgentPf3:chararray, userAgentRender:chararray, userAgentBrowser:chararray); hostIpList = FOREACH access GENERATE hostIp; hostIpList = DISTINCT hostIpList; STORE hostIpList INTO '/user/cloudera/pig/hostIpList' USING PigStorage('\t'); hostIpUrlList = FOREACH access GENERATE (hostIp,reqTime,reqTimeZone,reqLine); hostIpUrlList = DISTINCT hostIpUrlList; STORE hostIpUrlList INTO '/user/cloudera/pig/hostIpUrlList' USING PigStorage('\t'); hostIpBandwidthList = FOREACH access GENERATE (hostIp), respLength; groupByIp = GROUP hostIpBandwidthList BY hostIp; bandwidthByIp = FOREACH groupByIp GENERATE hostIpBandwidthList.hostIp, SUM(hostIpBandwidthList.respLength); STORE bandwidthByIp INTO '/user/cloudera/pig/bandwidthByIp' USING PigStorage('\t'); $ pig -x mapreduce accessLogETL.pig Job Stats (time in seconds): JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs job_1485688219066_0027 1 1 12 12 12 12 8 8 8 8 access,hostIpList,hostIpUrlList DISTINCT,MULTI_QUERY /user/cloudera/pig/hostIpList,/user/cloudera/pig/hostIpUrlList, job_1485688219066_0028 1 1 7 7 7 7 9 9 9 9 bandwidthByIp,groupByIp,hostIpBandwidthList GROUP_BY /user/cloudera/pig/bandwidthByIp, Input(s): Successfully read 470749 records (59020755 bytes) from: "/user/cloudera/pig/access.log" Output(s): Successfully stored 877 records (12342 bytes) in: "/user/cloudera/pig/hostIpList" Successfully stored 449192 records (31075055 bytes) in: "/user/cloudera/pig/hostIpUrlList" Successfully stored 877 records (6729928 bytes) in: "/user/cloudera/pig/bandwidthByIp" Counters: Total records written : 450946 Total bytes written : 37817325 Spillable Memory Manager spill count : 7 Total bags proactively spilled: 3 Total records proactively spilled: 213378
Let’s see our results now.
The above script will yield us three output.
First is the list of user IPs accessed the web server.
$ hadoop fs -ls /user/cloudera/pig/hostIpList Found 2 items -rw-r--r-- 1 cloudera cloudera 0 2017-01-31 12:43 /user/cloudera/pig/hostIpList/_SUCCESS -rw-r--r-- 1 cloudera cloudera 12342 2017-01-31 12:43 /user/cloudera/pig/hostIpList/part-r-00000 [cloudera@quickstart pig]$ hadoop fs -cat /user/cloudera/pig/hostIpList/part-r-00000 ::1 10.1.1.5 107.21.1.8 14.134.7.6 37.48.94.6 46.4.90.68 46.4.90.86
The second output will give us the list of user IPs, their access time and accessed URL
$ hadoop fs -ls /user/cloudera/pig/hostIpUrlList Found 2 items -rw-r--r-- 1 cloudera cloudera 0 2017-01-31 12:43 /user/cloudera/pig/hostIpUrlList/_SUCCESS -rw-r--r-- 1 cloudera cloudera 31075055 2017-01-31 12:43 /user/cloudera/pig/hostIpUrlList/part-r-00000 [cloudera@quickstart pig]$ hadoop fs -cat /user/cloudera/pig/hostIpUrlList/part-r-00000 (10.1.1.5,[22/Jan/2017:17:51:34,+0000],/egcrm) (10.1.1.5,[22/Jan/2017:17:51:34,+0000],/egcrm2/) (10.1.1.5,[22/Jan/2017:17:51:34,+0000],/egcrm/helloWorld.action)
And, finally the bandwidth spent for each user IP.
$ hadoop fs -ls /user/cloudera/pig/bandwidthByIp Found 2 items -rw-r--r-- 1 cloudera cloudera 0 2017-01-31 12:44 /user/cloudera/pig/bandwidthByIp/_SUCCESS -rw-r--r-- 1 cloudera cloudera 6729928 2017-01-31 12:44 /user/cloudera/pig/bandwidthByIp/part-r-00000 $ hadoop fs -cat /user/cloudera/pig/bandwidthByIp/part-r-00000 {(193.138.219.245)} 1313 {(193.138.219.250),(193.138.219.250),(193.138.219.250)} 3939 {(195.154.181.113)} 496 {(195.154.181.168)} 1026