Apache Access Log analysis with Apache Pig

So far I have documented some of the key functions in Apache Pig. Today, Let’s write a simple Pig Latin script to parse and analyse Apache’s access log. Here you go.

$ cat accessLogETL.pig
-- demo script javashine.wordpress.com
-- extract user IPs
access = load '/user/cloudera/pig/access.log' using PigStorage (' ') as (hostIp:chararray, clientId:chararray, userId:chararray, reqTime:chararray, reqTimeZone:chararray, reqMethod:chararray, reqLine:chararray, reqProt:chararray, statusCode:chararray, respLength:int, referrer:chararray, userAgentMozilla:chararray, userAgentPf1:chararray, userAgentPf2:chararray, userAgentPf3:chararray, userAgentRender:chararray, userAgentBrowser:chararray);
hostIpList = FOREACH access GENERATE hostIp;
hostIpList = DISTINCT hostIpList;
STORE hostIpList INTO '/user/cloudera/pig/hostIpList' USING PigStorage('\t');

hostIpUrlList = FOREACH access GENERATE (hostIp,reqTime,reqTimeZone,reqLine);
hostIpUrlList = DISTINCT hostIpUrlList;
STORE hostIpUrlList INTO '/user/cloudera/pig/hostIpUrlList' USING PigStorage('\t');

hostIpBandwidthList = FOREACH access GENERATE (hostIp), respLength;
groupByIp = GROUP hostIpBandwidthList BY hostIp;
bandwidthByIp = FOREACH groupByIp GENERATE hostIpBandwidthList.hostIp, SUM(hostIpBandwidthList.respLength);
STORE bandwidthByIp INTO '/user/cloudera/pig/bandwidthByIp' USING PigStorage('\t');

$ pig -x mapreduce accessLogETL.pig
Job Stats (time in seconds):
JobId   Maps    Reduces MaxMapTime      MinMapTIme      AvgMapTime      MedianMapTime   MaxReduceTime   MinReduceTime   AvgReduceTime   MedianReducetime        Alias   Feature Outputs
job_1485688219066_0027  1       1       12      12      12      12      8       8       8       8       access,hostIpList,hostIpUrlList DISTINCT,MULTI_QUERY    /user/cloudera/pig/hostIpList,/user/cloudera/pig/hostIpUrlList,
job_1485688219066_0028  1       1       7       7       7       7       9       9       9       9       bandwidthByIp,groupByIp,hostIpBandwidthList     GROUP_BY        /user/cloudera/pig/bandwidthByIp,

Input(s):
Successfully read 470749 records (59020755 bytes) from: "/user/cloudera/pig/access.log"

Output(s):
Successfully stored 877 records (12342 bytes) in: "/user/cloudera/pig/hostIpList"
Successfully stored 449192 records (31075055 bytes) in: "/user/cloudera/pig/hostIpUrlList"
Successfully stored 877 records (6729928 bytes) in: "/user/cloudera/pig/bandwidthByIp"

Counters:
Total records written : 450946
Total bytes written : 37817325
Spillable Memory Manager spill count : 7
Total bags proactively spilled: 3
Total records proactively spilled: 213378

Let’s see our results now.

The above script will yield us three output.
First is the list of user IPs accessed the web server.

$ hadoop fs -ls /user/cloudera/pig/hostIpList
Found 2 items
-rw-r--r--   1 cloudera cloudera          0 2017-01-31 12:43 /user/cloudera/pig/hostIpList/_SUCCESS
-rw-r--r--   1 cloudera cloudera      12342 2017-01-31 12:43 /user/cloudera/pig/hostIpList/part-r-00000
[cloudera@quickstart pig]$ hadoop fs -cat /user/cloudera/pig/hostIpList/part-r-00000
::1
10.1.1.5
107.21.1.8
14.134.7.6
37.48.94.6
46.4.90.68
46.4.90.86

The second output will give us the list of user IPs, their access time and accessed URL

$ hadoop fs -ls /user/cloudera/pig/hostIpUrlList
Found 2 items
-rw-r--r--   1 cloudera cloudera          0 2017-01-31 12:43 /user/cloudera/pig/hostIpUrlList/_SUCCESS
-rw-r--r--   1 cloudera cloudera   31075055 2017-01-31 12:43 /user/cloudera/pig/hostIpUrlList/part-r-00000
[cloudera@quickstart pig]$ hadoop fs -cat /user/cloudera/pig/hostIpUrlList/part-r-00000
(10.1.1.5,[22/Jan/2017:17:51:34,+0000],/egcrm)
(10.1.1.5,[22/Jan/2017:17:51:34,+0000],/egcrm2/)
(10.1.1.5,[22/Jan/2017:17:51:34,+0000],/egcrm/helloWorld.action)

And, finally the bandwidth spent for each user IP.

$ hadoop fs -ls /user/cloudera/pig/bandwidthByIp
Found 2 items
-rw-r--r--   1 cloudera cloudera          0 2017-01-31 12:44 /user/cloudera/pig/bandwidthByIp/_SUCCESS
-rw-r--r--   1 cloudera cloudera    6729928 2017-01-31 12:44 /user/cloudera/pig/bandwidthByIp/part-r-00000
$ hadoop fs -cat /user/cloudera/pig/bandwidthByIp/part-r-00000
{(193.138.219.245)}     1313
{(193.138.219.250),(193.138.219.250),(193.138.219.250)} 3939
{(195.154.181.113)}     496
{(195.154.181.168)}     1026

chinese-new-year-2017

Advertisements