Loading Data into R

I have written about storing and retrieving objects in R language in my previous post. Lets see how to load data in R language here.

c Command

R offers a command called c, which stands for combine. It used to enter numeric, alphanumeric data.

> marks = c (100, 80, 85, 70, 35)

R studio 3

Following commands show how to load the numeric, alphabetic and alphanumeric data. See how R responds when you give alphanumeric data.

> marks = c (100, 80, 85, 70, 35)
> marks
[1] 100  80  85  70  35
> names = c("sun", "moon", "earth")
> names
[1] "sun"   "moon"  "earth"
> alphanu = c("sun", "moon", "earth", 2, 3)
> alphanu
[1] "sun"   "moon"  "earth" "2"     "3"    
> #append data
> marks = c(marks, 10, 20)
> marks
[1] 100  80  85  70  35  10  20
> #combine two objects
> combo = c(names, marks)
> combo
 [1] "sun"   "moon"  "earth" "100"   "80"    "85"    "70"    "35"    "10"    "20"

Scan command

We give the complete data as CSV when we use c command. Scan command helps us to enter the data interactively. Double enter to complete the data loading process.

> #scan numbers
> scan()
1: 10
2: 20
3: 30
4: 
Read 3 items
[1] 10 20 30
> scan(what='character')
1: tamil
2: english
3: maths
4: 
Read 3 items
[1] "tamil"   "english" "maths"

Loading single dimensional data from flat files

Scan shall be used to read your data files. I have a data file in D:/gandhari/videos/Advanced Business Analytics/marks.txt

R studio 4

Here is the way, we shall read the values.

> marks = scan(file = 'D:/gandhari/videos/Advanced Business Analytics/marks.txt')
Read 20 items
> marks;
 [1]  80  90 100 100  90  70  85  67  74  76  50  55  57  62  51  35  30  27  40  39

So scan forms everything as single dimension array.

I have given the complete path of the file in the above example. If you have multiple file in a same folder, it would be easier to change the working directory to ease the loading process. We shall give only the file name instead of the complete path.

> getwd()
[1] "D:/gandhari/documents"
> setwd("D:/gandhari/videos/Advanced Business Analytics/")
> marks = scan(file="marks.txt")

Loading multi-dimensional data from CSV file

How to load multi-dimensional array? Let’s use read.csv command.

This is my input file.

R studio 5

> marks<-read.csv(file = 'marks.csv', header = FALSE, sep = ",")
> marks
  V1 V2  V3  V4 V5
1 80 90 100 100 90
2 70 85  67  74 76
3 50 55  57  62 51
4 35 30  27  40 39

v1, v2, … v5 are variables

1, 2, … 5 are rows

R Studio data import

R Studio has an option to import the CSV files interactively using GUI.

Following is our input data

R studio 5

Follow the steps given below.

R studio 6

R studio 7

R studio 8

R studio 9

R studio 9A

 

 

Advertisements

Apache Access Log analysis with Apache Pig

So far I have documented some of the key functions in Apache Pig. Today, Let’s write a simple Pig Latin script to parse and analyse Apache’s access log. Here you go.

$ cat accessLogETL.pig
-- demo script javashine.wordpress.com
-- extract user IPs
access = load '/user/cloudera/pig/access.log' using PigStorage (' ') as (hostIp:chararray, clientId:chararray, userId:chararray, reqTime:chararray, reqTimeZone:chararray, reqMethod:chararray, reqLine:chararray, reqProt:chararray, statusCode:chararray, respLength:int, referrer:chararray, userAgentMozilla:chararray, userAgentPf1:chararray, userAgentPf2:chararray, userAgentPf3:chararray, userAgentRender:chararray, userAgentBrowser:chararray);
hostIpList = FOREACH access GENERATE hostIp;
hostIpList = DISTINCT hostIpList;
STORE hostIpList INTO '/user/cloudera/pig/hostIpList' USING PigStorage('\t');

hostIpUrlList = FOREACH access GENERATE (hostIp,reqTime,reqTimeZone,reqLine);
hostIpUrlList = DISTINCT hostIpUrlList;
STORE hostIpUrlList INTO '/user/cloudera/pig/hostIpUrlList' USING PigStorage('\t');

hostIpBandwidthList = FOREACH access GENERATE (hostIp), respLength;
groupByIp = GROUP hostIpBandwidthList BY hostIp;
bandwidthByIp = FOREACH groupByIp GENERATE hostIpBandwidthList.hostIp, SUM(hostIpBandwidthList.respLength);
STORE bandwidthByIp INTO '/user/cloudera/pig/bandwidthByIp' USING PigStorage('\t');

$ pig -x mapreduce accessLogETL.pig
Job Stats (time in seconds):
JobId   Maps    Reduces MaxMapTime      MinMapTIme      AvgMapTime      MedianMapTime   MaxReduceTime   MinReduceTime   AvgReduceTime   MedianReducetime        Alias   Feature Outputs
job_1485688219066_0027  1       1       12      12      12      12      8       8       8       8       access,hostIpList,hostIpUrlList DISTINCT,MULTI_QUERY    /user/cloudera/pig/hostIpList,/user/cloudera/pig/hostIpUrlList,
job_1485688219066_0028  1       1       7       7       7       7       9       9       9       9       bandwidthByIp,groupByIp,hostIpBandwidthList     GROUP_BY        /user/cloudera/pig/bandwidthByIp,

Input(s):
Successfully read 470749 records (59020755 bytes) from: "/user/cloudera/pig/access.log"

Output(s):
Successfully stored 877 records (12342 bytes) in: "/user/cloudera/pig/hostIpList"
Successfully stored 449192 records (31075055 bytes) in: "/user/cloudera/pig/hostIpUrlList"
Successfully stored 877 records (6729928 bytes) in: "/user/cloudera/pig/bandwidthByIp"

Counters:
Total records written : 450946
Total bytes written : 37817325
Spillable Memory Manager spill count : 7
Total bags proactively spilled: 3
Total records proactively spilled: 213378

Let’s see our results now.

The above script will yield us three output.
First is the list of user IPs accessed the web server.

$ hadoop fs -ls /user/cloudera/pig/hostIpList
Found 2 items
-rw-r--r--   1 cloudera cloudera          0 2017-01-31 12:43 /user/cloudera/pig/hostIpList/_SUCCESS
-rw-r--r--   1 cloudera cloudera      12342 2017-01-31 12:43 /user/cloudera/pig/hostIpList/part-r-00000
[cloudera@quickstart pig]$ hadoop fs -cat /user/cloudera/pig/hostIpList/part-r-00000
::1
10.1.1.5
107.21.1.8
14.134.7.6
37.48.94.6
46.4.90.68
46.4.90.86

The second output will give us the list of user IPs, their access time and accessed URL

$ hadoop fs -ls /user/cloudera/pig/hostIpUrlList
Found 2 items
-rw-r--r--   1 cloudera cloudera          0 2017-01-31 12:43 /user/cloudera/pig/hostIpUrlList/_SUCCESS
-rw-r--r--   1 cloudera cloudera   31075055 2017-01-31 12:43 /user/cloudera/pig/hostIpUrlList/part-r-00000
[cloudera@quickstart pig]$ hadoop fs -cat /user/cloudera/pig/hostIpUrlList/part-r-00000
(10.1.1.5,[22/Jan/2017:17:51:34,+0000],/egcrm)
(10.1.1.5,[22/Jan/2017:17:51:34,+0000],/egcrm2/)
(10.1.1.5,[22/Jan/2017:17:51:34,+0000],/egcrm/helloWorld.action)

And, finally the bandwidth spent for each user IP.

$ hadoop fs -ls /user/cloudera/pig/bandwidthByIp
Found 2 items
-rw-r--r--   1 cloudera cloudera          0 2017-01-31 12:44 /user/cloudera/pig/bandwidthByIp/_SUCCESS
-rw-r--r--   1 cloudera cloudera    6729928 2017-01-31 12:44 /user/cloudera/pig/bandwidthByIp/part-r-00000
$ hadoop fs -cat /user/cloudera/pig/bandwidthByIp/part-r-00000
{(193.138.219.245)}     1313
{(193.138.219.250),(193.138.219.250),(193.138.219.250)} 3939
{(195.154.181.113)}     496
{(195.154.181.168)}     1026

chinese-new-year-2017