# Multiple ANOVA, Post hoc test using R

I have written about how to run the ANOVA test in my previous post Analysis of Variance ANOVA using R. We analyzed the salary difference between different level of education.

For ease of (my!) understanding, I would take the same data set in this post as well. So here is the same data set.

> sal
id gender      educ  Designation Level Salary Last.drawn.salary Pre..Exp Ratings.by.interviewer
1   1 female        UG  Jr Engineer   JLM  10000              1000        3                      4
2   2   male DOCTORATE     Chairman   TLM 100000            100000       20                      4
3   3   male   DIPLOMA        Jr HR   JLM   6000              6000        1                      3
4   4   male        PG     Engineer   MLM  15000             15000        7                      2
5   5 female        PG  Sr Engineer   MLM  25000             25000       12                      4
6   6   male   DIPLOMA  Jr Engineer   JLM   6000              8000        1                      1
7   7   male   DIPLOMA Jr Associate   JLM   8000              8000        2                      4
8   8 female        PG     Engineer   MLM  13000             13000        7                      3
9   9 female        PG     Engineer   MLM  14000             14000        7                      2
10 10 female        PG     Engineer   MLM  16000             16000        8                      4
11 11 female        UG  Jr Engineer   JLM  10000              1000        3                      4
12 12   male DOCTORATE     Chairman   TLM 100000            100000       20                      4
13 13   male   DIPLOMA        Jr HR   JLM   6000              6000        1                      3
14 14   male        PG     Engineer   MLM  15000             15000        7                      2
15 15 female        PG  Sr Engineer   MLM  25000             25000       12                      4
16 16   male   DIPLOMA  Jr Engineer   JLM   6000              8000        1                      1
17 17   male   DIPLOMA Jr Associate   JLM   8000              8000        2                      4
18 18 female        PG     Engineer   MLM  13000             13000        7                      3
19 19 female        PG     Engineer   MLM  14000             14000        7                      2
20 20 female        PG     Engineer   MLM  16000             16000        8                      4
21 21 female        PG  Sr Engineer   MLM  25000             25000       12                      4
22 22   male   DIPLOMA  Jr Engineer   JLM   6000              8000        1                      1
23 23   male   DIPLOMA Jr Associate   JLM   8000              8000        2                      4
24 24 female        PG     Engineer   MLM  13000             13000        7                      3
25 25 female        PG     Engineer   MLM  14000             14000        7                      2
26 26 female        PG     Engineer   MLM  16000             16000        8                      4
27 27 female        UG  Jr Engineer   JLM  10000              1000        3                      4
28 28   male DOCTORATE     Chairman   TLM 100000            100000       20                      4
29 29   male   DIPLOMA        Jr HR   JLM   6000              6000        1                      3
30 30   male        PG     Engineer   MLM  15000             15000        7                      2
31 31 female        PG  Sr Engineer   MLM  25000             25000       12                      4
32 32 female        PG  Sr Engineer   MLM  25000             25000       12                      4
33 33   male   DIPLOMA  Jr Engineer   JLM   6000              8000        1                      1
34 34   male   DIPLOMA Jr Associate   JLM   8000              8000        2                      4
35 35 female        PG     Engineer   MLM  13000             13000        7                      3
36 36 female        PG     Engineer   MLM  14000             14000        7                      2
37 37 female        PG     Engineer   MLM  16000             16000        8                      4
38 38 female        UG  Jr Engineer   JLM  10000              1000        3                      4
39 39   male DOCTORATE     Chairman   TLM 100000            100000       20                      4
40 40   male   DIPLOMA        Jr HR   JLM   6000              6000        1                      3
41 41   male        PG     Engineer   MLM  15000             15000        7                      2
42 42 female        PG  Sr Engineer   MLM  25000             25000       12                      4
43 43   male   DIPLOMA  Jr Engineer   JLM   6000              8000        1                      1
44 44   male   DIPLOMA Jr Associate   JLM   8000              8000        2                      4
45 45 female        PG     Engineer   MLM  13000             13000        7                      3
46 46 female        PG     Engineer   MLM  16000             16000        8                      4
47 47 female        UG  Jr Engineer   JLM  10000              1000        3                      4
48 48   male DOCTORATE     Chairman   TLM 100000            100000       20                      4
49 49   male   DIPLOMA        Jr HR   JLM   6000              6000        1                      3
50 50   male        PG     Engineer   MLM  15000             15000        7                      2


We have already executed ANOVA test. Following is the output.

> aov1 <-aov(Salary~educ, data=sal)
> aov1
Call:
aov(formula = Salary ~ educ, data = sal)

Terms:
educ   Residuals
Sum of Squares  35270186667   538293333
Deg. of Freedom           3          46

Residual standard error: 3420.823
Estimated effects may be unbalanced
> summary(aov1)
Df    Sum Sq   Mean Sq F value Pr(>F)
educ         3 3.527e+10 1.176e+10    1005 <2e-16 ***
Residuals   46 5.383e+08 1.170e+07
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


### Variance between groups

What we get above is the overall significant difference between DV (salary) and IV (Education)

To compare the difference between the group, we use Post hoc test. We shall use TukeyHSD() for this. We need to go for this approach, only if the anova is significant. If anova is not significant, there is no need for posthoc.

We’d see how to run get the variances across the groups in this post.

> tukey <- TukeyHSD(aov1)
> tukey
Tukey multiple comparisons of means
95% family-wise confidence level

Fit: aov(formula = Salary ~ educ, data = sal)

$educ diff lwr upr p adj DOCTORATE-DIPLOMA 93333.333 88624.720 98041.947 0.0000000 PG-DIPLOMA 10373.333 7395.345 13351.322 0.0000000 UG-DIPLOMA 3333.333 -1375.280 8041.947 0.2477298 PG-DOCTORATE -82960.000 -87426.983 -78493.017 0.0000000 UG-DOCTORATE -90000.000 -95766.850 -84233.150 0.0000000 UG-PG -7040.000 -11506.983 -2573.017 0.0006777  • diff – mean difference between education level • lwr – lower mean • upr – upper mean If signs between lwr and upr are same, irrelevant of + or -, that denotes significant difference. When you compare diploma (lower degree) with doctorate (higher degree), the difference would be +ve and vice versa. If you just want to see the difference, + or – is not significant. Let’s plot this in a graph. > plot(tukey)  0 is the mid point. So, anything near 0 do not have significant difference. From the top, first plot is for the comparison between DOCTORATE-DIPLOMA. You would see a high positive difference. If you see the plot for UG-DOCTORATE, is it second highest difference, but this is negative difference. Anything near 0 like UG-DIPLOMA, does not have significant difference. ### ANOVA between multiple variables We received a new data set from company now, which has a new column Loan.deducation. Last.drawn.salary changes with respect to his loans. > sal <- read.csv("sal.csv") > head(sal) id gender educ Designation Level Salary Loan.deduction Last.drawn.salary Pre..Exp Ratings.by.interviewer 1 1 female UG Jr Engineer JLM 10000 5901.74 4098.26 3 4 2 2 male DOCTORATE Chairman TLM 100000 4247.31 95752.69 20 4 3 3 male DIPLOMA Jr HR JLM 6000 3895.76 2104.24 1 3 4 4 male PG Engineer MLM 15000 9108.36 5891.64 7 2 5 5 female PG Sr Engineer MLM 25000 4269.39 20730.61 12 4 6 6 male DIPLOMA Jr Engineer JLM 6000 4137.31 1862.69 1 1  Company wants to see the differences among Salary (column 6), Loan.deduction (column 7) and Last.drawn.salary (column 8). We combine apply and anova as given below. > aovset <- apply(sal[,6:8], 2, function(x)aov(x~educ, data = sal))  • sal[,6:8] takes all rows of columns 6, 7 and 8 • aov is our function Following is the variance between education and Last.drawn.salary. > summary(aovset$Last.drawn.salary)
Df    Sum Sq   Mean Sq F value Pr(>F)
educ         3 3.342e+10 1.114e+10   674.2 <2e-16 ***
Residuals   46 7.602e+08 1.653e+07
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


F value is 674, which means, the change is significant. Following would be more interesting.

> mtcars5by5$total [1] 300.90 300.90 231.65 398.48 564.85 > #total is added a new variable in our data set > mtcars5by5 mpg cyl disp hp drat total Mazda RX4 21.0 6 160 110 3.90 300.90 Mazda RX4 Wag 21.0 6 160 110 3.90 300.90 Datsun 710 22.8 4 108 93 3.85 231.65 Hornet 4 Drive 21.4 6 258 110 3.08 398.48 Hornet Sportabout 18.7 8 360 175 3.15 564.85 > #column sum > apply(mtcars5by5, 2, sum) mpg cyl disp hp drat total 104.90 30.00 1046.00 598.00 17.88 1796.78  ### Transform Transform() helps us to prepare data. Using this, we shall create n number of new variables. > transform(mtcars5by5,tot=sum(mtcars5by5[,1:5]),mtper=mpg/drat,ntprod=mpg/hp) mpg cyl disp hp drat total tot mtper ntprod Mazda RX4 21.0 6 160 110 3.90 300.90 1796.78 5.384615 0.1909091 Mazda RX4 Wag 21.0 6 160 110 3.90 300.90 1796.78 5.384615 0.1909091 Datsun 710 22.8 4 108 93 3.85 231.65 1796.78 5.922078 0.2451613 Hornet 4 Drive 21.4 6 258 110 3.08 398.48 1796.78 6.948052 0.1945455 Hornet Sportabout 18.7 8 360 175 3.15 564.85 1796.78 5.936508 0.1068571  I have added new variables tot, mtper, ntprod above. ### lapply This help us to issue a function over a list. It loops over a list and evaluate a function on each element > lapply(mtcars5by5, mean)$mpg
[1] 20.98

$cyl [1] 6$disp
[1] 209.2

$hp [1] 119.6$drat
[1] 3.576

$total [1] 359.356  It went through the complete list to provide the mean of each variable. ### tapply tapply() helps us to apply the function in a ragged array or a subset of vector. > tapply(mtcars5by5$mpg, mtcars5by5$cyl, mean) 4 6 8 22.80000 21.13333 18.70000  Consider our mtcars data set. I need to find out mean and maximum horse power hp grouped by different gears > mtcars mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 > ctmean v s automatic 15.05 20.74286 manual 19.75 28.37143 > tapply(mtcars$hp, mtcars$gear, mean) 3 4 5 176.1333 89.5000 195.6000 > tapply(mtcars$hp, mtcars$gear, max) 3 4 5 245 123 335  We got mean horse power for 3, 4 and 5 gears. We may even provide a list to group the mean operation. In the below given example, we shall calculate the mean for different transmission model am (0 – automatic; 1 – manual) and v/s. > list(mtcars$am,mtcars$vs) [[1]] [1] 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 [[2]] [1] 0 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 0 0 0 1 > ctmean <- tapply(mtcars$hp, list(mtcars$am,mtcars$vs), mean)
> rownames(ctmean) <- c("automatic", "manual")
> colnames(ctmean) <- c("v", "s")
> ctmean
v         s
automatic 194.1667 102.14286
manual    180.8333  80.57143


I think I shall stop here. See you in another interesting post.

Have a leisurely weekend.

# Exploring data files with R

I have written about data types and data structures of R in my previous post Working with data types of R. We shall explore a data set in this post.

mtcars is a dataset exists in R already

> mtcars
mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2


### File structure

Very first question would be about the size of the data set.

> dim(mtcars)
[1] 32 11


It contains 32 rows and 11 columns.

Now, how many variables we have in mtcars?

> names(mtcars)
[1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"


We shall preview the data using head and tail commands.

> head(mtcars)
mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
> tail(mtcars)
mpg cyl  disp  hp drat    wt qsec vs am gear carb
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2


Similar to unix tail, head commands, you would see first and last 6 records in R.

This is the time to know about the structure of the data set.

> str(mtcars)
'data.frame':	32 obs. of  11 variables:
$mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...$ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
$disp: num 160 160 108 258 360 ...$ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
$drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...$ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
$qsec: num 16.5 17 18.6 19.4 17 ...$ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
$am : num 1 1 1 0 0 0 0 0 0 0 ...$ gear: num  4 4 4 3 3 3 3 4 4 4 ...
$carb: num 4 4 1 1 2 1 4 2 2 4 ...  So, this is a data frame. Each variables are explained above. How the data is being stored? > mode(mtcars) [1] "list"  It is stored as a list. Let’s take another dataset available with R – airquality. > head(airquality, n=10) Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6 7 23 299 8.6 65 5 7 8 19 99 13.8 59 5 8 9 8 19 20.1 61 5 9 10 NA 194 8.6 69 5 10  You may parameterize head commands as I have shown above. Let’s omit the records with NA. > aqNoNA=na.omit(airquality) > dim(aqNoNA) [1] 111 6 > dim(airquality) [1] 153 6  So, out new object aqNoNA contains the records without missing cases, totally 111 rows. We have is.na command to check if the data is NA. > is.na(airquality) Ozone Solar.R Wind Temp Month Day [1,] FALSE FALSE FALSE FALSE FALSE FALSE [2,] FALSE FALSE FALSE FALSE FALSE FALSE [3,] FALSE FALSE FALSE FALSE FALSE FALSE [4,] FALSE FALSE FALSE FALSE FALSE FALSE [5,] TRUE TRUE FALSE FALSE FALSE FALSE [6,] FALSE TRUE FALSE FALSE FALSE FALSE > sum(is.na(airquality)) [1] 44  So totally 44 NAs found. Summary command gives us minimum, quartile 1, median, mean, 3rd quartile, maximum and number of NAa. > summary(airquality) Ozone Solar.R Wind Temp Month Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00 Min. :5.000 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00 1st Qu.:6.000 Median : 31.50 Median :205.0 Median : 9.700 Median :79.00 Median :7.000 Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88 Mean :6.993 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00 3rd Qu.:8.000 Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00 Max. :9.000 NA's :37 NA's :7 Day Min. : 1.0 1st Qu.: 8.0 Median :16.0 Mean :15.8 3rd Qu.:23.0 Max. :31.0  1st quartile meant of 24% data 2nd quartile meant of 50 percentile of data 3rd quartile meant of 75 percentile of data We shall filter the summary commands as given below. > summary(airquality$Ozone)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
1.00   18.00   31.50   42.13   63.25  168.00      37


We have given the summary of only one variable Ozone above.

We shall apply other mathematical functions as given below

> sd(aqNoNA$Ozone) [1] 33.27597  We have calculated the standard deviation for Ozone above. I have taken the data without NA, as my sd operation will fail if I have NA. ### Finding the Outlier > myNumbers <- rep(c(1,4,10, 100), c(5, 5, 5, 1)) > myNumbers [1] 1 1 1 1 1 4 4 4 4 4 10 10 10 10 10 100 > mean(myNumbers) [1] 10.9375 > sd(myNumbers) [1] 24.04293 > scale(myNumbers) [,1] [1,] -0.41332316 [2,] -0.41332316 [3,] -0.41332316 [4,] -0.41332316 [5,] -0.41332316 [6,] -0.28854636 [7,] -0.28854636 [8,] -0.28854636 [9,] -0.28854636 [10,] -0.28854636 [11,] -0.03899275 [12,] -0.03899275 [13,] -0.03899275 [14,] -0.03899275 [15,] -0.03899275 [16,] 3.70431136 attr(,"scaled:center") [1] 10.9375 attr(,"scaled:scale") [1] 24.04293  scaling of the metrics (z score) is calculated using the formula each sample data – mean / standard deviation After scaling, if you find any values is greater than ±2, they are called outliers (odd points located away from the central measures). If SD > mean, it is outlier. Other way to test the normality is Shapiro test > shapiro.test(myNumbers) Shapiro-Wilk normality test data: myNumbers W = 0.40055, p-value = 3.532e-07  If the p-value is greater than 0.05, it is normalized data. Here it is not. ### Data Analysis Here comes another interesting part. How to analyze the data, after you upload your files. To explain this, I’d prepare sample data set first. > #Employee ID > sn <- seq(1, 10, 1) > sn [1] 1 2 3 4 5 6 7 8 9 10 > #Employee gender > gender <- rep(c("male", "female"), c(6,4)) > gender [1] "male" "male" "male" "male" "male" "male" "female" "female" "female" [10] "female" > #available departments > dept <- rep(c("Admin", "HR", "Prod", "Contractor"), c(1, 3, 3, 3)) > dept [1] "Admin" "HR" "HR" "HR" "Prod" "Prod" [7] "Prod" "Contractor" "Contractor" "Contractor" > #Employee Salary > sal <- rnorm(10, 1000, 200); > sal [1] 888.2026 876.6272 919.7453 1005.9058 1084.4704 1337.7696 909.4302 801.1482 [9] 1025.9457 1182.9774 > #Now our data set > mydataset <- data.frame(sn, gender, sal, dept) > mydataset sn gender sal dept 1 1 male 888.2026 Admin 2 2 male 876.6272 HR 3 3 male 919.7453 HR 4 4 male 1005.9058 HR 5 5 male 1084.4704 Prod 6 6 male 1337.7696 Prod 7 7 female 909.4302 Prod 8 8 female 801.1482 Contractor 9 9 female 1025.9457 Contractor 10 10 female 1182.9774 Contractor  So we have serial number, gender, salary and department. Which gender is majority in the given data? we shall use table function to arrive at a simple frequency distribution table. > #which gender is more in each dept > #this is frequency distribution > table(mydataset$gender)

female   male
4      6


So we have 4 females and 6 males. Let’s group the above FD by department now.

> table(mydataset$gender, mydataset$dept)

female     0          3  0    1
male       1          0  3    2
> #assign the results to an object
> freqDis <- table(mydataset$gender, mydataset$dept);
> #transpose the table
> t(freqDis)

female male
Contractor      3    0
HR              0    3
Prod            1    2


So all departments except contractors, have male as majority. The function t() stands for transpose.

Let’s do a proportion of the data using prop.table()

> #proportion
> prop.table(freqDis)

female   0.0        0.3 0.0  0.1
male     0.1        0.0 0.3  0.2


Rather than proportion, % of males and females would give us a better visibility. hence we multiply proportion by 100.

> #Percentage
> prop.table(freqDis)*100

female     0         30  0   10
male      10          0 30   20


Column sum and row sums are frequently asked in our day today life.

> #sum
> colSums(freqDis)
1          3          3          3


Row sum shall be calculated as below.

> rowSums(freqDis)
female   male
4      6


Salary is interesting part in our profession. I’d like to see who earns more using aggregate() function? Male or Female?

> #who earns more - male or female?
> aggregate(sal~gender, mean, data = mydataset)
gender       sal
1 female  979.8754
2   male 1018.7868
> #sal is continuous variable
> #gender is categorical variable
> #mean is the function
> #data is our data source


We used mean salary above. Lets use sum now.

> aggregate(sal~gender, sum, data = mydataset)
gender      sal
1 female 3919.501
2   male 6112.721


Similarly, we use standard deviation.

> aggregate(sal~gender, sd, data = mydataset)
gender      sal
1 female 163.5837
2   male 175.1006


So salary package for females looks to be more consistent than that of males.

We calculated all the above functions individually. psych package helps to calculate everything in one command.

> install.packages("psych")
Installing package into ‘D:/gandhari/documents/R/win-library/3.4’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/psych_1.7.5.zip'
Content type 'application/zip' length 3966969 bytes (3.8 MB)

package ‘psych’ successfully unpacked and MD5 sums checked

> library(psych)
> describe (mydataset)
vars  n    mean     sd median trimmed    mad    min     max  range  skew
sn         1 10    5.50   3.03   5.50    5.50   3.71   1.00   10.00   9.00  0.00
gender*    2 10    1.60   0.52   2.00    1.62   0.00   1.00    2.00   1.00 -0.35
sal        3 10 1003.22 162.35 962.83  986.66 119.22 801.15 1337.77 536.62  0.71
dept*      4 10    2.80   1.03   3.00    2.88   1.48   1.00    4.00   3.00 -0.20
kurtosis    se
sn         -1.56  0.96
gender*    -2.05  0.16
sal        -0.72 51.34
dept*      -1.42  0.33


n is total samples, sd is standard deviation etc. All the aggregated functions are calculated for serial numbmer, gender, salary and department.

In the above data, I get only the salary data, which is in 3rd row.

> describe (mydataset[,3])
vars  n    mean     sd median trimmed    mad    min     max  range skew kurtosis
X1    1 10 1003.22 162.35 962.83  986.66 119.22 801.15 1337.77 536.62 0.71    -0.72
se
X1 51.34


Lets group the above described data by gender, ie, mean, sd for males and females separately.

> describe.by(mydataset, mydataset$gender) Descriptive statistics by group group: female vars n mean sd median trimmed mad min max range skew sn 1 4 8.50 1.29 8.50 8.50 1.48 7.00 10.00 3.00 0.00 gender* 2 4 1.00 0.00 1.00 1.00 0.00 1.00 1.00 0.00 NaN sal 3 4 979.88 163.58 967.69 979.88 166.64 801.15 1182.98 381.83 0.14 dept* 4 4 2.50 1.00 2.00 2.50 0.00 2.00 4.00 2.00 0.75 kurtosis se sn -2.08 0.65 gender* NaN 0.00 sal -2.04 81.79 dept* -1.69 0.50 --------------------------------------------------------------- group: male vars n mean sd median trimmed mad min max range skew sn 1 6 3.50 1.87 3.50 3.50 2.22 1.00 6.00 5.00 0.00 gender* 2 6 2.00 0.00 2.00 2.00 0.00 2.00 2.00 0.00 NaN sal 3 6 1018.79 175.10 962.83 1018.79 119.22 876.63 1337.77 461.14 0.83 dept* 4 6 3.00 1.10 3.00 3.00 0.74 1.00 4.00 3.00 -0.76 kurtosis se sn -1.80 0.76 gender* NaN 0.00 sal -1.02 71.48 dept* -0.92 0.45 Warning message: describe.by is deprecated. Please use the describeBy function  # Working with data types of R I have discussed about package management in the previous post. This post concentrates on data type, data structure and data coercion. ### Data types & Data Structure We have different Object types in R as given below. • Vectors • Lists • Matrices • Arrays • Factors • Data Frames We have the following data types in R. • Logical myvar <- TRUE  • Numeric myvar <- 23.5  • Integer myvar <- 2L  • Complex myvar <- 2+5i  • Character myvar <- "GOOD"  • Raw myvar <- charToRaw("GOOD")  We could see them with examples below. I have already written about c command in Basic functions in R language > id1 <- c(1,2,3,4) > id1+10 [1] 11 12 13 14  rep command helps us to repeat the values. > gender <- rep(c("male", "female"), c(6,4)) > gender [1] "male" "male" "male" "male" "male" "male" "female" "female" [9] "female" "female"  male is repeated 6 times where as female repeated 4 times. Serial number is given by the command sn. > sn <- seq(1, 10, 1) > sn [1] 1 2 3 4 5 6 7 8 9 10 > sn <-seq(1, 10, 2) > sn [1] 1 3 5 7 9  we passed start value, end value and interval. cbind meant for column bind. > #vector to matrix - cbind > slno_gender <- cbind(sn, gender) > slno_gender sn gender [1,] "1" "male" [2,] "2" "male" [3,] "3" "male" [4,] "4" "male" [5,] "5" "male" [6,] "6" "male" [7,] "7" "female" [8,] "8" "female" [9,] "9" "female" [10,] "10" "female"  This gives column wise data, in matrix. Matrix takes single data type only. Hence serial number is changed to character data type. Let’s look at similar example using DataFrame. We shall overcome this data type problem. > #data frame > df <- data.frame(sn, gender) > df sn gender 1 1 male 2 2 male 3 3 male 4 4 male 5 5 male 6 6 male 7 7 female 8 8 female 9 9 female 10 10 female  Let’s create another object called names > name <- c("Sophia","Jackson","Isabella","Lucas","Charlotte","Oliver","Amelia","Benjamin","Sarah","Julian") > name [1] "Sophia" "Jackson" "Isabella" "Lucas" "Charlotte" "Oliver" [7] "Amelia" "Benjamin" "Sarah" "Julian"  Let’s create another object called salary. when we specify number of samples (10), mean (1000) and standard deviation (20), this command would generate random numbers to satisfy this. > sal <- rnorm(10, 1000, 200); > sal [1] 898.5731 1105.1638 757.7067 934.9025 1006.0053 1014.4837 1188.6763 611.0265 [9] 643.5498 1121.6005 > mean(sal) [1] 928.1688 > sd(sal) [1] 200.4666  Let’s combine 3 objects together now to form a new object, using data.frame > mydataset <- data.frame(sn, gender, sal) > mydataset sn gender sal 1 1 male 898.5731 2 2 male 1105.1638 3 3 male 757.7067 4 4 male 934.9025 5 5 male 1006.0053 6 6 male 1014.4837 7 7 female 1188.6763 8 8 female 611.0265 9 9 female 643.5498 10 10 female 1121.6005  After showing the above example for data frame, lets switch to list. > mydatalist <- list(v1=id1, v2=sn, v3=slno_gender, v4=mydataset) > mydatalist$v1
[1] 1 2 3 4

$v2 [1] 1 2 3 4 5 6 7 8 9 10$v3
sn   gender
[1,] "1"  "male"
[2,] "2"  "male"
[3,] "3"  "male"
[4,] "4"  "male"
[5,] "5"  "male"
[6,] "6"  "male"
[7,] "7"  "female"
[8,] "8"  "female"
[9,] "9"  "female"
[10,] "10" "female"

$v4 sn gender sal 1 1 male 898.5731 2 2 male 1105.1638 3 3 male 757.7067 4 4 male 934.9025 5 5 male 1006.0053 6 6 male 1014.4837 7 7 female 1188.6763 8 8 female 611.0265 9 9 female 643.5498 10 10 female 1121.6005  I put different object types together in a single collection and stored it in one object. We shall refer to individual data as variables as shown below. > mydatalist$v1
[1] 1 2 3 4


Cool isn’t it!

### Data Coercion

Let’s look at data coercion now. mode command helps is to find out the data type of an object

> mode(mydatalist$v1) [1] "numeric" > mode(mydatalist$v3)
[1] "character"
> mode(gender)
[1] "character"


Let’s convert this character type to factor type now.


gender1<-as.factor(gender)



How is this possible to represent the text  as numeric? Let’s look at the class, which shows us the data structure.

> class(gender1)
[1] "factor


Okay, let’s unclass gender1 now to know how is it stored.

> unclass(gender1)
[1] 2 2 2 2 2 2 1 1 1 1
attr(,"levels")
[1] "female" "male"


So 1 represents female, 2 represents male.

See you in another interest post.