# Regression (Explanatory) in R

Hi,

I have written about Regression – Predictive model in my earlier post Regression testing in R. Following posts are useful if you want to know what is regression.

Previous post talks about predicting unknown values using known values. This post would explain about how much change is observed between IV(s) and DV.

> setwd("D:/gandhari/videos/Advanced Business Analytics")
> student_data <- read.csv("student_data.csv") > student_data
id gender sup.help sup.under sup.appre adv.comp adv.access tut.prof tut.sched val.devel val.meet sat.glad sat.expe loy.proud loy.recom loy.pleas scholarships job
1   1 female        7         1         7        5          5        5         4         5        6        7        7         7         7         7           no  no
2   2   male        7         1         7        6          6        6         6         6        7        7        7         7         7         7           no yes
3   3 female        6         1         7        6          6        6         6         6        7        7        6         7         7         7           no  no
4   4   male        1         7         1        1          2        3         2         1        1        1        1         1         1         1          yes  no
5   5 female        6         5         7        7          6        7         7         7        7        7        7         7         7         7           no yes
6   6   male        3         1         7        7          7        6         7         6        6        7        6         7         7         7          yes  no
7   7 female        5         2         7        7          6        6         7         4        3        7        7         7         7         7          yes  no
8   8   male        6         1         7        7          7        7         5         7        6        7        7         5         6         7          yes yes
9   9 female        7         1         7        6          6        5         5         5        5        7        6         6         7         7           no yes
10 10   male        2         4         7        7          6        6         6         4        2        5        4         4         7         7           no  no
> str(student_data)
'data.frame': 10 obs. of 18 variables:
$id : int 1 2 3 4 5 6 7 8 9 10$ gender : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2
$sup.help : int 7 7 6 1 6 3 5 6 7 2$ sup.under : int 1 1 1 7 5 1 2 1 1 4
$sup.appre : int 7 7 7 1 7 7 7 7 7 7$ adv.comp : int 5 6 6 1 7 7 7 7 6 7
$adv.access : int 5 6 6 2 6 7 6 7 6 6$ tut.prof : int 5 6 6 3 7 6 6 7 5 6
$tut.sched : int 4 6 6 2 7 7 7 5 5 6$ val.devel : int 5 6 6 1 7 6 4 7 5 4
$val.meet : int 6 7 7 1 7 6 3 6 5 2$ sat.glad : int 7 7 7 1 7 7 7 7 7 5
$sat.expe : int 7 7 6 1 7 6 7 7 6 4$ loy.proud : int 7 7 7 1 7 7 7 5 6 4
$loy.recom : int 7 7 7 1 7 7 7 6 7 7$ loy.pleas : int 7 7 7 1 7 7 7 7 7 7
$scholarships: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 2 2 1 1$ job : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 1 2 2 1<span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			>﻿</span>


Sometimes, the dataset is not completely visible in wordpress. Hence I’m giving it as an image below.

support, advice, satisfaction and loyalty has multiple variables in the above data set as sup.help, sup.under etc.

Let’s make it as a single variable (mean) for easy analysis.

> #get sing score for support advice satisfaction loyalty
> student_data$support <- apply(student_data[,3:5],1,mean) > summary (student_data$support)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
3.000   4.417   4.667   4.600   5.000   6.000
> student_data$value <- rowMeans(student_data[,10:11]) > student_data$sat <- rowMeans(student_data[,12:13])
> student_data$loy <- rowMeans(student_data[,14:16])  So we found the mean using apply() and rowMeans(). Those mean values are appended to our original data set student_data. Now, let’s take only 4 variables – gender and the 3 new variables value, sat and loy in a new data set for analysis. > student_data_min <- student_data[,c(2, 20:22)] > head(student_data_min) gender value sat loy 1 female 5.5 7.0 7 2 male 6.5 7.0 7 3 female 6.5 6.5 7 4 male 1.0 1.0 1 5 female 7.0 7.0 7 6 male 6.0 6.5 7  Looks simple and great, isn’t it? • If value for money is good, satisfaction score would be high. • If the customer is satisfied, he would be loyal to the organization. So Loy is our dependent variable DV. sat and value are our independent variables IV. I’m using regression to know how gender influences loyalty. > #DV - loy > #IV - sat, value > loyalty_gender_reln <- lm(loy~gender, data=student_data_min) > summary (loyalty_gender_reln) Call: lm(formula = loy ~ gender, data = student_data_min) Residuals: Min 1Q Median 3Q Max -4.4000 0.0667 0.0667 0.6000 1.6000 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.9333 0.7951 8.720 2.34e-05 *** gendermale -1.5333 1.1245 -1.364 0.21 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.778 on 8 degrees of freedom Multiple R-squared: 0.1886, Adjusted R-squared: 0.08717 F-statistic: 1.859 on 1 and 8 DF, p-value: 0.2098 > #R2 is 18%, which says weak relation. So gender does not influence the loyalty.  R-squared value is 0.1886, which is 18.86%, which shows very weak correlation. Hence I’d decide gender doesn’t influence loyalty. Here is the influence of value for money on loyalty. > loyalty_value_reln <- lm(loy~value, data = student_data_min) > summary(loyalty_value_reln) Call: lm(formula = loy ~ value, data = student_data_min) Residuals: Min 1Q Median 3Q Max -2.2182 -0.4953 -0.0403 0.5287 1.9618 Coefficients: Estimate Std. Error t value Pr(<|t|) (Intercept) 2.4901 1.1731 2.123 0.0665 . value 0.7280 0.2181 3.338 0.0103 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.276 on 8 degrees of freedom Multiple R-squared: 0.582, Adjusted R-squared: 0.5298 F-statistic: 11.14 on 1 and 8 DF, p-value: 0.01027 > #58%  Value for money has 58.2% influence on loyalty. Following is the influence of satisfaction against loyalty. > loyalty_sat_reln <- lm (loy~sat, data = student_data_min) > summary(loyalty_sat_reln) Call: lm(formula = loy ~ sat, data = student_data_min) Residuals: Min 1Q Median 3Q Max -1.08586 -0.08586 -0.08586 0.29040 1.21212 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.6515 0.6992 0.932 0.379 sat 0.9192 0.1115 8.241 3.53e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.6408 on 8 degrees of freedom Multiple R-squared: 0.8946, Adjusted R-squared: 0.8814 F-statistic: 67.91 on 1 and 8 DF, p-value: 3.525e-05 > #89%  Wah, 89.46%. So to keep up our customers, satisfaction should be high. This is the message we read. I wish my beloved Air India should read this post. We are combining everything below. > loyalty_everything <- lm(loy~., data = student_data_min) > summary(loyalty_everything) Call: lm(formula = loy ~ ., data = student_data_min) Residuals: Min 1Q Median 3Q Max -1.01381 -0.28807 -0.01515 0.33286 1.13931 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.66470 1.03039 0.645 0.54273 gendermale -0.01796 0.53076 -0.034 0.97411 value -0.10252 0.23777 -0.431 0.68141 sat 1.00478 0.26160 3.841 0.00855 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7273 on 6 degrees of freedom Multiple R-squared: 0.8982, Adjusted R-squared: 0.8472 F-statistic: 17.64 on 3 and 6 DF, p-value: 0.00222  Really, I don’t know how to read the above value at the moment. I’d update this post (if I don’t forget!) To collate the results and show in a consolidated format, we use screenreg() of rexreg package. > install.packages("texreg") Installing package into ‘D:/gandhari/documents/R/win-library/3.4’ (as ‘lib’ is unspecified) trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/texreg_1.36.23.zip' Content type 'application/zip' length 651831 bytes (636 KB) downloaded 636 KB package ‘texreg’ successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\pandian\AppData\Local\Temp\Rtmp085gnT\downloaded_packages > library("texreg") Version: 1.36.23 Date: 2017-03-03 Author: Philip Leifeld (University of Glasgow) Please cite the JSS article in your publications -- see citation("texreg"). > library(texreg) > screenreg(list(loyalty_gender_reln, loyalty_value_reln, loyalty_sat_reln, loyalty_everything)) ==================================================== Model 1 Model 2 Model 3 Model 4 ---------------------------------------------------- (Intercept) 6.93 *** 2.49 0.65 0.66 (0.80) (1.17) (0.70) (1.03) gendermale -1.53 -0.02 (1.12) (0.53) value 0.73 * -0.10 (0.22) (0.24) sat 0.92 *** 1.00 ** (0.11) (0.26) ---------------------------------------------------- R^2 0.19 0.58 0.89 0.90 Adj. R^2 0.09 0.53 0.88 0.85 Num. obs. 10 10 10 10 RMSE 1.78 1.28 0.64 0.73 ==================================================== *** p < 0.001, ** p < 0.01, * p < 0.05  So this linear regression post explains the relation between the variables. See you in another post with an interesting topic. # Regression testing in R T-test and ANOVA, are two parametric statistical techniques used to test the hypothesis. When the population means of only two groups is to be compared, the t-test is used, but when means of more than two groups are to be compared, ANOVA is used. Here are my previous posts about ANOVA These T-Test and ANOVA belongs to General Linear Model (GLM) family. So we can compare 2 groups with ANOVA. If we have more than 2 groups, we shall use Regression. We can have multiple IV on DV. ANOVA allows Categorical IV only. But regression allows both Categorical and Continuous data, in addition to multiple IV. DV should be continuous. We need to check correlation before getting into regression. If we do not have regression or poor correlation, lets not think about regression. I have written about correlation in the following posts. While correlation shows is degree of relation (+ve or -ve), regression shows us the correlation and sign of causation. So we are going to estimate DV based on changes to IV. Let’s take the same salary data used in my previous examples. > setwd("d:/gandhari/videos/Advanced Business Analytics/") > sal <-read.csv("sal.csv") > head(sal) id gender educ Designation Level Salary Loan.deduction Last.drawn.salary Pre..Exp Ratings.by.interviewer 1 1 female UG Jr Engineer JLM 10000 5901.74 4098.26 3 4 2 2 male DOCTORATE Chairman TLM 100000 4247.31 95752.69 20 4 3 3 male DIPLOMA Jr HR JLM 6000 3895.76 2104.24 1 3 4 4 male PG Engineer MLM 15000 9108.36 5891.64 7 2 5 5 female PG Sr Engineer MLM 25000 4269.39 20730.61 12 4 6 6 male DIPLOMA Jr Engineer JLM 6000 4137.31 1862.69 1 1 > dim(sal) [1] 50 10 > str(sal) 'data.frame': 50 obs. of 10 variables:$ id                    : int  1 2 3 4 5 6 7 8 9 10 ...
$gender : Factor w/ 2 levels "female","male": 1 2 2 2 1 2 2 1 1 1 ...$ educ                  : Factor w/ 4 levels "DIPLOMA","DOCTORATE",..: 4 2 1 3 3 1 1 3 3 3 ...
$Designation : Factor w/ 6 levels "Chairman","Engineer",..: 4 1 5 2 6 4 3 2 2 2 ...$ Level                 : Factor w/ 3 levels "JLM","MLM","TLM": 1 3 1 2 2 1 1 2 2 2 ...
$Salary : int 10000 100000 6000 15000 25000 6000 8000 13000 14000 16000 ...$ Loan.deduction        : num  5902 4247 3896 9108 4269 ...
$Last.drawn.salary : num 4098 95753 2104 5892 20731 ...$ Pre..Exp              : int  3 20 1 7 12 1 2 7 7 8 ...
$Ratings.by.interviewer: int 4 4 3 2 4 1 4 3 2 4 ... > tail (sal, n=10) id gender educ Designation Level Salary Loan.deduction Last.drawn.salary Pre..Exp Ratings.by.interviewer 41 41 male PG Engineer MLM 15000 1741.33 13258.67 7 2 42 42 female PG Sr Engineer MLM 25000 2934.33 22065.67 12 4 43 43 male DIPLOMA Jr Engineer JLM 6000 2803.03 3196.97 1 1 44 44 male DIPLOMA Jr Associate JLM 8000 5480.77 2519.23 2 4 45 45 female PG Engineer MLM 13000 1317.26 11682.74 7 3 46 46 female PG Engineer MLM 16000 9927.11 6072.89 8 4 47 47 female UG Jr Engineer JLM 10000 2507.66 7492.34 3 4 48 48 male DOCTORATE Chairman TLM 100000 9684.88 90315.12 20 4 49 49 male DIPLOMA Jr HR JLM 6000 2717.26 3282.74 1 3 50 50 male PG Engineer MLM 15000 4512.12 10487.88 7 2  Hope the preview of the data set I’ve given above makes sense. To predict, we should have two type of data – training data and testing data > salarytrain <-sal[1:35,] > salarytest <- sal[36:50,] > dim (salarytrain) [1] 35 10 > dim (salarytest) [1] 15 10  Let’s run the regression now. > salreg <- lm(Salary~educ, data=salarytrain) > summary(salreg) Call: lm(formula = Salary ~ educ, data = salarytrain) Residuals: Min 1Q Median 3Q Max -4333.3 -2333.3 -727.3 636.4 7666.7 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6727 1128 5.962 1.37e-06 *** educDOCTORATE 93273 2438 38.264 < 2e-16 *** educPG 10606 1432 7.405 2.44e-08 *** educUG 3273 2438 1.343 0.189 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3742 on 31 degrees of freedom Multiple R-squared: 0.9803, Adjusted R-squared: 0.9783 F-statistic: 513.1 on 3 and 31 DF, p-value: <; 2.2e-16  The formula for the prediction is given below. • Y = a + b1 * X1 + c • Y is DV • X1 is IV • a is intercept or baseline or constant • b1 is error value. Let’s substitute the values. Predicted Y = 6727 + 1128 * Education + code R square value is 0.9803, which is 98.03. This is a high level of correlation. 98% influence of explained variance between education and salary. Remaining 2% is unexplained variance. Intercept 6727 is the baseline, which means, a person with no education may get 6727 salary. when he gets 1st level of education, he will get 6727+1128. when he gets 2nd level of education, he will get 6727+(2 x 1128) and so on. We have considered only the education in this example. Plus point of regression is, we shall use more than one IV. In this case, I want to consider years of experience in addition to education. Then my command goes as below. > salexp <- lm(Salary~educ + Pre..Exp, data=salarytrain) > summary(salexp) Call: lm(formula = Salary ~ educ + Pre..Exp, data = salarytrain) Residuals: Min 1Q Median 3Q Max -886.44 -102.30 34.25 78.50 1113.56 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3705.6 162.5 22.80 < 2e-16 *** educDOCTORATE 51977.1 1019.4 50.99 < 2e-16 *** educPG -5330.2 417.5 -12.77 1.17e-13 *** educUG -353.2 327.2 -1.08 0.289 Pre..Exp 2215.9 52.0 42.61 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 485 on 30 degrees of freedom Multiple R-squared: 0.9997, Adjusted R-squared: 0.9996 F-statistic: 2.336e+04 on 4 and 30 DF, p-value: < 2.2e-16  This time I get 99.97% influence of education and experience in deciding someones salary. If you see the signs of estimate, Education UG or PG does not make a big difference. But previous experience and DOCTORATE surge our R square value. If R square score is low, your correlation is weak. Do not use prediction in this case or search for right IVs. Let’s predict the salary now. > salpred <- predict(salreg, salarytest) > salpred 36 37 38 39 40 41 42 43 44 45 46 17333.333 17333.333 10000.000 100000.000 6727.273 17333.333 17333.333 6727.273 6727.273 17333.333 17333.333 47 48 49 50 10000.000 100000.000 6727.273 17333.333  So this is the prediction of salaries from rows 36 to 50. Let’s use cbind for better understanding. What you see as Salary is the actual salary. What you see under salpred is predicted salary. In some cases, the prediction is close, in some cases, it is far. So difference between actual salary (actual Y) and predicted salary (predicted Y) is called residual. Residual should be lower to have better prediction. > cbind(salarytest, salpred) id gender educ Designation Level Salary Loan.deduction Last.drawn.salary Pre..Exp Ratings.by.interviewer salpred 36 36 female PG Engineer MLM 14000 716.48 13283.52 7 2 17333.333 37 37 female PG Engineer MLM 16000 6595.95 9404.05 8 4 17333.333 38 38 female UG Jr Engineer JLM 10000 5433.07 4566.93 3 4 10000.000 39 39 male DOCTORATE Chairman TLM 100000 9028.68 90971.32 20 4 100000.000 40 40 male DIPLOMA Jr HR JLM 6000 794.66 5205.34 1 3 6727.273 41 41 male PG Engineer MLM 15000 1741.33 13258.67 7 2 17333.333 42 42 female PG Sr Engineer MLM 25000 2934.33 22065.67 12 4 17333.333 43 43 male DIPLOMA Jr Engineer JLM 6000 2803.03 3196.97 1 1 6727.273 44 44 male DIPLOMA Jr Associate JLM 8000 5480.77 2519.23 2 4 6727.273 45 45 female PG Engineer MLM 13000 1317.26 11682.74 7 3 17333.333 46 46 female PG Engineer MLM 16000 9927.11 6072.89 8 4 17333.333 47 47 female UG Jr Engineer JLM 10000 2507.66 7492.34 3 4 10000.000 48 48 male DOCTORATE Chairman TLM 100000 9684.88 90315.12 20 4 100000.000 49 49 male DIPLOMA Jr HR JLM 6000 2717.26 3282.74 1 3 6727.273 50 50 male PG Engineer MLM 15000 4512.12 10487.88 7 2 17333.333  See you in another interesting post. # Multiple ANOVA, Post hoc test using R I have written about how to run the ANOVA test in my previous post Analysis of Variance ANOVA using R. We analyzed the salary difference between different level of education. For ease of (my!) understanding, I would take the same data set in this post as well. So here is the same data set. > sal id gender educ Designation Level Salary Last.drawn.salary Pre..Exp Ratings.by.interviewer 1 1 female UG Jr Engineer JLM 10000 1000 3 4 2 2 male DOCTORATE Chairman TLM 100000 100000 20 4 3 3 male DIPLOMA Jr HR JLM 6000 6000 1 3 4 4 male PG Engineer MLM 15000 15000 7 2 5 5 female PG Sr Engineer MLM 25000 25000 12 4 6 6 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 7 7 male DIPLOMA Jr Associate JLM 8000 8000 2 4 8 8 female PG Engineer MLM 13000 13000 7 3 9 9 female PG Engineer MLM 14000 14000 7 2 10 10 female PG Engineer MLM 16000 16000 8 4 11 11 female UG Jr Engineer JLM 10000 1000 3 4 12 12 male DOCTORATE Chairman TLM 100000 100000 20 4 13 13 male DIPLOMA Jr HR JLM 6000 6000 1 3 14 14 male PG Engineer MLM 15000 15000 7 2 15 15 female PG Sr Engineer MLM 25000 25000 12 4 16 16 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 17 17 male DIPLOMA Jr Associate JLM 8000 8000 2 4 18 18 female PG Engineer MLM 13000 13000 7 3 19 19 female PG Engineer MLM 14000 14000 7 2 20 20 female PG Engineer MLM 16000 16000 8 4 21 21 female PG Sr Engineer MLM 25000 25000 12 4 22 22 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 23 23 male DIPLOMA Jr Associate JLM 8000 8000 2 4 24 24 female PG Engineer MLM 13000 13000 7 3 25 25 female PG Engineer MLM 14000 14000 7 2 26 26 female PG Engineer MLM 16000 16000 8 4 27 27 female UG Jr Engineer JLM 10000 1000 3 4 28 28 male DOCTORATE Chairman TLM 100000 100000 20 4 29 29 male DIPLOMA Jr HR JLM 6000 6000 1 3 30 30 male PG Engineer MLM 15000 15000 7 2 31 31 female PG Sr Engineer MLM 25000 25000 12 4 32 32 female PG Sr Engineer MLM 25000 25000 12 4 33 33 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 34 34 male DIPLOMA Jr Associate JLM 8000 8000 2 4 35 35 female PG Engineer MLM 13000 13000 7 3 36 36 female PG Engineer MLM 14000 14000 7 2 37 37 female PG Engineer MLM 16000 16000 8 4 38 38 female UG Jr Engineer JLM 10000 1000 3 4 39 39 male DOCTORATE Chairman TLM 100000 100000 20 4 40 40 male DIPLOMA Jr HR JLM 6000 6000 1 3 41 41 male PG Engineer MLM 15000 15000 7 2 42 42 female PG Sr Engineer MLM 25000 25000 12 4 43 43 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 44 44 male DIPLOMA Jr Associate JLM 8000 8000 2 4 45 45 female PG Engineer MLM 13000 13000 7 3 46 46 female PG Engineer MLM 16000 16000 8 4 47 47 female UG Jr Engineer JLM 10000 1000 3 4 48 48 male DOCTORATE Chairman TLM 100000 100000 20 4 49 49 male DIPLOMA Jr HR JLM 6000 6000 1 3 50 50 male PG Engineer MLM 15000 15000 7 2  We have already executed ANOVA test. Following is the output. > aov1 <-aov(Salary~educ, data=sal) > aov1 Call: aov(formula = Salary ~ educ, data = sal) Terms: educ Residuals Sum of Squares 35270186667 538293333 Deg. of Freedom 3 46 Residual standard error: 3420.823 Estimated effects may be unbalanced > summary(aov1) Df Sum Sq Mean Sq F value Pr(>F) educ 3 3.527e+10 1.176e+10 1005 <2e-16 *** Residuals 46 5.383e+08 1.170e+07 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  ### Variance between groups What we get above is the overall significant difference between DV (salary) and IV (Education) To compare the difference between the group, we use Post hoc test. We shall use TukeyHSD() for this. We need to go for this approach, only if the anova is significant. If anova is not significant, there is no need for posthoc. We’d see how to run get the variances across the groups in this post. > tukey <- TukeyHSD(aov1) > tukey Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = Salary ~ educ, data = sal)$educ
DOCTORATE-DIPLOMA  93333.333  88624.720  98041.947 0.0000000
PG-DIPLOMA         10373.333   7395.345  13351.322 0.0000000
UG-DIPLOMA          3333.333  -1375.280   8041.947 0.2477298
PG-DOCTORATE      -82960.000 -87426.983 -78493.017 0.0000000
UG-DOCTORATE      -90000.000 -95766.850 -84233.150 0.0000000
UG-PG              -7040.000 -11506.983  -2573.017 0.0006777

• diff – mean difference between education level
• lwr – lower mean
• upr – upper mean

If signs between lwr and upr are same, irrelevant of + or -, that denotes significant difference.

When you compare diploma (lower degree) with doctorate (higher degree), the difference would be +ve and vice versa. If you just want to see the difference, + or – is not significant.

Let’s plot this in a graph.

> plot(tukey)


0 is the mid point. So, anything near 0 do not have significant difference.

From the top, first plot is for the comparison between DOCTORATE-DIPLOMA. You would see a high positive difference. If you see the plot for UG-DOCTORATE, is it second highest difference, but this is negative difference. Anything near 0 like UG-DIPLOMA, does not have significant difference.

### ANOVA between multiple variables

We received a new data set from company now, which has a new column Loan.deducation. Last.drawn.salary changes with respect to his loans.

> sal <- read.csv("sal.csv")
id gender      educ Designation Level Salary Loan.deduction Last.drawn.salary Pre..Exp Ratings.by.interviewer
1  1 female        UG Jr Engineer   JLM  10000        5901.74           4098.26        3                      4
2  2   male DOCTORATE    Chairman   TLM 100000        4247.31          95752.69       20                      4
3  3   male   DIPLOMA       Jr HR   JLM   6000        3895.76           2104.24        1                      3
4  4   male        PG    Engineer   MLM  15000        9108.36           5891.64        7                      2
5  5 female        PG Sr Engineer   MLM  25000        4269.39          20730.61       12                      4
6  6   male   DIPLOMA Jr Engineer   JLM   6000        4137.31           1862.69        1                      1


Company wants to see the differences among Salary (column 6), Loan.deduction (column 7) and Last.drawn.salary (column 8). We combine apply and anova as given below.

> aovset <- apply(sal[,6:8], 2, function(x)aov(x~educ, data = sal))

• sal[,6:8] takes all rows of columns 6, 7 and 8
• aov is our function

Following is the variance between education and Last.drawn.salary.

> summary(aovset$Last.drawn.salary) Df Sum Sq Mean Sq F value Pr(>F) educ 3 3.342e+10 1.114e+10 674.2 <2e-16 *** Residuals 46 7.602e+08 1.653e+07 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  F value is 674, which means, the change is significant. Following would be more interesting. > summary(aovset$Loan.deduction)
Df    Sum Sq Mean Sq F value Pr(>F)
educ         3  25577616 8525872    1.14  0.343
Residuals   46 343898395 7476052


F value for Loan.deduction is lesser than 4. So, there is no change in the deductions between different education level.

See you in another interesting post.

# Analysis of Variance ANOVA using R

I have written about different types of hypothesis testing in my previous posts.

This post discuss about how to calculate ANOVA using R. This post will not talk about anova, as I have written this in One way analysis of variance and Calculating Anova with LibreOffice Calc

So this is also similar to t test. The difference exists in in the type of variables used. As given in Testing of Independence – Chi Square test – Manual, LibreOffice, R blog post, T test would be used when two variables are categorical. In addition, T test uses only two groups.

We can use categorical (IV) and continuous DV variables in Anova. We may use more than 2 variables.

An increase in number of schools, leads to increase in educational expense of the Government. We call this as reasonable change.

An increase in number of schools, leads to increase school fees of each family. This is unfortunate, right. So this is unreasonable change.

So reasonable change is a change in independent variable has a change in dependent variable.

An unreasonable change is a change in dependent variable with no change in independent variable.

We would care about f ratio here.

f ratio = ratio between explained and unexplained variable.

f ratio = explained variance/unexplained variance

If f > 4, the variables have significant difference. Before making a decision, refer to p value.

Let me take the same data set used in my previous posts.

> sal
id gender      educ  Designation Level Salary Last.drawn.salary Pre..Exp Ratings.by.interviewer
1   1 female        UG  Jr Engineer   JLM  10000              1000        3                      4
2   2   male DOCTORATE     Chairman   TLM 100000            100000       20                      4
3   3   male   DIPLOMA        Jr HR   JLM   6000              6000        1                      3
4   4   male        PG     Engineer   MLM  15000             15000        7                      2
5   5 female        PG  Sr Engineer   MLM  25000             25000       12                      4
6   6   male   DIPLOMA  Jr Engineer   JLM   6000              8000        1                      1
7   7   male   DIPLOMA Jr Associate   JLM   8000              8000        2                      4
8   8 female        PG     Engineer   MLM  13000             13000        7                      3
9   9 female        PG     Engineer   MLM  14000             14000        7                      2
10 10 female        PG     Engineer   MLM  16000             16000        8                      4
11 11 female        UG  Jr Engineer   JLM  10000              1000        3                      4
12 12   male DOCTORATE     Chairman   TLM 100000            100000       20                      4
13 13   male   DIPLOMA        Jr HR   JLM   6000              6000        1                      3
14 14   male        PG     Engineer   MLM  15000             15000        7                      2
15 15 female        PG  Sr Engineer   MLM  25000             25000       12                      4
16 16   male   DIPLOMA  Jr Engineer   JLM   6000              8000        1                      1
17 17   male   DIPLOMA Jr Associate   JLM   8000              8000        2                      4
18 18 female        PG     Engineer   MLM  13000             13000        7                      3
19 19 female        PG     Engineer   MLM  14000             14000        7                      2
20 20 female        PG     Engineer   MLM  16000             16000        8                      4
21 21 female        PG  Sr Engineer   MLM  25000             25000       12                      4
22 22   male   DIPLOMA  Jr Engineer   JLM   6000              8000        1                      1
23 23   male   DIPLOMA Jr Associate   JLM   8000              8000        2                      4
24 24 female        PG     Engineer   MLM  13000             13000        7                      3
25 25 female        PG     Engineer   MLM  14000             14000        7                      2
26 26 female        PG     Engineer   MLM  16000             16000        8                      4
27 27 female        UG  Jr Engineer   JLM  10000              1000        3                      4
28 28   male DOCTORATE     Chairman   TLM 100000            100000       20                      4
29 29   male   DIPLOMA        Jr HR   JLM   6000              6000        1                      3
30 30   male        PG     Engineer   MLM  15000             15000        7                      2
31 31 female        PG  Sr Engineer   MLM  25000             25000       12                      4
32 32 female        PG  Sr Engineer   MLM  25000             25000       12                      4
33 33   male   DIPLOMA  Jr Engineer   JLM   6000              8000        1                      1
34 34   male   DIPLOMA Jr Associate   JLM   8000              8000        2                      4
35 35 female        PG     Engineer   MLM  13000             13000        7                      3
36 36 female        PG     Engineer   MLM  14000             14000        7                      2
37 37 female        PG     Engineer   MLM  16000             16000        8                      4
38 38 female        UG  Jr Engineer   JLM  10000              1000        3                      4
39 39   male DOCTORATE     Chairman   TLM 100000            100000       20                      4
40 40   male   DIPLOMA        Jr HR   JLM   6000              6000        1                      3
41 41   male        PG     Engineer   MLM  15000             15000        7                      2
42 42 female        PG  Sr Engineer   MLM  25000             25000       12                      4
43 43   male   DIPLOMA  Jr Engineer   JLM   6000              8000        1                      1
44 44   male   DIPLOMA Jr Associate   JLM   8000              8000        2                      4
45 45 female        PG     Engineer   MLM  13000             13000        7                      3
46 46 female        PG     Engineer   MLM  16000             16000        8                      4
47 47 female        UG  Jr Engineer   JLM  10000              1000        3                      4
48 48   male DOCTORATE     Chairman   TLM 100000            100000       20                      4
49 49   male   DIPLOMA        Jr HR   JLM   6000              6000        1                      3
50 50   male        PG     Engineer   MLM  15000             15000        7                      2


Lets take education qualification and salary drawn. Usually higher the educational qualification, higher the salary.

> aggregate(Salary~educ, mean, data=sal)
educ     Salary
1   DIPLOMA   6666.667
2 DOCTORATE 100000.000
3        PG  17040.000
4        UG  10000.000


Obviously there is change. When diploma holder gets 6666 rupees, a doctorate gets 100000. Let’s check the analysis of variance aov() now.

> aov1 <- aov(Salary~educ, data=sal)
> aov1
Call:
aov(formula = Salary ~ educ, data = sal)

Terms:
educ   Residuals
Sum of Squares  35270186667   538293333
Deg. of Freedom           3          46

Residual standard error: 3420.823
Estimated effects may be unbalanced


Let’s look at the mean salary and anova summary below

> aggregate(Salary~educ, mean, data=sal)
educ     Salary
1   DIPLOMA   6666.667
2 DOCTORATE 100000.000
3        PG  17040.000
4        UG  10000.000

> summary(aov1)
Df    Sum Sq   Mean Sq F value Pr(>F)
educ         3 3.527e+10 1.176e+10    1005 <2e-16 ***
Residuals   46 5.383e+08 1.170e+07
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


F value is 1005. Our threshold is 4. We are getting very high value.
p value *** is <0.001. It is lesser than standard alpha 5% which is 0.05.

So change is salary is significant with the change in education. This data has significant difference.

The first row educ indicates explained variance. The second row, residuals, indicates unexplained variance.

The *** indicates that our model is not only significant for 0.05, but it is significant even at 0.001. Out of 1000, we may reject 1 time.

# Testing of difference – T test using R

Hi,

I have written about a hypothesis testing of independence in my previous post Testing of Independence – Chi Square test – Manual, LibreOffice, R. This post talks about testing of mean difference.

Lets take the same salary data set used in my previous post.

> sal
id gender      educ  Designation Level Salary Last.drawn.salary Pre..Exp Ratings.by.interviewer
1   1 female        UG  Jr Engineer   JLM  10000              1000        3                      4
2   2   male DOCTORATE     Chairman   TLM 100000            100000       20                      4
3   3   male   DIPLOMA        Jr HR   JLM   6000              6000        1                      3
4   4   male        PG     Engineer   MLM  15000             15000        7                      2
5   5 female        PG  Sr Engineer   MLM  25000             25000       12                      4
6   6   male   DIPLOMA  Jr Engineer   JLM   6000              8000        1                      1
7   7   male   DIPLOMA Jr Associate   JLM   8000              8000        2                      4
8   8 female        PG     Engineer   MLM  13000             13000        7                      3
9   9 female        PG     Engineer   MLM  14000             14000        7                      2
10 10 female        PG     Engineer   MLM  16000             16000        8                      4
11 11 female        UG  Jr Engineer   JLM  10000              1000        3                      4
12 12   male DOCTORATE     Chairman   TLM 100000            100000       20                      4
13 13   male   DIPLOMA        Jr HR   JLM   6000              6000        1                      3
14 14   male        PG     Engineer   MLM  15000             15000        7                      2
15 15 female        PG  Sr Engineer   MLM  25000             25000       12                      4
16 16   male   DIPLOMA  Jr Engineer   JLM   6000              8000        1                      1
17 17   male   DIPLOMA Jr Associate   JLM   8000              8000        2                      4
18 18 female        PG     Engineer   MLM  13000             13000        7                      3
19 19 female        PG     Engineer   MLM  14000             14000        7                      2
20 20 female        PG     Engineer   MLM  16000             16000        8                      4
21 21 female        PG  Sr Engineer   MLM  25000             25000       12                      4
22 22   male   DIPLOMA  Jr Engineer   JLM   6000              8000        1                      1
23 23   male   DIPLOMA Jr Associate   JLM   8000              8000        2                      4
24 24 female        PG     Engineer   MLM  13000             13000        7                      3
25 25 female        PG     Engineer   MLM  14000             14000        7                      2
26 26 female        PG     Engineer   MLM  16000             16000        8                      4
27 27 female        UG  Jr Engineer   JLM  10000              1000        3                      4
28 28   male DOCTORATE     Chairman   TLM 100000            100000       20                      4
29 29   male   DIPLOMA        Jr HR   JLM   6000              6000        1                      3
30 30   male        PG     Engineer   MLM  15000             15000        7                      2
31 31 female        PG  Sr Engineer   MLM  25000             25000       12                      4
32 32 female        PG  Sr Engineer   MLM  25000             25000       12                      4
33 33   male   DIPLOMA  Jr Engineer   JLM   6000              8000        1                      1
34 34   male   DIPLOMA Jr Associate   JLM   8000              8000        2                      4
35 35 female        PG     Engineer   MLM  13000             13000        7                      3
36 36 female        PG     Engineer   MLM  14000             14000        7                      2
37 37 female        PG     Engineer   MLM  16000             16000        8                      4
38 38 female        UG  Jr Engineer   JLM  10000              1000        3                      4
39 39   male DOCTORATE     Chairman   TLM 100000            100000       20                      4
40 40   male   DIPLOMA        Jr HR   JLM   6000              6000        1                      3
41 41   male        PG     Engineer   MLM  15000             15000        7                      2
42 42 female        PG  Sr Engineer   MLM  25000             25000       12                      4
43 43   male   DIPLOMA  Jr Engineer   JLM   6000              8000        1                      1
44 44   male   DIPLOMA Jr Associate   JLM   8000              8000        2                      4
45 45 female        PG     Engineer   MLM  13000             13000        7                      3
46 46 female        PG     Engineer   MLM  16000             16000        8                      4
47 47 female        UG  Jr Engineer   JLM  10000              1000        3                      4
48 48   male DOCTORATE     Chairman   TLM 100000            100000       20                      4
49 49   male   DIPLOMA        Jr HR   JLM   6000              6000        1                      3
50 50   male        PG     Engineer   MLM  15000             15000        7                      2


Let’s find out the difference of mean salary between male and female.

> aggregate(Salary~gender, mean, data=sal)
gender Salary
1 female  16040
2   male  27000


Mean salary of female μ0 = 16040
Mean salary of female μ1 = 27000
The symbol ~ differentiates between dependent and independent variables

Obviously there is a difference. Let’s see what a t-test in R shows us.

> t.test(Salary~gender, data = sal)

Welch Two Sample t-test

data:  Salary by gender
t = -1.4494, df = 25.039, p-value = 0.1596
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-26532.252   4612.252
sample estimates:
mean in group female   mean in group male
16040                27000


How to interpret this result?

The p-value is compared with the desired significance level of our test and, if it is smaller, the result is significant. That is, if the null hypothesis were to be rejected at the 5% significance level, this would be reported as “p < 0.05″. Small p-values suggest that the null hypothesis is unlikely to be true. But our p-value 0.15 > 0.05. Hence null hypothesis is rejected and alternate hypothesis is accepted. There is a difference in salary between both genders.

Higher the t value (ignore the sign), higher the difference.

# Testing of Independence – Chi Square test – Manual, LibreOffice, R

Hi,

I have written about testing of hypothesis in my earlier posts

Statisticians recommended right testing approaches for different type of data.

When we have –

• both data as categorical, we shall use Chi Square Test
• Continuous and Continuous data, we shall use correlation
• Categorical and Continuous data, we shall use t test or anova.

In this post, I’d be using the below given data set.

   id gender      educ  Designation Level Salary Last.drawn.salary Pre..Exp Ratings.by.interviewer
1 female        UG  Jr Engineer   JLM  10000              1000        3                      4
2   male DOCTORATE     Chairman   TLM 100000            100000       20                      4
3   male   DIPLOMA        Jr HR   JLM   6000              6000        1                      3
4   male        PG     Engineer   MLM  15000             15000        7                      2
5 female        PG  Sr Engineer   MLM  25000             25000       12                      4
6   male   DIPLOMA  Jr Engineer   JLM   6000              8000        1                      1
7   male   DIPLOMA Jr Associate   JLM   8000              8000        2                      4
8 female        PG     Engineer   MLM  13000             13000        7                      3
9 female        PG     Engineer   MLM  14000             14000        7                      2
10 female        PG     Engineer   MLM  16000             16000        8                      4
11 female        UG  Jr Engineer   JLM  10000              1000        3                      4
12   male DOCTORATE     Chairman   TLM 100000            100000       20                      4
13   male   DIPLOMA        Jr HR   JLM   6000              6000        1                      3
14   male        PG     Engineer   MLM  15000             15000        7                      2
15 female        PG  Sr Engineer   MLM  25000             25000       12                      4
16   male   DIPLOMA  Jr Engineer   JLM   6000              8000        1                      1
17   male   DIPLOMA Jr Associate   JLM   8000              8000        2                      4
18 female        PG     Engineer   MLM  13000             13000        7                      3
19 female        PG     Engineer   MLM  14000             14000        7                      2
20 female        PG     Engineer   MLM  16000             16000        8                      4
21 female        PG  Sr Engineer   MLM  25000             25000       12                      4
22   male   DIPLOMA  Jr Engineer   JLM   6000              8000        1                      1
23   male   DIPLOMA Jr Associate   JLM   8000              8000        2                      4
24 female        PG     Engineer   MLM  13000             13000        7                      3
25 female        PG     Engineer   MLM  14000             14000        7                      2
26 female        PG     Engineer   MLM  16000             16000        8                      4
27 female        UG  Jr Engineer   JLM  10000              1000        3                      4
28   male DOCTORATE     Chairman   TLM 100000            100000       20                      4
29   male   DIPLOMA        Jr HR   JLM   6000              6000        1                      3
30   male        PG     Engineer   MLM  15000             15000        7                      2
31 female        PG  Sr Engineer   MLM  25000             25000       12                      4
32 female        PG  Sr Engineer   MLM  25000             25000       12                      4
33   male   DIPLOMA  Jr Engineer   JLM   6000              8000        1                      1
34   male   DIPLOMA Jr Associate   JLM   8000              8000        2                      4
35 female        PG     Engineer   MLM  13000             13000        7                      3
36 female        PG     Engineer   MLM  14000             14000        7                      2
37 female        PG     Engineer   MLM  16000             16000        8                      4
38 female        UG  Jr Engineer   JLM  10000              1000        3                      4
39   male DOCTORATE     Chairman   TLM 100000            100000       20                      4
40   male   DIPLOMA        Jr HR   JLM   6000              6000        1                      3
41   male        PG     Engineer   MLM  15000             15000        7                      2
42 female        PG  Sr Engineer   MLM  25000             25000       12                      4
43   male   DIPLOMA  Jr Engineer   JLM   6000              8000        1                      1
44   male   DIPLOMA Jr Associate   JLM   8000              8000        2                      4
45 female        PG     Engineer   MLM  13000             13000        7                      3
46 female        PG     Engineer   MLM  16000             16000        8                      4
47 female        UG  Jr Engineer   JLM  10000              1000        3                      4
48   male DOCTORATE     Chairman   TLM 100000            100000       20                      4
49   male   DIPLOMA        Jr HR   JLM   6000              6000        1                      3
50   male        PG     Engineer   MLM  15000             15000        7                      2


We shall use chi square test for two types of hypothesis testing

• test of independence of variables
• test goodness of fit

### Testing of independence

We can find out the association between two (at least) categorical variables. Higher the chi square value, better the result is. We shall use this to test our hypothesis.

### Goodness of fit

When we use chi square test to find the goodness of fit, we shall use 2 categorical variables. higher the chi square value, better the result is. We shall use this to test BLR, SEM tests.

### Example for Testing of independence

This post talks about testing of independence. We have employee data given above. Following are my hypothesis.

H0 = Number of female employees and level of management are not related.

H1 = Number of female employees and level of management are related.

We would solve this using three methods

1. Manual way of chi square test
2. Chi square test with LibreOffice Calc
3. Chi square test with R

#### Manual way of chi square test

We prepare the count of female employees in each level as given below. I have used COUNTIFS() function of LibreOffice.

Calculate the row (highlighted in pink colour) and column sums (blue colour) and summation of all row sums (saffron colour).

The values are called observed values. We shall find out the expected values as well easily as given below.

Expected value = column sum x row sum/sum of rowsum

=J15*N12/N15 = 25 x 20/50 = 10

Finally our table looks like this.

All the observed values (O), Expected values (E) are substituted in the below table. We calculate the Chi square value χ2 which is 19.

 O E O-E (O-E)2 (O-E)2/E 5 10 -5 25 2.5 20 12.5 7.5 56.25 4.5 0 2.5 -2.5 6.25 2.5 15 10 5 25 2.5 5 12.5 -7.5 56.25 4.5 5 2.5 2.5 6.25 2.5 χ2 19

Level of significance or Type 1 error = 5%, which is 0.05

Degrees of freedom = (row count – 1) x (column count – 1) = 2

Critical value of χ2 is 5.991, which is looked up using the level of significance and degrees of freedom in the below given table.

Make a decision

To accept our null hypothesis H0, calculated χ2 < critical χ2.
Our calculated χ2 = 19
Our critical χ2 = 5.991

Hence, we reject null hypothesis and accept alternate hypothesis.

You may watch the following video to understand the above calculation.

#### Chi square test with LibreOffice Calc

We have already found out the frequency distribution of females and males per each management level. Let’s use the same.

Select Data>Statistics>Chi-square Test

Choose the input cells

Select the Output Cell

Finally my selections are given as below

After pressing OK, We get the following result

Make a decision

If pα reject the null hypothesis. If p>α fail to reject the null hypothesis.

Our p 0.00007485 is lesser than alpha 0.05. So null hypothesis is rejected and alternate hypothesis is accepted.

#### Chi square test with R

I have the data set stored as sal.csv file. I’m importing it and store to sal object.

> setwd("d:/gandhari/videos/Advanced Business Analytics/")
id gender      educ Designation Level Salary Last.drawn.salary Pre..Exp Ratings.by.interviewer
1  1 female        UG Jr Engineer   JLM  10000              1000        3                      4
2  2   male DOCTORATE    Chairman   TLM 100000            100000       20                      4
3  3   male   DIPLOMA       Jr HR   JLM   6000              6000        1                      3
4  4   male        PG    Engineer   MLM  15000             15000        7                      2
5  5 female        PG Sr Engineer   MLM  25000             25000       12                      4
6  6   male   DIPLOMA Jr Engineer   JLM   6000              8000        1                      1


As I wrote in Exploring data files with R I create a Frequency Distribution table using table() function.

> gender_level_table <- table(sal$Level, sal$gender)
> gender_level_table

female male
JLM      5   15
MLM     20    5
TLM      0    5


Use chisq.test() function with gender_level_table as its input, to run the chi square test

> chisq.test(gender_level_table)

Pearson's Chi-squared test

data:  gender_level_table
X-squared = 19, df = 2, p-value = 7.485e-05

Warning message:
In chisq.test(gender_level_table) :
Chi-squared approximation may be incorrect


Make a decision

If pα reject the null hypothesis. If p>α fail to reject the null hypothesis.

Our p 7.485e-05 is lesser than alpha 0.05. So null hypothesis is rejected and alternate hypothesis is accepted.

See you in another interesting post. Happy Sunday.

# Looping with apply commands in R

After a long post Exploring data files with R, this is the time to get into looping. Instead of looping statements like while, for etc, we shall apply command in R.

Let’s take the mtcars data set available in R.

> mtcars
mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2


My aim is to find the mean of the above data. I have already written about summary() in my previous post. It gives the min, max, mean, median for each variables of mtcars.

> summary(mtcars)
mpg             cyl             disp             hp             drat
Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0   Min.   :2.760
1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:3.080
Median :19.20   Median :6.000   Median :196.3   Median :123.0   Median :3.695
Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7   Mean   :3.597
3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.920
Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0   Max.   :4.930
wt             qsec             vs               am              gear
Min.   :1.513   Min.   :14.50   Min.   :0.0000   Min.   :0.0000   Min.   :3.000
1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:3.000
Median :3.325   Median :17.71   Median :0.0000   Median :0.0000   Median :4.000
Mean   :3.217   Mean   :17.85   Mean   :0.4375   Mean   :0.4062   Mean   :3.688
3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:4.000
Max.   :5.424   Max.   :22.90   Max.   :1.0000   Max.   :1.0000   Max.   :5.000
carb
Min.   :1.000
1st Qu.:2.000
Median :2.000
Mean   :2.812
3rd Qu.:4.000
Max.   :8.000


### Apply

To find the mean of all variables, we need to do a looping across all rows of mtcars, which is performed using apply() command.

> apply(mtcars, 2, mean)
mpg        cyl       disp         hp       drat         wt       qsec
20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750
vs         am       gear       carb
0.437500   0.406250   3.687500   2.812500


Arguments:

1. mtcars – dataset
2. 2 – row or column wise calculation. 1 means row, 2 means column
3. mean – function

Similar task shall be accomplished using colMeans(), rowMeans().

> rowMeans(mtcars)
Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive
29.90727            29.98136            23.59818            38.73955
Hornet Sportabout             Valiant          Duster 360           Merc 240D
53.66455            35.04909            59.72000            24.63455
Merc 230            Merc 280           Merc 280C          Merc 450SE
27.23364            31.86000            31.78727            46.43091
Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental
46.50000            46.35000            66.23273            66.05855
Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla
65.97227            19.44091            17.74227            18.81409
Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28
24.88864            47.24091            46.00773            58.75273
Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa
57.37955            18.92864            24.77909            24.88027
Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E
60.97182            34.50818            63.15545            26.26273
> colMeans(mtcars)
mpg        cyl       disp         hp       drat         wt       qsec
20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750
vs         am       gear       carb
0.437500   0.406250   3.687500   2.812500


But these row or column commands do not have all functions like sd(),scale() etc which is possible with apply command. Lets take a small dataset.

> mtcars5by5 <- mtcars[1:5, 1:5]
> mtcars5by5
mpg cyl disp  hp drat
Mazda RX4         21.0   6  160 110 3.90
Mazda RX4 Wag     21.0   6  160 110 3.90
Datsun 710        22.8   4  108  93 3.85
Hornet 4 Drive    21.4   6  258 110 3.08
Hornet Sportabout 18.7   8  360 175 3.15


For the above data set, below given is the row wise and column wise sum.

> mtcars5by5$total <- apply(mtcars5by5, 1, sum) > mtcars5by5$total
[1] 300.90 300.90 231.65 398.48 564.85
> #total is added a new variable in our data set
> mtcars5by5
mpg cyl disp  hp drat  total
Mazda RX4         21.0   6  160 110 3.90 300.90
Mazda RX4 Wag     21.0   6  160 110 3.90 300.90
Datsun 710        22.8   4  108  93 3.85 231.65
Hornet 4 Drive    21.4   6  258 110 3.08 398.48
Hornet Sportabout 18.7   8  360 175 3.15 564.85
> #column sum
> apply(mtcars5by5, 2, sum)
mpg     cyl    disp      hp    drat   total
104.90   30.00 1046.00  598.00   17.88 1796.78


### Transform

Transform() helps us to prepare data. Using this, we shall create n number of new variables.

> transform(mtcars5by5,tot=sum(mtcars5by5[,1:5]),mtper=mpg/drat,ntprod=mpg/hp)
mpg cyl disp  hp drat  total     tot    mtper    ntprod
Mazda RX4         21.0   6  160 110 3.90 300.90 1796.78 5.384615 0.1909091
Mazda RX4 Wag     21.0   6  160 110 3.90 300.90 1796.78 5.384615 0.1909091
Datsun 710        22.8   4  108  93 3.85 231.65 1796.78 5.922078 0.2451613
Hornet 4 Drive    21.4   6  258 110 3.08 398.48 1796.78 6.948052 0.1945455
Hornet Sportabout 18.7   8  360 175 3.15 564.85 1796.78 5.936508 0.1068571


I have added new variables tot, mtper, ntprod above.

### lapply

This help us to issue a function over a list. It loops over a list and evaluate a function on each element

> lapply(mtcars5by5, mean)
$mpg [1] 20.98$cyl
[1] 6

$disp [1] 209.2$hp
[1] 119.6

$drat [1] 3.576$total
[1] 359.356


It went through the complete list to provide the mean of each variable.

### tapply

tapply() helps us to apply the function in a ragged array or a subset of vector.

> tapply(mtcars5by5$mpg, mtcars5by5$cyl, mean)
4        6        8
22.80000 21.13333 18.70000


Consider our mtcars data set. I need to find out mean and maximum horse power hp grouped by different gears

> mtcars
mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
> ctmean
v        s
automatic 15.05 20.74286
manual    19.75 28.37143
> tapply(mtcars$hp, mtcars$gear, mean)
3        4        5
176.1333  89.5000 195.6000
> tapply(mtcars$hp, mtcars$gear, max)
3   4   5
245 123 335


We got mean horse power for 3, 4 and 5 gears.

We may even provide a list to group the mean operation. In the below given example, we shall calculate the mean for different transmission model am (0 – automatic; 1 – manual) and v/s.

> list(mtcars$am,mtcars$vs)
[[1]]
[1] 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1

[[2]]
[1] 0 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 0 0 0 1

> ctmean <- tapply(mtcars$hp, list(mtcars$am,mtcars$vs), mean) > rownames(ctmean) <- c("automatic", "manual") > colnames(ctmean) <- c("v", "s") > ctmean v s automatic 194.1667 102.14286 manual 180.8333 80.57143  I think I shall stop here. See you in another interesting post. Have a leisurely weekend. # Exploring data files with R I have written about data types and data structures of R in my previous post Working with data types of R. We shall explore a data set in this post. mtcars is a dataset exists in R already > mtcars mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2  ### File structure Very first question would be about the size of the data set. > dim(mtcars) [1] 32 11  It contains 32 rows and 11 columns. Now, how many variables we have in mtcars? > names(mtcars) [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"  We shall preview the data using head and tail commands. > head(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 > tail(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2  Similar to unix tail, head commands, you would see first and last 6 records in R. This is the time to know about the structure of the data set. > str(mtcars) 'data.frame': 32 obs. of 11 variables:$ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$cyl : num 6 6 4 6 8 6 8 4 4 6 ...$ disp: num  160 160 108 258 360 ...
$hp : num 110 110 93 110 175 105 245 62 95 123 ...$ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$wt : num 2.62 2.88 2.32 3.21 3.44 ...$ qsec: num  16.5 17 18.6 19.4 17 ...
$vs : num 0 0 1 1 0 1 0 1 1 1 ...$ am  : num  1 1 1 0 0 0 0 0 0 0 ...
$gear: num 4 4 4 3 3 3 3 4 4 4 ...$ carb: num  4 4 1 1 2 1 4 2 2 4 ...


So, this is a data frame. Each variables are explained above.

How the data is being stored?

> mode(mtcars)
[1] "list"


It is stored as a list.

Let’s take another dataset available with R – airquality.

> head(airquality, n=10)
Ozone Solar.R Wind Temp Month Day
1     41     190  7.4   67     5   1
2     36     118  8.0   72     5   2
3     12     149 12.6   74     5   3
4     18     313 11.5   62     5   4
5     NA      NA 14.3   56     5   5
6     28      NA 14.9   66     5   6
7     23     299  8.6   65     5   7
8     19      99 13.8   59     5   8
9      8      19 20.1   61     5   9
10    NA     194  8.6   69     5  10


You may parameterize head commands as I have shown above.

Let’s omit the records with NA.

> aqNoNA=na.omit(airquality)
> dim(aqNoNA)
[1] 111   6
> dim(airquality)
[1] 153   6


So, out new object aqNoNA contains the records without missing cases, totally 111 rows.

We have is.na command to check if the data is NA.

> is.na(airquality)
Ozone Solar.R  Wind  Temp Month   Day
[1,] FALSE   FALSE FALSE FALSE FALSE FALSE
[2,] FALSE   FALSE FALSE FALSE FALSE FALSE
[3,] FALSE   FALSE FALSE FALSE FALSE FALSE
[4,] FALSE   FALSE FALSE FALSE FALSE FALSE
[5,]  TRUE    TRUE FALSE FALSE FALSE FALSE
[6,] FALSE    TRUE FALSE FALSE FALSE FALSE
> sum(is.na(airquality))
[1] 44


So totally 44 NAs found.

Summary command gives us minimum, quartile 1, median, mean, 3rd quartile, maximum and number of NAa.

> summary(airquality)
Ozone           Solar.R           Wind             Temp           Month
Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00   Min.   :5.000
1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00   1st Qu.:6.000
Median : 31.50   Median :205.0   Median : 9.700   Median :79.00   Median :7.000
Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88   Mean   :6.993
3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00   3rd Qu.:8.000
Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00   Max.   :9.000
NA's   :37       NA's   :7
Day
Min.   : 1.0
1st Qu.: 8.0
Median :16.0
Mean   :15.8
3rd Qu.:23.0
Max.   :31.0


1st quartile meant of 24% data

2nd quartile meant of 50 percentile of data

3rd quartile meant of 75 percentile of data

We shall filter the summary commands as given below.

> summary(airquality$Ozone) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 1.00 18.00 31.50 42.13 63.25 168.00 37  We have given the summary of only one variable Ozone above. We shall apply other mathematical functions as given below > sd(aqNoNA$Ozone)
[1] 33.27597


We have calculated the standard deviation for Ozone above. I have taken the data without NA, as my sd operation will fail if I have NA.

### Finding the Outlier

> myNumbers <- rep(c(1,4,10, 100), c(5, 5, 5, 1))
> myNumbers
[1]   1   1   1   1   1   4   4   4   4   4  10  10  10  10  10 100
> mean(myNumbers)
[1] 10.9375
> sd(myNumbers)
[1] 24.04293
> scale(myNumbers)
[,1]
[1,] -0.41332316
[2,] -0.41332316
[3,] -0.41332316
[4,] -0.41332316
[5,] -0.41332316
[6,] -0.28854636
[7,] -0.28854636
[8,] -0.28854636
[9,] -0.28854636
[10,] -0.28854636
[11,] -0.03899275
[12,] -0.03899275
[13,] -0.03899275
[14,] -0.03899275
[15,] -0.03899275
[16,]  3.70431136
attr(,"scaled:center")
[1] 10.9375
attr(,"scaled:scale")
[1] 24.04293


scaling of the metrics (z score) is calculated using the formula

each sample data – mean / standard deviation

After scaling, if you find any values is greater than ±2, they are called outliers (odd points located away from the central measures). If SD > mean, it is outlier.

Other way to test the normality is Shapiro test

> shapiro.test(myNumbers)

Shapiro-Wilk normality test

data:  myNumbers
W = 0.40055, p-value = 3.532e-07


If the p-value is greater than 0.05, it is normalized data. Here it is not.

### Data Analysis

Here comes another interesting part. How to analyze the data, after you upload your files.

To explain this, I’d prepare sample data set first.

> #Employee ID
> sn <- seq(1, 10, 1)
> sn
[1]  1  2  3  4  5  6  7  8  9 10
> #Employee gender
> gender <- rep(c("male", "female"), c(6,4))
> gender
[1] "male"   "male"   "male"   "male"   "male"   "male"   "female" "female" "female"
[10] "female"
> #available departments
> dept <- rep(c("Admin", "HR", "Prod", "Contractor"), c(1, 3, 3, 3))
> dept
[1] "Admin"      "HR"         "HR"         "HR"         "Prod"       "Prod"
[7] "Prod"       "Contractor" "Contractor" "Contractor"
> #Employee Salary
> sal <- rnorm(10, 1000, 200);
> sal
[1]  888.2026  876.6272  919.7453 1005.9058 1084.4704 1337.7696  909.4302  801.1482
[9] 1025.9457 1182.9774
> #Now our data set
> mydataset <- data.frame(sn, gender, sal, dept)
> mydataset
sn gender       sal       dept
2   2   male  876.6272         HR
3   3   male  919.7453         HR
4   4   male 1005.9058         HR
5   5   male 1084.4704       Prod
6   6   male 1337.7696       Prod
7   7 female  909.4302       Prod
8   8 female  801.1482 Contractor
9   9 female 1025.9457 Contractor
10 10 female 1182.9774 Contractor


So we have serial number, gender, salary and department.

Which gender is majority in the given data? we shall use table function to arrive at a simple frequency distribution table.

> #which gender is more in each dept
> #this is frequency distribution
> table(mydataset$gender) female male 4 6  So we have 4 females and 6 males. Let’s group the above FD by department now. > table(mydataset$gender, mydataset$dept) Admin Contractor HR Prod female 0 3 0 1 male 1 0 3 2 > #assign the results to an object > freqDis <- table(mydataset$gender, mydataset$dept); > #transpose the table > t(freqDis) female male Admin 0 1 Contractor 3 0 HR 0 3 Prod 1 2  So all departments except contractors, have male as majority. The function t() stands for transpose. Let’s do a proportion of the data using prop.table() > #proportion > prop.table(freqDis) Admin Contractor HR Prod female 0.0 0.3 0.0 0.1 male 0.1 0.0 0.3 0.2  Rather than proportion, % of males and females would give us a better visibility. hence we multiply proportion by 100. > #Percentage > prop.table(freqDis)*100 Admin Contractor HR Prod female 0 30 0 10 male 10 0 30 20  Column sum and row sums are frequently asked in our day today life. > #sum > colSums(freqDis) Admin Contractor HR Prod 1 3 3 3  Row sum shall be calculated as below. > rowSums(freqDis) female male 4 6  Salary is interesting part in our profession. I’d like to see who earns more using aggregate() function? Male or Female? > #who earns more - male or female? > aggregate(sal~gender, mean, data = mydataset) gender sal 1 female 979.8754 2 male 1018.7868 > #sal is continuous variable > #gender is categorical variable > #mean is the function > #data is our data source  We used mean salary above. Lets use sum now. > aggregate(sal~gender, sum, data = mydataset) gender sal 1 female 3919.501 2 male 6112.721  Similarly, we use standard deviation. > aggregate(sal~gender, sd, data = mydataset) gender sal 1 female 163.5837 2 male 175.1006  So salary package for females looks to be more consistent than that of males. We calculated all the above functions individually. psych package helps to calculate everything in one command. > install.packages("psych") Installing package into ‘D:/gandhari/documents/R/win-library/3.4’ (as ‘lib’ is unspecified) trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/psych_1.7.5.zip' Content type 'application/zip' length 3966969 bytes (3.8 MB) downloaded 3.8 MB package ‘psych’ successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\pandian\AppData\Local\Temp\Rtmpy2L0Yq\downloaded_packages > library(psych) > describe (mydataset) vars n mean sd median trimmed mad min max range skew sn 1 10 5.50 3.03 5.50 5.50 3.71 1.00 10.00 9.00 0.00 gender* 2 10 1.60 0.52 2.00 1.62 0.00 1.00 2.00 1.00 -0.35 sal 3 10 1003.22 162.35 962.83 986.66 119.22 801.15 1337.77 536.62 0.71 dept* 4 10 2.80 1.03 3.00 2.88 1.48 1.00 4.00 3.00 -0.20 kurtosis se sn -1.56 0.96 gender* -2.05 0.16 sal -0.72 51.34 dept* -1.42 0.33  n is total samples, sd is standard deviation etc. All the aggregated functions are calculated for serial numbmer, gender, salary and department. In the above data, I get only the salary data, which is in 3rd row. > describe (mydataset[,3]) vars n mean sd median trimmed mad min max range skew kurtosis X1 1 10 1003.22 162.35 962.83 986.66 119.22 801.15 1337.77 536.62 0.71 -0.72 se X1 51.34  Lets group the above described data by gender, ie, mean, sd for males and females separately. > describe.by(mydataset, mydataset$gender)

Descriptive statistics by group
group: female
vars n   mean     sd median trimmed    mad    min     max  range skew
sn         1 4   8.50   1.29   8.50    8.50   1.48   7.00   10.00   3.00 0.00
gender*    2 4   1.00   0.00   1.00    1.00   0.00   1.00    1.00   0.00  NaN
sal        3 4 979.88 163.58 967.69  979.88 166.64 801.15 1182.98 381.83 0.14
dept*      4 4   2.50   1.00   2.00    2.50   0.00   2.00    4.00   2.00 0.75
kurtosis    se
sn         -2.08  0.65
gender*      NaN  0.00
sal        -2.04 81.79
dept*      -1.69  0.50
---------------------------------------------------------------
group: male
vars n    mean     sd median trimmed    mad    min     max  range  skew
sn         1 6    3.50   1.87   3.50    3.50   2.22   1.00    6.00   5.00  0.00
gender*    2 6    2.00   0.00   2.00    2.00   0.00   2.00    2.00   0.00   NaN
sal        3 6 1018.79 175.10 962.83 1018.79 119.22 876.63 1337.77 461.14  0.83
dept*      4 6    3.00   1.10   3.00    3.00   0.74   1.00    4.00   3.00 -0.76
kurtosis    se
sn         -1.80  0.76
gender*      NaN  0.00
sal        -1.02 71.48
dept*      -0.92  0.45
Warning message:
describe.by is deprecated.  Please use the describeBy function


# Working with data types of R

I have discussed about package management in the previous post. This post concentrates on data type, data structure and data coercion.

### Data types & Data Structure

We have different Object types in R as given below.

• Vectors
• Lists
• Matrices
• Arrays
• Factors
• Data Frames

We have the following data types in R.

• Logical
myvar <- TRUE

• Numeric
myvar <- 23.5

• Integer
myvar <- 2L

• Complex
myvar <- 2+5i

• Character
myvar <- "GOOD"

• Raw
myvar <- charToRaw("GOOD")


We could see them with examples below.

I have already written about c command in Basic functions in R language

> id1 <- c(1,2,3,4)
> id1+10
[1] 11 12 13 14


rep command helps us to repeat the values.

> gender <- rep(c("male", "female"), c(6,4))
> gender
[1] "male" "male" "male" "male" "male" "male" "female" "female"
[9] "female" "female"


male is repeated 6 times where as female repeated 4 times.

Serial number is given by the command sn.

> sn <- seq(1, 10, 1)
> sn
[1] 1 2 3 4 5 6 7 8 9 10 > sn <-seq(1, 10, 2)
> sn
[1] 1 3 5 7 9


we passed start value, end value and interval.

cbind meant for column bind.

> #vector to matrix - cbind
> slno_gender <- cbind(sn, gender)
> slno_gender
sn gender
[1,] "1" "male"
[2,] "2" "male"
[3,] "3" "male"
[4,] "4" "male"
[5,] "5" "male"
[6,] "6" "male"
[7,] "7" "female"
[8,] "8" "female"
[9,] "9" "female"
[10,] "10" "female"


This gives column wise data, in matrix. Matrix takes single data type only. Hence serial number is changed to character data type.

Let’s look at similar example using DataFrame. We shall overcome this data type problem.

> #data frame
> df <- data.frame(sn, gender)
> df
sn gender
1   1   male
2   2   male
3   3   male
4   4   male
5   5   male
6   6   male
7   7 female
8   8 female
9   9 female
10 10 female


Let’s create another object called names

> name <- c("Sophia","Jackson","Isabella","Lucas","Charlotte","Oliver","Amelia","Benjamin","Sarah","Julian")
> name
[1] "Sophia"    "Jackson"   "Isabella"  "Lucas"     "Charlotte" "Oliver"
[7] "Amelia"    "Benjamin"  "Sarah"     "Julian"


Let’s create another object called salary.  when we specify number of samples (10), mean (1000) and standard deviation (20), this command would generate random numbers to satisfy this.

> sal <- rnorm(10, 1000, 200);
> sal
[1]  898.5731 1105.1638  757.7067  934.9025 1006.0053 1014.4837 1188.6763  611.0265
[9]  643.5498 1121.6005
> mean(sal)
[1] 928.1688
> sd(sal)
[1] 200.4666


Let’s combine 3 objects together now to form a new object, using data.frame

> mydataset <- data.frame(sn, gender, sal)
> mydataset
sn gender       sal
1   1   male  898.5731
2   2   male 1105.1638
3   3   male  757.7067
4   4   male  934.9025
5   5   male 1006.0053
6   6   male 1014.4837
7   7 female 1188.6763
8   8 female  611.0265
9   9 female  643.5498
10 10 female 1121.6005


After showing the above example for data frame, lets switch to list.

> mydatalist <- list(v1=id1, v2=sn, v3=slno_gender, v4=mydataset)
> mydatalist
$v1 [1] 1 2 3 4$v2
[1]  1  2  3  4  5  6  7  8  9 10

$v3 sn gender [1,] "1" "male" [2,] "2" "male" [3,] "3" "male" [4,] "4" "male" [5,] "5" "male" [6,] "6" "male" [7,] "7" "female" [8,] "8" "female" [9,] "9" "female" [10,] "10" "female"$v4
sn gender       sal
1   1   male  898.5731
2   2   male 1105.1638
3   3   male  757.7067
4   4   male  934.9025
5   5   male 1006.0053
6   6   male 1014.4837
7   7 female 1188.6763
8   8 female  611.0265
9   9 female  643.5498
10 10 female 1121.6005


I put different object types together in a single collection and stored it in one object. We shall refer to individual data as variables as shown below.

> mydatalist$v1 [1] 1 2 3 4  Cool isn’t it! ### Data Coercion Let’s look at data coercion now. mode command helps is to find out the data type of an object > mode(mydatalist$v1)
[1] "numeric"
> mode(mydatalist\$v3)
[1] "character"
> mode(gender)
[1] "character"


Let’s convert this character type to factor type now.


gender1<-as.factor(gender)



How is this possible to represent the text  as numeric? Let’s look at the class, which shows us the data structure.

> class(gender1)
[1] "factor


Okay, let’s unclass gender1 now to know how is it stored.

> unclass(gender1)
[1] 2 2 2 2 2 2 1 1 1 1
attr(,"levels")
[1] "female" "male"


So 1 represents female, 2 represents male.

See you in another interest post.

# Package management in R

Hi,

We have seen how to load the data into R language in my previous post Loading Data into R. It is an important part of this blog series. Let’s talk about packages now.

Packages are not new to programmers. Any programming language comes with packages, of course limited set of packages. Additional packages are added a la carte. We shall see same behavior in R as well. The default installation of is a thin solution, which has only basic packages. If needed we need to add additional packages. Lets see how.

### Viewing the packages

search() would help us to check the list of loaded packages.

> search()
[5] "package:graphics"  "package:grDevices" "package:utils"     "package:datasets"
[9] "package:methods"   "Autoloads"         "package:base"

installed.packages() shows us the packages installed but not loaded.

> installed.packages()
Package        LibPath                                   Version
BH           "BH"           "D:/gandhari/documents/R/win-library/3.4" "1.65.0-1"
hms          "hms"          "D:/gandhari/documents/R/win-library/3.4" "0.3"
R6           "R6"           "D:/gandhari/documents/R/win-library/3.4" "2.2.2"
Rcpp         "Rcpp"         "D:/gandhari/documents/R/win-library/3.4" "0.12.12"
rlang        "rlang"        "D:/gandhari/documents/R/win-library/3.4" "0.1.2"
tibble       "tibble"       "D:/gandhari/documents/R/win-library/3.4" "1.3.4"
base         "base"         "C:/Program Files/R/R-3.4.1/library"      "3.4.1"
boot         "boot"         "C:/Program Files/R/R-3.4.1/library"      "1.3-19"
class        "class"        "C:/Program Files/R/R-3.4.1/library"      "7.3-14"
cluster      "cluster"      "C:/Program Files/R/R-3.4.1/library"      "2.0.6"
codetools    "codetools"    "C:/Program Files/R/R-3.4.1/library"      "0.2-15"
compiler     "compiler"     "C:/Program Files/R/R-3.4.1/library"      "3.4.1"
datasets     "datasets"     "C:/Program Files/R/R-3.4.1/library"      "3.4.1"
foreign      "foreign"      "C:/Program Files/R/R-3.4.1/library"      "0.8-69"
graphics     "graphics"     "C:/Program Files/R/R-3.4.1/library"      "3.4.1"
grDevices    "grDevices"    "C:/Program Files/R/R-3.4.1/library"      "3.4.1"
grid         "grid"         "C:/Program Files/R/R-3.4.1/library"      "3.4.1"
KernSmooth   "KernSmooth"   "C:/Program Files/R/R-3.4.1/library"      "2.23-15"
lattice      "lattice"      "C:/Program Files/R/R-3.4.1/library"      "0.20-35"
MASS         "MASS"         "C:/Program Files/R/R-3.4.1/library"      "7.3-47"
Matrix       "Matrix"       "C:/Program Files/R/R-3.4.1/library"      "1.2-10"
methods      "methods"      "C:/Program Files/R/R-3.4.1/library"      "3.4.1"
mgcv         "mgcv"         "C:/Program Files/R/R-3.4.1/library"      "1.8-17"
nlme         "nlme"         "C:/Program Files/R/R-3.4.1/library"      "3.1-131"
nnet         "nnet"         "C:/Program Files/R/R-3.4.1/library"      "7.3-12"
parallel     "parallel"     "C:/Program Files/R/R-3.4.1/library"      "3.4.1"
rpart        "rpart"        "C:/Program Files/R/R-3.4.1/library"      "4.1-11"
spatial      "spatial"      "C:/Program Files/R/R-3.4.1/library"      "7.3-11"
splines      "splines"      "C:/Program Files/R/R-3.4.1/library"      "3.4.1"
stats        "stats"        "C:/Program Files/R/R-3.4.1/library"      "3.4.1"
stats4       "stats4"       "C:/Program Files/R/R-3.4.1/library"      "3.4.1"
survival     "survival"     "C:/Program Files/R/R-3.4.1/library"      "2.41-3"
tcltk        "tcltk"        "C:/Program Files/R/R-3.4.1/library"      "3.4.1"
tools        "tools"        "C:/Program Files/R/R-3.4.1/library"      "3.4.1"
translations "translations" "C:/Program Files/R/R-3.4.1/library"      "3.4.1"
utils        "utils"        "C:/Program Files/R/R-3.4.1/library"      "3.4.1"
Priority      Depends
BH           NA            NA
hms          NA            NA
R6           NA            "R (>= 3.0)"
Rcpp         NA            "R (>= 3.0.0)"
rlang        NA            "R (>= 3.1.0)"
tibble       NA            "R (>= 3.1.0)"
base         "base"        NA
boot         "recommended" "R (>= 3.0.0), graphics, stats"
class        "recommended" "R (>= 3.0.0), stats, utils"
cluster      "recommended" "R (>= 3.0.1)"
codetools    "recommended" "R (>= 2.1)"
compiler     "base"        NA
datasets     "base"        NA
foreign      "recommended" "R (>= 3.0.0)"
graphics     "base"        NA
grDevices    "base"        NA
grid         "base"        NA
KernSmooth   "recommended" "R (>= 2.5.0), stats"
lattice      "recommended" "R (>= 3.0.0)"
MASS         "recommended" "R (>= 3.1.0), grDevices, graphics, stats, utils"
Matrix       "recommended" "R (>= 3.0.1)"
methods      "base"        NA
mgcv         "recommended" "R (>= 2.14.0), nlme (>= 3.1-64)"
nlme         "recommended" "R (>= 3.0.2)"
nnet         "recommended" "R (>= 2.14.0), stats, utils"
parallel     "base"        NA
rpart        "recommended" "R (>= 2.15.0), graphics, stats, grDevices"
spatial      "recommended" "R (>= 3.0.0), graphics, stats, utils"
splines      "base"        NA
stats        "base"        NA
stats4       "base"        NA
survival     "recommended" "R (>= 2.13.0)"
tcltk        "base"        NA
tools        "base"        NA
translations NA            NA
utils        "base"        NA
BH           NA                                                 NA
hms          "methods"                                          NA
R6           NA                                                 NA
Rcpp         "methods, utils"                                   NA
readr        "Rcpp (>= 0.12.0.5), tibble, hms, R6"              "Rcpp, BH"
rlang        NA                                                 NA
tibble       "methods, rlang, Rcpp (>= 0.12.3), utils"          "Rcpp"
base         NA                                                 NA
boot         NA                                                 NA
class        "MASS"                                             NA
cluster      "graphics, grDevices, stats, utils"                NA
codetools    NA                                                 NA
compiler     NA                                                 NA
datasets     NA                                                 NA
foreign      "methods, utils, stats"                            NA
graphics     "grDevices"                                        NA
grDevices    NA                                                 NA
grid         "grDevices, utils"                                 NA
KernSmooth   NA                                                 NA
lattice      "grid, grDevices, graphics, stats, utils"          NA
MASS         "methods"                                          NA
Matrix       "methods, graphics, grid, stats, utils, lattice"   NA
methods      "utils, stats"                                     NA
mgcv         "methods, stats, graphics, Matrix"                 NA
nlme         "graphics, stats, utils, lattice"                  NA
nnet         NA                                                 NA
parallel     "tools, compiler"                                  NA
rpart        NA                                                 NA
spatial      NA                                                 NA
splines      "graphics, stats"                                  NA
stats        "utils, grDevices, graphics"                       NA
stats4       "graphics, methods, stats"                         NA
survival     "graphics, Matrix, methods, splines, stats, utils" NA
tcltk        "utils"                                            NA
tools        NA                                                 NA
translations NA                                                 NA
utils        NA                                                 NA
Suggests
BH           NA
hms          "testthat, lubridate"
R6           "knitr, microbenchmark, pryr, testthat, ggplot2, scales"
Rcpp         "RUnit, inline, rbenchmark, highlight, pkgKitten (>= 0.1.2)"
readr        "curl, testthat, knitr, rmarkdown, stringi, covr"
rlang        "knitr, rmarkdown (>= 0.2.65), testthat, covr"
tibble       "covr, dplyr, knitr (>= 1.5.32), microbenchmark, nycflights13,\ntestthat, rmarkdown, withr"
base         "methods"
boot         "MASS, survival"
class        NA
cluster      "MASS"
codetools    NA
compiler     NA
datasets     NA
foreign      NA
graphics     NA
grDevices    "KernSmooth"
grid         "lattice"
KernSmooth   "MASS"
lattice      "KernSmooth, MASS, latticeExtra"
MASS         "lattice, nlme, nnet, survival"
Matrix       "expm, MASS"
methods      "codetools"
mgcv         "splines, parallel, survival, MASS"
nlme         "Hmisc, MASS"
nnet         "MASS"
parallel     "methods"
rpart        "survival"
spatial      "MASS"
splines      "Matrix, methods"
stats        "MASS, Matrix, SuppDists, methods, stats4"
stats4       NA
survival     NA
tcltk        NA
tools        "codetools, methods, xml2, curl"
translations NA
utils        "methods, XML"
BH           NA                                      "BSL-1.0"
hms          NA                                      "GPL-3"
R6           NA                                      "MIT + file LICENSE"
Rcpp         NA                                      "GPL (>= 2)"
rlang        NA                                      "GPL-3"
tibble       NA                                      "MIT + file LICENSE"
base         NA                                      "Part of R 3.4.1"
boot         NA                                      "Unlimited"
class        NA                                      "GPL-2 | GPL-3"
cluster      NA                                      "GPL (>= 2)"
codetools    NA                                      "GPL"
compiler     NA                                      "Part of R 3.4.1"
datasets     NA                                      "Part of R 3.4.1"
foreign      NA                                      "GPL (>= 2)"
graphics     NA                                      "Part of R 3.4.1"
grDevices    NA                                      "Part of R 3.4.1"
grid         NA                                      "Part of R 3.4.1"
KernSmooth   NA                                      "Unlimited"
lattice      "chron"                                 "GPL (>= 2)"
MASS         NA                                      "GPL-2 | GPL-3"
Matrix       "MatrixModels, graph, SparseM, sfsmisc" "GPL (>= 2) | file LICENCE"
methods      NA                                      "Part of R 3.4.1"
mgcv         NA                                      "GPL (>= 2)"
nlme         NA                                      "GPL (>= 2) | file LICENCE"
nnet         NA                                      "GPL-2 | GPL-3"
parallel     "snow, nws, Rmpi"                       "Part of R 3.4.1"
rpart        NA                                      "GPL-2 | GPL-3"
spatial      NA                                      "GPL-2 | GPL-3"
splines      NA                                      "Part of R 3.4.1"
stats        NA                                      "Part of R 3.4.1"
stats4       NA                                      "Part of R 3.4.1"
survival     NA                                      "LGPL (>= 2)"
tcltk        NA                                      "Part of R 3.4.1"
tools        NA                                      "Part of R 3.4.1"
translations NA                                      "Part of R 3.4.1"
utils        NA                                      "Part of R 3.4.1"
BH           NA              NA                    NA      NA     "no"
hms          NA              NA                    NA      NA     "no"
R6           NA              NA                    NA      NA     "no"
Rcpp         NA              NA                    NA      NA     "yes"
readr        NA              NA                    NA      NA     "yes"
rlang        NA              NA                    NA      NA     "yes"
tibble       NA              NA                    NA      NA     "yes"
base         NA              NA                    NA      NA     NA
boot         NA              NA                    NA      NA     "no"
class        NA              NA                    NA      NA     "yes"
cluster      NA              NA                    NA      NA     "yes"
codetools    NA              NA                    NA      NA     "no"
compiler     NA              NA                    NA      NA     NA
datasets     NA              NA                    NA      NA     NA
foreign      NA              NA                    NA      NA     "yes"
graphics     NA              NA                    NA      NA     "yes"
grDevices    NA              NA                    NA      NA     "yes"
grid         NA              NA                    NA      NA     "yes"
KernSmooth   NA              NA                    NA      NA     "yes"
lattice      NA              NA                    NA      NA     "yes"
MASS         NA              NA                    NA      NA     "yes"
Matrix       NA              NA                    NA      NA     "yes"
methods      NA              NA                    NA      NA     "yes"
mgcv         NA              NA                    NA      NA     "yes"
nlme         NA              NA                    NA      NA     "yes"
nnet         NA              NA                    NA      NA     "yes"
parallel     NA              NA                    NA      NA     "yes"
rpart        NA              NA                    NA      NA     "yes"
spatial      NA              NA                    NA      NA     "yes"
splines      NA              NA                    NA      NA     "yes"
stats        NA              NA                    NA      NA     "yes"
stats4       NA              NA                    NA      NA     NA
survival     NA              NA                    NA      NA     "yes"
tcltk        NA              NA                    NA      NA     "yes"
tools        NA              NA                    NA      NA     "yes"
translations NA              NA                    NA      NA     NA
utils        NA              NA                    NA      NA     "yes"
Built
BH           "3.4.1"
hms          "3.4.1"
R6           "3.4.1"
Rcpp         "3.4.1"
rlang        "3.4.1"
tibble       "3.4.1"
base         "3.4.1"
boot         "3.4.1"
class        "3.4.1"
cluster      "3.4.1"
codetools    "3.4.1"
compiler     "3.4.1"
datasets     "3.4.1"
foreign      "3.4.1"
graphics     "3.4.1"
grDevices    "3.4.1"
grid         "3.4.1"
KernSmooth   "3.4.1"
lattice      "3.4.1"
MASS         "3.4.1"
Matrix       "3.4.1"
methods      "3.4.1"
mgcv         "3.4.1"
nlme         "3.4.1"
nnet         "3.4.1"
parallel     "3.4.1"
rpart        "3.4.1"
spatial      "3.4.1"
splines      "3.4.1"
stats        "3.4.1"
stats4       "3.4.1"
survival     "3.4.1"
tcltk        "3.4.1"
tools        "3.4.1"
translations "3.4.1"
utils        "3.4.1"

R Studio IDE has a tab which shows the loaded/not loaded packages.

### Installing new packages

To install a new package we shall use Install packages option in R Studio, or install.packages() command.

> install.packages("regress")
Installing package into ‘D:/gandhari/documents/R/win-library/3.4’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/regress_1.3-15.zip'
Content type 'application/zip' length 32695 bytes (31 KB)

package ‘regress’ successfully unpacked and MD5 sums checked

C:\Users\pandian\AppData\Local\Temp\Rtmpq29bYI\downloaded_packages

### Removing packages

Removing a package using R Studio is as easy as clicking the x mark.

To do it from console, we shall use remove.packages()

> remove.packages("regress")
Removing package from ‘D:/gandhari/documents/R/win-library/3.4’
(as ‘lib’ is unspecified)

To load a package, we shall just check ✅ the needed package in package tab. The same task shall be performed using library() command in console.

> library("BH", lib.loc="~/R/win-library/3.4")

See in another interesting post. 💖 you all.