# Regression (Explanatory) in R

Hi,

I have written about Regression – Predictive model in my earlier post Regression testing in R. Following posts are useful if you want to know what is regression.

Previous post talks about predicting unknown values using known values. This post would explain about how much change is observed between IV(s) and DV.

> setwd("D:/gandhari/videos/Advanced Business Analytics")
> student_data <- read.csv("student_data.csv") > student_data
id gender sup.help sup.under sup.appre adv.comp adv.access tut.prof tut.sched val.devel val.meet sat.glad sat.expe loy.proud loy.recom loy.pleas scholarships job
1   1 female        7         1         7        5          5        5         4         5        6        7        7         7         7         7           no  no
2   2   male        7         1         7        6          6        6         6         6        7        7        7         7         7         7           no yes
3   3 female        6         1         7        6          6        6         6         6        7        7        6         7         7         7           no  no
4   4   male        1         7         1        1          2        3         2         1        1        1        1         1         1         1          yes  no
5   5 female        6         5         7        7          6        7         7         7        7        7        7         7         7         7           no yes
6   6   male        3         1         7        7          7        6         7         6        6        7        6         7         7         7          yes  no
7   7 female        5         2         7        7          6        6         7         4        3        7        7         7         7         7          yes  no
8   8   male        6         1         7        7          7        7         5         7        6        7        7         5         6         7          yes yes
9   9 female        7         1         7        6          6        5         5         5        5        7        6         6         7         7           no yes
10 10   male        2         4         7        7          6        6         6         4        2        5        4         4         7         7           no  no
> str(student_data)
'data.frame': 10 obs. of 18 variables:
$id : int 1 2 3 4 5 6 7 8 9 10$ gender : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2
$sup.help : int 7 7 6 1 6 3 5 6 7 2$ sup.under : int 1 1 1 7 5 1 2 1 1 4
$sup.appre : int 7 7 7 1 7 7 7 7 7 7$ adv.comp : int 5 6 6 1 7 7 7 7 6 7
$adv.access : int 5 6 6 2 6 7 6 7 6 6$ tut.prof : int 5 6 6 3 7 6 6 7 5 6
$tut.sched : int 4 6 6 2 7 7 7 5 5 6$ val.devel : int 5 6 6 1 7 6 4 7 5 4
$val.meet : int 6 7 7 1 7 6 3 6 5 2$ sat.glad : int 7 7 7 1 7 7 7 7 7 5
$sat.expe : int 7 7 6 1 7 6 7 7 6 4$ loy.proud : int 7 7 7 1 7 7 7 5 6 4
$loy.recom : int 7 7 7 1 7 7 7 6 7 7$ loy.pleas : int 7 7 7 1 7 7 7 7 7 7
$scholarships: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 2 2 1 1$ job : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 1 2 2 1<span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			>﻿</span>


Sometimes, the dataset is not completely visible in wordpress. Hence I’m giving it as an image below.

support, advice, satisfaction and loyalty has multiple variables in the above data set as sup.help, sup.under etc.

Let’s make it as a single variable (mean) for easy analysis.

> #get sing score for support advice satisfaction loyalty
> student_data$support <- apply(student_data[,3:5],1,mean) > summary (student_data$support)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
3.000   4.417   4.667   4.600   5.000   6.000
> student_data$value <- rowMeans(student_data[,10:11]) > student_data$sat <- rowMeans(student_data[,12:13])
> student_data$loy <- rowMeans(student_data[,14:16])  So we found the mean using apply() and rowMeans(). Those mean values are appended to our original data set student_data. Now, let’s take only 4 variables – gender and the 3 new variables value, sat and loy in a new data set for analysis. > student_data_min <- student_data[,c(2, 20:22)] > head(student_data_min) gender value sat loy 1 female 5.5 7.0 7 2 male 6.5 7.0 7 3 female 6.5 6.5 7 4 male 1.0 1.0 1 5 female 7.0 7.0 7 6 male 6.0 6.5 7  Looks simple and great, isn’t it? • If value for money is good, satisfaction score would be high. • If the customer is satisfied, he would be loyal to the organization. So Loy is our dependent variable DV. sat and value are our independent variables IV. I’m using regression to know how gender influences loyalty. > #DV - loy > #IV - sat, value > loyalty_gender_reln <- lm(loy~gender, data=student_data_min) > summary (loyalty_gender_reln) Call: lm(formula = loy ~ gender, data = student_data_min) Residuals: Min 1Q Median 3Q Max -4.4000 0.0667 0.0667 0.6000 1.6000 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.9333 0.7951 8.720 2.34e-05 *** gendermale -1.5333 1.1245 -1.364 0.21 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.778 on 8 degrees of freedom Multiple R-squared: 0.1886, Adjusted R-squared: 0.08717 F-statistic: 1.859 on 1 and 8 DF, p-value: 0.2098 > #R2 is 18%, which says weak relation. So gender does not influence the loyalty.  R-squared value is 0.1886, which is 18.86%, which shows very weak correlation. Hence I’d decide gender doesn’t influence loyalty. Here is the influence of value for money on loyalty. > loyalty_value_reln <- lm(loy~value, data = student_data_min) > summary(loyalty_value_reln) Call: lm(formula = loy ~ value, data = student_data_min) Residuals: Min 1Q Median 3Q Max -2.2182 -0.4953 -0.0403 0.5287 1.9618 Coefficients: Estimate Std. Error t value Pr(<|t|) (Intercept) 2.4901 1.1731 2.123 0.0665 . value 0.7280 0.2181 3.338 0.0103 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.276 on 8 degrees of freedom Multiple R-squared: 0.582, Adjusted R-squared: 0.5298 F-statistic: 11.14 on 1 and 8 DF, p-value: 0.01027 > #58%  Value for money has 58.2% influence on loyalty. Following is the influence of satisfaction against loyalty. > loyalty_sat_reln <- lm (loy~sat, data = student_data_min) > summary(loyalty_sat_reln) Call: lm(formula = loy ~ sat, data = student_data_min) Residuals: Min 1Q Median 3Q Max -1.08586 -0.08586 -0.08586 0.29040 1.21212 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.6515 0.6992 0.932 0.379 sat 0.9192 0.1115 8.241 3.53e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.6408 on 8 degrees of freedom Multiple R-squared: 0.8946, Adjusted R-squared: 0.8814 F-statistic: 67.91 on 1 and 8 DF, p-value: 3.525e-05 > #89%  Wah, 89.46%. So to keep up our customers, satisfaction should be high. This is the message we read. I wish my beloved Air India should read this post. We are combining everything below. > loyalty_everything <- lm(loy~., data = student_data_min) > summary(loyalty_everything) Call: lm(formula = loy ~ ., data = student_data_min) Residuals: Min 1Q Median 3Q Max -1.01381 -0.28807 -0.01515 0.33286 1.13931 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.66470 1.03039 0.645 0.54273 gendermale -0.01796 0.53076 -0.034 0.97411 value -0.10252 0.23777 -0.431 0.68141 sat 1.00478 0.26160 3.841 0.00855 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7273 on 6 degrees of freedom Multiple R-squared: 0.8982, Adjusted R-squared: 0.8472 F-statistic: 17.64 on 3 and 6 DF, p-value: 0.00222  Really, I don’t know how to read the above value at the moment. I’d update this post (if I don’t forget!) To collate the results and show in a consolidated format, we use screenreg() of rexreg package. > install.packages("texreg") Installing package into ‘D:/gandhari/documents/R/win-library/3.4’ (as ‘lib’ is unspecified) trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/texreg_1.36.23.zip' Content type 'application/zip' length 651831 bytes (636 KB) downloaded 636 KB package ‘texreg’ successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\pandian\AppData\Local\Temp\Rtmp085gnT\downloaded_packages > library("texreg") Version: 1.36.23 Date: 2017-03-03 Author: Philip Leifeld (University of Glasgow) Please cite the JSS article in your publications -- see citation("texreg"). > library(texreg) > screenreg(list(loyalty_gender_reln, loyalty_value_reln, loyalty_sat_reln, loyalty_everything)) ==================================================== Model 1 Model 2 Model 3 Model 4 ---------------------------------------------------- (Intercept) 6.93 *** 2.49 0.65 0.66 (0.80) (1.17) (0.70) (1.03) gendermale -1.53 -0.02 (1.12) (0.53) value 0.73 * -0.10 (0.22) (0.24) sat 0.92 *** 1.00 ** (0.11) (0.26) ---------------------------------------------------- R^2 0.19 0.58 0.89 0.90 Adj. R^2 0.09 0.53 0.88 0.85 Num. obs. 10 10 10 10 RMSE 1.78 1.28 0.64 0.73 ==================================================== *** p < 0.001, ** p < 0.01, * p < 0.05  So this linear regression post explains the relation between the variables. See you in another post with an interesting topic. Advertisements # Testing of difference – T test using R Hi, I have written about a hypothesis testing of independence in my previous post Testing of Independence – Chi Square test – Manual, LibreOffice, R. This post talks about testing of mean difference. Lets take the same salary data set used in my previous post. > sal id gender educ Designation Level Salary Last.drawn.salary Pre..Exp Ratings.by.interviewer 1 1 female UG Jr Engineer JLM 10000 1000 3 4 2 2 male DOCTORATE Chairman TLM 100000 100000 20 4 3 3 male DIPLOMA Jr HR JLM 6000 6000 1 3 4 4 male PG Engineer MLM 15000 15000 7 2 5 5 female PG Sr Engineer MLM 25000 25000 12 4 6 6 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 7 7 male DIPLOMA Jr Associate JLM 8000 8000 2 4 8 8 female PG Engineer MLM 13000 13000 7 3 9 9 female PG Engineer MLM 14000 14000 7 2 10 10 female PG Engineer MLM 16000 16000 8 4 11 11 female UG Jr Engineer JLM 10000 1000 3 4 12 12 male DOCTORATE Chairman TLM 100000 100000 20 4 13 13 male DIPLOMA Jr HR JLM 6000 6000 1 3 14 14 male PG Engineer MLM 15000 15000 7 2 15 15 female PG Sr Engineer MLM 25000 25000 12 4 16 16 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 17 17 male DIPLOMA Jr Associate JLM 8000 8000 2 4 18 18 female PG Engineer MLM 13000 13000 7 3 19 19 female PG Engineer MLM 14000 14000 7 2 20 20 female PG Engineer MLM 16000 16000 8 4 21 21 female PG Sr Engineer MLM 25000 25000 12 4 22 22 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 23 23 male DIPLOMA Jr Associate JLM 8000 8000 2 4 24 24 female PG Engineer MLM 13000 13000 7 3 25 25 female PG Engineer MLM 14000 14000 7 2 26 26 female PG Engineer MLM 16000 16000 8 4 27 27 female UG Jr Engineer JLM 10000 1000 3 4 28 28 male DOCTORATE Chairman TLM 100000 100000 20 4 29 29 male DIPLOMA Jr HR JLM 6000 6000 1 3 30 30 male PG Engineer MLM 15000 15000 7 2 31 31 female PG Sr Engineer MLM 25000 25000 12 4 32 32 female PG Sr Engineer MLM 25000 25000 12 4 33 33 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 34 34 male DIPLOMA Jr Associate JLM 8000 8000 2 4 35 35 female PG Engineer MLM 13000 13000 7 3 36 36 female PG Engineer MLM 14000 14000 7 2 37 37 female PG Engineer MLM 16000 16000 8 4 38 38 female UG Jr Engineer JLM 10000 1000 3 4 39 39 male DOCTORATE Chairman TLM 100000 100000 20 4 40 40 male DIPLOMA Jr HR JLM 6000 6000 1 3 41 41 male PG Engineer MLM 15000 15000 7 2 42 42 female PG Sr Engineer MLM 25000 25000 12 4 43 43 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 44 44 male DIPLOMA Jr Associate JLM 8000 8000 2 4 45 45 female PG Engineer MLM 13000 13000 7 3 46 46 female PG Engineer MLM 16000 16000 8 4 47 47 female UG Jr Engineer JLM 10000 1000 3 4 48 48 male DOCTORATE Chairman TLM 100000 100000 20 4 49 49 male DIPLOMA Jr HR JLM 6000 6000 1 3 50 50 male PG Engineer MLM 15000 15000 7 2  Let’s find out the difference of mean salary between male and female. > aggregate(Salary~gender, mean, data=sal) gender Salary 1 female 16040 2 male 27000  Mean salary of female μ0 = 16040 Mean salary of female μ1 = 27000 The symbol ~ differentiates between dependent and independent variables Obviously there is a difference. Let’s see what a t-test in R shows us. > t.test(Salary~gender, data = sal) Welch Two Sample t-test data: Salary by gender t = -1.4494, df = 25.039, p-value = 0.1596 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -26532.252 4612.252 sample estimates: mean in group female mean in group male 16040 27000  How to interpret this result? The p-value is compared with the desired significance level of our test and, if it is smaller, the result is significant. That is, if the null hypothesis were to be rejected at the 5% significance level, this would be reported as “p < 0.05″. Small p-values suggest that the null hypothesis is unlikely to be true. But our p-value 0.15 > 0.05. Hence null hypothesis is rejected and alternate hypothesis is accepted. There is a difference in salary between both genders. Higher the t value (ignore the sign), higher the difference. # Testing of Independence – Chi Square test – Manual, LibreOffice, R Hi, I have written about testing of hypothesis in my earlier posts Statisticians recommended right testing approaches for different type of data. When we have – • both data as categorical, we shall use Chi Square Test • Continuous and Continuous data, we shall use correlation • Categorical and Continuous data, we shall use t test or anova. In this post, I’d be using the below given data set.  id gender educ Designation Level Salary Last.drawn.salary Pre..Exp Ratings.by.interviewer 1 female UG Jr Engineer JLM 10000 1000 3 4 2 male DOCTORATE Chairman TLM 100000 100000 20 4 3 male DIPLOMA Jr HR JLM 6000 6000 1 3 4 male PG Engineer MLM 15000 15000 7 2 5 female PG Sr Engineer MLM 25000 25000 12 4 6 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 7 male DIPLOMA Jr Associate JLM 8000 8000 2 4 8 female PG Engineer MLM 13000 13000 7 3 9 female PG Engineer MLM 14000 14000 7 2 10 female PG Engineer MLM 16000 16000 8 4 11 female UG Jr Engineer JLM 10000 1000 3 4 12 male DOCTORATE Chairman TLM 100000 100000 20 4 13 male DIPLOMA Jr HR JLM 6000 6000 1 3 14 male PG Engineer MLM 15000 15000 7 2 15 female PG Sr Engineer MLM 25000 25000 12 4 16 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 17 male DIPLOMA Jr Associate JLM 8000 8000 2 4 18 female PG Engineer MLM 13000 13000 7 3 19 female PG Engineer MLM 14000 14000 7 2 20 female PG Engineer MLM 16000 16000 8 4 21 female PG Sr Engineer MLM 25000 25000 12 4 22 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 23 male DIPLOMA Jr Associate JLM 8000 8000 2 4 24 female PG Engineer MLM 13000 13000 7 3 25 female PG Engineer MLM 14000 14000 7 2 26 female PG Engineer MLM 16000 16000 8 4 27 female UG Jr Engineer JLM 10000 1000 3 4 28 male DOCTORATE Chairman TLM 100000 100000 20 4 29 male DIPLOMA Jr HR JLM 6000 6000 1 3 30 male PG Engineer MLM 15000 15000 7 2 31 female PG Sr Engineer MLM 25000 25000 12 4 32 female PG Sr Engineer MLM 25000 25000 12 4 33 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 34 male DIPLOMA Jr Associate JLM 8000 8000 2 4 35 female PG Engineer MLM 13000 13000 7 3 36 female PG Engineer MLM 14000 14000 7 2 37 female PG Engineer MLM 16000 16000 8 4 38 female UG Jr Engineer JLM 10000 1000 3 4 39 male DOCTORATE Chairman TLM 100000 100000 20 4 40 male DIPLOMA Jr HR JLM 6000 6000 1 3 41 male PG Engineer MLM 15000 15000 7 2 42 female PG Sr Engineer MLM 25000 25000 12 4 43 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 44 male DIPLOMA Jr Associate JLM 8000 8000 2 4 45 female PG Engineer MLM 13000 13000 7 3 46 female PG Engineer MLM 16000 16000 8 4 47 female UG Jr Engineer JLM 10000 1000 3 4 48 male DOCTORATE Chairman TLM 100000 100000 20 4 49 male DIPLOMA Jr HR JLM 6000 6000 1 3 50 male PG Engineer MLM 15000 15000 7 2  We shall use chi square test for two types of hypothesis testing • test of independence of variables • test goodness of fit ### Testing of independence We can find out the association between two (at least) categorical variables. Higher the chi square value, better the result is. We shall use this to test our hypothesis. ### Goodness of fit When we use chi square test to find the goodness of fit, we shall use 2 categorical variables. higher the chi square value, better the result is. We shall use this to test BLR, SEM tests. ### Example for Testing of independence This post talks about testing of independence. We have employee data given above. Following are my hypothesis. H0 = Number of female employees and level of management are not related. H1 = Number of female employees and level of management are related. We would solve this using three methods 1. Manual way of chi square test 2. Chi square test with LibreOffice Calc 3. Chi square test with R #### Manual way of chi square test We prepare the count of female employees in each level as given below. I have used COUNTIFS() function of LibreOffice. Calculate the row (highlighted in pink colour) and column sums (blue colour) and summation of all row sums (saffron colour). The values are called observed values. We shall find out the expected values as well easily as given below. Expected value = column sum x row sum/sum of rowsum =J15*N12/N15 = 25 x 20/50 = 10 Finally our table looks like this. All the observed values (O), Expected values (E) are substituted in the below table. We calculate the Chi square value χ2 which is 19.  O E O-E (O-E)2 (O-E)2/E 5 10 -5 25 2.5 20 12.5 7.5 56.25 4.5 0 2.5 -2.5 6.25 2.5 15 10 5 25 2.5 5 12.5 -7.5 56.25 4.5 5 2.5 2.5 6.25 2.5 χ2 19 Level of significance or Type 1 error = 5%, which is 0.05 Degrees of freedom = (row count – 1) x (column count – 1) = 2 Critical value of χ2 is 5.991, which is looked up using the level of significance and degrees of freedom in the below given table. Make a decision To accept our null hypothesis H0, calculated χ2 < critical χ2. Our calculated χ2 = 19 Our critical χ2 = 5.991 Hence, we reject null hypothesis and accept alternate hypothesis. You may watch the following video to understand the above calculation. #### Chi square test with LibreOffice Calc We have already found out the frequency distribution of females and males per each management level. Let’s use the same. Select Data>Statistics>Chi-square Test Choose the input cells Select the Output Cell Finally my selections are given as below After pressing OK, We get the following result Make a decision If pα reject the null hypothesis. If p>α fail to reject the null hypothesis. Our p 0.00007485 is lesser than alpha 0.05. So null hypothesis is rejected and alternate hypothesis is accepted. #### Chi square test with R I have the data set stored as sal.csv file. I’m importing it and store to sal object. > setwd("d:/gandhari/videos/Advanced Business Analytics/") > sal <-read.csv("sal.csv") > head(sal) id gender educ Designation Level Salary Last.drawn.salary Pre..Exp Ratings.by.interviewer 1 1 female UG Jr Engineer JLM 10000 1000 3 4 2 2 male DOCTORATE Chairman TLM 100000 100000 20 4 3 3 male DIPLOMA Jr HR JLM 6000 6000 1 3 4 4 male PG Engineer MLM 15000 15000 7 2 5 5 female PG Sr Engineer MLM 25000 25000 12 4 6 6 male DIPLOMA Jr Engineer JLM 6000 8000 1 1  As I wrote in Exploring data files with R I create a Frequency Distribution table using table() function. > gender_level_table <- table(sal$Level, sal$gender) > gender_level_table female male JLM 5 15 MLM 20 5 TLM 0 5  Use chisq.test() function with gender_level_table as its input, to run the chi square test > chisq.test(gender_level_table) Pearson's Chi-squared test data: gender_level_table X-squared = 19, df = 2, p-value = 7.485e-05 Warning message: In chisq.test(gender_level_table) : Chi-squared approximation may be incorrect  Make a decision If pα reject the null hypothesis. If p>α fail to reject the null hypothesis. Our p 7.485e-05 is lesser than alpha 0.05. So null hypothesis is rejected and alternate hypothesis is accepted. See you in another interesting post. Happy Sunday. # Exploring data files with R I have written about data types and data structures of R in my previous post Working with data types of R. We shall explore a data set in this post. mtcars is a dataset exists in R already > mtcars mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2  ### File structure Very first question would be about the size of the data set. > dim(mtcars) [1] 32 11  It contains 32 rows and 11 columns. Now, how many variables we have in mtcars? > names(mtcars) [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"  We shall preview the data using head and tail commands. > head(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 > tail(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2  Similar to unix tail, head commands, you would see first and last 6 records in R. This is the time to know about the structure of the data set. > str(mtcars) 'data.frame': 32 obs. of 11 variables:$ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$cyl : num 6 6 4 6 8 6 8 4 4 6 ...$ disp: num  160 160 108 258 360 ...
$hp : num 110 110 93 110 175 105 245 62 95 123 ...$ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$wt : num 2.62 2.88 2.32 3.21 3.44 ...$ qsec: num  16.5 17 18.6 19.4 17 ...
$vs : num 0 0 1 1 0 1 0 1 1 1 ...$ am  : num  1 1 1 0 0 0 0 0 0 0 ...
$gear: num 4 4 4 3 3 3 3 4 4 4 ...$ carb: num  4 4 1 1 2 1 4 2 2 4 ...


So, this is a data frame. Each variables are explained above.

How the data is being stored?

> mode(mtcars)
[1] "list"


It is stored as a list.

Let’s take another dataset available with R – airquality.

> head(airquality, n=10)
Ozone Solar.R Wind Temp Month Day
1     41     190  7.4   67     5   1
2     36     118  8.0   72     5   2
3     12     149 12.6   74     5   3
4     18     313 11.5   62     5   4
5     NA      NA 14.3   56     5   5
6     28      NA 14.9   66     5   6
7     23     299  8.6   65     5   7
8     19      99 13.8   59     5   8
9      8      19 20.1   61     5   9
10    NA     194  8.6   69     5  10


You may parameterize head commands as I have shown above.

Let’s omit the records with NA.

> aqNoNA=na.omit(airquality)
> dim(aqNoNA)
[1] 111   6
> dim(airquality)
[1] 153   6


So, out new object aqNoNA contains the records without missing cases, totally 111 rows.

We have is.na command to check if the data is NA.

> is.na(airquality)
Ozone Solar.R  Wind  Temp Month   Day
[1,] FALSE   FALSE FALSE FALSE FALSE FALSE
[2,] FALSE   FALSE FALSE FALSE FALSE FALSE
[3,] FALSE   FALSE FALSE FALSE FALSE FALSE
[4,] FALSE   FALSE FALSE FALSE FALSE FALSE
[5,]  TRUE    TRUE FALSE FALSE FALSE FALSE
[6,] FALSE    TRUE FALSE FALSE FALSE FALSE
> sum(is.na(airquality))
[1] 44


So totally 44 NAs found.

Summary command gives us minimum, quartile 1, median, mean, 3rd quartile, maximum and number of NAa.

> summary(airquality)
Ozone           Solar.R           Wind             Temp           Month
Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00   Min.   :5.000
1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00   1st Qu.:6.000
Median : 31.50   Median :205.0   Median : 9.700   Median :79.00   Median :7.000
Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88   Mean   :6.993
3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00   3rd Qu.:8.000
Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00   Max.   :9.000
NA's   :37       NA's   :7
Day
Min.   : 1.0
1st Qu.: 8.0
Median :16.0
Mean   :15.8
3rd Qu.:23.0
Max.   :31.0


1st quartile meant of 24% data

2nd quartile meant of 50 percentile of data

3rd quartile meant of 75 percentile of data

We shall filter the summary commands as given below.

> summary(airquality$Ozone) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 1.00 18.00 31.50 42.13 63.25 168.00 37  We have given the summary of only one variable Ozone above. We shall apply other mathematical functions as given below > sd(aqNoNA$Ozone)
[1] 33.27597


We have calculated the standard deviation for Ozone above. I have taken the data without NA, as my sd operation will fail if I have NA.

### Finding the Outlier

> myNumbers <- rep(c(1,4,10, 100), c(5, 5, 5, 1))
> myNumbers
[1]   1   1   1   1   1   4   4   4   4   4  10  10  10  10  10 100
> mean(myNumbers)
[1] 10.9375
> sd(myNumbers)
[1] 24.04293
> scale(myNumbers)
[,1]
[1,] -0.41332316
[2,] -0.41332316
[3,] -0.41332316
[4,] -0.41332316
[5,] -0.41332316
[6,] -0.28854636
[7,] -0.28854636
[8,] -0.28854636
[9,] -0.28854636
[10,] -0.28854636
[11,] -0.03899275
[12,] -0.03899275
[13,] -0.03899275
[14,] -0.03899275
[15,] -0.03899275
[16,]  3.70431136
attr(,"scaled:center")
[1] 10.9375
attr(,"scaled:scale")
[1] 24.04293


scaling of the metrics (z score) is calculated using the formula

each sample data – mean / standard deviation

After scaling, if you find any values is greater than ±2, they are called outliers (odd points located away from the central measures). If SD > mean, it is outlier.

Other way to test the normality is Shapiro test

> shapiro.test(myNumbers)

Shapiro-Wilk normality test

data:  myNumbers
W = 0.40055, p-value = 3.532e-07


If the p-value is greater than 0.05, it is normalized data. Here it is not.

### Data Analysis

Here comes another interesting part. How to analyze the data, after you upload your files.

To explain this, I’d prepare sample data set first.

> #Employee ID
> sn <- seq(1, 10, 1)
> sn
[1]  1  2  3  4  5  6  7  8  9 10
> #Employee gender
> gender <- rep(c("male", "female"), c(6,4))
> gender
[1] "male"   "male"   "male"   "male"   "male"   "male"   "female" "female" "female"
[10] "female"
> #available departments
> dept <- rep(c("Admin", "HR", "Prod", "Contractor"), c(1, 3, 3, 3))
> dept
[1] "Admin"      "HR"         "HR"         "HR"         "Prod"       "Prod"
[7] "Prod"       "Contractor" "Contractor" "Contractor"
> #Employee Salary
> sal <- rnorm(10, 1000, 200);
> sal
[1]  888.2026  876.6272  919.7453 1005.9058 1084.4704 1337.7696  909.4302  801.1482
[9] 1025.9457 1182.9774
> #Now our data set
> mydataset <- data.frame(sn, gender, sal, dept)
> mydataset
sn gender       sal       dept
2   2   male  876.6272         HR
3   3   male  919.7453         HR
4   4   male 1005.9058         HR
5   5   male 1084.4704       Prod
6   6   male 1337.7696       Prod
7   7 female  909.4302       Prod
8   8 female  801.1482 Contractor
9   9 female 1025.9457 Contractor
10 10 female 1182.9774 Contractor


So we have serial number, gender, salary and department.

Which gender is majority in the given data? we shall use table function to arrive at a simple frequency distribution table.

> #which gender is more in each dept
> #this is frequency distribution
> table(mydataset$gender) female male 4 6  So we have 4 females and 6 males. Let’s group the above FD by department now. > table(mydataset$gender, mydataset$dept) Admin Contractor HR Prod female 0 3 0 1 male 1 0 3 2 > #assign the results to an object > freqDis <- table(mydataset$gender, mydataset$dept); > #transpose the table > t(freqDis) female male Admin 0 1 Contractor 3 0 HR 0 3 Prod 1 2  So all departments except contractors, have male as majority. The function t() stands for transpose. Let’s do a proportion of the data using prop.table() > #proportion > prop.table(freqDis) Admin Contractor HR Prod female 0.0 0.3 0.0 0.1 male 0.1 0.0 0.3 0.2  Rather than proportion, % of males and females would give us a better visibility. hence we multiply proportion by 100. > #Percentage > prop.table(freqDis)*100 Admin Contractor HR Prod female 0 30 0 10 male 10 0 30 20  Column sum and row sums are frequently asked in our day today life. > #sum > colSums(freqDis) Admin Contractor HR Prod 1 3 3 3  Row sum shall be calculated as below. > rowSums(freqDis) female male 4 6  Salary is interesting part in our profession. I’d like to see who earns more using aggregate() function? Male or Female? > #who earns more - male or female? > aggregate(sal~gender, mean, data = mydataset) gender sal 1 female 979.8754 2 male 1018.7868 > #sal is continuous variable > #gender is categorical variable > #mean is the function > #data is our data source  We used mean salary above. Lets use sum now. > aggregate(sal~gender, sum, data = mydataset) gender sal 1 female 3919.501 2 male 6112.721  Similarly, we use standard deviation. > aggregate(sal~gender, sd, data = mydataset) gender sal 1 female 163.5837 2 male 175.1006  So salary package for females looks to be more consistent than that of males. We calculated all the above functions individually. psych package helps to calculate everything in one command. > install.packages("psych") Installing package into ‘D:/gandhari/documents/R/win-library/3.4’ (as ‘lib’ is unspecified) trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/psych_1.7.5.zip' Content type 'application/zip' length 3966969 bytes (3.8 MB) downloaded 3.8 MB package ‘psych’ successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\pandian\AppData\Local\Temp\Rtmpy2L0Yq\downloaded_packages > library(psych) > describe (mydataset) vars n mean sd median trimmed mad min max range skew sn 1 10 5.50 3.03 5.50 5.50 3.71 1.00 10.00 9.00 0.00 gender* 2 10 1.60 0.52 2.00 1.62 0.00 1.00 2.00 1.00 -0.35 sal 3 10 1003.22 162.35 962.83 986.66 119.22 801.15 1337.77 536.62 0.71 dept* 4 10 2.80 1.03 3.00 2.88 1.48 1.00 4.00 3.00 -0.20 kurtosis se sn -1.56 0.96 gender* -2.05 0.16 sal -0.72 51.34 dept* -1.42 0.33  n is total samples, sd is standard deviation etc. All the aggregated functions are calculated for serial numbmer, gender, salary and department. In the above data, I get only the salary data, which is in 3rd row. > describe (mydataset[,3]) vars n mean sd median trimmed mad min max range skew kurtosis X1 1 10 1003.22 162.35 962.83 986.66 119.22 801.15 1337.77 536.62 0.71 -0.72 se X1 51.34  Lets group the above described data by gender, ie, mean, sd for males and females separately. > describe.by(mydataset, mydataset$gender)

Descriptive statistics by group
group: female
vars n   mean     sd median trimmed    mad    min     max  range skew
sn         1 4   8.50   1.29   8.50    8.50   1.48   7.00   10.00   3.00 0.00
gender*    2 4   1.00   0.00   1.00    1.00   0.00   1.00    1.00   0.00  NaN
sal        3 4 979.88 163.58 967.69  979.88 166.64 801.15 1182.98 381.83 0.14
dept*      4 4   2.50   1.00   2.00    2.50   0.00   2.00    4.00   2.00 0.75
kurtosis    se
sn         -2.08  0.65
gender*      NaN  0.00
sal        -2.04 81.79
dept*      -1.69  0.50
---------------------------------------------------------------
group: male
vars n    mean     sd median trimmed    mad    min     max  range  skew
sn         1 6    3.50   1.87   3.50    3.50   2.22   1.00    6.00   5.00  0.00
gender*    2 6    2.00   0.00   2.00    2.00   0.00   2.00    2.00   0.00   NaN
sal        3 6 1018.79 175.10 962.83 1018.79 119.22 876.63 1337.77 461.14  0.83
dept*      4 6    3.00   1.10   3.00    3.00   0.74   1.00    4.00   3.00 -0.76
kurtosis    se
sn         -1.80  0.76
gender*      NaN  0.00
sal        -1.02 71.48
dept*      -0.92  0.45
Warning message:
describe.by is deprecated.  Please use the describeBy function


# Linear Programming – Covering Model using LibreOffice Calc Solver

🕋 Eid Mubarak, Selamat Hari Raya Haji ☪️

I have written about Linear Programming – Allocation model in my previous post Linear Programming and Linear Programming with LibreOffice Calc Solver. This post would talk about Linear Programming – Covering models.

First question would be – what’s the difference between Allocation model and Covering model. There is no difference in the optimization function. The difference exists in the constraints. All our constraints talk about maximum in allocation model. All those constraints had symbol. Covering models talk about minimization, usually cost.

### Example

I’ll use the data set given in https://paginas.fe.up.pt/~mac/ensino/docs/OR/otherDocs/PowellAllocationCoveringBlendingConstraints.pdf

Dahlby Outfitters wishes to introduce packaged trail mix as a new product. The ingredients for the trail mix are seeds, raisins, flakes, and two kinds of nuts. Each ingredient contains certain amounts of vitamins, minerals, protein, and calories.

The marketing department has specified that the product be designed so that a certain minimum nutritional profile is met. The decision problem is to determine the optimal product composition—that is, to minimize the product cost by choosing the amount for each of the ingredients in the mix. The data shown below summarize the parameters of the problem:

 Component Grams per pound Nutritional Requirement Seeds Raisins Flakes Pecans Walnuts Vitamins 10 20 10 30 20 20 Minerals 5 7 4 9 2 10 Protein 1 4 10 2 1 15 Calories 500 450 160 300 500 600 Cost/pound 4 5 3 7 6

Lets  denote the product names as S, R, F, P and W. Our objective function would be like this.

Total Cost = 4S+5R+3F+7P+6W

Rewriting the above statement as –

Zmin = 4S+5R+3F+7P+6W

subject to constraints –

 Vitamin content 10S + 20R + 10F + 30P + 20W greater than or eq 20 Mineral content 5S + 7R + 4F + 9P + 2W greater than or eq 10 Protein content 1S + 4R + 10F + 2P + 1W greater than or eq 15 Calorie content 500S + 450R + 160F + 300P + 500W greater than or eq 600

Rewriting the above constraints as linear equations as given below,

10S + 20R + 10F + 30P + 20W ≥ 20
5S + 7R + 4F + 9P + 2W ≥ 10
1S + 4R + 10F + 2P + 1W ≥ 15
500S + 450R + 160F + 300P + 500W ≥ 600

Prepare the data set. G9 is highlighted in yellow colour. This would be our minimizing figure.

The data have given the cost of each product already. So, our aim is to find how much amount of each product shall be produced. This would be the decision variable. We need to find out. The cells of the decision variables are also highlighted in yellow colour.

Let’s write the constraints now. Our aim is to find how much vitamin, mineral etc to be added in our product. Those cells are highlighted in yellow colour.

Let’s open the Solver now. Following is my selection.

1. Target cell is where we find the minimum cost.
2. As we are talking about minimum, we choose ‘optimize result to’ as ‘Minimum’
3. By changing cells = Decision variable cells
4. Limiting Constraints are highlighted with => operator.

Following is the result.

The answer I get in Calc is not equal to what I see in the reference PDF. However, let’s take it as the decision at the moment –

We would take 24.6, 10, 15, 600 for vitamins, minerals, protein and calories.

Linear programming suggests us to avoid pecans and walnuts.

0.5 x seeds, 0.3 x Raisins and 1.3 x Flakes are sufficient.

With this, we would be able to provide 24.6 vitamins, 10 minerals, 15 protein and 600 calories.

With this I’m closing the statistics post. I’d be starting the next part of this series soon, which is R programming.

# Linear Programming with LibreOffice Calc Solver

Hi,

I wrote about Linear Programming using a car factory example in my previous post Linear Programming. That was a manual graphical method computation. In this post, I’ll solve the same maximization problem using LibreOffice Calc.

Lets prepare the data –

What we need to find out? We need to find out how many cars of both types can be produced. Let’s leave one blank cell for each SUV (highlighted in yellow colour).

What’s out objective function? Maximum profit obtained using x number of SUV cars and y number of Sedan cars.

Let’s add another blank cell. The formula of this cell would be =SUMPRODUCT(B6:C6,B9:C9)

This is nothing but, number of SUV cars x SUV profit + number of Sedan cars x Sedan profit

We have constraints on raw material, machine and man power, right? Let’s add the the first constraint as below.

Number of SUV cars x raw material required by SUV car (2) should be lesser than or equal to maximum raw material available (18 tonnes). This is represented using the following formula.

=SUMPRODUCT(B3:C3,B9:C9)

Similarly we are adding another two constraints using the following formulas.

=SUMPRODUCT(B4:C4,B9:C9)

=SUMPRODUCT(B5:C5,B9:C9)

Finally out data set looks like this.

Open Tools> Solver

Let’s fill the solver window as below.

Target cell = Profit cell G9

We want maximization function, because we need to know the maximum count of cars can be produced.

Optimize results to = Maximum

We want to get the best results by changing the value of the count of cars shall be produced.

By changing cells = number of SUV and Sedan cars = B9 and C9.

1. Constraints cells B12:B14 should be lesser than maximum available resources D12:D14
2. Number of cars should not be negative
3. I tested a third constraint, which is maximum number of cars shall not exceed 10 (You may ignore this).

Finally, my selection is given as below.

Ok Siri, get me the results.

This solver suggests me not to produce any Sedan cars. It asks me to produce 2 SUV cars. This is same as that of the results I got in graphical analysis Linear Programming (point c in the below given graph).

In addition LibreOffice Calc, suggests me maximum resources can be consumed!

See you in another interesting post.

# Linear Programming

We talked about some of the interesting statistical computations in my earlier posts. Now, let’s get into Linear Programming.

Programming here, does not refer to computer programming. It refers to statistical operations.

Linear programming helps in planning to obtain optimum result that meets a specific goal after considering all the possible options. It is widely used to calculate the optimum allocation of scarce resources among competing demands. Formulation of linear programing is the representation of the problem situation in a mathematical form.

### Properties or prerequisites for linear programming model

1. Relationship among the decision variables must be linear.
2. A model must have an objective function
3. Constraints like resource constraints should exist.
4. None of the constraint should be a negative number

### Steps for the formulation of linear programming model

1. Identify the objective of the model
2. Identify the suitable variables and their appropriate unit to measure
3. Identify the required constraints and other parameters
4. Assign right algebraic symbols to the objective function (Z) and the variables (x1, x2 … xn)
5. Form the objective function z = c1x1 + c2x2 + .. .. .. + cnxn
6. Include all the constraints
ai1x1 + ai2x2 + .. .. .. +ainxn ≤ bi
0 ≤ xj
where i = 1 ~ m and j = 1 ~ n

### Simplex Algorithm

This is an algorithm for starting at some extreme feasible point and by using a sequence of exchanges, moving on to other such points until a final solution point is found.

### Example

Lets look at an example. A popular car maker in GST near Chennai has the following resources available per day.

• 18 tonnes of raw materials
• 9 hours of machine hours
• 8 hours of man power

This manufacturing plant of the company gets 5 lakhs and 2 lakh as profit from SUV and Sedan type cars respectively.  How many SUV and Sedan cars should be produced by the company to maximize total profit?

 Resources Requirement per unit Daily Availability SUV Sedan Raw material 2 1 18 tonnes Machine 2 3 9 hours Man Power 4 2 8 hours Profit 5 2 lakhs

Let’s perform the steps one by one.

#### Identify the key variables

Let x1 be the number of SUVs
Let x2 be the number of Sedans

#### Define the objective function

Based on the profit, company gets 3 Lakhs and 1 Lakh as benefit from x1 and x2. So,

Zmax = f(x,y) = 5x1+2x2

#### Include the constraints

Availability of the raw materials, machine hours and man power are our constraints.

Raw material 2 x1+1 x2≤18
machine hours 2 x1+3 x2≤9
Man Power 4 x1+2 x2≤8

Non-negative constraints:

The plant cannot produce -ve cars. So,

x1≥0
x2≥0

Hence our linear programming model for this case is given as:

Max Z = 5x1+2x2

Subject to constraints,

2 x1+1 x2≤18
2 x1+3 x2≤9
4 x1+2 x2≤8
x1≥0
x2≥0

#### Graphical method

Remove the inequalities to form equations

2x1+1x2=18  ————– Constraint 1
2x1+3x2=9  ————– Constraint 2
4x1+2x2=8  ————– Constraint 3
x1=0
x2=0

Substitute x1 = 0 in first equation.

2x1+1x2=18
0x1+1x2=18
x2=18

Now substitute x2 = 0

2x1+1x2=18
2×1+0x2=18
x1=9

So our data points for first constraint is given as below

 x1 0 9 x2 18 0

Similarly compute the data points for constraints 2 and 3.

Data points for constraint 2 is given below.

 x1 0 4.5 x2 3 0

Data points for constraint 2 is given below.

 x1 0 2 x2 4 0

Let’s plot the x-y or scatter graph now.

All our constraints are ≤. So we need to find an area in the graph which is lesser than all three constraints. This is highlighted in the below given graph.

So we have 4 data points a, b, c and d. Lets find those values from the graph.

 a b c d x1 0 0.8 2 0 x2 3 2.5 0 0

So we have got 4 sets of values, which is to be substituted in the objective function to find the maximization.

Max Z = 5x1+2x2

Substituting the values of data point a – (0, 3).

Max Z = 5(0)+2(3) = 6.

Substituting the values of data point b – (0.8, 2.5).

Max Z = 5(0.8)+2(2.5) = 9.

Substituting the values of data point c – (2, 0).

Max Z = 5(2)+2(0) = 10.

Substituting the values of data point d – (0, 0).

Max Z = 5(0)+2(0) = 0.

So, based on our finding, 2 SUV and 0 Sedan would give us best profit.

You may look at the following tutorial videos –