Hi,

I have written about Regression – Predictive model in my earlier post Regression testing in R. Following posts are useful if you want to know what is regression.

Previous post talks about predicting unknown values using known values. This post would explain about how much change is observed between IV(s) and DV.

> setwd("D:/gandhari/videos/Advanced Business Analytics") > student_data <- read.csv("student_data.csv") > student_data id gender sup.help sup.under sup.appre adv.comp adv.access tut.prof tut.sched val.devel val.meet sat.glad sat.expe loy.proud loy.recom loy.pleas scholarships job 1 1 female 7 1 7 5 5 5 4 5 6 7 7 7 7 7 no no 2 2 male 7 1 7 6 6 6 6 6 7 7 7 7 7 7 no yes 3 3 female 6 1 7 6 6 6 6 6 7 7 6 7 7 7 no no 4 4 male 1 7 1 1 2 3 2 1 1 1 1 1 1 1 yes no 5 5 female 6 5 7 7 6 7 7 7 7 7 7 7 7 7 no yes 6 6 male 3 1 7 7 7 6 7 6 6 7 6 7 7 7 yes no 7 7 female 5 2 7 7 6 6 7 4 3 7 7 7 7 7 yes no 8 8 male 6 1 7 7 7 7 5 7 6 7 7 5 6 7 yes yes 9 9 female 7 1 7 6 6 5 5 5 5 7 6 6 7 7 no yes 10 10 male 2 4 7 7 6 6 6 4 2 5 4 4 7 7 no no > str(student_data) 'data.frame': 10 obs. of 18 variables: $ id : int 1 2 3 4 5 6 7 8 9 10 $ gender : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 $ sup.help : int 7 7 6 1 6 3 5 6 7 2 $ sup.under : int 1 1 1 7 5 1 2 1 1 4 $ sup.appre : int 7 7 7 1 7 7 7 7 7 7 $ adv.comp : int 5 6 6 1 7 7 7 7 6 7 $ adv.access : int 5 6 6 2 6 7 6 7 6 6 $ tut.prof : int 5 6 6 3 7 6 6 7 5 6 $ tut.sched : int 4 6 6 2 7 7 7 5 5 6 $ val.devel : int 5 6 6 1 7 6 4 7 5 4 $ val.meet : int 6 7 7 1 7 6 3 6 5 2 $ sat.glad : int 7 7 7 1 7 7 7 7 7 5 $ sat.expe : int 7 7 6 1 7 6 7 7 6 4 $ loy.proud : int 7 7 7 1 7 7 7 5 6 4 $ loy.recom : int 7 7 7 1 7 7 7 6 7 7 $ loy.pleas : int 7 7 7 1 7 7 7 7 7 7 $ scholarships: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 2 2 1 1 $ job : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 1 2 2 1<span data-mce-type="bookmark" id="mce_SELREST_start" data-mce-style="overflow:hidden;line-height:0" style="overflow:hidden;line-height:0" ></span>

Sometimes, the dataset is not completely visible in wordpress. Hence I’m giving it as an image below.

support, advice, satisfaction and loyalty has multiple variables in the above data set as sup.help, sup.under etc.

Let’s make it as a single variable (mean) for easy analysis.

> #get sing score for support advice satisfaction loyalty > student_data$support <- apply(student_data[,3:5],1,mean) > summary (student_data$support) Min. 1st Qu. Median Mean 3rd Qu. Max. 3.000 4.417 4.667 4.600 5.000 6.000 > student_data$value <- rowMeans(student_data[,10:11]) > student_data$sat <- rowMeans(student_data[,12:13]) > student_data$loy <- rowMeans(student_data[,14:16])

So we found the mean using apply() and rowMeans(). Those mean values are appended to our original data set student_data. Now, let’s take only 4 variables – gender and the 3 new variables value, sat and loy in a new data set for analysis.

> student_data_min <- student_data[,c(2, 20:22)] > head(student_data_min) gender value sat loy 1 female 5.5 7.0 7 2 male 6.5 7.0 7 3 female 6.5 6.5 7 4 male 1.0 1.0 1 5 female 7.0 7.0 7 6 male 6.0 6.5 7

Looks simple and great, isn’t it?

- If value for money is good, satisfaction score would be high.
- If the customer is satisfied, he would be loyal to the organization.

So Loy is our dependent variable DV. sat and value are our independent variables IV. I’m using regression to know how gender influences loyalty.

> #DV - loy > #IV - sat, value > loyalty_gender_reln <- lm(loy~gender, data=student_data_min) > summary (loyalty_gender_reln) Call: lm(formula = loy ~ gender, data = student_data_min) Residuals: Min 1Q Median 3Q Max -4.4000 0.0667 0.0667 0.6000 1.6000 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.9333 0.7951 8.720 2.34e-05 *** gendermale -1.5333 1.1245 -1.364 0.21 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.778 on 8 degrees of freedom Multiple R-squared: 0.1886, Adjusted R-squared: 0.08717 F-statistic: 1.859 on 1 and 8 DF, p-value: 0.2098 > #R2 is 18%, which says weak relation. So gender does not influence the loyalty.

R-squared value is 0.1886, which is 18.86%, which shows very weak correlation. Hence I’d decide gender doesn’t influence loyalty.

Here is the influence of value for money on loyalty.

> loyalty_value_reln <- lm(loy~value, data = student_data_min) > summary(loyalty_value_reln) Call: lm(formula = loy ~ value, data = student_data_min) Residuals: Min 1Q Median 3Q Max -2.2182 -0.4953 -0.0403 0.5287 1.9618 Coefficients: Estimate Std. Error t value Pr(<|t|) (Intercept) 2.4901 1.1731 2.123 0.0665 . value 0.7280 0.2181 3.338 0.0103 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.276 on 8 degrees of freedom Multiple R-squared: 0.582, Adjusted R-squared: 0.5298 F-statistic: 11.14 on 1 and 8 DF, p-value: 0.01027 > #58%

Value for money has 58.2% influence on loyalty. Following is the influence of satisfaction against loyalty.

> loyalty_sat_reln <- lm (loy~sat, data = student_data_min) > summary(loyalty_sat_reln) Call: lm(formula = loy ~ sat, data = student_data_min) Residuals: Min 1Q Median 3Q Max -1.08586 -0.08586 -0.08586 0.29040 1.21212 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.6515 0.6992 0.932 0.379 sat 0.9192 0.1115 8.241 3.53e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.6408 on 8 degrees of freedom Multiple R-squared: 0.8946, Adjusted R-squared: 0.8814 F-statistic: 67.91 on 1 and 8 DF, p-value: 3.525e-05 > #89%

Wah, 89.46%. So to keep up our customers, satisfaction should be high. This is the message we read. I wish my beloved Air India should read this post.

We are combining everything below.

> loyalty_everything <- lm(loy~., data = student_data_min) > summary(loyalty_everything) Call: lm(formula = loy ~ ., data = student_data_min) Residuals: Min 1Q Median 3Q Max -1.01381 -0.28807 -0.01515 0.33286 1.13931 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.66470 1.03039 0.645 0.54273 gendermale -0.01796 0.53076 -0.034 0.97411 value -0.10252 0.23777 -0.431 0.68141 sat 1.00478 0.26160 3.841 0.00855 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7273 on 6 degrees of freedom Multiple R-squared: 0.8982, Adjusted R-squared: 0.8472 F-statistic: 17.64 on 3 and 6 DF, p-value: 0.00222

Really, I don’t know how to read the above value at the moment. I’d update this post (if I don’t forget!)

To collate the results and show in a consolidated format, we use screenreg() of rexreg package.

> install.packages("texreg") Installing package into ‘D:/gandhari/documents/R/win-library/3.4’ (as ‘lib’ is unspecified) trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/texreg_1.36.23.zip' Content type 'application/zip' length 651831 bytes (636 KB) downloaded 636 KB package ‘texreg’ successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\pandian\AppData\Local\Temp\Rtmp085gnT\downloaded_packages > library("texreg") Version: 1.36.23 Date: 2017-03-03 Author: Philip Leifeld (University of Glasgow) Please cite the JSS article in your publications -- see citation("texreg"). > library(texreg) > screenreg(list(loyalty_gender_reln, loyalty_value_reln, loyalty_sat_reln, loyalty_everything)) ==================================================== Model 1 Model 2 Model 3 Model 4 ---------------------------------------------------- (Intercept) 6.93 *** 2.49 0.65 0.66 (0.80) (1.17) (0.70) (1.03) gendermale -1.53 -0.02 (1.12) (0.53) value 0.73 * -0.10 (0.22) (0.24) sat 0.92 *** 1.00 ** (0.11) (0.26) ---------------------------------------------------- R^2 0.19 0.58 0.89 0.90 Adj. R^2 0.09 0.53 0.88 0.85 Num. obs. 10 10 10 10 RMSE 1.78 1.28 0.64 0.73 ==================================================== *** p < 0.001, ** p < 0.01, * p < 0.05

So this linear regression post explains the relation between the variables.

See you in another post with an interesting topic.