# Regression (Explanatory) in R

Hi,

I have written about Regression – Predictive model in my earlier post Regression testing in R. Following posts are useful if you want to know what is regression.

Previous post talks about predicting unknown values using known values. This post would explain about how much change is observed between IV(s) and DV.

```> setwd("D:/gandhari/videos/Advanced Business Analytics")
> student_data <- read.csv("student_data.csv") > student_data
id gender sup.help sup.under sup.appre adv.comp adv.access tut.prof tut.sched val.devel val.meet sat.glad sat.expe loy.proud loy.recom loy.pleas scholarships job
1   1 female        7         1         7        5          5        5         4         5        6        7        7         7         7         7           no  no
2   2   male        7         1         7        6          6        6         6         6        7        7        7         7         7         7           no yes
3   3 female        6         1         7        6          6        6         6         6        7        7        6         7         7         7           no  no
4   4   male        1         7         1        1          2        3         2         1        1        1        1         1         1         1          yes  no
5   5 female        6         5         7        7          6        7         7         7        7        7        7         7         7         7           no yes
6   6   male        3         1         7        7          7        6         7         6        6        7        6         7         7         7          yes  no
7   7 female        5         2         7        7          6        6         7         4        3        7        7         7         7         7          yes  no
8   8   male        6         1         7        7          7        7         5         7        6        7        7         5         6         7          yes yes
9   9 female        7         1         7        6          6        5         5         5        5        7        6         6         7         7           no yes
10 10   male        2         4         7        7          6        6         6         4        2        5        4         4         7         7           no  no
> str(student_data)
'data.frame': 10 obs. of 18 variables:
\$ id : int 1 2 3 4 5 6 7 8 9 10
\$ gender : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2
\$ sup.help : int 7 7 6 1 6 3 5 6 7 2
\$ sup.under : int 1 1 1 7 5 1 2 1 1 4
\$ sup.appre : int 7 7 7 1 7 7 7 7 7 7
\$ adv.comp : int 5 6 6 1 7 7 7 7 6 7
\$ adv.access : int 5 6 6 2 6 7 6 7 6 6
\$ tut.prof : int 5 6 6 3 7 6 6 7 5 6
\$ tut.sched : int 4 6 6 2 7 7 7 5 5 6
\$ val.devel : int 5 6 6 1 7 6 4 7 5 4
\$ val.meet : int 6 7 7 1 7 6 3 6 5 2
\$ sat.glad : int 7 7 7 1 7 7 7 7 7 5
\$ sat.expe : int 7 7 6 1 7 6 7 7 6 4
\$ loy.proud : int 7 7 7 1 7 7 7 5 6 4
\$ loy.recom : int 7 7 7 1 7 7 7 6 7 7
\$ loy.pleas : int 7 7 7 1 7 7 7 7 7 7
\$ scholarships: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 2 2 1 1
\$ job : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 1 2 2 1<span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			>﻿</span>
```

Sometimes, the dataset is not completely visible in wordpress. Hence I’m giving it as an image below.

support, advice, satisfaction and loyalty has multiple variables in the above data set as sup.help, sup.under etc.

Let’s make it as a single variable (mean) for easy analysis.

```> #get sing score for support advice satisfaction loyalty
> student_data\$support <- apply(student_data[,3:5],1,mean) > summary (student_data\$support)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
3.000   4.417   4.667   4.600   5.000   6.000
> student_data\$value <- rowMeans(student_data[,10:11])
> student_data\$sat <- rowMeans(student_data[,12:13])
> student_data\$loy <- rowMeans(student_data[,14:16])
```

So we found the mean using apply() and rowMeans(). Those mean values are appended to our original data set student_data. Now, let’s take only 4 variables – gender and the 3 new variables value, sat and loy in a new data set for analysis.

```> student_data_min <- student_data[,c(2, 20:22)]
gender value sat loy
1 female   5.5 7.0   7
2   male   6.5 7.0   7
3 female   6.5 6.5   7
4   male   1.0 1.0   1
5 female   7.0 7.0   7
6   male   6.0 6.5   7
```

Looks simple and great, isn’t it?

• If value for money is good, satisfaction score would be high.
• If the customer is satisfied, he would be loyal to the organization.

So Loy is our dependent variable DV. sat and value are our independent variables IV. I’m using regression to know how gender influences loyalty.

```> #DV - loy
> #IV - sat, value
> loyalty_gender_reln <- lm(loy~gender, data=student_data_min)
> summary (loyalty_gender_reln)

Call:
lm(formula = loy ~ gender, data = student_data_min)

Residuals:
Min      1Q  Median      3Q     Max
-4.4000  0.0667  0.0667  0.6000  1.6000

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   6.9333     0.7951   8.720 2.34e-05 ***
gendermale   -1.5333     1.1245  -1.364     0.21
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.778 on 8 degrees of freedom
Multiple R-squared:  0.1886,	Adjusted R-squared:  0.08717
F-statistic: 1.859 on 1 and 8 DF,  p-value: 0.2098

> #R2 is 18%, which says weak relation. So gender does not influence the loyalty.
```

R-squared value is 0.1886, which is 18.86%, which shows very weak correlation. Hence I’d decide gender doesn’t influence loyalty.

Here is the influence of value for money on loyalty.

```> loyalty_value_reln <- lm(loy~value, data = student_data_min)
> summary(loyalty_value_reln)

Call:
lm(formula = loy ~ value, data = student_data_min)

Residuals:
Min      1Q  Median      3Q     Max
-2.2182 -0.4953 -0.0403  0.5287  1.9618

Coefficients:
Estimate Std. Error t value Pr(<|t|)
(Intercept)   2.4901     1.1731   2.123   0.0665 .
value         0.7280     0.2181   3.338   0.0103 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.276 on 8 degrees of freedom
Multiple R-squared:  0.582,	Adjusted R-squared:  0.5298
F-statistic: 11.14 on 1 and 8 DF,  p-value: 0.01027
> #58%
```

Value for money has 58.2% influence on loyalty. Following is the influence of  satisfaction against loyalty.

```> loyalty_sat_reln <- lm (loy~sat, data = student_data_min)
> summary(loyalty_sat_reln)

Call:
lm(formula = loy ~ sat, data = student_data_min)

Residuals:
Min       1Q   Median       3Q      Max
-1.08586 -0.08586 -0.08586  0.29040  1.21212

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.6515     0.6992   0.932    0.379
sat           0.9192     0.1115   8.241 3.53e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6408 on 8 degrees of freedom
Multiple R-squared:  0.8946,	Adjusted R-squared:  0.8814
F-statistic: 67.91 on 1 and 8 DF,  p-value: 3.525e-05

> #89%
```

Wah, 89.46%. So to keep up our customers, satisfaction should be high. This is the message we read. I wish my beloved Air India should read this post.

We are combining everything below.

```> loyalty_everything <- lm(loy~., data = student_data_min)
> summary(loyalty_everything)

Call:
lm(formula = loy ~ ., data = student_data_min)

Residuals:
Min       1Q   Median       3Q      Max
-1.01381 -0.28807 -0.01515  0.33286  1.13931

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.66470    1.03039   0.645  0.54273
gendermale  -0.01796    0.53076  -0.034  0.97411
value       -0.10252    0.23777  -0.431  0.68141
sat          1.00478    0.26160   3.841  0.00855 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7273 on 6 degrees of freedom
Multiple R-squared:  0.8982,	Adjusted R-squared:  0.8472
F-statistic: 17.64 on 3 and 6 DF,  p-value: 0.00222
```

Really, I don’t know how to read the above value at the moment. I’d update this post (if I don’t forget!)

To collate the results and show in a consolidated format, we use screenreg() of rexreg package.

```> install.packages("texreg")
Installing package into ‘D:/gandhari/documents/R/win-library/3.4’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/texreg_1.36.23.zip'
Content type 'application/zip' length 651831 bytes (636 KB)

package ‘texreg’ successfully unpacked and MD5 sums checked

> library("texreg")
Version:  1.36.23
Date:     2017-03-03
Author:   Philip Leifeld (University of Glasgow)

> library(texreg)
> screenreg(list(loyalty_gender_reln, loyalty_value_reln, loyalty_sat_reln, loyalty_everything))

====================================================
Model 1    Model 2  Model 3    Model 4
----------------------------------------------------
(Intercept)   6.93 ***   2.49     0.65       0.66
(0.80)     (1.17)   (0.70)     (1.03)
gendermale   -1.53                          -0.02
(1.12)                         (0.53)
value                    0.73 *             -0.10
(0.22)              (0.24)
sat                               0.92 ***   1.00 **
(0.11)     (0.26)
----------------------------------------------------
R^2           0.19       0.58     0.89       0.90
Adj. R^2      0.09       0.53     0.88       0.85
Num. obs.    10         10       10         10
RMSE          1.78       1.28     0.64       0.73
====================================================
*** p < 0.001, ** p < 0.01, * p < 0.05
```

So this linear regression post explains the relation between the variables.

See you in another post with an interesting topic.