Regression (Explanatory) in R

Hi,

I have written about Regression – Predictive model in my earlier post Regression testing in R. Following posts are useful if you want to know what is regression.

Previous post talks about predicting unknown values using known values. This post would explain about how much change is observed between IV(s) and DV.

> setwd("D:/gandhari/videos/Advanced Business Analytics")
> student_data <- read.csv("student_data.csv") > student_data
   id gender sup.help sup.under sup.appre adv.comp adv.access tut.prof tut.sched val.devel val.meet sat.glad sat.expe loy.proud loy.recom loy.pleas scholarships job
1   1 female        7         1         7        5          5        5         4         5        6        7        7         7         7         7           no  no
2   2   male        7         1         7        6          6        6         6         6        7        7        7         7         7         7           no yes
3   3 female        6         1         7        6          6        6         6         6        7        7        6         7         7         7           no  no
4   4   male        1         7         1        1          2        3         2         1        1        1        1         1         1         1          yes  no
5   5 female        6         5         7        7          6        7         7         7        7        7        7         7         7         7           no yes
6   6   male        3         1         7        7          7        6         7         6        6        7        6         7         7         7          yes  no
7   7 female        5         2         7        7          6        6         7         4        3        7        7         7         7         7          yes  no
8   8   male        6         1         7        7          7        7         5         7        6        7        7         5         6         7          yes yes
9   9 female        7         1         7        6          6        5         5         5        5        7        6         6         7         7           no yes
10 10   male        2         4         7        7          6        6         6         4        2        5        4         4         7         7           no  no
> str(student_data)
'data.frame': 10 obs. of 18 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10
$ gender : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2
$ sup.help : int 7 7 6 1 6 3 5 6 7 2
$ sup.under : int 1 1 1 7 5 1 2 1 1 4
$ sup.appre : int 7 7 7 1 7 7 7 7 7 7
$ adv.comp : int 5 6 6 1 7 7 7 7 6 7
$ adv.access : int 5 6 6 2 6 7 6 7 6 6
$ tut.prof : int 5 6 6 3 7 6 6 7 5 6
$ tut.sched : int 4 6 6 2 7 7 7 5 5 6
$ val.devel : int 5 6 6 1 7 6 4 7 5 4
$ val.meet : int 6 7 7 1 7 6 3 6 5 2
$ sat.glad : int 7 7 7 1 7 7 7 7 7 5
$ sat.expe : int 7 7 6 1 7 6 7 7 6 4
$ loy.proud : int 7 7 7 1 7 7 7 5 6 4
$ loy.recom : int 7 7 7 1 7 7 7 6 7 7
$ loy.pleas : int 7 7 7 1 7 7 7 7 7 7
$ scholarships: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 2 2 1 1
$ job : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 1 2 2 1<span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			></span>

Sometimes, the dataset is not completely visible in wordpress. Hence I’m giving it as an image below.

Student_data

support, advice, satisfaction and loyalty has multiple variables in the above data set as sup.help, sup.under etc.

Let’s make it as a single variable (mean) for easy analysis.

> #get sing score for support advice satisfaction loyalty
> student_data$support <- apply(student_data[,3:5],1,mean) > summary (student_data$support)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  3.000   4.417   4.667   4.600   5.000   6.000
> student_data$value <- rowMeans(student_data[,10:11])
> student_data$sat <- rowMeans(student_data[,12:13])
> student_data$loy <- rowMeans(student_data[,14:16])

So we found the mean using apply() and rowMeans(). Those mean values are appended to our original data set student_data. Now, let’s take only 4 variables – gender and the 3 new variables value, sat and loy in a new data set for analysis.

> student_data_min <- student_data[,c(2, 20:22)]
> head(student_data_min)
  gender value sat loy
1 female   5.5 7.0   7
2   male   6.5 7.0   7
3 female   6.5 6.5   7
4   male   1.0 1.0   1
5 female   7.0 7.0   7
6   male   6.0 6.5   7

Looks simple and great, isn’t it?

  • If value for money is good, satisfaction score would be high.
  • If the customer is satisfied, he would be loyal to the organization.

So Loy is our dependent variable DV. sat and value are our independent variables IV. I’m using regression to know how gender influences loyalty.

> #DV - loy
> #IV - sat, value
> loyalty_gender_reln <- lm(loy~gender, data=student_data_min)
> summary (loyalty_gender_reln)

Call:
lm(formula = loy ~ gender, data = student_data_min)

Residuals:
    Min      1Q  Median      3Q     Max
-4.4000  0.0667  0.0667  0.6000  1.6000 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   6.9333     0.7951   8.720 2.34e-05 ***
gendermale   -1.5333     1.1245  -1.364     0.21
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.778 on 8 degrees of freedom
Multiple R-squared:  0.1886,	Adjusted R-squared:  0.08717
F-statistic: 1.859 on 1 and 8 DF,  p-value: 0.2098

> #R2 is 18%, which says weak relation. So gender does not influence the loyalty.

R-squared value is 0.1886, which is 18.86%, which shows very weak correlation. Hence I’d decide gender doesn’t influence loyalty.

Here is the influence of value for money on loyalty.

> loyalty_value_reln <- lm(loy~value, data = student_data_min)
> summary(loyalty_value_reln)

Call:
lm(formula = loy ~ value, data = student_data_min)

Residuals:
    Min      1Q  Median      3Q     Max
-2.2182 -0.4953 -0.0403  0.5287  1.9618 

Coefficients:
            Estimate Std. Error t value Pr(<|t|)
(Intercept)   2.4901     1.1731   2.123   0.0665 .
value         0.7280     0.2181   3.338   0.0103 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.276 on 8 degrees of freedom
Multiple R-squared:  0.582,	Adjusted R-squared:  0.5298
F-statistic: 11.14 on 1 and 8 DF,  p-value: 0.01027
> #58%

Value for money has 58.2% influence on loyalty. Following is the influence of  satisfaction against loyalty.

> loyalty_sat_reln <- lm (loy~sat, data = student_data_min)
> summary(loyalty_sat_reln)

Call:
lm(formula = loy ~ sat, data = student_data_min)

Residuals:
     Min       1Q   Median       3Q      Max
-1.08586 -0.08586 -0.08586  0.29040  1.21212 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.6515     0.6992   0.932    0.379
sat           0.9192     0.1115   8.241 3.53e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6408 on 8 degrees of freedom
Multiple R-squared:  0.8946,	Adjusted R-squared:  0.8814
F-statistic: 67.91 on 1 and 8 DF,  p-value: 3.525e-05

> #89%

Wah, 89.46%. So to keep up our customers, satisfaction should be high. This is the message we read. I wish my beloved Air India should read this post.

We are combining everything below.

> loyalty_everything <- lm(loy~., data = student_data_min)
> summary(loyalty_everything)

Call:
lm(formula = loy ~ ., data = student_data_min)

Residuals:
     Min       1Q   Median       3Q      Max
-1.01381 -0.28807 -0.01515  0.33286  1.13931 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.66470    1.03039   0.645  0.54273
gendermale  -0.01796    0.53076  -0.034  0.97411
value       -0.10252    0.23777  -0.431  0.68141
sat          1.00478    0.26160   3.841  0.00855 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7273 on 6 degrees of freedom
Multiple R-squared:  0.8982,	Adjusted R-squared:  0.8472
F-statistic: 17.64 on 3 and 6 DF,  p-value: 0.00222

Really, I don’t know how to read the above value at the moment. I’d update this post (if I don’t forget!)

To collate the results and show in a consolidated format, we use screenreg() of rexreg package.

> install.packages("texreg")
Installing package into ‘D:/gandhari/documents/R/win-library/3.4’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/texreg_1.36.23.zip'
Content type 'application/zip' length 651831 bytes (636 KB)
downloaded 636 KB

package ‘texreg’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\pandian\AppData\Local\Temp\Rtmp085gnT\downloaded_packages
> library("texreg")
Version:  1.36.23
Date:     2017-03-03
Author:   Philip Leifeld (University of Glasgow)

Please cite the JSS article in your publications -- see citation("texreg").
> library(texreg)
> screenreg(list(loyalty_gender_reln, loyalty_value_reln, loyalty_sat_reln, loyalty_everything))

====================================================
             Model 1    Model 2  Model 3    Model 4 
----------------------------------------------------
(Intercept)   6.93 ***   2.49     0.65       0.66   
             (0.80)     (1.17)   (0.70)     (1.03)  
gendermale   -1.53                          -0.02   
             (1.12)                         (0.53)  
value                    0.73 *             -0.10   
                        (0.22)              (0.24)  
sat                               0.92 ***   1.00 **
                                 (0.11)     (0.26)  
----------------------------------------------------
R^2           0.19       0.58     0.89       0.90   
Adj. R^2      0.09       0.53     0.88       0.85   
Num. obs.    10         10       10         10      
RMSE          1.78       1.28     0.64       0.73   
====================================================
*** p < 0.001, ** p < 0.01, * p < 0.05

So this linear regression post explains the relation between the variables.

See you in another post with an interesting topic.

Advertisements

Regression testing in R

T-test and ANOVA, are two parametric statistical techniques used to test the hypothesis. When the population means of only two groups is to be compared, the t-test is used, but when means of more than two groups are to be compared, ANOVA is used.

Here are my previous posts about ANOVA

These T-Test and ANOVA belongs to General Linear Model (GLM) family.

So we can compare 2 groups with ANOVA. If we have more than 2 groups, we shall use Regression. We can have multiple IV on DV. ANOVA allows Categorical IV only. But regression allows both Categorical and Continuous data, in addition to multiple IV. DV should be continuous.

We need to check correlation before getting into regression. If we do not have regression or poor correlation, lets not think about regression. I have written about correlation in the following posts.

While correlation shows is degree of relation (+ve or -ve), regression shows us the correlation and sign of causation. So we are going to estimate DV based on changes to IV.

Let’s take the same salary data used in my previous examples.

> setwd("d:/gandhari/videos/Advanced Business Analytics/")
> sal <-read.csv("sal.csv")
> head(sal)
  id gender      educ Designation Level Salary Loan.deduction Last.drawn.salary Pre..Exp Ratings.by.interviewer
1  1 female        UG Jr Engineer   JLM  10000        5901.74           4098.26        3                      4
2  2   male DOCTORATE    Chairman   TLM 100000        4247.31          95752.69       20                      4
3  3   male   DIPLOMA       Jr HR   JLM   6000        3895.76           2104.24        1                      3
4  4   male        PG    Engineer   MLM  15000        9108.36           5891.64        7                      2
5  5 female        PG Sr Engineer   MLM  25000        4269.39          20730.61       12                      4
6  6   male   DIPLOMA Jr Engineer   JLM   6000        4137.31           1862.69        1                      1
> dim(sal)
[1] 50 10
> str(sal)
'data.frame':	50 obs. of  10 variables:
 $ id                    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ gender                : Factor w/ 2 levels "female","male": 1 2 2 2 1 2 2 1 1 1 ...
 $ educ                  : Factor w/ 4 levels "DIPLOMA","DOCTORATE",..: 4 2 1 3 3 1 1 3 3 3 ...
 $ Designation           : Factor w/ 6 levels "Chairman","Engineer",..: 4 1 5 2 6 4 3 2 2 2 ...
 $ Level                 : Factor w/ 3 levels "JLM","MLM","TLM": 1 3 1 2 2 1 1 2 2 2 ...
 $ Salary                : int  10000 100000 6000 15000 25000 6000 8000 13000 14000 16000 ...
 $ Loan.deduction        : num  5902 4247 3896 9108 4269 ...
 $ Last.drawn.salary     : num  4098 95753 2104 5892 20731 ...
 $ Pre..Exp              : int  3 20 1 7 12 1 2 7 7 8 ...
 $ Ratings.by.interviewer: int  4 4 3 2 4 1 4 3 2 4 ...
 > tail (sal, n=10)
   id gender      educ  Designation Level Salary Loan.deduction Last.drawn.salary Pre..Exp Ratings.by.interviewer
41 41   male        PG     Engineer   MLM  15000        1741.33          13258.67        7                      2
42 42 female        PG  Sr Engineer   MLM  25000        2934.33          22065.67       12                      4
43 43   male   DIPLOMA  Jr Engineer   JLM   6000        2803.03           3196.97        1                      1
44 44   male   DIPLOMA Jr Associate   JLM   8000        5480.77           2519.23        2                      4
45 45 female        PG     Engineer   MLM  13000        1317.26          11682.74        7                      3
46 46 female        PG     Engineer   MLM  16000        9927.11           6072.89        8                      4
47 47 female        UG  Jr Engineer   JLM  10000        2507.66           7492.34        3                      4
48 48   male DOCTORATE     Chairman   TLM 100000        9684.88          90315.12       20                      4
49 49   male   DIPLOMA        Jr HR   JLM   6000        2717.26           3282.74        1                      3
50 50   male        PG     Engineer   MLM  15000        4512.12          10487.88        7                      2

Hope the preview of the data set I’ve given above makes sense.
To predict, we should have two type of data – training data and testing data

> salarytrain <-sal[1:35,]
> salarytest <- sal[36:50,]
> dim (salarytrain)
[1] 35 10
> dim (salarytest)
[1] 15 10

Let’s run the regression now.

> salreg <- lm(Salary~educ, data=salarytrain)
> summary(salreg)

Call:
lm(formula = Salary ~ educ, data = salarytrain)

Residuals:
    Min      1Q  Median      3Q     Max
-4333.3 -2333.3  -727.3   636.4  7666.7 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)       6727       1128   5.962 1.37e-06 ***
educDOCTORATE    93273       2438  38.264  < 2e-16 ***
educPG           10606       1432   7.405 2.44e-08 ***
educUG            3273       2438   1.343    0.189
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3742 on 31 degrees of freedom
Multiple R-squared:  0.9803,	Adjusted R-squared:  0.9783
F-statistic: 513.1 on 3 and 31 DF,  p-value: <; 2.2e-16

The formula for the prediction is given below.

  • Y = a + b1 * X1 + c
  • Y is DV
  • X1 is IV
  • a is intercept or baseline or constant
  • b1 is error value.

Let’s substitute the values.
Predicted Y = 6727 + 1128 * Education + code

R square value is 0.9803, which is 98.03. This is a high level of correlation. 98% influence of explained variance between education and salary. Remaining 2% is unexplained variance.

Intercept 6727 is the baseline, which means, a person with no education may get 6727 salary.

when he gets 1st level of education, he will get 6727+1128.
when he gets 2nd level of education, he will get 6727+(2 x 1128) and so on.

We have considered only the education in this example. Plus point of regression is, we shall use more than one IV. In this case, I want to consider years of experience in addition to education. Then my command goes as below.

> salexp <- lm(Salary~educ + Pre..Exp, data=salarytrain)
> summary(salexp)

Call:
lm(formula = Salary ~ educ + Pre..Exp, data = salarytrain)

Residuals:
    Min      1Q  Median      3Q     Max
-886.44 -102.30   34.25   78.50 1113.56 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)     3705.6      162.5   22.80  < 2e-16 ***
educDOCTORATE  51977.1     1019.4   50.99  < 2e-16 ***
educPG         -5330.2      417.5  -12.77 1.17e-13 ***
educUG          -353.2      327.2   -1.08    0.289
Pre..Exp        2215.9       52.0   42.61  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 485 on 30 degrees of freedom
Multiple R-squared:  0.9997,	Adjusted R-squared:  0.9996
F-statistic: 2.336e+04 on 4 and 30 DF,  p-value: < 2.2e-16

This time I get 99.97% influence of education and experience in deciding someones salary. If you see the signs of estimate, Education UG or PG does not make a big difference. But previous experience and DOCTORATE surge our R square value. If R square score is low, your correlation is weak. Do not use prediction in this case or search for right IVs.

Let’s predict the salary now.

> salpred <- predict(salreg, salarytest)
> salpred
        36         37         38         39         40         41         42         43         44         45         46
 17333.333  17333.333  10000.000 100000.000   6727.273  17333.333  17333.333   6727.273   6727.273  17333.333  17333.333
        47         48         49         50
 10000.000 100000.000   6727.273  17333.333

So this is the prediction of salaries from rows 36 to 50.

Let’s use cbind for better understanding. What you see as Salary is the actual salary. What you see under salpred is predicted salary. In some cases, the prediction is close, in some cases, it is far. So difference between actual salary (actual Y) and predicted salary (predicted Y) is called residual. Residual should be lower to have better prediction.

> cbind(salarytest, salpred)
   id gender      educ  Designation Level Salary Loan.deduction Last.drawn.salary Pre..Exp Ratings.by.interviewer    salpred
36 36 female        PG     Engineer   MLM  14000         716.48          13283.52        7                      2  17333.333
37 37 female        PG     Engineer   MLM  16000        6595.95           9404.05        8                      4  17333.333
38 38 female        UG  Jr Engineer   JLM  10000        5433.07           4566.93        3                      4  10000.000
39 39   male DOCTORATE     Chairman   TLM 100000        9028.68          90971.32       20                      4 100000.000
40 40   male   DIPLOMA        Jr HR   JLM   6000         794.66           5205.34        1                      3   6727.273
41 41   male        PG     Engineer   MLM  15000        1741.33          13258.67        7                      2  17333.333
42 42 female        PG  Sr Engineer   MLM  25000        2934.33          22065.67       12                      4  17333.333
43 43   male   DIPLOMA  Jr Engineer   JLM   6000        2803.03           3196.97        1                      1   6727.273
44 44   male   DIPLOMA Jr Associate   JLM   8000        5480.77           2519.23        2                      4   6727.273
45 45 female        PG     Engineer   MLM  13000        1317.26          11682.74        7                      3  17333.333
46 46 female        PG     Engineer   MLM  16000        9927.11           6072.89        8                      4  17333.333
47 47 female        UG  Jr Engineer   JLM  10000        2507.66           7492.34        3                      4  10000.000
48 48   male DOCTORATE     Chairman   TLM 100000        9684.88          90315.12       20                      4 100000.000
49 49   male   DIPLOMA        Jr HR   JLM   6000        2717.26           3282.74        1                      3   6727.273
50 50   male        PG     Engineer   MLM  15000        4512.12          10487.88        7                      2  17333.333

See you in another interesting post.

 

 

 

 

Computing Regression with LibreOffice Calc

Hi,

I wrote about regression equation using algebraic method in my previous post Linear Regression. Now, I’ll show you how to compute it using LibreOffice Calc.

Let’s start!

Choose your data set. I’m choosing same data set I used in my previous post. You may check if my algebraic method computation is correct or wrong!

Regression 3

Select Data>Statistics>Regression

Regression 4

 

Choose variable 1, which is our X.

Regression 5

Here is how we choose Variable 1

Regression 6

Here is how we choose Variable 2, which is our Y.

Regression 7

Choose the output cell, where you want LibreOffice to write the output.

Regression 8

Finally these are what we selected.

Regression 9

 

And, here is the output. You may check the slope and intercept value with my previous post 🙂

Regression 10

See you in another interesting post.

 

Linear Regression

I have written about correlation in my previous post Identifying the correlation coefficient using LibreOffice Calc. This correlation talks about effect of one variable on another. So we talk about relationship between known values.

In this post, we pay attention to predicting unknown values from known values, which is regression. The primary objective of regression analysis is to provide estimates of dependent variables from independent variables.

Calculation of Regression Equation

We shall use the following methods to study about regression

  • Algebraic Method
  • Graphical Method

Calculation of Regression using algebraic method

Regression equation of X on Y is given as below –

Xe = a+by

a, b = two unknown constants of the line.

a – level of the fitted line.

b – slope of the line.

Xe – value of X computed from the relationship for a given Y.

By the least square method, we shall find out the values of a and b to determine the regression line. This line is called the “line of best fit”.

ΣX = Na + bΣY

ΣXY = aΣY + bΣY2

N is the number of observed pairs of value

Now – the regression equation of Y on X

Ye = a+bx

To compute the value of a & b, the formula is given as below.

ΣY = Na + bΣX

ΣXY = aΣX + bΣX2

Example

Lets take the following  data set to work out.

X Y
10 10
12 22
13 24
16 27
17 29
20 33
25 37
113 182

So, lets compute the values needed for our equation –

X Y X2 XY
10 10 100 100
12 22 144 264
13 24 169 312
16 27 256 432
17 29 289 493
20 33 400 660
25 37 625 925
ΣX = 113 ΣY = 182 ΣX2 = 1983 ΣXY = 3186

The regression equation of Y on X

Ye = a+bx

To compute the value of a & b, the formula is given as below.

ΣY = Na + bΣX

182 = 7a + b113

Rewriting the above equation –

113b + 7a = 182 —————-(1)

Substituting the values in the equation –

ΣXY = aΣX + bΣX2

3186 = a113 + b1983

113a + 1983b = 3186 ——————(2)

Multiply equation (1) by 113 and equation 2 by 7.

113*113b + 113*7a = 113*182

12769b + 791a = 20566 ————–(3)

7*113a + 7*1983b = 7*3186

791a + 13881b = 22302 —————(4)

(3) – (4)

12769b + 791a = 20566

13881b + 791a = 22302 (-)
————————————-
-1112b=-1736

Removing the – on both sides,

1112b=1736

b = 1736/1112

b = 1.5611510791366906474820143884892

Substitute the value of b to find the value of a.

from (1) –

113b + 7a = 182

113*1.5611510791366906474820143884892 + 7a = 182

176.41007194244604316546762589928 + 7a = 182

7a = 182-176.41007194244604316546762589928

7a = 5.58992805755395683453237410072

a = 5.58992805755395683453237410072/7

a = 0.79856115107913669064748201438857

The equation of straight line is –

Ye = a+bx

Substituting the values of a and b –

Ye = 0.79856115107913669064748201438857 +1.56x

This is the regression equation of Y on X.

When I plot the graph of the given data set, It looks like this.

Regression 1.PNG

Using the regression value, I predict Y for the missing values of X such as 11, 14, 15 etc. Here is the complete table with actual and predicted values. Predicted values are given in different colour.

X Y
10 10
11 17.9712230215827
12 22
13 24
14 22.6546762589928
15 24.2158273381295
16 27
17 29
18 28.8992805755396
19 30.4604316546763
20 33
21 33.5827338129496
22 35.1438848920863
23 36.705035971223
24 38.2661870503597
25 37

Let’s plot this table in scattered chart now. Here you go!

Regression 2

Similarly you may predict the value of X for the given Y.

In my next post, I’d write about how to compute regression using LibreOffice Calc. Bye until then!