Regression testing in R

T-test and ANOVA, are two parametric statistical techniques used to test the hypothesis. When the population means of only two groups is to be compared, the t-test is used, but when means of more than two groups are to be compared, ANOVA is used.

Here are my previous posts about ANOVA

These T-Test and ANOVA belongs to General Linear Model (GLM) family.

So we can compare 2 groups with ANOVA. If we have more than 2 groups, we shall use Regression. We can have multiple IV on DV. ANOVA allows Categorical IV only. But regression allows both Categorical and Continuous data, in addition to multiple IV. DV should be continuous.

We need to check correlation before getting into regression. If we do not have regression or poor correlation, lets not think about regression. I have written about correlation in the following posts.

While correlation shows is degree of relation (+ve or -ve), regression shows us the correlation and sign of causation. So we are going to estimate DV based on changes to IV.

Let’s take the same salary data used in my previous examples.

> setwd("d:/gandhari/videos/Advanced Business Analytics/")
> sal <-read.csv("sal.csv")
> head(sal)
  id gender      educ Designation Level Salary Loan.deduction Last.drawn.salary Pre..Exp Ratings.by.interviewer
1  1 female        UG Jr Engineer   JLM  10000        5901.74           4098.26        3                      4
2  2   male DOCTORATE    Chairman   TLM 100000        4247.31          95752.69       20                      4
3  3   male   DIPLOMA       Jr HR   JLM   6000        3895.76           2104.24        1                      3
4  4   male        PG    Engineer   MLM  15000        9108.36           5891.64        7                      2
5  5 female        PG Sr Engineer   MLM  25000        4269.39          20730.61       12                      4
6  6   male   DIPLOMA Jr Engineer   JLM   6000        4137.31           1862.69        1                      1
> dim(sal)
[1] 50 10
> str(sal)
'data.frame':	50 obs. of  10 variables:
 $ id                    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ gender                : Factor w/ 2 levels "female","male": 1 2 2 2 1 2 2 1 1 1 ...
 $ educ                  : Factor w/ 4 levels "DIPLOMA","DOCTORATE",..: 4 2 1 3 3 1 1 3 3 3 ...
 $ Designation           : Factor w/ 6 levels "Chairman","Engineer",..: 4 1 5 2 6 4 3 2 2 2 ...
 $ Level                 : Factor w/ 3 levels "JLM","MLM","TLM": 1 3 1 2 2 1 1 2 2 2 ...
 $ Salary                : int  10000 100000 6000 15000 25000 6000 8000 13000 14000 16000 ...
 $ Loan.deduction        : num  5902 4247 3896 9108 4269 ...
 $ Last.drawn.salary     : num  4098 95753 2104 5892 20731 ...
 $ Pre..Exp              : int  3 20 1 7 12 1 2 7 7 8 ...
 $ Ratings.by.interviewer: int  4 4 3 2 4 1 4 3 2 4 ...
 > tail (sal, n=10)
   id gender      educ  Designation Level Salary Loan.deduction Last.drawn.salary Pre..Exp Ratings.by.interviewer
41 41   male        PG     Engineer   MLM  15000        1741.33          13258.67        7                      2
42 42 female        PG  Sr Engineer   MLM  25000        2934.33          22065.67       12                      4
43 43   male   DIPLOMA  Jr Engineer   JLM   6000        2803.03           3196.97        1                      1
44 44   male   DIPLOMA Jr Associate   JLM   8000        5480.77           2519.23        2                      4
45 45 female        PG     Engineer   MLM  13000        1317.26          11682.74        7                      3
46 46 female        PG     Engineer   MLM  16000        9927.11           6072.89        8                      4
47 47 female        UG  Jr Engineer   JLM  10000        2507.66           7492.34        3                      4
48 48   male DOCTORATE     Chairman   TLM 100000        9684.88          90315.12       20                      4
49 49   male   DIPLOMA        Jr HR   JLM   6000        2717.26           3282.74        1                      3
50 50   male        PG     Engineer   MLM  15000        4512.12          10487.88        7                      2

Hope the preview of the data set I’ve given above makes sense.
To predict, we should have two type of data – training data and testing data

> salarytrain <-sal[1:35,]
> salarytest <- sal[36:50,]
> dim (salarytrain)
[1] 35 10
> dim (salarytest)
[1] 15 10

Let’s run the regression now.

> salreg <- lm(Salary~educ, data=salarytrain)
> summary(salreg)

Call:
lm(formula = Salary ~ educ, data = salarytrain)

Residuals:
    Min      1Q  Median      3Q     Max
-4333.3 -2333.3  -727.3   636.4  7666.7 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)       6727       1128   5.962 1.37e-06 ***
educDOCTORATE    93273       2438  38.264  < 2e-16 ***
educPG           10606       1432   7.405 2.44e-08 ***
educUG            3273       2438   1.343    0.189
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3742 on 31 degrees of freedom
Multiple R-squared:  0.9803,	Adjusted R-squared:  0.9783
F-statistic: 513.1 on 3 and 31 DF,  p-value: <; 2.2e-16

The formula for the prediction is given below.

  • Y = a + b1 * X1 + c
  • Y is DV
  • X1 is IV
  • a is intercept or baseline or constant
  • b1 is error value.

Let’s substitute the values.
Predicted Y = 6727 + 1128 * Education + code

R square value is 0.9803, which is 98.03. This is a high level of correlation. 98% influence of explained variance between education and salary. Remaining 2% is unexplained variance.

Intercept 6727 is the baseline, which means, a person with no education may get 6727 salary.

when he gets 1st level of education, he will get 6727+1128.
when he gets 2nd level of education, he will get 6727+(2 x 1128) and so on.

We have considered only the education in this example. Plus point of regression is, we shall use more than one IV. In this case, I want to consider years of experience in addition to education. Then my command goes as below.

> salexp <- lm(Salary~educ + Pre..Exp, data=salarytrain)
> summary(salexp)

Call:
lm(formula = Salary ~ educ + Pre..Exp, data = salarytrain)

Residuals:
    Min      1Q  Median      3Q     Max
-886.44 -102.30   34.25   78.50 1113.56 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)     3705.6      162.5   22.80  < 2e-16 ***
educDOCTORATE  51977.1     1019.4   50.99  < 2e-16 ***
educPG         -5330.2      417.5  -12.77 1.17e-13 ***
educUG          -353.2      327.2   -1.08    0.289
Pre..Exp        2215.9       52.0   42.61  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 485 on 30 degrees of freedom
Multiple R-squared:  0.9997,	Adjusted R-squared:  0.9996
F-statistic: 2.336e+04 on 4 and 30 DF,  p-value: < 2.2e-16

This time I get 99.97% influence of education and experience in deciding someones salary. If you see the signs of estimate, Education UG or PG does not make a big difference. But previous experience and DOCTORATE surge our R square value. If R square score is low, your correlation is weak. Do not use prediction in this case or search for right IVs.

Let’s predict the salary now.

> salpred <- predict(salreg, salarytest)
> salpred
        36         37         38         39         40         41         42         43         44         45         46
 17333.333  17333.333  10000.000 100000.000   6727.273  17333.333  17333.333   6727.273   6727.273  17333.333  17333.333
        47         48         49         50
 10000.000 100000.000   6727.273  17333.333

So this is the prediction of salaries from rows 36 to 50.

Let’s use cbind for better understanding. What you see as Salary is the actual salary. What you see under salpred is predicted salary. In some cases, the prediction is close, in some cases, it is far. So difference between actual salary (actual Y) and predicted salary (predicted Y) is called residual. Residual should be lower to have better prediction.

> cbind(salarytest, salpred)
   id gender      educ  Designation Level Salary Loan.deduction Last.drawn.salary Pre..Exp Ratings.by.interviewer    salpred
36 36 female        PG     Engineer   MLM  14000         716.48          13283.52        7                      2  17333.333
37 37 female        PG     Engineer   MLM  16000        6595.95           9404.05        8                      4  17333.333
38 38 female        UG  Jr Engineer   JLM  10000        5433.07           4566.93        3                      4  10000.000
39 39   male DOCTORATE     Chairman   TLM 100000        9028.68          90971.32       20                      4 100000.000
40 40   male   DIPLOMA        Jr HR   JLM   6000         794.66           5205.34        1                      3   6727.273
41 41   male        PG     Engineer   MLM  15000        1741.33          13258.67        7                      2  17333.333
42 42 female        PG  Sr Engineer   MLM  25000        2934.33          22065.67       12                      4  17333.333
43 43   male   DIPLOMA  Jr Engineer   JLM   6000        2803.03           3196.97        1                      1   6727.273
44 44   male   DIPLOMA Jr Associate   JLM   8000        5480.77           2519.23        2                      4   6727.273
45 45 female        PG     Engineer   MLM  13000        1317.26          11682.74        7                      3  17333.333
46 46 female        PG     Engineer   MLM  16000        9927.11           6072.89        8                      4  17333.333
47 47 female        UG  Jr Engineer   JLM  10000        2507.66           7492.34        3                      4  10000.000
48 48   male DOCTORATE     Chairman   TLM 100000        9684.88          90315.12       20                      4 100000.000
49 49   male   DIPLOMA        Jr HR   JLM   6000        2717.26           3282.74        1                      3   6727.273
50 50   male        PG     Engineer   MLM  15000        4512.12          10487.88        7                      2  17333.333

See you in another interesting post.

 

 

 

 

Identifying the correlation coefficient using LibreOffice Calc

Hi,

I have written about correlation in my previous post Correlation and Pearson’s correlation coefficient. I also worked out a small data set to identify the correlation coefficient. In this post, let’s see how quickly we shall generate the same using LibreOffice Calc (FOSS Equivalent of Microsoft Excel).

Here is the data set. We have two variables x and y. We have 7 samples for each.

correlation 1

Select Data>Statistics>Correlation

correlation 2

Choose all data set we have (values of x and y given above)

correlation 3

Click the Results to: button to decide where LibreOffice to write the output. I have chosen the cell D2.

correlation 4

Finally, following is my inputs. Click Ok.

correlation 5

Here is the output.

correlation 6

Column2 and column 1 is 0.95.

What does this indicate? Check the below given picture. Out correlation coefficient is 0.95 in the scale of -1 to +1. So x and y are highly positively correlated.

pearson correlation coefficient interpretation (c) http://www.mathcaptain.com/statistics/pearson-correlation-coefficient.html

pearson correlation coefficient interpretation (c) http://www.mathcaptain.com/statistics/pearson-correlation-coefficient.html

If this is true, the graphical representation should show a raising line, shouldn’t it? Let’s try.

correlation 7

Obviously!

Correlation and Pearson’s correlation coefficient

Correlation is a measure of degree to which two variables are related.

For instance, Tamilnadu government opened more number of primary schools and offered free noon-meal scheme in the past, which improved the education quality. The same government has opened liquor shops all over Tamilnadu, which increased the number of accidents.

So we are talking about two variables here. In our first example, education promotion activities taken by government is one variable, education quality is another variable. Next example has number of liquor shops as first variable and number of accidents in second variable.

So, correlation is linear association between two variables which wold help to determine the relationship between the. Correlation coefficient lies in the range of -1.00 to +1.00 as +ve or -ve probability.

It not only gives the estimate of degree of association between two or more variables but also helps us to test the interdependence of the variables.

We use Spearman’s coefficient ρ (rho), which is apt for both continuous, discrete and ordinal variables.

Types of Correlation

  1. Positive Correlation
  2. Negative Correlation
  3. Simple Correlation
  4. Multiple Correlation
  5. Partial Correlation
  6. Total Correlation
  7. Linear Correlation
  8. Non-linear Correlation

Positive Correlation

The correlation depends on the direction of the variables. An increase in variable A causing an increase in variable B leads to positive correlation.

Examples

  • Height and Weight
  • Demand and Price

Negative Correlation

The correlation depends on the direction of the variables. An increase in variable A causing a decrease in variable B leads to positive correlation.

Examples

  • Number of files and free space in the hard drive
  • Price and competition

Simple & Multiple Correlation

We have already seen that correlation is relation between  two variables. This is simple correlation. Sometimes, you may see more than two variables sometimes. This would be multiple correlation. For example, Number of students enrolled in a school, number of similar schools available in its vicinity and Number of school going children around the same.

Partial & Total Correlation

Analyzing the correlation excluding one or more variables is called partial correlation. We’d consider all variables in a total correlation.

Linear & Non-linear Correlation

If the ratio of change between two variables is uniform (directly or reverse proportional), we say it is linear correlation. if not, it is non-linear.

Computing Coefficient of Correlation Manually

Lets take the following data set for analysis.

x y
12 14
9 8
8 6
10 9
11 11
13 12
7 3

Karl Pearson’s formula for coefficient of correlation r is given as –

r = (Σxy * N) – Σx * Σy / [(Σx2 * N – (Σx)2] * [Σy2 * N – (Σy)2]

x y x2 y2 xy
12 14 144 196 168
9 8 81 64 72
8 6 64 36 48
10 9 100 81 90
11 11 121 121 121
13 12 169 144 156
7 3 49 9 21
Σx = 70 Σy = 63 Σx2=728 Σy2=651 Σxy = 676

N = Total number of samples / column count

N = 14/2 = 7 ————-(1)

Pearson Equation for Correlation Coefficient

Pearson Equation for Correlation Coefficient

r = (Σxy * N) – Σx * Σy / √([Σx2 * N – (Σx)2] * [Σy2 * N – (Σy)2])

r = ((676 * 7) – 70 * 63) / √([728 * 7 – (70)2] * [651 * 7 – (63)2])

r = 4732 – 4410/√([5096-4900] * [4557 – 3969])

r = 322 / 339.4819582835 = 0.948504013668671

pearson correlation coefficient interpretation (c) http://www.mathcaptain.com/statistics/pearson-correlation-coefficient.html

pearson correlation coefficient interpretation (c) http://www.mathcaptain.com/statistics/pearson-correlation-coefficient.html

Between the scale of -1 to +1, our coefficient is +0.95, which shows a strong +ve correlation.