T-test and ANOVA, are two parametric statistical techniques used to test the hypothesis. When the population means of only two groups is to be compared, the t-test is used, but when means of more than two groups are to be compared, ANOVA is used.
- One way analysis of variance
- Calculating Anova with LibreOffice Calc
- Analysis of Variance ANOVA using R
- Multiple ANOVA, Post hoc test using R
These T-Test and ANOVA belongs to General Linear Model (GLM) family.
So we can compare 2 groups with ANOVA. If we have more than 2 groups, we shall use Regression. We can have multiple IV on DV. ANOVA allows Categorical IV only. But regression allows both Categorical and Continuous data, in addition to multiple IV. DV should be continuous.
We need to check correlation before getting into regression. If we do not have regression or poor correlation, lets not think about regression. I have written about correlation in the following posts.
- Correlation and Pearson’s correlation coefficient
- Identifying the correlation coefficient using LibreOffice Calc
While correlation shows is degree of relation (+ve or -ve), regression shows us the correlation and sign of causation. So we are going to estimate DV based on changes to IV.
Let’s take the same salary data used in my previous examples.
> setwd("d:/gandhari/videos/Advanced Business Analytics/") > sal <-read.csv("sal.csv") > head(sal) id gender educ Designation Level Salary Loan.deduction Last.drawn.salary Pre..Exp Ratings.by.interviewer 1 1 female UG Jr Engineer JLM 10000 5901.74 4098.26 3 4 2 2 male DOCTORATE Chairman TLM 100000 4247.31 95752.69 20 4 3 3 male DIPLOMA Jr HR JLM 6000 3895.76 2104.24 1 3 4 4 male PG Engineer MLM 15000 9108.36 5891.64 7 2 5 5 female PG Sr Engineer MLM 25000 4269.39 20730.61 12 4 6 6 male DIPLOMA Jr Engineer JLM 6000 4137.31 1862.69 1 1 > dim(sal) [1] 50 10 > str(sal) 'data.frame': 50 obs. of 10 variables: $ id : int 1 2 3 4 5 6 7 8 9 10 ... $ gender : Factor w/ 2 levels "female","male": 1 2 2 2 1 2 2 1 1 1 ... $ educ : Factor w/ 4 levels "DIPLOMA","DOCTORATE",..: 4 2 1 3 3 1 1 3 3 3 ... $ Designation : Factor w/ 6 levels "Chairman","Engineer",..: 4 1 5 2 6 4 3 2 2 2 ... $ Level : Factor w/ 3 levels "JLM","MLM","TLM": 1 3 1 2 2 1 1 2 2 2 ... $ Salary : int 10000 100000 6000 15000 25000 6000 8000 13000 14000 16000 ... $ Loan.deduction : num 5902 4247 3896 9108 4269 ... $ Last.drawn.salary : num 4098 95753 2104 5892 20731 ... $ Pre..Exp : int 3 20 1 7 12 1 2 7 7 8 ... $ Ratings.by.interviewer: int 4 4 3 2 4 1 4 3 2 4 ... > tail (sal, n=10) id gender educ Designation Level Salary Loan.deduction Last.drawn.salary Pre..Exp Ratings.by.interviewer 41 41 male PG Engineer MLM 15000 1741.33 13258.67 7 2 42 42 female PG Sr Engineer MLM 25000 2934.33 22065.67 12 4 43 43 male DIPLOMA Jr Engineer JLM 6000 2803.03 3196.97 1 1 44 44 male DIPLOMA Jr Associate JLM 8000 5480.77 2519.23 2 4 45 45 female PG Engineer MLM 13000 1317.26 11682.74 7 3 46 46 female PG Engineer MLM 16000 9927.11 6072.89 8 4 47 47 female UG Jr Engineer JLM 10000 2507.66 7492.34 3 4 48 48 male DOCTORATE Chairman TLM 100000 9684.88 90315.12 20 4 49 49 male DIPLOMA Jr HR JLM 6000 2717.26 3282.74 1 3 50 50 male PG Engineer MLM 15000 4512.12 10487.88 7 2
Hope the preview of the data set I’ve given above makes sense.
To predict, we should have two type of data – training data and testing data
> salarytrain <-sal[1:35,] > salarytest <- sal[36:50,] > dim (salarytrain) [1] 35 10 > dim (salarytest) [1] 15 10
Let’s run the regression now.
> salreg <- lm(Salary~educ, data=salarytrain) > summary(salreg) Call: lm(formula = Salary ~ educ, data = salarytrain) Residuals: Min 1Q Median 3Q Max -4333.3 -2333.3 -727.3 636.4 7666.7 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6727 1128 5.962 1.37e-06 *** educDOCTORATE 93273 2438 38.264 < 2e-16 *** educPG 10606 1432 7.405 2.44e-08 *** educUG 3273 2438 1.343 0.189 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3742 on 31 degrees of freedom Multiple R-squared: 0.9803, Adjusted R-squared: 0.9783 F-statistic: 513.1 on 3 and 31 DF, p-value: <; 2.2e-16
The formula for the prediction is given below.
- Y = a + b1 * X1 + c
- Y is DV
- X1 is IV
- a is intercept or baseline or constant
- b1 is error value.
Let’s substitute the values.
Predicted Y = 6727 + 1128 * Education + code
R square value is 0.9803, which is 98.03. This is a high level of correlation. 98% influence of explained variance between education and salary. Remaining 2% is unexplained variance.
Intercept 6727 is the baseline, which means, a person with no education may get 6727 salary.
when he gets 1st level of education, he will get 6727+1128.
when he gets 2nd level of education, he will get 6727+(2 x 1128) and so on.
We have considered only the education in this example. Plus point of regression is, we shall use more than one IV. In this case, I want to consider years of experience in addition to education. Then my command goes as below.
> salexp <- lm(Salary~educ + Pre..Exp, data=salarytrain) > summary(salexp) Call: lm(formula = Salary ~ educ + Pre..Exp, data = salarytrain) Residuals: Min 1Q Median 3Q Max -886.44 -102.30 34.25 78.50 1113.56 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3705.6 162.5 22.80 < 2e-16 *** educDOCTORATE 51977.1 1019.4 50.99 < 2e-16 *** educPG -5330.2 417.5 -12.77 1.17e-13 *** educUG -353.2 327.2 -1.08 0.289 Pre..Exp 2215.9 52.0 42.61 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 485 on 30 degrees of freedom Multiple R-squared: 0.9997, Adjusted R-squared: 0.9996 F-statistic: 2.336e+04 on 4 and 30 DF, p-value: < 2.2e-16
This time I get 99.97% influence of education and experience in deciding someones salary. If you see the signs of estimate, Education UG or PG does not make a big difference. But previous experience and DOCTORATE surge our R square value. If R square score is low, your correlation is weak. Do not use prediction in this case or search for right IVs.
Let’s predict the salary now.
> salpred <- predict(salreg, salarytest) > salpred 36 37 38 39 40 41 42 43 44 45 46 17333.333 17333.333 10000.000 100000.000 6727.273 17333.333 17333.333 6727.273 6727.273 17333.333 17333.333 47 48 49 50 10000.000 100000.000 6727.273 17333.333
So this is the prediction of salaries from rows 36 to 50.
Let’s use cbind for better understanding. What you see as Salary is the actual salary. What you see under salpred is predicted salary. In some cases, the prediction is close, in some cases, it is far. So difference between actual salary (actual Y) and predicted salary (predicted Y) is called residual. Residual should be lower to have better prediction.
> cbind(salarytest, salpred) id gender educ Designation Level Salary Loan.deduction Last.drawn.salary Pre..Exp Ratings.by.interviewer salpred 36 36 female PG Engineer MLM 14000 716.48 13283.52 7 2 17333.333 37 37 female PG Engineer MLM 16000 6595.95 9404.05 8 4 17333.333 38 38 female UG Jr Engineer JLM 10000 5433.07 4566.93 3 4 10000.000 39 39 male DOCTORATE Chairman TLM 100000 9028.68 90971.32 20 4 100000.000 40 40 male DIPLOMA Jr HR JLM 6000 794.66 5205.34 1 3 6727.273 41 41 male PG Engineer MLM 15000 1741.33 13258.67 7 2 17333.333 42 42 female PG Sr Engineer MLM 25000 2934.33 22065.67 12 4 17333.333 43 43 male DIPLOMA Jr Engineer JLM 6000 2803.03 3196.97 1 1 6727.273 44 44 male DIPLOMA Jr Associate JLM 8000 5480.77 2519.23 2 4 6727.273 45 45 female PG Engineer MLM 13000 1317.26 11682.74 7 3 17333.333 46 46 female PG Engineer MLM 16000 9927.11 6072.89 8 4 17333.333 47 47 female UG Jr Engineer JLM 10000 2507.66 7492.34 3 4 10000.000 48 48 male DOCTORATE Chairman TLM 100000 9684.88 90315.12 20 4 100000.000 49 49 male DIPLOMA Jr HR JLM 6000 2717.26 3282.74 1 3 6727.273 50 50 male PG Engineer MLM 15000 4512.12 10487.88 7 2 17333.333
See you in another interesting post.