My Spring web app, which was running smoothly, refused to start. I don’t find any logs in log4j or tomcat log. Only clue I had was –

No Spring WebApplicationInitializer types detected on classpath

Here is how I solved it –

- Stop the tomcat
- Clean and build all eclipse projects
- Goto server tab and select the tomcat server. Press clean. Press clean work directory.
- Right click on the tomcat and remove it.
- Delete the tomcat from the eclipse run times
- Add tomcat server to the eclipse servers again
- Start the application

Filed under: Errors, spring4, tomcat Tagged: eclipse ]]>

Criteria c = super.getSession().createCriteria(Ticket.class); c.add(Restrictions.not(Restrictions.like("status", "%CLOSED%")));

Filed under: core java, hibernate ]]>

org.hibernate.HibernateException: No Session found for current thread

Any such operations needs @Transactional annotations.

Filed under: Errors, hibernate Tagged: junit ]]>

After adding javax.el to pom, this is resolved.

<dependency> <groupId>javax.el</groupId> <artifactId>javax.el-api</artifactId> <version>2.2.4</version> </dependency> <dependency> <groupId>org.glassfish.web</groupId> <artifactId>javax.el</artifactId> <version>2.2.4</version> </dependency>

Filed under: core java, Errors, hibernate Tagged: EL ]]>

I have written about Regression – Predictive model in my earlier post Regression testing in R. Following posts are useful if you want to know what is regression.

Previous post talks about predicting unknown values using known values. This post would explain about how much change is observed between IV(s) and DV.

> setwd("D:/gandhari/videos/Advanced Business Analytics") > student_data <- read.csv("student_data.csv") > student_data id gender sup.help sup.under sup.appre adv.comp adv.access tut.prof tut.sched val.devel val.meet sat.glad sat.expe loy.proud loy.recom loy.pleas scholarships job 1 1 female 7 1 7 5 5 5 4 5 6 7 7 7 7 7 no no 2 2 male 7 1 7 6 6 6 6 6 7 7 7 7 7 7 no yes 3 3 female 6 1 7 6 6 6 6 6 7 7 6 7 7 7 no no 4 4 male 1 7 1 1 2 3 2 1 1 1 1 1 1 1 yes no 5 5 female 6 5 7 7 6 7 7 7 7 7 7 7 7 7 no yes 6 6 male 3 1 7 7 7 6 7 6 6 7 6 7 7 7 yes no 7 7 female 5 2 7 7 6 6 7 4 3 7 7 7 7 7 yes no 8 8 male 6 1 7 7 7 7 5 7 6 7 7 5 6 7 yes yes 9 9 female 7 1 7 6 6 5 5 5 5 7 6 6 7 7 no yes 10 10 male 2 4 7 7 6 6 6 4 2 5 4 4 7 7 no no > str(student_data) 'data.frame': 10 obs. of 18 variables: $ id : int 1 2 3 4 5 6 7 8 9 10 $ gender : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 $ sup.help : int 7 7 6 1 6 3 5 6 7 2 $ sup.under : int 1 1 1 7 5 1 2 1 1 4 $ sup.appre : int 7 7 7 1 7 7 7 7 7 7 $ adv.comp : int 5 6 6 1 7 7 7 7 6 7 $ adv.access : int 5 6 6 2 6 7 6 7 6 6 $ tut.prof : int 5 6 6 3 7 6 6 7 5 6 $ tut.sched : int 4 6 6 2 7 7 7 5 5 6 $ val.devel : int 5 6 6 1 7 6 4 7 5 4 $ val.meet : int 6 7 7 1 7 6 3 6 5 2 $ sat.glad : int 7 7 7 1 7 7 7 7 7 5 $ sat.expe : int 7 7 6 1 7 6 7 7 6 4 $ loy.proud : int 7 7 7 1 7 7 7 5 6 4 $ loy.recom : int 7 7 7 1 7 7 7 6 7 7 $ loy.pleas : int 7 7 7 1 7 7 7 7 7 7 $ scholarships: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 2 2 1 1 $ job : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 1 2 2 1<span data-mce-type="bookmark" id="mce_SELREST_start" data-mce-style="overflow:hidden;line-height:0" style="overflow:hidden;line-height:0" ></span>

Sometimes, the dataset is not completely visible in wordpress. Hence I’m giving it as an image below.

support, advice, satisfaction and loyalty has multiple variables in the above data set as sup.help, sup.under etc.

Let’s make it as a single variable (mean) for easy analysis.

> #get sing score for support advice satisfaction loyalty > student_data$support <- apply(student_data[,3:5],1,mean) > summary (student_data$support) Min. 1st Qu. Median Mean 3rd Qu. Max. 3.000 4.417 4.667 4.600 5.000 6.000 > student_data$value <- rowMeans(student_data[,10:11]) > student_data$sat <- rowMeans(student_data[,12:13]) > student_data$loy <- rowMeans(student_data[,14:16])

So we found the mean using apply() and rowMeans(). Those mean values are appended to our original data set student_data. Now, let’s take only 4 variables – gender and the 3 new variables value, sat and loy in a new data set for analysis.

> student_data_min <- student_data[,c(2, 20:22)] > head(student_data_min) gender value sat loy 1 female 5.5 7.0 7 2 male 6.5 7.0 7 3 female 6.5 6.5 7 4 male 1.0 1.0 1 5 female 7.0 7.0 7 6 male 6.0 6.5 7

Looks simple and great, isn’t it?

- If value for money is good, satisfaction score would be high.
- If the customer is satisfied, he would be loyal to the organization.

So Loy is our dependent variable DV. sat and value are our independent variables IV. I’m using regression to know how gender influences loyalty.

> #DV - loy > #IV - sat, value > loyalty_gender_reln <- lm(loy~gender, data=student_data_min) > summary (loyalty_gender_reln) Call: lm(formula = loy ~ gender, data = student_data_min) Residuals: Min 1Q Median 3Q Max -4.4000 0.0667 0.0667 0.6000 1.6000 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.9333 0.7951 8.720 2.34e-05 *** gendermale -1.5333 1.1245 -1.364 0.21 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.778 on 8 degrees of freedom Multiple R-squared: 0.1886, Adjusted R-squared: 0.08717 F-statistic: 1.859 on 1 and 8 DF, p-value: 0.2098 > #R2 is 18%, which says weak relation. So gender does not influence the loyalty.

R-squared value is 0.1886, which is 18.86%, which shows very weak correlation. Hence I’d decide gender doesn’t influence loyalty.

Here is the influence of value for money on loyalty.

> loyalty_value_reln <- lm(loy~value, data = student_data_min) > summary(loyalty_value_reln) Call: lm(formula = loy ~ value, data = student_data_min) Residuals: Min 1Q Median 3Q Max -2.2182 -0.4953 -0.0403 0.5287 1.9618 Coefficients: Estimate Std. Error t value Pr(<|t|) (Intercept) 2.4901 1.1731 2.123 0.0665 . value 0.7280 0.2181 3.338 0.0103 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.276 on 8 degrees of freedom Multiple R-squared: 0.582, Adjusted R-squared: 0.5298 F-statistic: 11.14 on 1 and 8 DF, p-value: 0.01027 > #58%

Value for money has 58.2% influence on loyalty. Following is the influence of satisfaction against loyalty.

> loyalty_sat_reln <- lm (loy~sat, data = student_data_min) > summary(loyalty_sat_reln) Call: lm(formula = loy ~ sat, data = student_data_min) Residuals: Min 1Q Median 3Q Max -1.08586 -0.08586 -0.08586 0.29040 1.21212 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.6515 0.6992 0.932 0.379 sat 0.9192 0.1115 8.241 3.53e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.6408 on 8 degrees of freedom Multiple R-squared: 0.8946, Adjusted R-squared: 0.8814 F-statistic: 67.91 on 1 and 8 DF, p-value: 3.525e-05 > #89%

Wah, 89.46%. So to keep up our customers, satisfaction should be high. This is the message we read. I wish my beloved Air India should read this post.

We are combining everything below.

> loyalty_everything <- lm(loy~., data = student_data_min) > summary(loyalty_everything) Call: lm(formula = loy ~ ., data = student_data_min) Residuals: Min 1Q Median 3Q Max -1.01381 -0.28807 -0.01515 0.33286 1.13931 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.66470 1.03039 0.645 0.54273 gendermale -0.01796 0.53076 -0.034 0.97411 value -0.10252 0.23777 -0.431 0.68141 sat 1.00478 0.26160 3.841 0.00855 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7273 on 6 degrees of freedom Multiple R-squared: 0.8982, Adjusted R-squared: 0.8472 F-statistic: 17.64 on 3 and 6 DF, p-value: 0.00222

Really, I don’t know how to read the above value at the moment. I’d update this post (if I don’t forget!)

To collate the results and show in a consolidated format, we use screenreg() of rexreg package.

> install.packages("texreg") Installing package into ‘D:/gandhari/documents/R/win-library/3.4’ (as ‘lib’ is unspecified) trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/texreg_1.36.23.zip' Content type 'application/zip' length 651831 bytes (636 KB) downloaded 636 KB package ‘texreg’ successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\pandian\AppData\Local\Temp\Rtmp085gnT\downloaded_packages > library("texreg") Version: 1.36.23 Date: 2017-03-03 Author: Philip Leifeld (University of Glasgow) Please cite the JSS article in your publications -- see citation("texreg"). > library(texreg) > screenreg(list(loyalty_gender_reln, loyalty_value_reln, loyalty_sat_reln, loyalty_everything)) ==================================================== Model 1 Model 2 Model 3 Model 4 ---------------------------------------------------- (Intercept) 6.93 *** 2.49 0.65 0.66 (0.80) (1.17) (0.70) (1.03) gendermale -1.53 -0.02 (1.12) (0.53) value 0.73 * -0.10 (0.22) (0.24) sat 0.92 *** 1.00 ** (0.11) (0.26) ---------------------------------------------------- R^2 0.19 0.58 0.89 0.90 Adj. R^2 0.09 0.53 0.88 0.85 Num. obs. 10 10 10 10 RMSE 1.78 1.28 0.64 0.73 ==================================================== *** p < 0.001, ** p < 0.01, * p < 0.05

So this linear regression post explains the relation between the variables.

See you in another post with an interesting topic.

Filed under: Business Analytics Tagged: Alternate Hypothesis, Hypothesis testing, Null Hypothesis, R language, R Studio, Regression, Regression Equation, Statistics ]]>

Here are my previous posts about ANOVA

- One way analysis of variance
- Calculating Anova with LibreOffice Calc
- Analysis of Variance ANOVA using R
- Multiple ANOVA, Post hoc test using R

These T-Test and ANOVA belongs to General Linear Model (GLM) family.

So we can compare 2 groups with ANOVA. If we have more than 2 groups, we shall use Regression. We can have multiple IV on DV. ANOVA allows Categorical IV only. But regression allows both Categorical and Continuous data, in addition to multiple IV. DV should be continuous.

We need to check correlation before getting into regression. If we do not have regression or poor correlation, lets not think about regression. I have written about correlation in the following posts.

- Correlation and Pearson’s correlation coefficient
- Identifying the correlation coefficient using LibreOffice Calc

While correlation shows is degree of relation (+ve or -ve), regression shows us the correlation and sign of causation. So we are going to estimate DV based on changes to IV.

Let’s take the same salary data used in my previous examples.

> setwd("d:/gandhari/videos/Advanced Business Analytics/") > sal <-read.csv("sal.csv") > head(sal) id gender educ Designation Level Salary Loan.deduction Last.drawn.salary Pre..Exp Ratings.by.interviewer 1 1 female UG Jr Engineer JLM 10000 5901.74 4098.26 3 4 2 2 male DOCTORATE Chairman TLM 100000 4247.31 95752.69 20 4 3 3 male DIPLOMA Jr HR JLM 6000 3895.76 2104.24 1 3 4 4 male PG Engineer MLM 15000 9108.36 5891.64 7 2 5 5 female PG Sr Engineer MLM 25000 4269.39 20730.61 12 4 6 6 male DIPLOMA Jr Engineer JLM 6000 4137.31 1862.69 1 1 > dim(sal) [1] 50 10 > str(sal) 'data.frame': 50 obs. of 10 variables: $ id : int 1 2 3 4 5 6 7 8 9 10 ... $ gender : Factor w/ 2 levels "female","male": 1 2 2 2 1 2 2 1 1 1 ... $ educ : Factor w/ 4 levels "DIPLOMA","DOCTORATE",..: 4 2 1 3 3 1 1 3 3 3 ... $ Designation : Factor w/ 6 levels "Chairman","Engineer",..: 4 1 5 2 6 4 3 2 2 2 ... $ Level : Factor w/ 3 levels "JLM","MLM","TLM": 1 3 1 2 2 1 1 2 2 2 ... $ Salary : int 10000 100000 6000 15000 25000 6000 8000 13000 14000 16000 ... $ Loan.deduction : num 5902 4247 3896 9108 4269 ... $ Last.drawn.salary : num 4098 95753 2104 5892 20731 ... $ Pre..Exp : int 3 20 1 7 12 1 2 7 7 8 ... $ Ratings.by.interviewer: int 4 4 3 2 4 1 4 3 2 4 ... > tail (sal, n=10) id gender educ Designation Level Salary Loan.deduction Last.drawn.salary Pre..Exp Ratings.by.interviewer 41 41 male PG Engineer MLM 15000 1741.33 13258.67 7 2 42 42 female PG Sr Engineer MLM 25000 2934.33 22065.67 12 4 43 43 male DIPLOMA Jr Engineer JLM 6000 2803.03 3196.97 1 1 44 44 male DIPLOMA Jr Associate JLM 8000 5480.77 2519.23 2 4 45 45 female PG Engineer MLM 13000 1317.26 11682.74 7 3 46 46 female PG Engineer MLM 16000 9927.11 6072.89 8 4 47 47 female UG Jr Engineer JLM 10000 2507.66 7492.34 3 4 48 48 male DOCTORATE Chairman TLM 100000 9684.88 90315.12 20 4 49 49 male DIPLOMA Jr HR JLM 6000 2717.26 3282.74 1 3 50 50 male PG Engineer MLM 15000 4512.12 10487.88 7 2

Hope the preview of the data set I’ve given above makes sense.

To predict, we should have two type of data – training data and testing data

> salarytrain <-sal[1:35,] > salarytest <- sal[36:50,] > dim (salarytrain) [1] 35 10 > dim (salarytest) [1] 15 10

Let’s run the regression now.

> salreg <- lm(Salary~educ, data=salarytrain) > summary(salreg) Call: lm(formula = Salary ~ educ, data = salarytrain) Residuals: Min 1Q Median 3Q Max -4333.3 -2333.3 -727.3 636.4 7666.7 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6727 1128 5.962 1.37e-06 *** educDOCTORATE 93273 2438 38.264 < 2e-16 *** educPG 10606 1432 7.405 2.44e-08 *** educUG 3273 2438 1.343 0.189 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3742 on 31 degrees of freedom Multiple R-squared: 0.9803, Adjusted R-squared: 0.9783 F-statistic: 513.1 on 3 and 31 DF, p-value: <; 2.2e-16

The formula for the prediction is given below.

- Y = a + b1 * X1 + c
- Y is DV
- X1 is IV
- a is intercept or baseline or constant
- b1 is error value.

Let’s substitute the values.

Predicted Y = 6727 + 1128 * Education + code

R square value is 0.9803, which is 98.03. This is a high level of correlation. 98% influence of explained variance between education and salary. Remaining 2% is unexplained variance.

Intercept 6727 is the baseline, which means, a person with no education may get 6727 salary.

when he gets 1st level of education, he will get 6727+1128.

when he gets 2nd level of education, he will get 6727+(2 x 1128) and so on.

We have considered only the education in this example. Plus point of regression is, we shall use more than one IV. In this case, I want to consider years of experience in addition to education. Then my command goes as below.

> salexp <- lm(Salary~educ + Pre..Exp, data=salarytrain) > summary(salexp) Call: lm(formula = Salary ~ educ + Pre..Exp, data = salarytrain) Residuals: Min 1Q Median 3Q Max -886.44 -102.30 34.25 78.50 1113.56 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3705.6 162.5 22.80 < 2e-16 *** educDOCTORATE 51977.1 1019.4 50.99 < 2e-16 *** educPG -5330.2 417.5 -12.77 1.17e-13 *** educUG -353.2 327.2 -1.08 0.289 Pre..Exp 2215.9 52.0 42.61 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 485 on 30 degrees of freedom Multiple R-squared: 0.9997, Adjusted R-squared: 0.9996 F-statistic: 2.336e+04 on 4 and 30 DF, p-value: < 2.2e-16

This time I get 99.97% influence of education and experience in deciding someones salary. If you see the signs of estimate, Education UG or PG does not make a big difference. But previous experience and DOCTORATE surge our R square value. If R square score is low, your correlation is weak. Do not use prediction in this case or search for right IVs.

Let’s predict the salary now.

> salpred <- predict(salreg, salarytest) > salpred 36 37 38 39 40 41 42 43 44 45 46 17333.333 17333.333 10000.000 100000.000 6727.273 17333.333 17333.333 6727.273 6727.273 17333.333 17333.333 47 48 49 50 10000.000 100000.000 6727.273 17333.333

So this is the prediction of salaries from rows 36 to 50.

Let’s use cbind for better understanding. What you see as Salary is the actual salary. What you see under salpred is predicted salary. In some cases, the prediction is close, in some cases, it is far. So difference between actual salary (actual Y) and predicted salary (predicted Y) is called residual. Residual should be lower to have better prediction.

> cbind(salarytest, salpred) id gender educ Designation Level Salary Loan.deduction Last.drawn.salary Pre..Exp Ratings.by.interviewer salpred 36 36 female PG Engineer MLM 14000 716.48 13283.52 7 2 17333.333 37 37 female PG Engineer MLM 16000 6595.95 9404.05 8 4 17333.333 38 38 female UG Jr Engineer JLM 10000 5433.07 4566.93 3 4 10000.000 39 39 male DOCTORATE Chairman TLM 100000 9028.68 90971.32 20 4 100000.000 40 40 male DIPLOMA Jr HR JLM 6000 794.66 5205.34 1 3 6727.273 41 41 male PG Engineer MLM 15000 1741.33 13258.67 7 2 17333.333 42 42 female PG Sr Engineer MLM 25000 2934.33 22065.67 12 4 17333.333 43 43 male DIPLOMA Jr Engineer JLM 6000 2803.03 3196.97 1 1 6727.273 44 44 male DIPLOMA Jr Associate JLM 8000 5480.77 2519.23 2 4 6727.273 45 45 female PG Engineer MLM 13000 1317.26 11682.74 7 3 17333.333 46 46 female PG Engineer MLM 16000 9927.11 6072.89 8 4 17333.333 47 47 female UG Jr Engineer JLM 10000 2507.66 7492.34 3 4 10000.000 48 48 male DOCTORATE Chairman TLM 100000 9684.88 90315.12 20 4 100000.000 49 49 male DIPLOMA Jr HR JLM 6000 2717.26 3282.74 1 3 6727.273 50 50 male PG Engineer MLM 15000 4512.12 10487.88 7 2 17333.333

See you in another interesting post.

Filed under: Business Analytics Tagged: Alternate Hypothesis, Anova, Correlation, Correlation Coefficient, Hypothesis testing, Null Hypothesis, One-way Anova, Regression, Regression Equation ]]>

For ease of (my!) understanding, I would take the same data set in this post as well. So here is the same data set.

> sal id gender educ Designation Level Salary Last.drawn.salary Pre..Exp Ratings.by.interviewer 1 1 female UG Jr Engineer JLM 10000 1000 3 4 2 2 male DOCTORATE Chairman TLM 100000 100000 20 4 3 3 male DIPLOMA Jr HR JLM 6000 6000 1 3 4 4 male PG Engineer MLM 15000 15000 7 2 5 5 female PG Sr Engineer MLM 25000 25000 12 4 6 6 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 7 7 male DIPLOMA Jr Associate JLM 8000 8000 2 4 8 8 female PG Engineer MLM 13000 13000 7 3 9 9 female PG Engineer MLM 14000 14000 7 2 10 10 female PG Engineer MLM 16000 16000 8 4 11 11 female UG Jr Engineer JLM 10000 1000 3 4 12 12 male DOCTORATE Chairman TLM 100000 100000 20 4 13 13 male DIPLOMA Jr HR JLM 6000 6000 1 3 14 14 male PG Engineer MLM 15000 15000 7 2 15 15 female PG Sr Engineer MLM 25000 25000 12 4 16 16 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 17 17 male DIPLOMA Jr Associate JLM 8000 8000 2 4 18 18 female PG Engineer MLM 13000 13000 7 3 19 19 female PG Engineer MLM 14000 14000 7 2 20 20 female PG Engineer MLM 16000 16000 8 4 21 21 female PG Sr Engineer MLM 25000 25000 12 4 22 22 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 23 23 male DIPLOMA Jr Associate JLM 8000 8000 2 4 24 24 female PG Engineer MLM 13000 13000 7 3 25 25 female PG Engineer MLM 14000 14000 7 2 26 26 female PG Engineer MLM 16000 16000 8 4 27 27 female UG Jr Engineer JLM 10000 1000 3 4 28 28 male DOCTORATE Chairman TLM 100000 100000 20 4 29 29 male DIPLOMA Jr HR JLM 6000 6000 1 3 30 30 male PG Engineer MLM 15000 15000 7 2 31 31 female PG Sr Engineer MLM 25000 25000 12 4 32 32 female PG Sr Engineer MLM 25000 25000 12 4 33 33 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 34 34 male DIPLOMA Jr Associate JLM 8000 8000 2 4 35 35 female PG Engineer MLM 13000 13000 7 3 36 36 female PG Engineer MLM 14000 14000 7 2 37 37 female PG Engineer MLM 16000 16000 8 4 38 38 female UG Jr Engineer JLM 10000 1000 3 4 39 39 male DOCTORATE Chairman TLM 100000 100000 20 4 40 40 male DIPLOMA Jr HR JLM 6000 6000 1 3 41 41 male PG Engineer MLM 15000 15000 7 2 42 42 female PG Sr Engineer MLM 25000 25000 12 4 43 43 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 44 44 male DIPLOMA Jr Associate JLM 8000 8000 2 4 45 45 female PG Engineer MLM 13000 13000 7 3 46 46 female PG Engineer MLM 16000 16000 8 4 47 47 female UG Jr Engineer JLM 10000 1000 3 4 48 48 male DOCTORATE Chairman TLM 100000 100000 20 4 49 49 male DIPLOMA Jr HR JLM 6000 6000 1 3 50 50 male PG Engineer MLM 15000 15000 7 2

We have already executed ANOVA test. Following is the output.

> aov1 <-aov(Salary~educ, data=sal) > aov1 Call: aov(formula = Salary ~ educ, data = sal) Terms: educ Residuals Sum of Squares 35270186667 538293333 Deg. of Freedom 3 46 Residual standard error: 3420.823 Estimated effects may be unbalanced > summary(aov1) Df Sum Sq Mean Sq F value Pr(>F) educ 3 3.527e+10 1.176e+10 1005 <2e-16 *** Residuals 46 5.383e+08 1.170e+07 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

What we get above is the overall significant difference between DV (salary) and IV (Education)

To compare the difference between the group, we use Post hoc test. We shall use TukeyHSD() for this. We need to go for this approach, only if the anova is significant. If anova is not significant, there is no need for posthoc.

We’d see how to run get the variances across the groups in this post.

> tukey <- TukeyHSD(aov1) > tukey Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = Salary ~ educ, data = sal) $educ diff lwr upr p adj DOCTORATE-DIPLOMA 93333.333 88624.720 98041.947 0.0000000 PG-DIPLOMA 10373.333 7395.345 13351.322 0.0000000 UG-DIPLOMA 3333.333 -1375.280 8041.947 0.2477298 PG-DOCTORATE -82960.000 -87426.983 -78493.017 0.0000000 UG-DOCTORATE -90000.000 -95766.850 -84233.150 0.0000000 UG-PG -7040.000 -11506.983 -2573.017 0.0006777

- diff – mean difference between education level
- lwr – lower mean
- upr – upper mean

If signs between lwr and upr are same, irrelevant of + or -, that denotes significant difference.

When you compare diploma (lower degree) with doctorate (higher degree), the difference would be +ve and vice versa. If you just want to see the difference, + or – is not significant.

Let’s plot this in a graph.

> plot(tukey)

0 is the mid point. So, anything near 0 do not have significant difference.

From the top, first plot is for the comparison between DOCTORATE-DIPLOMA. You would see a high positive difference. If you see the plot for UG-DOCTORATE, is it second highest difference, but this is negative difference. Anything near 0 like UG-DIPLOMA, does not have significant difference.

We received a new data set from company now, which has a new column Loan.deducation. Last.drawn.salary changes with respect to his loans.

> sal <- read.csv("sal.csv") > head(sal) id gender educ Designation Level Salary Loan.deduction Last.drawn.salary Pre..Exp Ratings.by.interviewer 1 1 female UG Jr Engineer JLM 10000 5901.74 4098.26 3 4 2 2 male DOCTORATE Chairman TLM 100000 4247.31 95752.69 20 4 3 3 male DIPLOMA Jr HR JLM 6000 3895.76 2104.24 1 3 4 4 male PG Engineer MLM 15000 9108.36 5891.64 7 2 5 5 female PG Sr Engineer MLM 25000 4269.39 20730.61 12 4 6 6 male DIPLOMA Jr Engineer JLM 6000 4137.31 1862.69 1 1

Company wants to see the differences among Salary (column 6), Loan.deduction (column 7) and Last.drawn.salary (column 8). We combine apply and anova as given below.

> aovset <- apply(sal[,6:8], 2, function(x)aov(x~educ, data = sal))

- sal[,6:8] takes all rows of columns 6, 7 and 8
- aov is our function

Following is the variance between education and Last.drawn.salary.

> summary(aovset$Last.drawn.salary) Df Sum Sq Mean Sq F value Pr(>F) educ 3 3.342e+10 1.114e+10 674.2 <2e-16 *** Residuals 46 7.602e+08 1.653e+07 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

F value is 674, which means, the change is significant. Following would be more interesting.

> summary(aovset$Loan.deduction) Df Sum Sq Mean Sq F value Pr(>F) educ 3 25577616 8525872 1.14 0.343 Residuals 46 343898395 7476052

F value for Loan.deduction is lesser than 4. So, there is no change in the deductions between different education level.

See you in another interesting post.

Filed under: Business Analytics Tagged: Alternate Hypothesis, Anova, Hypothesis testing, Null Hypothesis, One-way Anova, Post hoc test, Tukey test ]]>

- Testing of Independence – Chi Square test – Manual, LibreOffice, R
- Testing of difference – T test using R

This post discuss about how to calculate ANOVA using R. This post will not talk about anova, as I have written this in One way analysis of variance and Calculating Anova with LibreOffice Calc

So this is also similar to t test. The difference exists in in the type of variables used. As given in Testing of Independence – Chi Square test – Manual, LibreOffice, R blog post, T test would be used when two variables are categorical. In addition, T test uses only two groups.

We can use categorical (IV) and continuous DV variables in Anova. We may use more than 2 variables.

An increase in number of schools, leads to increase in educational expense of the Government. We call this as reasonable change.

An increase in number of schools, leads to increase school fees of each family. This is unfortunate, right. So this is unreasonable change.

So reasonable change is a change in independent variable has a change in dependent variable.

An unreasonable change is a change in dependent variable with no change in independent variable.

We would care about f ratio here.

f ratio = ratio between explained and unexplained variable.

f ratio = explained variance/unexplained variance

If f > 4, the variables have significant difference. Before making a decision, refer to p value.

Let me take the same data set used in my previous posts.

> sal id gender educ Designation Level Salary Last.drawn.salary Pre..Exp Ratings.by.interviewer 1 1 female UG Jr Engineer JLM 10000 1000 3 4 2 2 male DOCTORATE Chairman TLM 100000 100000 20 4 3 3 male DIPLOMA Jr HR JLM 6000 6000 1 3 4 4 male PG Engineer MLM 15000 15000 7 2 5 5 female PG Sr Engineer MLM 25000 25000 12 4 6 6 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 7 7 male DIPLOMA Jr Associate JLM 8000 8000 2 4 8 8 female PG Engineer MLM 13000 13000 7 3 9 9 female PG Engineer MLM 14000 14000 7 2 10 10 female PG Engineer MLM 16000 16000 8 4 11 11 female UG Jr Engineer JLM 10000 1000 3 4 12 12 male DOCTORATE Chairman TLM 100000 100000 20 4 13 13 male DIPLOMA Jr HR JLM 6000 6000 1 3 14 14 male PG Engineer MLM 15000 15000 7 2 15 15 female PG Sr Engineer MLM 25000 25000 12 4 16 16 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 17 17 male DIPLOMA Jr Associate JLM 8000 8000 2 4 18 18 female PG Engineer MLM 13000 13000 7 3 19 19 female PG Engineer MLM 14000 14000 7 2 20 20 female PG Engineer MLM 16000 16000 8 4 21 21 female PG Sr Engineer MLM 25000 25000 12 4 22 22 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 23 23 male DIPLOMA Jr Associate JLM 8000 8000 2 4 24 24 female PG Engineer MLM 13000 13000 7 3 25 25 female PG Engineer MLM 14000 14000 7 2 26 26 female PG Engineer MLM 16000 16000 8 4 27 27 female UG Jr Engineer JLM 10000 1000 3 4 28 28 male DOCTORATE Chairman TLM 100000 100000 20 4 29 29 male DIPLOMA Jr HR JLM 6000 6000 1 3 30 30 male PG Engineer MLM 15000 15000 7 2 31 31 female PG Sr Engineer MLM 25000 25000 12 4 32 32 female PG Sr Engineer MLM 25000 25000 12 4 33 33 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 34 34 male DIPLOMA Jr Associate JLM 8000 8000 2 4 35 35 female PG Engineer MLM 13000 13000 7 3 36 36 female PG Engineer MLM 14000 14000 7 2 37 37 female PG Engineer MLM 16000 16000 8 4 38 38 female UG Jr Engineer JLM 10000 1000 3 4 39 39 male DOCTORATE Chairman TLM 100000 100000 20 4 40 40 male DIPLOMA Jr HR JLM 6000 6000 1 3 41 41 male PG Engineer MLM 15000 15000 7 2 42 42 female PG Sr Engineer MLM 25000 25000 12 4 43 43 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 44 44 male DIPLOMA Jr Associate JLM 8000 8000 2 4 45 45 female PG Engineer MLM 13000 13000 7 3 46 46 female PG Engineer MLM 16000 16000 8 4 47 47 female UG Jr Engineer JLM 10000 1000 3 4 48 48 male DOCTORATE Chairman TLM 100000 100000 20 4 49 49 male DIPLOMA Jr HR JLM 6000 6000 1 3 50 50 male PG Engineer MLM 15000 15000 7 2

Lets take education qualification and salary drawn. Usually higher the educational qualification, higher the salary.

> aggregate(Salary~educ, mean, data=sal) educ Salary 1 DIPLOMA 6666.667 2 DOCTORATE 100000.000 3 PG 17040.000 4 UG 10000.000

Obviously there is change. When diploma holder gets 6666 rupees, a doctorate gets 100000. Let’s check the analysis of variance aov() now.

> aov1 <- aov(Salary~educ, data=sal) > aov1 Call: aov(formula = Salary ~ educ, data = sal) Terms: educ Residuals Sum of Squares 35270186667 538293333 Deg. of Freedom 3 46 Residual standard error: 3420.823 Estimated effects may be unbalanced

Let’s look at the mean salary and anova summary below

> aggregate(Salary~educ, mean, data=sal) educ Salary 1 DIPLOMA 6666.667 2 DOCTORATE 100000.000 3 PG 17040.000 4 UG 10000.000

> summary(aov1) Df Sum Sq Mean Sq F value Pr(>F) educ 3 3.527e+10 1.176e+10 1005 <2e-16 *** Residuals 46 5.383e+08 1.170e+07 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

F value is **1005**. Our threshold is 4. We are getting very high value.

p value *** is <0.001. It is lesser than standard alpha 5% which is 0.05.

So change is salary is significant with the change in education. This data has significant difference.

The first row educ indicates explained variance. The second row, residuals, indicates unexplained variance.

The *** indicates that our model is not only significant for 0.05, but it is significant even at 0.001. Out of 1000, we may reject 1 time.

Filed under: Business Analytics ]]>

I have written about a hypothesis testing of independence in my previous post Testing of Independence – Chi Square test – Manual, LibreOffice, R. This post talks about testing of mean difference.

Lets take the same salary data set used in my previous post.

> sal id gender educ Designation Level Salary Last.drawn.salary Pre..Exp Ratings.by.interviewer 1 1 female UG Jr Engineer JLM 10000 1000 3 4 2 2 male DOCTORATE Chairman TLM 100000 100000 20 4 3 3 male DIPLOMA Jr HR JLM 6000 6000 1 3 4 4 male PG Engineer MLM 15000 15000 7 2 5 5 female PG Sr Engineer MLM 25000 25000 12 4 6 6 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 7 7 male DIPLOMA Jr Associate JLM 8000 8000 2 4 8 8 female PG Engineer MLM 13000 13000 7 3 9 9 female PG Engineer MLM 14000 14000 7 2 10 10 female PG Engineer MLM 16000 16000 8 4 11 11 female UG Jr Engineer JLM 10000 1000 3 4 12 12 male DOCTORATE Chairman TLM 100000 100000 20 4 13 13 male DIPLOMA Jr HR JLM 6000 6000 1 3 14 14 male PG Engineer MLM 15000 15000 7 2 15 15 female PG Sr Engineer MLM 25000 25000 12 4 16 16 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 17 17 male DIPLOMA Jr Associate JLM 8000 8000 2 4 18 18 female PG Engineer MLM 13000 13000 7 3 19 19 female PG Engineer MLM 14000 14000 7 2 20 20 female PG Engineer MLM 16000 16000 8 4 21 21 female PG Sr Engineer MLM 25000 25000 12 4 22 22 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 23 23 male DIPLOMA Jr Associate JLM 8000 8000 2 4 24 24 female PG Engineer MLM 13000 13000 7 3 25 25 female PG Engineer MLM 14000 14000 7 2 26 26 female PG Engineer MLM 16000 16000 8 4 27 27 female UG Jr Engineer JLM 10000 1000 3 4 28 28 male DOCTORATE Chairman TLM 100000 100000 20 4 29 29 male DIPLOMA Jr HR JLM 6000 6000 1 3 30 30 male PG Engineer MLM 15000 15000 7 2 31 31 female PG Sr Engineer MLM 25000 25000 12 4 32 32 female PG Sr Engineer MLM 25000 25000 12 4 33 33 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 34 34 male DIPLOMA Jr Associate JLM 8000 8000 2 4 35 35 female PG Engineer MLM 13000 13000 7 3 36 36 female PG Engineer MLM 14000 14000 7 2 37 37 female PG Engineer MLM 16000 16000 8 4 38 38 female UG Jr Engineer JLM 10000 1000 3 4 39 39 male DOCTORATE Chairman TLM 100000 100000 20 4 40 40 male DIPLOMA Jr HR JLM 6000 6000 1 3 41 41 male PG Engineer MLM 15000 15000 7 2 42 42 female PG Sr Engineer MLM 25000 25000 12 4 43 43 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 44 44 male DIPLOMA Jr Associate JLM 8000 8000 2 4 45 45 female PG Engineer MLM 13000 13000 7 3 46 46 female PG Engineer MLM 16000 16000 8 4 47 47 female UG Jr Engineer JLM 10000 1000 3 4 48 48 male DOCTORATE Chairman TLM 100000 100000 20 4 49 49 male DIPLOMA Jr HR JLM 6000 6000 1 3 50 50 male PG Engineer MLM 15000 15000 7 2

Let’s find out the difference of mean salary between male and female.

> aggregate(Salary~gender, mean, data=sal) gender Salary 1 female 16040 2 male 27000

Mean salary of female **μ _{0}** = 16040

The symbol ~ differentiates between dependent and independent variables

Obviously there is a difference. Let’s see what a t-test in R shows us.

> t.test(Salary~gender, data = sal) Welch Two Sample t-test data: Salary by gender t = -1.4494, df = 25.039, p-value = 0.1596 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -26532.252 4612.252 sample estimates: mean in group female mean in group male 16040 27000

*How to interpret this result?*

The p-value is compared with the desired significance level of our test and, if it is smaller, the result is significant. That is, if the null hypothesis were to be rejected at the 5% significance level, this would be reported as “p < 0.05″. Small p-values suggest that the null hypothesis is unlikely to be true. But our p-value 0.15 > 0.05. Hence null hypothesis is rejected and alternate hypothesis is accepted. There is a difference in salary between both genders.

Higher the t value (ignore the sign), higher the difference.

Filed under: Business Analytics Tagged: Alternate Hypothesis, Hypothesis testing, Null Hypothesis, R language, R Studio, Statistics, T test ]]>

I have written about testing of hypothesis in my earlier posts

- One way analysis of variance
- Calculating Anova with LibreOffice Calc
- Correlation and Pearson’s correlation coefficient
- Identifying the correlation coefficient using LibreOffice Calc

Statisticians recommended right testing approaches for different type of data.

When we have –

- both data as categorical, we shall use Chi Square Test
- Continuous and Continuous data, we shall use correlation
- Categorical and Continuous data, we shall use t test or anova.

In this post, I’d be using the below given data set.

id gender educ Designation Level Salary Last.drawn.salary Pre..Exp Ratings.by.interviewer 1 female UG Jr Engineer JLM 10000 1000 3 4 2 male DOCTORATE Chairman TLM 100000 100000 20 4 3 male DIPLOMA Jr HR JLM 6000 6000 1 3 4 male PG Engineer MLM 15000 15000 7 2 5 female PG Sr Engineer MLM 25000 25000 12 4 6 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 7 male DIPLOMA Jr Associate JLM 8000 8000 2 4 8 female PG Engineer MLM 13000 13000 7 3 9 female PG Engineer MLM 14000 14000 7 2 10 female PG Engineer MLM 16000 16000 8 4 11 female UG Jr Engineer JLM 10000 1000 3 4 12 male DOCTORATE Chairman TLM 100000 100000 20 4 13 male DIPLOMA Jr HR JLM 6000 6000 1 3 14 male PG Engineer MLM 15000 15000 7 2 15 female PG Sr Engineer MLM 25000 25000 12 4 16 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 17 male DIPLOMA Jr Associate JLM 8000 8000 2 4 18 female PG Engineer MLM 13000 13000 7 3 19 female PG Engineer MLM 14000 14000 7 2 20 female PG Engineer MLM 16000 16000 8 4 21 female PG Sr Engineer MLM 25000 25000 12 4 22 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 23 male DIPLOMA Jr Associate JLM 8000 8000 2 4 24 female PG Engineer MLM 13000 13000 7 3 25 female PG Engineer MLM 14000 14000 7 2 26 female PG Engineer MLM 16000 16000 8 4 27 female UG Jr Engineer JLM 10000 1000 3 4 28 male DOCTORATE Chairman TLM 100000 100000 20 4 29 male DIPLOMA Jr HR JLM 6000 6000 1 3 30 male PG Engineer MLM 15000 15000 7 2 31 female PG Sr Engineer MLM 25000 25000 12 4 32 female PG Sr Engineer MLM 25000 25000 12 4 33 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 34 male DIPLOMA Jr Associate JLM 8000 8000 2 4 35 female PG Engineer MLM 13000 13000 7 3 36 female PG Engineer MLM 14000 14000 7 2 37 female PG Engineer MLM 16000 16000 8 4 38 female UG Jr Engineer JLM 10000 1000 3 4 39 male DOCTORATE Chairman TLM 100000 100000 20 4 40 male DIPLOMA Jr HR JLM 6000 6000 1 3 41 male PG Engineer MLM 15000 15000 7 2 42 female PG Sr Engineer MLM 25000 25000 12 4 43 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 44 male DIPLOMA Jr Associate JLM 8000 8000 2 4 45 female PG Engineer MLM 13000 13000 7 3 46 female PG Engineer MLM 16000 16000 8 4 47 female UG Jr Engineer JLM 10000 1000 3 4 48 male DOCTORATE Chairman TLM 100000 100000 20 4 49 male DIPLOMA Jr HR JLM 6000 6000 1 3 50 male PG Engineer MLM 15000 15000 7 2

We shall use chi square test for two types of hypothesis testing

- test of independence of variables
- test goodness of fit

We can find out the association between two (at least) categorical variables. Higher the chi square value, better the result is. We shall use this to test our hypothesis.

When we use chi square test to find the goodness of fit, we shall use 2 categorical variables. higher the chi square value, better the result is. We shall use this to test BLR, SEM tests.

This post talks about testing of independence. We have employee data given above. Following are my hypothesis.

H0 = Number of female employees and level of management are not related.

H1 = Number of female employees and level of management are related.

We would solve this using three methods

- Manual way of chi square test
- Chi square test with LibreOffice Calc
- Chi square test with R

We prepare the count of female employees in each level as given below. I have used COUNTIFS() function of LibreOffice.

Calculate the row (highlighted in pink colour) and column sums (blue colour) and summation of all row sums (saffron colour).

The values are called observed values. We shall find out the expected values as well easily as given below.

Expected value = column sum x row sum/sum of rowsum

=J15*N12/N15 = 25 x 20/50 = 10

Finally our table looks like this.

All the observed values (O), Expected values (E) are substituted in the below table. We calculate the Chi square value χ2 which is 19.

O | E | O-E | (O-E)2 | (O-E)2/E |

5 | 10 | -5 | 25 | 2.5 |

20 | 12.5 | 7.5 | 56.25 | 4.5 |

0 | 2.5 | -2.5 | 6.25 | 2.5 |

15 | 10 | 5 | 25 | 2.5 |

5 | 12.5 | -7.5 | 56.25 | 4.5 |

5 | 2.5 | 2.5 | 6.25 | 2.5 |

χ2 |
19 |

Level of significance or Type 1 error = 5%, which is 0.05

Degrees of freedom = (row count – 1) x (column count – 1) = 2

Critical value of χ2 is 5.991, which is looked up using the level of significance and degrees of freedom in the below given table.

*Make a decision*

To accept our null hypothesis H0, calculated χ2 < critical χ2.

Our calculated χ2 = 19

Our critical χ2 = 5.991

Hence, we reject null hypothesis and accept alternate hypothesis.

You may watch the following video to understand the above calculation.

We have already found out the frequency distribution of females and males per each management level. Let’s use the same.

Select Data>Statistics>Chi-square Test

Choose the input cells

Select the Output Cell

Finally my selections are given as below

After pressing OK, We get the following result

*Make a decision*

If p≤α reject the null hypothesis. If p>α fail to reject the null hypothesis.

Our p 0.00007485 is lesser than alpha 0.05. So null hypothesis is rejected and alternate hypothesis is accepted.

I have the data set stored as *sal.csv* file. I’m importing it and store to sal object.

> setwd("d:/gandhari/videos/Advanced Business Analytics/") > sal <-read.csv("sal.csv") > head(sal) id gender educ Designation Level Salary Last.drawn.salary Pre..Exp Ratings.by.interviewer 1 1 female UG Jr Engineer JLM 10000 1000 3 4 2 2 male DOCTORATE Chairman TLM 100000 100000 20 4 3 3 male DIPLOMA Jr HR JLM 6000 6000 1 3 4 4 male PG Engineer MLM 15000 15000 7 2 5 5 female PG Sr Engineer MLM 25000 25000 12 4 6 6 male DIPLOMA Jr Engineer JLM 6000 8000 1 1

As I wrote in Exploring data files with R I create a Frequency Distribution table using table() function.

> gender_level_table <- table(sal$Level, sal$gender) > gender_level_table female male JLM 5 15 MLM 20 5 TLM 0 5

Use chisq.test() function with gender_level_table as its input, to run the chi square test

> chisq.test(gender_level_table) Pearson's Chi-squared test data: gender_level_table X-squared = 19, df = 2, p-value = 7.485e-05 Warning message: In chisq.test(gender_level_table) : Chi-squared approximation may be incorrect

*Make a decision*

If p≤α reject the null hypothesis. If p>α fail to reject the null hypothesis.

Our p 7.485e-05 is lesser than alpha 0.05. So null hypothesis is rejected and alternate hypothesis is accepted.

See you in another interesting post. Happy Sunday.

Filed under: Business Analytics Tagged: Alternate Hypothesis, Frequency Distribution, Hypothesis testing, LibreOffice, Null Hypothesis, R language, R Studio, Statistics, Type 1 error ]]>