Hi,

I have written about testing of hypothesis in my earlier posts

- One way analysis of variance
- Calculating Anova with LibreOffice Calc
- Correlation and Pearson’s correlation coefficient
- Identifying the correlation coefficient using LibreOffice Calc

Statisticians recommended right testing approaches for different type of data.

When we have –

- both data as categorical, we shall use Chi Square Test
- Continuous and Continuous data, we shall use correlation
- Categorical and Continuous data, we shall use t test or anova.

In this post, I’d be using the below given data set.

id gender educ Designation Level Salary Last.drawn.salary Pre..Exp Ratings.by.interviewer 1 female UG Jr Engineer JLM 10000 1000 3 4 2 male DOCTORATE Chairman TLM 100000 100000 20 4 3 male DIPLOMA Jr HR JLM 6000 6000 1 3 4 male PG Engineer MLM 15000 15000 7 2 5 female PG Sr Engineer MLM 25000 25000 12 4 6 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 7 male DIPLOMA Jr Associate JLM 8000 8000 2 4 8 female PG Engineer MLM 13000 13000 7 3 9 female PG Engineer MLM 14000 14000 7 2 10 female PG Engineer MLM 16000 16000 8 4 11 female UG Jr Engineer JLM 10000 1000 3 4 12 male DOCTORATE Chairman TLM 100000 100000 20 4 13 male DIPLOMA Jr HR JLM 6000 6000 1 3 14 male PG Engineer MLM 15000 15000 7 2 15 female PG Sr Engineer MLM 25000 25000 12 4 16 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 17 male DIPLOMA Jr Associate JLM 8000 8000 2 4 18 female PG Engineer MLM 13000 13000 7 3 19 female PG Engineer MLM 14000 14000 7 2 20 female PG Engineer MLM 16000 16000 8 4 21 female PG Sr Engineer MLM 25000 25000 12 4 22 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 23 male DIPLOMA Jr Associate JLM 8000 8000 2 4 24 female PG Engineer MLM 13000 13000 7 3 25 female PG Engineer MLM 14000 14000 7 2 26 female PG Engineer MLM 16000 16000 8 4 27 female UG Jr Engineer JLM 10000 1000 3 4 28 male DOCTORATE Chairman TLM 100000 100000 20 4 29 male DIPLOMA Jr HR JLM 6000 6000 1 3 30 male PG Engineer MLM 15000 15000 7 2 31 female PG Sr Engineer MLM 25000 25000 12 4 32 female PG Sr Engineer MLM 25000 25000 12 4 33 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 34 male DIPLOMA Jr Associate JLM 8000 8000 2 4 35 female PG Engineer MLM 13000 13000 7 3 36 female PG Engineer MLM 14000 14000 7 2 37 female PG Engineer MLM 16000 16000 8 4 38 female UG Jr Engineer JLM 10000 1000 3 4 39 male DOCTORATE Chairman TLM 100000 100000 20 4 40 male DIPLOMA Jr HR JLM 6000 6000 1 3 41 male PG Engineer MLM 15000 15000 7 2 42 female PG Sr Engineer MLM 25000 25000 12 4 43 male DIPLOMA Jr Engineer JLM 6000 8000 1 1 44 male DIPLOMA Jr Associate JLM 8000 8000 2 4 45 female PG Engineer MLM 13000 13000 7 3 46 female PG Engineer MLM 16000 16000 8 4 47 female UG Jr Engineer JLM 10000 1000 3 4 48 male DOCTORATE Chairman TLM 100000 100000 20 4 49 male DIPLOMA Jr HR JLM 6000 6000 1 3 50 male PG Engineer MLM 15000 15000 7 2

We shall use chi square test for two types of hypothesis testing

- test of independence of variables
- test goodness of fit

### Testing of independence

We can find out the association between two (at least) categorical variables. Higher the chi square value, better the result is. We shall use this to test our hypothesis.

### Goodness of fit

When we use chi square test to find the goodness of fit, we shall use 2 categorical variables. higher the chi square value, better the result is. We shall use this to test BLR, SEM tests.

### Example for Testing of independence

This post talks about testing of independence. We have employee data given above. Following are my hypothesis.

H0 = Number of female employees and level of management are not related.

H1 = Number of female employees and level of management are related.

We would solve this using three methods

- Manual way of chi square test
- Chi square test with LibreOffice Calc
- Chi square test with R

#### Manual way of chi square test

We prepare the count of female employees in each level as given below. I have used COUNTIFS() function of LibreOffice.

Calculate the row (highlighted in pink colour) and column sums (blue colour) and summation of all row sums (saffron colour).

The values are called observed values. We shall find out the expected values as well easily as given below.

Expected value = column sum x row sum/sum of rowsum

=J15*N12/N15 = 25 x 20/50 = 10

Finally our table looks like this.

All the observed values (O), Expected values (E) are substituted in the below table. We calculate the Chi square value χ2 which is 19.

O | E | O-E | (O-E)2 | (O-E)2/E |

5 | 10 | -5 | 25 | 2.5 |

20 | 12.5 | 7.5 | 56.25 | 4.5 |

0 | 2.5 | -2.5 | 6.25 | 2.5 |

15 | 10 | 5 | 25 | 2.5 |

5 | 12.5 | -7.5 | 56.25 | 4.5 |

5 | 2.5 | 2.5 | 6.25 | 2.5 |

χ2 |
19 |

Level of significance or Type 1 error = 5%, which is 0.05

Degrees of freedom = (row count – 1) x (column count – 1) = 2

Critical value of χ2 is 5.991, which is looked up using the level of significance and degrees of freedom in the below given table.

*Make a decision*

To accept our null hypothesis H0, calculated χ2 < critical χ2.

Our calculated χ2 = 19

Our critical χ2 = 5.991

Hence, we reject null hypothesis and accept alternate hypothesis.

You may watch the following video to understand the above calculation.

#### Chi square test with LibreOffice Calc

We have already found out the frequency distribution of females and males per each management level. Let’s use the same.

Select Data>Statistics>Chi-square Test

Choose the input cells

Select the Output Cell

Finally my selections are given as below

After pressing OK, We get the following result

*Make a decision*

If p≤α reject the null hypothesis. If p>α fail to reject the null hypothesis.

Our p 0.00007485 is lesser than alpha 0.05. So null hypothesis is rejected and alternate hypothesis is accepted.

#### Chi square test with R

I have the data set stored as *sal.csv* file. I’m importing it and store to sal object.

> setwd("d:/gandhari/videos/Advanced Business Analytics/") > sal <-read.csv("sal.csv") > head(sal) id gender educ Designation Level Salary Last.drawn.salary Pre..Exp Ratings.by.interviewer 1 1 female UG Jr Engineer JLM 10000 1000 3 4 2 2 male DOCTORATE Chairman TLM 100000 100000 20 4 3 3 male DIPLOMA Jr HR JLM 6000 6000 1 3 4 4 male PG Engineer MLM 15000 15000 7 2 5 5 female PG Sr Engineer MLM 25000 25000 12 4 6 6 male DIPLOMA Jr Engineer JLM 6000 8000 1 1

As I wrote in Exploring data files with R I create a Frequency Distribution table using table() function.

> gender_level_table <- table(sal$Level, sal$gender) > gender_level_table female male JLM 5 15 MLM 20 5 TLM 0 5

Use chisq.test() function with gender_level_table as its input, to run the chi square test

> chisq.test(gender_level_table) Pearson's Chi-squared test data: gender_level_table X-squared = 19, df = 2, p-value = 7.485e-05 Warning message: In chisq.test(gender_level_table) : Chi-squared approximation may be incorrect

*Make a decision*

If p≤α reject the null hypothesis. If p>α fail to reject the null hypothesis.

Our p 7.485e-05 is lesser than alpha 0.05. So null hypothesis is rejected and alternate hypothesis is accepted.

See you in another interesting post. Happy Sunday.