# Testing of Independence – Chi Square test – Manual, LibreOffice, R

Hi,

I have written about testing of hypothesis in my earlier posts

Statisticians recommended right testing approaches for different type of data.

When we have –

• both data as categorical, we shall use Chi Square Test
• Continuous and Continuous data, we shall use correlation
• Categorical and Continuous data, we shall use t test or anova.

In this post, I’d be using the below given data set.

   id gender      educ  Designation Level Salary Last.drawn.salary Pre..Exp Ratings.by.interviewer
1 female        UG  Jr Engineer   JLM  10000              1000        3                      4
2   male DOCTORATE     Chairman   TLM 100000            100000       20                      4
3   male   DIPLOMA        Jr HR   JLM   6000              6000        1                      3
4   male        PG     Engineer   MLM  15000             15000        7                      2
5 female        PG  Sr Engineer   MLM  25000             25000       12                      4
6   male   DIPLOMA  Jr Engineer   JLM   6000              8000        1                      1
7   male   DIPLOMA Jr Associate   JLM   8000              8000        2                      4
8 female        PG     Engineer   MLM  13000             13000        7                      3
9 female        PG     Engineer   MLM  14000             14000        7                      2
10 female        PG     Engineer   MLM  16000             16000        8                      4
11 female        UG  Jr Engineer   JLM  10000              1000        3                      4
12   male DOCTORATE     Chairman   TLM 100000            100000       20                      4
13   male   DIPLOMA        Jr HR   JLM   6000              6000        1                      3
14   male        PG     Engineer   MLM  15000             15000        7                      2
15 female        PG  Sr Engineer   MLM  25000             25000       12                      4
16   male   DIPLOMA  Jr Engineer   JLM   6000              8000        1                      1
17   male   DIPLOMA Jr Associate   JLM   8000              8000        2                      4
18 female        PG     Engineer   MLM  13000             13000        7                      3
19 female        PG     Engineer   MLM  14000             14000        7                      2
20 female        PG     Engineer   MLM  16000             16000        8                      4
21 female        PG  Sr Engineer   MLM  25000             25000       12                      4
22   male   DIPLOMA  Jr Engineer   JLM   6000              8000        1                      1
23   male   DIPLOMA Jr Associate   JLM   8000              8000        2                      4
24 female        PG     Engineer   MLM  13000             13000        7                      3
25 female        PG     Engineer   MLM  14000             14000        7                      2
26 female        PG     Engineer   MLM  16000             16000        8                      4
27 female        UG  Jr Engineer   JLM  10000              1000        3                      4
28   male DOCTORATE     Chairman   TLM 100000            100000       20                      4
29   male   DIPLOMA        Jr HR   JLM   6000              6000        1                      3
30   male        PG     Engineer   MLM  15000             15000        7                      2
31 female        PG  Sr Engineer   MLM  25000             25000       12                      4
32 female        PG  Sr Engineer   MLM  25000             25000       12                      4
33   male   DIPLOMA  Jr Engineer   JLM   6000              8000        1                      1
34   male   DIPLOMA Jr Associate   JLM   8000              8000        2                      4
35 female        PG     Engineer   MLM  13000             13000        7                      3
36 female        PG     Engineer   MLM  14000             14000        7                      2
37 female        PG     Engineer   MLM  16000             16000        8                      4
38 female        UG  Jr Engineer   JLM  10000              1000        3                      4
39   male DOCTORATE     Chairman   TLM 100000            100000       20                      4
40   male   DIPLOMA        Jr HR   JLM   6000              6000        1                      3
41   male        PG     Engineer   MLM  15000             15000        7                      2
42 female        PG  Sr Engineer   MLM  25000             25000       12                      4
43   male   DIPLOMA  Jr Engineer   JLM   6000              8000        1                      1
44   male   DIPLOMA Jr Associate   JLM   8000              8000        2                      4
45 female        PG     Engineer   MLM  13000             13000        7                      3
46 female        PG     Engineer   MLM  16000             16000        8                      4
47 female        UG  Jr Engineer   JLM  10000              1000        3                      4
48   male DOCTORATE     Chairman   TLM 100000            100000       20                      4
49   male   DIPLOMA        Jr HR   JLM   6000              6000        1                      3
50   male        PG     Engineer   MLM  15000             15000        7                      2


We shall use chi square test for two types of hypothesis testing

• test of independence of variables
• test goodness of fit

### Testing of independence

We can find out the association between two (at least) categorical variables. Higher the chi square value, better the result is. We shall use this to test our hypothesis.

### Goodness of fit

When we use chi square test to find the goodness of fit, we shall use 2 categorical variables. higher the chi square value, better the result is. We shall use this to test BLR, SEM tests.

### Example for Testing of independence

This post talks about testing of independence. We have employee data given above. Following are my hypothesis.

H0 = Number of female employees and level of management are not related.

H1 = Number of female employees and level of management are related.

We would solve this using three methods

1. Manual way of chi square test
2. Chi square test with LibreOffice Calc
3. Chi square test with R

#### Manual way of chi square test

We prepare the count of female employees in each level as given below. I have used COUNTIFS() function of LibreOffice.

Calculate the row (highlighted in pink colour) and column sums (blue colour) and summation of all row sums (saffron colour).

The values are called observed values. We shall find out the expected values as well easily as given below.

Expected value = column sum x row sum/sum of rowsum

=J15*N12/N15 = 25 x 20/50 = 10

Finally our table looks like this.

All the observed values (O), Expected values (E) are substituted in the below table. We calculate the Chi square value χ2 which is 19.

 O E O-E (O-E)2 (O-E)2/E 5 10 -5 25 2.5 20 12.5 7.5 56.25 4.5 0 2.5 -2.5 6.25 2.5 15 10 5 25 2.5 5 12.5 -7.5 56.25 4.5 5 2.5 2.5 6.25 2.5 χ2 19

Level of significance or Type 1 error = 5%, which is 0.05

Degrees of freedom = (row count – 1) x (column count – 1) = 2

Critical value of χ2 is 5.991, which is looked up using the level of significance and degrees of freedom in the below given table.

Make a decision

To accept our null hypothesis H0, calculated χ2 < critical χ2.
Our calculated χ2 = 19
Our critical χ2 = 5.991

Hence, we reject null hypothesis and accept alternate hypothesis.

You may watch the following video to understand the above calculation.

#### Chi square test with LibreOffice Calc

We have already found out the frequency distribution of females and males per each management level. Let’s use the same.

Select Data>Statistics>Chi-square Test

Choose the input cells

Select the Output Cell

Finally my selections are given as below

After pressing OK, We get the following result

Make a decision

If pα reject the null hypothesis. If p>α fail to reject the null hypothesis.

Our p 0.00007485 is lesser than alpha 0.05. So null hypothesis is rejected and alternate hypothesis is accepted.

#### Chi square test with R

I have the data set stored as sal.csv file. I’m importing it and store to sal object.

> setwd("d:/gandhari/videos/Advanced Business Analytics/")
id gender      educ Designation Level Salary Last.drawn.salary Pre..Exp Ratings.by.interviewer
1  1 female        UG Jr Engineer   JLM  10000              1000        3                      4
2  2   male DOCTORATE    Chairman   TLM 100000            100000       20                      4
3  3   male   DIPLOMA       Jr HR   JLM   6000              6000        1                      3
4  4   male        PG    Engineer   MLM  15000             15000        7                      2
5  5 female        PG Sr Engineer   MLM  25000             25000       12                      4
6  6   male   DIPLOMA Jr Engineer   JLM   6000              8000        1                      1


As I wrote in Exploring data files with R I create a Frequency Distribution table using table() function.

> gender_level_table <- table(sal$Level, sal$gender)
> gender_level_table

female male
JLM      5   15
MLM     20    5
TLM      0    5


Use chisq.test() function with gender_level_table as its input, to run the chi square test

> chisq.test(gender_level_table)

Pearson's Chi-squared test

data:  gender_level_table
X-squared = 19, df = 2, p-value = 7.485e-05

Warning message:
In chisq.test(gender_level_table) :
Chi-squared approximation may be incorrect


Make a decision

If pα reject the null hypothesis. If p>α fail to reject the null hypothesis.

Our p 7.485e-05 is lesser than alpha 0.05. So null hypothesis is rejected and alternate hypothesis is accepted.

See you in another interesting post. Happy Sunday.

# Linear Programming – Covering Model using LibreOffice Calc Solver

🕋 Eid Mubarak, Selamat Hari Raya Haji ☪️

I have written about Linear Programming – Allocation model in my previous post Linear Programming and Linear Programming with LibreOffice Calc Solver. This post would talk about Linear Programming – Covering models.

First question would be – what’s the difference between Allocation model and Covering model. There is no difference in the optimization function. The difference exists in the constraints. All our constraints talk about maximum in allocation model. All those constraints had symbol. Covering models talk about minimization, usually cost.

### Example

I’ll use the data set given in https://paginas.fe.up.pt/~mac/ensino/docs/OR/otherDocs/PowellAllocationCoveringBlendingConstraints.pdf

Dahlby Outfitters wishes to introduce packaged trail mix as a new product. The ingredients for the trail mix are seeds, raisins, flakes, and two kinds of nuts. Each ingredient contains certain amounts of vitamins, minerals, protein, and calories.

The marketing department has specified that the product be designed so that a certain minimum nutritional profile is met. The decision problem is to determine the optimal product composition—that is, to minimize the product cost by choosing the amount for each of the ingredients in the mix. The data shown below summarize the parameters of the problem:

 Component Grams per pound Nutritional Requirement Seeds Raisins Flakes Pecans Walnuts Vitamins 10 20 10 30 20 20 Minerals 5 7 4 9 2 10 Protein 1 4 10 2 1 15 Calories 500 450 160 300 500 600 Cost/pound 4 5 3 7 6

Lets  denote the product names as S, R, F, P and W. Our objective function would be like this.

Total Cost = 4S+5R+3F+7P+6W

Rewriting the above statement as –

Zmin = 4S+5R+3F+7P+6W

subject to constraints –

 Vitamin content 10S + 20R + 10F + 30P + 20W greater than or eq 20 Mineral content 5S + 7R + 4F + 9P + 2W greater than or eq 10 Protein content 1S + 4R + 10F + 2P + 1W greater than or eq 15 Calorie content 500S + 450R + 160F + 300P + 500W greater than or eq 600

Rewriting the above constraints as linear equations as given below,

10S + 20R + 10F + 30P + 20W ≥ 20
5S + 7R + 4F + 9P + 2W ≥ 10
1S + 4R + 10F + 2P + 1W ≥ 15
500S + 450R + 160F + 300P + 500W ≥ 600

Prepare the data set. G9 is highlighted in yellow colour. This would be our minimizing figure.

The data have given the cost of each product already. So, our aim is to find how much amount of each product shall be produced. This would be the decision variable. We need to find out. The cells of the decision variables are also highlighted in yellow colour.

Let’s write the constraints now. Our aim is to find how much vitamin, mineral etc to be added in our product. Those cells are highlighted in yellow colour.

Let’s open the Solver now. Following is my selection.

1. Target cell is where we find the minimum cost.
2. As we are talking about minimum, we choose ‘optimize result to’ as ‘Minimum’
3. By changing cells = Decision variable cells
4. Limiting Constraints are highlighted with => operator.

Following is the result.

The answer I get in Calc is not equal to what I see in the reference PDF. However, let’s take it as the decision at the moment –

We would take 24.6, 10, 15, 600 for vitamins, minerals, protein and calories.

Linear programming suggests us to avoid pecans and walnuts.

0.5 x seeds, 0.3 x Raisins and 1.3 x Flakes are sufficient.

With this, we would be able to provide 24.6 vitamins, 10 minerals, 15 protein and 600 calories.

With this I’m closing the statistics post. I’d be starting the next part of this series soon, which is R programming.

# Linear Programming with LibreOffice Calc Solver

Hi,

I wrote about Linear Programming using a car factory example in my previous post Linear Programming. That was a manual graphical method computation. In this post, I’ll solve the same maximization problem using LibreOffice Calc.

Lets prepare the data –

What we need to find out? We need to find out how many cars of both types can be produced. Let’s leave one blank cell for each SUV (highlighted in yellow colour).

What’s out objective function? Maximum profit obtained using x number of SUV cars and y number of Sedan cars.

Let’s add another blank cell. The formula of this cell would be =SUMPRODUCT(B6:C6,B9:C9)

This is nothing but, number of SUV cars x SUV profit + number of Sedan cars x Sedan profit

We have constraints on raw material, machine and man power, right? Let’s add the the first constraint as below.

Number of SUV cars x raw material required by SUV car (2) should be lesser than or equal to maximum raw material available (18 tonnes). This is represented using the following formula.

=SUMPRODUCT(B3:C3,B9:C9)

Similarly we are adding another two constraints using the following formulas.

=SUMPRODUCT(B4:C4,B9:C9)

=SUMPRODUCT(B5:C5,B9:C9)

Finally out data set looks like this.

Open Tools> Solver

Let’s fill the solver window as below.

Target cell = Profit cell G9

We want maximization function, because we need to know the maximum count of cars can be produced.

Optimize results to = Maximum

We want to get the best results by changing the value of the count of cars shall be produced.

By changing cells = number of SUV and Sedan cars = B9 and C9.

1. Constraints cells B12:B14 should be lesser than maximum available resources D12:D14
2. Number of cars should not be negative
3. I tested a third constraint, which is maximum number of cars shall not exceed 10 (You may ignore this).

Finally, my selection is given as below.

Ok Siri, get me the results.

This solver suggests me not to produce any Sedan cars. It asks me to produce 2 SUV cars. This is same as that of the results I got in graphical analysis Linear Programming (point c in the below given graph).

In addition LibreOffice Calc, suggests me maximum resources can be consumed!

See you in another interesting post.

# Computing Regression with LibreOffice Calc

Hi,

I wrote about regression equation using algebraic method in my previous post Linear Regression. Now, I’ll show you how to compute it using LibreOffice Calc.

Let’s start!

Choose your data set. I’m choosing same data set I used in my previous post. You may check if my algebraic method computation is correct or wrong!

Select Data>Statistics>Regression

Choose variable 1, which is our X.

Here is how we choose Variable 1

Here is how we choose Variable 2, which is our Y.

Choose the output cell, where you want LibreOffice to write the output.

Finally these are what we selected.

And, here is the output. You may check the slope and intercept value with my previous post 🙂

See you in another interesting post.

# Identifying the correlation coefficient using LibreOffice Calc

Hi,

I have written about correlation in my previous post Correlation and Pearson’s correlation coefficient. I also worked out a small data set to identify the correlation coefficient. In this post, let’s see how quickly we shall generate the same using LibreOffice Calc (FOSS Equivalent of Microsoft Excel).

Here is the data set. We have two variables x and y. We have 7 samples for each.

Select Data>Statistics>Correlation

Choose all data set we have (values of x and y given above)

Click the Results to: button to decide where LibreOffice to write the output. I have chosen the cell D2.

Finally, following is my inputs. Click Ok.

Here is the output.

Column2 and column 1 is 0.95.

What does this indicate? Check the below given picture. Out correlation coefficient is 0.95 in the scale of -1 to +1. So x and y are highly positively correlated.

pearson correlation coefficient interpretation (c) http://www.mathcaptain.com/statistics/pearson-correlation-coefficient.html

If this is true, the graphical representation should show a raising line, shouldn’t it? Let’s try.

Obviously!