Testing of Independence – Chi Square test – Manual, LibreOffice, R

Hi,

I have written about testing of hypothesis in my earlier posts

Statisticians recommended right testing approaches for different type of data.

When we have –

• both data as categorical, we shall use Chi Square Test
• Continuous and Continuous data, we shall use correlation
• Categorical and Continuous data, we shall use t test or anova.

In this post, I’d be using the below given data set.

   id gender      educ  Designation Level Salary Last.drawn.salary Pre..Exp Ratings.by.interviewer
1 female        UG  Jr Engineer   JLM  10000              1000        3                      4
2   male DOCTORATE     Chairman   TLM 100000            100000       20                      4
3   male   DIPLOMA        Jr HR   JLM   6000              6000        1                      3
4   male        PG     Engineer   MLM  15000             15000        7                      2
5 female        PG  Sr Engineer   MLM  25000             25000       12                      4
6   male   DIPLOMA  Jr Engineer   JLM   6000              8000        1                      1
7   male   DIPLOMA Jr Associate   JLM   8000              8000        2                      4
8 female        PG     Engineer   MLM  13000             13000        7                      3
9 female        PG     Engineer   MLM  14000             14000        7                      2
10 female        PG     Engineer   MLM  16000             16000        8                      4
11 female        UG  Jr Engineer   JLM  10000              1000        3                      4
12   male DOCTORATE     Chairman   TLM 100000            100000       20                      4
13   male   DIPLOMA        Jr HR   JLM   6000              6000        1                      3
14   male        PG     Engineer   MLM  15000             15000        7                      2
15 female        PG  Sr Engineer   MLM  25000             25000       12                      4
16   male   DIPLOMA  Jr Engineer   JLM   6000              8000        1                      1
17   male   DIPLOMA Jr Associate   JLM   8000              8000        2                      4
18 female        PG     Engineer   MLM  13000             13000        7                      3
19 female        PG     Engineer   MLM  14000             14000        7                      2
20 female        PG     Engineer   MLM  16000             16000        8                      4
21 female        PG  Sr Engineer   MLM  25000             25000       12                      4
22   male   DIPLOMA  Jr Engineer   JLM   6000              8000        1                      1
23   male   DIPLOMA Jr Associate   JLM   8000              8000        2                      4
24 female        PG     Engineer   MLM  13000             13000        7                      3
25 female        PG     Engineer   MLM  14000             14000        7                      2
26 female        PG     Engineer   MLM  16000             16000        8                      4
27 female        UG  Jr Engineer   JLM  10000              1000        3                      4
28   male DOCTORATE     Chairman   TLM 100000            100000       20                      4
29   male   DIPLOMA        Jr HR   JLM   6000              6000        1                      3
30   male        PG     Engineer   MLM  15000             15000        7                      2
31 female        PG  Sr Engineer   MLM  25000             25000       12                      4
32 female        PG  Sr Engineer   MLM  25000             25000       12                      4
33   male   DIPLOMA  Jr Engineer   JLM   6000              8000        1                      1
34   male   DIPLOMA Jr Associate   JLM   8000              8000        2                      4
35 female        PG     Engineer   MLM  13000             13000        7                      3
36 female        PG     Engineer   MLM  14000             14000        7                      2
37 female        PG     Engineer   MLM  16000             16000        8                      4
38 female        UG  Jr Engineer   JLM  10000              1000        3                      4
39   male DOCTORATE     Chairman   TLM 100000            100000       20                      4
40   male   DIPLOMA        Jr HR   JLM   6000              6000        1                      3
41   male        PG     Engineer   MLM  15000             15000        7                      2
42 female        PG  Sr Engineer   MLM  25000             25000       12                      4
43   male   DIPLOMA  Jr Engineer   JLM   6000              8000        1                      1
44   male   DIPLOMA Jr Associate   JLM   8000              8000        2                      4
45 female        PG     Engineer   MLM  13000             13000        7                      3
46 female        PG     Engineer   MLM  16000             16000        8                      4
47 female        UG  Jr Engineer   JLM  10000              1000        3                      4
48   male DOCTORATE     Chairman   TLM 100000            100000       20                      4
49   male   DIPLOMA        Jr HR   JLM   6000              6000        1                      3
50   male        PG     Engineer   MLM  15000             15000        7                      2


We shall use chi square test for two types of hypothesis testing

• test of independence of variables
• test goodness of fit

Testing of independence

We can find out the association between two (at least) categorical variables. Higher the chi square value, better the result is. We shall use this to test our hypothesis.

Goodness of fit

When we use chi square test to find the goodness of fit, we shall use 2 categorical variables. higher the chi square value, better the result is. We shall use this to test BLR, SEM tests.

Example for Testing of independence

This post talks about testing of independence. We have employee data given above. Following are my hypothesis.

H0 = Number of female employees and level of management are not related.

H1 = Number of female employees and level of management are related.

We would solve this using three methods

1. Manual way of chi square test
2. Chi square test with LibreOffice Calc
3. Chi square test with R

Manual way of chi square test

We prepare the count of female employees in each level as given below. I have used COUNTIFS() function of LibreOffice.

Calculate the row (highlighted in pink colour) and column sums (blue colour) and summation of all row sums (saffron colour).

The values are called observed values. We shall find out the expected values as well easily as given below.

Expected value = column sum x row sum/sum of rowsum

=J15*N12/N15 = 25 x 20/50 = 10

Finally our table looks like this.

All the observed values (O), Expected values (E) are substituted in the below table. We calculate the Chi square value χ2 which is 19.

 O E O-E (O-E)2 (O-E)2/E 5 10 -5 25 2.5 20 12.5 7.5 56.25 4.5 0 2.5 -2.5 6.25 2.5 15 10 5 25 2.5 5 12.5 -7.5 56.25 4.5 5 2.5 2.5 6.25 2.5 χ2 19

Level of significance or Type 1 error = 5%, which is 0.05

Degrees of freedom = (row count – 1) x (column count – 1) = 2

Critical value of χ2 is 5.991, which is looked up using the level of significance and degrees of freedom in the below given table.

Make a decision

To accept our null hypothesis H0, calculated χ2 < critical χ2.
Our calculated χ2 = 19
Our critical χ2 = 5.991

Hence, we reject null hypothesis and accept alternate hypothesis.

You may watch the following video to understand the above calculation.

Chi square test with LibreOffice Calc

We have already found out the frequency distribution of females and males per each management level. Let’s use the same.

Select Data>Statistics>Chi-square Test

Choose the input cells

Select the Output Cell

Finally my selections are given as below

After pressing OK, We get the following result

Make a decision

If pα reject the null hypothesis. If p>α fail to reject the null hypothesis.

Our p 0.00007485 is lesser than alpha 0.05. So null hypothesis is rejected and alternate hypothesis is accepted.

Chi square test with R

I have the data set stored as sal.csv file. I’m importing it and store to sal object.

> setwd("d:/gandhari/videos/Advanced Business Analytics/")
id gender      educ Designation Level Salary Last.drawn.salary Pre..Exp Ratings.by.interviewer
1  1 female        UG Jr Engineer   JLM  10000              1000        3                      4
2  2   male DOCTORATE    Chairman   TLM 100000            100000       20                      4
3  3   male   DIPLOMA       Jr HR   JLM   6000              6000        1                      3
4  4   male        PG    Engineer   MLM  15000             15000        7                      2
5  5 female        PG Sr Engineer   MLM  25000             25000       12                      4
6  6   male   DIPLOMA Jr Engineer   JLM   6000              8000        1                      1


As I wrote in Exploring data files with R I create a Frequency Distribution table using table() function.

> gender_level_table <- table(sal$Level, sal$gender)
> gender_level_table

female male
JLM      5   15
MLM     20    5
TLM      0    5


Use chisq.test() function with gender_level_table as its input, to run the chi square test

> chisq.test(gender_level_table)

Pearson's Chi-squared test

data:  gender_level_table
X-squared = 19, df = 2, p-value = 7.485e-05

Warning message:
In chisq.test(gender_level_table) :
Chi-squared approximation may be incorrect


Make a decision

If pα reject the null hypothesis. If p>α fail to reject the null hypothesis.

Our p 7.485e-05 is lesser than alpha 0.05. So null hypothesis is rejected and alternate hypothesis is accepted.

See you in another interesting post. Happy Sunday.

Exploring data files with R

I have written about data types and data structures of R in my previous post Working with data types of R. We shall explore a data set in this post.

mtcars is a dataset exists in R already

> mtcars
mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2


File structure

Very first question would be about the size of the data set.

> dim(mtcars)
[1] 32 11


It contains 32 rows and 11 columns.

Now, how many variables we have in mtcars?

> names(mtcars)
[1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"


We shall preview the data using head and tail commands.

> head(mtcars)
mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
> tail(mtcars)
mpg cyl  disp  hp drat    wt qsec vs am gear carb
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2


Similar to unix tail, head commands, you would see first and last 6 records in R.

This is the time to know about the structure of the data set.

> str(mtcars)
'data.frame':	32 obs. of  11 variables:
$mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...$ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
$disp: num 160 160 108 258 360 ...$ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
$drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...$ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
$qsec: num 16.5 17 18.6 19.4 17 ...$ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
$am : num 1 1 1 0 0 0 0 0 0 0 ...$ gear: num  4 4 4 3 3 3 3 4 4 4 ...
$carb: num 4 4 1 1 2 1 4 2 2 4 ...  So, this is a data frame. Each variables are explained above. How the data is being stored? > mode(mtcars) [1] "list"  It is stored as a list. Let’s take another dataset available with R – airquality. > head(airquality, n=10) Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6 7 23 299 8.6 65 5 7 8 19 99 13.8 59 5 8 9 8 19 20.1 61 5 9 10 NA 194 8.6 69 5 10  You may parameterize head commands as I have shown above. Let’s omit the records with NA. > aqNoNA=na.omit(airquality) > dim(aqNoNA) [1] 111 6 > dim(airquality) [1] 153 6  So, out new object aqNoNA contains the records without missing cases, totally 111 rows. We have is.na command to check if the data is NA. > is.na(airquality) Ozone Solar.R Wind Temp Month Day [1,] FALSE FALSE FALSE FALSE FALSE FALSE [2,] FALSE FALSE FALSE FALSE FALSE FALSE [3,] FALSE FALSE FALSE FALSE FALSE FALSE [4,] FALSE FALSE FALSE FALSE FALSE FALSE [5,] TRUE TRUE FALSE FALSE FALSE FALSE [6,] FALSE TRUE FALSE FALSE FALSE FALSE > sum(is.na(airquality)) [1] 44  So totally 44 NAs found. Summary command gives us minimum, quartile 1, median, mean, 3rd quartile, maximum and number of NAa. > summary(airquality) Ozone Solar.R Wind Temp Month Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00 Min. :5.000 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00 1st Qu.:6.000 Median : 31.50 Median :205.0 Median : 9.700 Median :79.00 Median :7.000 Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88 Mean :6.993 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00 3rd Qu.:8.000 Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00 Max. :9.000 NA's :37 NA's :7 Day Min. : 1.0 1st Qu.: 8.0 Median :16.0 Mean :15.8 3rd Qu.:23.0 Max. :31.0  1st quartile meant of 24% data 2nd quartile meant of 50 percentile of data 3rd quartile meant of 75 percentile of data We shall filter the summary commands as given below. > summary(airquality$Ozone)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
1.00   18.00   31.50   42.13   63.25  168.00      37


We have given the summary of only one variable Ozone above.

We shall apply other mathematical functions as given below

> sd(aqNoNA$Ozone) [1] 33.27597  We have calculated the standard deviation for Ozone above. I have taken the data without NA, as my sd operation will fail if I have NA. Finding the Outlier > myNumbers <- rep(c(1,4,10, 100), c(5, 5, 5, 1)) > myNumbers [1] 1 1 1 1 1 4 4 4 4 4 10 10 10 10 10 100 > mean(myNumbers) [1] 10.9375 > sd(myNumbers) [1] 24.04293 > scale(myNumbers) [,1] [1,] -0.41332316 [2,] -0.41332316 [3,] -0.41332316 [4,] -0.41332316 [5,] -0.41332316 [6,] -0.28854636 [7,] -0.28854636 [8,] -0.28854636 [9,] -0.28854636 [10,] -0.28854636 [11,] -0.03899275 [12,] -0.03899275 [13,] -0.03899275 [14,] -0.03899275 [15,] -0.03899275 [16,] 3.70431136 attr(,"scaled:center") [1] 10.9375 attr(,"scaled:scale") [1] 24.04293  scaling of the metrics (z score) is calculated using the formula each sample data – mean / standard deviation After scaling, if you find any values is greater than ±2, they are called outliers (odd points located away from the central measures). If SD > mean, it is outlier. Other way to test the normality is Shapiro test > shapiro.test(myNumbers) Shapiro-Wilk normality test data: myNumbers W = 0.40055, p-value = 3.532e-07  If the p-value is greater than 0.05, it is normalized data. Here it is not. Data Analysis Here comes another interesting part. How to analyze the data, after you upload your files. To explain this, I’d prepare sample data set first. > #Employee ID > sn <- seq(1, 10, 1) > sn [1] 1 2 3 4 5 6 7 8 9 10 > #Employee gender > gender <- rep(c("male", "female"), c(6,4)) > gender [1] "male" "male" "male" "male" "male" "male" "female" "female" "female" [10] "female" > #available departments > dept <- rep(c("Admin", "HR", "Prod", "Contractor"), c(1, 3, 3, 3)) > dept [1] "Admin" "HR" "HR" "HR" "Prod" "Prod" [7] "Prod" "Contractor" "Contractor" "Contractor" > #Employee Salary > sal <- rnorm(10, 1000, 200); > sal [1] 888.2026 876.6272 919.7453 1005.9058 1084.4704 1337.7696 909.4302 801.1482 [9] 1025.9457 1182.9774 > #Now our data set > mydataset <- data.frame(sn, gender, sal, dept) > mydataset sn gender sal dept 1 1 male 888.2026 Admin 2 2 male 876.6272 HR 3 3 male 919.7453 HR 4 4 male 1005.9058 HR 5 5 male 1084.4704 Prod 6 6 male 1337.7696 Prod 7 7 female 909.4302 Prod 8 8 female 801.1482 Contractor 9 9 female 1025.9457 Contractor 10 10 female 1182.9774 Contractor  So we have serial number, gender, salary and department. Which gender is majority in the given data? we shall use table function to arrive at a simple frequency distribution table. > #which gender is more in each dept > #this is frequency distribution > table(mydataset$gender)

female   male
4      6


So we have 4 females and 6 males. Let’s group the above FD by department now.

> table(mydataset$gender, mydataset$dept)

female     0          3  0    1
male       1          0  3    2
> #assign the results to an object
> freqDis <- table(mydataset$gender, mydataset$dept);
> #transpose the table
> t(freqDis)

female male
Contractor      3    0
HR              0    3
Prod            1    2


So all departments except contractors, have male as majority. The function t() stands for transpose.

Let’s do a proportion of the data using prop.table()

> #proportion
> prop.table(freqDis)

female   0.0        0.3 0.0  0.1
male     0.1        0.0 0.3  0.2


Rather than proportion, % of males and females would give us a better visibility. hence we multiply proportion by 100.

> #Percentage
> prop.table(freqDis)*100

female     0         30  0   10
male      10          0 30   20


Column sum and row sums are frequently asked in our day today life.

> #sum
> colSums(freqDis)
1          3          3          3


Row sum shall be calculated as below.

> rowSums(freqDis)
female   male
4      6


Salary is interesting part in our profession. I’d like to see who earns more using aggregate() function? Male or Female?

> #who earns more - male or female?
> aggregate(sal~gender, mean, data = mydataset)
gender       sal
1 female  979.8754
2   male 1018.7868
> #sal is continuous variable
> #gender is categorical variable
> #mean is the function
> #data is our data source


We used mean salary above. Lets use sum now.

> aggregate(sal~gender, sum, data = mydataset)
gender      sal
1 female 3919.501
2   male 6112.721


Similarly, we use standard deviation.

> aggregate(sal~gender, sd, data = mydataset)
gender      sal
1 female 163.5837
2   male 175.1006


So salary package for females looks to be more consistent than that of males.

We calculated all the above functions individually. psych package helps to calculate everything in one command.

> install.packages("psych")
Installing package into ‘D:/gandhari/documents/R/win-library/3.4’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/psych_1.7.5.zip'
Content type 'application/zip' length 3966969 bytes (3.8 MB)

package ‘psych’ successfully unpacked and MD5 sums checked

> library(psych)
> describe (mydataset)
vars  n    mean     sd median trimmed    mad    min     max  range  skew
sn         1 10    5.50   3.03   5.50    5.50   3.71   1.00   10.00   9.00  0.00
gender*    2 10    1.60   0.52   2.00    1.62   0.00   1.00    2.00   1.00 -0.35
sal        3 10 1003.22 162.35 962.83  986.66 119.22 801.15 1337.77 536.62  0.71
dept*      4 10    2.80   1.03   3.00    2.88   1.48   1.00    4.00   3.00 -0.20
kurtosis    se
sn         -1.56  0.96
gender*    -2.05  0.16
sal        -0.72 51.34
dept*      -1.42  0.33


n is total samples, sd is standard deviation etc. All the aggregated functions are calculated for serial numbmer, gender, salary and department.

In the above data, I get only the salary data, which is in 3rd row.

> describe (mydataset[,3])
vars  n    mean     sd median trimmed    mad    min     max  range skew kurtosis
X1    1 10 1003.22 162.35 962.83  986.66 119.22 801.15 1337.77 536.62 0.71    -0.72
se
X1 51.34


Lets group the above described data by gender, ie, mean, sd for males and females separately.

> describe.by(mydataset, mydataset\$gender)

Descriptive statistics by group
group: female
vars n   mean     sd median trimmed    mad    min     max  range skew
sn         1 4   8.50   1.29   8.50    8.50   1.48   7.00   10.00   3.00 0.00
gender*    2 4   1.00   0.00   1.00    1.00   0.00   1.00    1.00   0.00  NaN
sal        3 4 979.88 163.58 967.69  979.88 166.64 801.15 1182.98 381.83 0.14
dept*      4 4   2.50   1.00   2.00    2.50   0.00   2.00    4.00   2.00 0.75
kurtosis    se
sn         -2.08  0.65
gender*      NaN  0.00
sal        -2.04 81.79
dept*      -1.69  0.50
---------------------------------------------------------------
group: male
vars n    mean     sd median trimmed    mad    min     max  range  skew
sn         1 6    3.50   1.87   3.50    3.50   2.22   1.00    6.00   5.00  0.00
gender*    2 6    2.00   0.00   2.00    2.00   0.00   2.00    2.00   0.00   NaN
sal        3 6 1018.79 175.10 962.83 1018.79 119.22 876.63 1337.77 461.14  0.83
dept*      4 6    3.00   1.10   3.00    3.00   0.74   1.00    4.00   3.00 -0.76
kurtosis    se
sn         -1.80  0.76
gender*      NaN  0.00
sal        -1.02 71.48
dept*      -0.92  0.45
Warning message:
describe.by is deprecated.  Please use the describeBy function


Frequency Distribution

Let’s talk about Frequency distribution today.

I wrote about various data collection and sampling methods in my previous blog post Sampling Techniques.

After data collection or sampling, the first task a researcher do is organizing or categorizing. It would help him/her to get a overview of his data set. Frequency distribution is a simple method in this stage.

It contains at least two columns

1. Scale of Measurement – X
2. Frequency – f

X Column would list min-max values without missing any value.

f contains the tallies for the scale. Each tally represent one occurrence. Let’s explain this with a simple data as given below.

Following is the arrival of flights from Trichy Airport today.

 Origin Airline Flight Arrival Status (DXB) Dubai Air India Express 612 00:05 Landed (SIN) Singapore TigerAir 2668 00:35 Landed (SHJ) Sharjah Air India Express 614 02:35 Landed (CMB) Colombo Srilankan 131 08:40 Landed (KUL) Kuala Lumpur AirAsia 25 08:55 Landed (KUL) Kuala Lumpur (MXD) Malindo Air 221 09:45 Landed (SIN) Singapore TigerAir 2662 10:10 En Route (MAA) Chennai Jet Airways 2748 11:05 Landed (CMB) Colombo Srilankan 133 14:30 Landed (SIN) Singapore Air India Express 681 15:10 En Route (KUL) Kuala Lumpur AirAsia 27 16:35 En Route (MAA) Chennai Jet Airways 2411 17:35 Scheduled (MAA) Chennai Jet Airways 2789 21:25 Scheduled (KUL) Kuala Lumpur AirAsia 23 21:45 Scheduled (KUL) Kuala Lumpur (MXD) Malindo Air 223 22:35 Scheduled (SIN) Singapore TigerAir 2664 22:50 Scheduled (KUL) Kuala Lumpur AirAsia 29 23:45 Scheduled

I want to do a timeline analysis of how many flights landed during different part of the timings.

Let’s perform a frequency distribution. I want to classify based in 6 hours interval.

So our classes count is given as below

24 hours/6 hour interval = 4 hours interval

Generally number classes is identified as 1+3.3log(n), where number of observations in the data.

1+3.3 log(17) = 5. Anyway for my own convenience, I chose the 4 hours interval here.

what’s the lowest value given in the above table? 00.50

What’s the highest value given? 23.45

Now lets identify our class width

Which is 5.7 hours. Lets round it as 6.

Our class width is 6 now.

Following is the FD table. Lower class limit and Upper Class Limit denotes X and Frequency denotes f.

 Lower Class Limit Upper Class Limit Frequency 00:00 06:00 3 06:00 12:00 5 12:00 18:00 4 18:00 00:00 5

Excel would do this job at no time! Following is the output from Excel Histogram function.

 Bin Frequency 00:00 0 06:00 3 12:00 5 18:00 4 More 5

Types of Frequency Distribution – Skewing

The above graph for Trichy airport, does it show us any trend? Yes it is. Pls look at the below given graph with a trend line. We the a tail on the left side and the head is on right side. We call this behaviour as skew!

When the head is on left side and tail is on right side we call those skew as positive. Vice-versa is called negative skew. When you see a bell like trend, up in the center and tails are uniformly extended in left and right, it is called symmetric distribution.

Here you go. See you in next post.