Mean Deviation or Average Deviation

I spoke about positional measures in my previous post Measures of dispersion – Range, Quartile Deviation.

Mean Deviation

Let’s discuss about Mean Deviation, which is the arithmetic mean of the deviations of a data series computed from any measure of central tendency.

Mean Deviation = Σ |D| / N

Σ |D| is the total of all deviations and N is the number of samples.

Mean deviation is calculated by any measure of central tendency as an absolute value.

Coefficient of Mean Deviation

Coefficient of mean deviation is obtained by dividing the mean deviation by the average used for calculating the mean deviation.

How to calculate Mean Deviation?

  1. Calculate the mean, median or mode of the data series
  2. Take a deviation of items from average, ignoring the + – signs. Mark these deviations using |D|
  3. Compute the total of these deviations to get Σ |D|
  4. Divide the total by number of items.

Data set:

100, 150, 200, 250, 360, 490, 500, 600, 671

X |D| = x-x̄ |D| = X-Median
100 269 (360-100) 260 (360 – 100)
150 219 210
200 169 160
250 119 110
360 9 0
490 121 130
500 131 140
600 231 240
671 302 311
ΣX = 3321

Average or Mean is 3321/9 = 369

Median is 360

Σ |D| = 1570 Σ |D| = 1561

Mean Deviation from the mean = Σ |D|/N = 1570/9 = 174.44

Coefficient of mean deviation = Mean Deviation/Average Mean = 174.4/369 = 0.47

Mean Deviation from the median = Σ |D|/N  =1561/9 = 173.44

Coefficient of median deviation = Median deviation/Median = 173.44/360 = 0.48

DHLJOWjUQAEpxpg.jpg large

 

Measures of dispersion – Range, Quartile Deviation

Happy Independence Day

Independence-day-India

My previous post about measures of central tendency is simple and interesting. But, see my dear budding data scientists, it doesn’t represent the whole population completely. India has variety of states. While southern states are reasonably performing good, north Indian states may bring down the efficiency of the entire country. In this situation, how accurate would be our decisions from our statistical analysis?

So, we need to look at the dispersion of samples as well. Dispersion is a process of distributing something or someone over an area. It describes how scattered or how squeezed our samples are. So, we talked about measurements at center in the previous post. Let’s talk about how far the samples are from the center, now.

Methods of measuring dispersion

I want to write about the following now.

  1. Range
  2. Mean Deviation or Standard Deviation
  3. Lorenz Curve

Range

It is a rough measure of dispersion that is based on the extreme items, not on the available items.

Range R = Largest value L – Smallest Value S

Coefficient of Range = (L-S)/(L+S)

Following is the percentage of marks obtained by students in a class 10-A, which is doing well in studies.

90, 95, 93, 99, 100

Range R = L – S

R = 100 – 90

R = 10.

Coefficient of Range = L-S/L+S

C = 100-90/100+90

C = 10/190 = 0.0526.

Following is the percentage of marks obtained by students in a class 10-B, which has variety of students.

10, 25, 17, 80, 40

Range R = L – S

R = 80 – 10

R = 70.

Coefficient of Range = L-S/L+S

C = 80-10/80+10

C = 70/90 = 0.778

Following is the percentage of marks obtained by students in a class 10-C, which is doing poor in studies.

10, 15, 13, 19, 20

20-10/20+10

Coefficient of Range = 10/30=0.333

Following is the percentage of marks obtained by students in a class 10-D, which is doing well except some students.

90, 95, 93, 20, 100

Range = 100-20 = 70

100-20/100+20

Coefficient of Range = 80/120=0.6

Class % of marks Max Min Range Range Coefficient
10A 90 95 93 99 100 100 90 10 0.052631579
10B 10 25 17 80 40 80 10 70 0.777777778
10C 10 15 13 19 20 20 10 10 0.333333333
10D 90 95 93 20 100 100 20 80 0.666666667
  • Range is being used in industries QA, to identify the samples not within the accepted range.
  • Range is used to identify variation prices of commodities.

Inter-Quartile Range & Quartile Deviation

To avoid the extreme values, lets’ try to eliminate 25% of lowest and highest items in the series. To obtain the measure of variance, we shall use the distance between first and 3rd quartile, which is called inter-quartile range.

Interquartile range = Third Quartile Q3 – First Quartile Q1

Semi-Quartile range = Interquartile range / 2

Lets take the below given data set

Age Members
20 3
30 61
40 132
50 153
60 140
70 51
80 3

Compute the cumulative frequency c.f

Age members c.f
20 3 3
30 61 64
40 132 196
50 153 349
60 140 489
70 51 540
80 3 543

First quartile Q1 = (N+1)/4th item

Q1 = (543+1)/4

Q1 = 136th item

The value closer to 136 is 40 in the above table.

Third quartile Q3 = Value of 3 * (N+1)/4th Item

Q3 = 408th item which is 60.

Quartile deviation is QD = (Q3 – Q1) / 2

QD = (60-40)/2

QD = 10

Coefficient of Quartile Deviation = (Q3-Q1)/(Q3+Q1)

c = 60-40/60+40

c = 0.2

See you in another post.

 

Measures of Central Tendency

A measure of central tendency gives us an idea about where the middle of the data lies.

389814_483519921673085_1070311293_n

We shall discuss about the following in this post.

  • Mean
  • Median
  • Mode

Mean

  • It shall be used against discrete and continuous data
  • This is equal to the sum of all values divided by the number of data
  • Usually it is represented by x-bar x̄.

x̄ = Σx/n

Lets take the following example to explain mean. Here is the data about the length of the runways in commercial airports of Tamil Nadu.

Airport Runway length in meters
Chennai – I 3,658
Chennai – II 2,925
Trichy 2,480
Madurai 2,285
Coimbatore 3,120
Tuticorin 1,351
Salem 1,806

Hence the mean length of runways

x̄  or μ = 3568+2925+2480+2285+3120+1351+1806 / 7
μ = 17,625/7
μ = 2518 meters!

You may use average function in Excel – =AVERAGE(D2:D8).

Median

When we sort the data in any order, the items found in the middle is called median.

Lets take the same example in ascending order.

Airport Runway length in meters Average
Tuticorin 1,351 2,518
Salem 1,806 2,518
Madurai 2,285 2,518
Trichy 2,480 2,518
Chennai – II 2,925 2,518
Coimbatore 3,120 2,518
Chennai – I 3,658 2,518

So 1351, 1806, 2285, 2480, 2925, 3120, 3658 we have 7 data. Lets take the data in the middle, which is 2480, which is our median.

If we have even number of samples, average of middle values is the median.

Excel function: =MEDIAN(D2:D8)

Mode

The observation data observed frequently in the population is called mode. Let’s modify the above given data slightly to explain mode.

Airport Runway length in meters
Tuticorin 1,000
Salem 2,000
Madurai 2,000
Trichy 2,000
Chennai – II 3,000
Coimbatore 3,000
Chennai – I 4,000

2000 is repeated thrice. Hence this is the mode of this data set.

Excel function: =MODE.SNGL(E2:E8)

 

Statistics – Basic terminologies

I’m starting a new blog posts series after my Big Data series of posts. I’d be starting with statistical concepts. I’m planning to take it until programming with R. Let’s see how it goes.

Let’s start with jargons or terminologies.

We’d be discussing about the following.

  1. Data
  2. Population
  3. Sample
  4. Sampling
  5. Characteristic
  6. Variable & Attribute
  7. Parameter

Data

We observe numerical figures for a desired characteristics. Collection of such numerical figures is called data.

Let’s take the below given table. This denotes the number of flights operated by different Airlines for Tier 2 cities of Tamil Nadu state of India. This is a collection of numerical figures, which we call as a data set.

The number of flights operated by different Airlines for Tier 2 cities of Tamilnadu state of India. This is a collection of numerical figures, which we call as a data set.

The number of flights operated by different Airlines for Tier 2 cities of Tamilnadu state of India. This is a collection of numerical figures, which we call as a data set.

We may classify the data into two categories.

  1. Categorical Data (or Qualitative Data) – Examples: Weight=”low’, Height=”short”
  2. Numerical Data (or Quantitative Data) – Examples: Height=1.8m Weight = 70Kg.

Population, Sample and Sampling

Statistical investigations is always performed against a collection of metrics, individuals and their attributes. Such collection is called Population. For example – India is a vast country with 1.3 billion people.

Population, Sampling and Samples

Population, Sampling and Samples

Finite subset of population is sample. When we want to know about what Indian people think about China, it is impractical to consult all 1.3 billion people. We choose small set of people to do the survey. this small set is called sample. It is small part of something, used to represent the whole.

sampling

The process of selection is Sampling. Each sample/observation/data may measure different properties.

Characteristic

Quality possessed by a sample is called characteristic. For example height of the individuals, nationality of the group of passengers etc

Variable & Attribute

If a characteristic is measurable, it is variable. This is usually measured in numbers. For example, age, height etc.

Check the below given table. Number of flights operated by Air India, Silk Air etc are variables.

The number of flights operated by different Airlines for Tier 2 cities of Tamilnadu state of India. This is a collection of numerical figures, which we call as a data set.

The number of flights operated by different Airlines for Tier 2 cities of Tamilnadu state of India. This is a collection of numerical figures, which we call as a data set.

If the characteristic can not be measured, it is attribute. For example – single, married, widowed, divorced etc.

Following is an example for attribute. Delhi, Bangalore, Mumbai – are not numerical measures

Airline Hub

attribute variable comparison

Parameter & Statistic

Parameter is the characteristics of a population. Statistic is the characteristic of a Sample.

Population, Sampling and Samples

Population, Sampling and Samples

Let’s take the example again. What is the mean salary of Indians? – this mean salary is a characteristic. When you answer this question from population, it is called Population Mean μ. If you answer this question from Sample, it is called Sample Mean x̅.