Mean Deviation or Average Deviation

I spoke about positional measures in my previous post Measures of dispersion – Range, Quartile Deviation.

Mean Deviation

Let’s discuss about Mean Deviation, which is the arithmetic mean of the deviations of a data series computed from any measure of central tendency.

Mean Deviation = Σ |D| / N

Σ |D| is the total of all deviations and N is the number of samples.

Mean deviation is calculated by any measure of central tendency as an absolute value.

Coefficient of Mean Deviation

Coefficient of mean deviation is obtained by dividing the mean deviation by the average used for calculating the mean deviation.

How to calculate Mean Deviation?

  1. Calculate the mean, median or mode of the data series
  2. Take a deviation of items from average, ignoring the + – signs. Mark these deviations using |D|
  3. Compute the total of these deviations to get Σ |D|
  4. Divide the total by number of items.

Data set:

100, 150, 200, 250, 360, 490, 500, 600, 671

X |D| = x-x̄ |D| = X-Median
100 269 (360-100) 260 (360 – 100)
150 219 210
200 169 160
250 119 110
360 9 0
490 121 130
500 131 140
600 231 240
671 302 311
ΣX = 3321

Average or Mean is 3321/9 = 369

Median is 360

Σ |D| = 1570 Σ |D| = 1561

Mean Deviation from the mean = Σ |D|/N = 1570/9 = 174.44

Coefficient of mean deviation = Mean Deviation/Average Mean = 174.4/369 = 0.47

Mean Deviation from the median = Σ |D|/N  =1561/9 = 173.44

Coefficient of median deviation = Median deviation/Median = 173.44/360 = 0.48

DHLJOWjUQAEpxpg.jpg large


Measures of dispersion – Range, Quartile Deviation

Happy Independence Day


My previous post about measures of central tendency is simple and interesting. But, see my dear budding data scientists, it doesn’t represent the whole population completely. India has variety of states. While southern states are reasonably performing good, north Indian states may bring down the efficiency of the entire country. In this situation, how accurate would be our decisions from our statistical analysis?

So, we need to look at the dispersion of samples as well. Dispersion is a process of distributing something or someone over an area. It describes how scattered or how squeezed our samples are. So, we talked about measurements at center in the previous post. Let’s talk about how far the samples are from the center, now.

Methods of measuring dispersion

I want to write about the following now.

  1. Range
  2. Mean Deviation or Standard Deviation
  3. Lorenz Curve


It is a rough measure of dispersion that is based on the extreme items, not on the available items.

Range R = Largest value L – Smallest Value S

Coefficient of Range = (L-S)/(L+S)

Following is the percentage of marks obtained by students in a class 10-A, which is doing well in studies.

90, 95, 93, 99, 100

Range R = L – S

R = 100 – 90

R = 10.

Coefficient of Range = L-S/L+S

C = 100-90/100+90

C = 10/190 = 0.0526.

Following is the percentage of marks obtained by students in a class 10-B, which has variety of students.

10, 25, 17, 80, 40

Range R = L – S

R = 80 – 10

R = 70.

Coefficient of Range = L-S/L+S

C = 80-10/80+10

C = 70/90 = 0.778

Following is the percentage of marks obtained by students in a class 10-C, which is doing poor in studies.

10, 15, 13, 19, 20


Coefficient of Range = 10/30=0.333

Following is the percentage of marks obtained by students in a class 10-D, which is doing well except some students.

90, 95, 93, 20, 100

Range = 100-20 = 70


Coefficient of Range = 80/120=0.6

Class % of marks Max Min Range Range Coefficient
10A 90 95 93 99 100 100 90 10 0.052631579
10B 10 25 17 80 40 80 10 70 0.777777778
10C 10 15 13 19 20 20 10 10 0.333333333
10D 90 95 93 20 100 100 20 80 0.666666667
  • Range is being used in industries QA, to identify the samples not within the accepted range.
  • Range is used to identify variation prices of commodities.

Inter-Quartile Range & Quartile Deviation

To avoid the extreme values, lets’ try to eliminate 25% of lowest and highest items in the series. To obtain the measure of variance, we shall use the distance between first and 3rd quartile, which is called inter-quartile range.

Interquartile range = Third Quartile Q3 – First Quartile Q1

Semi-Quartile range = Interquartile range / 2

Lets take the below given data set

Age Members
20 3
30 61
40 132
50 153
60 140
70 51
80 3

Compute the cumulative frequency c.f

Age members c.f
20 3 3
30 61 64
40 132 196
50 153 349
60 140 489
70 51 540
80 3 543

First quartile Q1 = (N+1)/4th item

Q1 = (543+1)/4

Q1 = 136th item

The value closer to 136 is 40 in the above table.

Third quartile Q3 = Value of 3 * (N+1)/4th Item

Q3 = 408th item which is 60.

Quartile deviation is QD = (Q3 – Q1) / 2

QD = (60-40)/2

QD = 10

Coefficient of Quartile Deviation = (Q3-Q1)/(Q3+Q1)

c = 60-40/60+40

c = 0.2

See you in another post.


Measures of Central Tendency

A measure of central tendency gives us an idea about where the middle of the data lies.


We shall discuss about the following in this post.

  • Mean
  • Median
  • Mode


  • It shall be used against discrete and continuous data
  • This is equal to the sum of all values divided by the number of data
  • Usually it is represented by x-bar x̄.

x̄ = Σx/n

Lets take the following example to explain mean. Here is the data about the length of the runways in commercial airports of Tamil Nadu.

Airport Runway length in meters
Chennai – I 3,658
Chennai – II 2,925
Trichy 2,480
Madurai 2,285
Coimbatore 3,120
Tuticorin 1,351
Salem 1,806

Hence the mean length of runways

x̄  or μ = 3568+2925+2480+2285+3120+1351+1806 / 7
μ = 17,625/7
μ = 2518 meters!

You may use average function in Excel – =AVERAGE(D2:D8).


When we sort the data in any order, the items found in the middle is called median.

Lets take the same example in ascending order.

Airport Runway length in meters Average
Tuticorin 1,351 2,518
Salem 1,806 2,518
Madurai 2,285 2,518
Trichy 2,480 2,518
Chennai – II 2,925 2,518
Coimbatore 3,120 2,518
Chennai – I 3,658 2,518

So 1351, 1806, 2285, 2480, 2925, 3120, 3658 we have 7 data. Lets take the data in the middle, which is 2480, which is our median.

If we have even number of samples, average of middle values is the median.

Excel function: =MEDIAN(D2:D8)


The observation data observed frequently in the population is called mode. Let’s modify the above given data slightly to explain mode.

Airport Runway length in meters
Tuticorin 1,000
Salem 2,000
Madurai 2,000
Trichy 2,000
Chennai – II 3,000
Coimbatore 3,000
Chennai – I 4,000

2000 is repeated thrice. Hence this is the mode of this data set.

Excel function: =MODE.SNGL(E2:E8)


Frequency Distribution

Let’s talk about Frequency distribution today.

I wrote about various data collection and sampling methods in my previous blog post Sampling Techniques.

After data collection or sampling, the first task a researcher do is organizing or categorizing. It would help him/her to get a overview of his data set. Frequency distribution is a simple method in this stage.

It contains at least two columns

  1. Scale of Measurement – X
  2. Frequency – f

X Column would list min-max values without missing any value.

f contains the tallies for the scale. Each tally represent one occurrence. Let’s explain this with a simple data as given below.

Following is the arrival of flights from Trichy Airport today.

Origin Airline Flight Arrival Status
(DXB) Dubai Air India Express 612 00:05 Landed
(SIN) Singapore TigerAir 2668 00:35 Landed
(SHJ) Sharjah Air India Express 614 02:35 Landed
(CMB) Colombo Srilankan 131 08:40 Landed
(KUL) Kuala Lumpur AirAsia 25 08:55 Landed
(KUL) Kuala Lumpur (MXD) Malindo Air 221 09:45 Landed
(SIN) Singapore TigerAir 2662 10:10 En Route
(MAA) Chennai Jet Airways 2748 11:05 Landed
(CMB) Colombo Srilankan 133 14:30 Landed
(SIN) Singapore Air India Express 681 15:10 En Route
(KUL) Kuala Lumpur AirAsia 27 16:35 En Route
(MAA) Chennai Jet Airways 2411 17:35 Scheduled
(MAA) Chennai Jet Airways 2789 21:25 Scheduled
(KUL) Kuala Lumpur AirAsia 23 21:45 Scheduled
(KUL) Kuala Lumpur (MXD) Malindo Air 223 22:35 Scheduled
(SIN) Singapore TigerAir 2664 22:50 Scheduled
(KUL) Kuala Lumpur AirAsia 29 23:45 Scheduled

I want to do a timeline analysis of how many flights landed during different part of the timings.

Let’s perform a frequency distribution. I want to classify based in 6 hours interval.

So our classes count is given as below

24 hours/6 hour interval = 4 hours interval

Generally number classes is identified as 1+3.3log(n), where number of observations in the data.

1+3.3 log(17) = 5. Anyway for my own convenience, I chose the 4 hours interval here.


what’s the lowest value given in the above table? 00.50

What’s the highest value given? 23.45

Now lets identify our class width


Which is 5.7 hours. Lets round it as 6.

Our class width is 6 now.

Following is the FD table. Lower class limit and Upper Class Limit denotes X and Frequency denotes f.

Lower Class Limit Upper Class Limit Frequency
00:00 06:00 3
06:00 12:00 5
12:00 18:00 4
18:00 00:00 5

Excel would do this job at no time! Following is the output from Excel Histogram function.

Bin Frequency
00:00 0
06:00 3
12:00 5
18:00 4
More 5



Types of Frequency Distribution – Skewing

The above graph for Trichy airport, does it show us any trend? Yes it is. Pls look at the below given graph with a trend line. We the a tail on the left side and the head is on right side. We call this behaviour as skew!


When the head is on left side and tail is on right side we call those skew as positive. Vice-versa is called negative skew. When you see a bell like trend, up in the center and tails are uniformly extended in left and right, it is called symmetric distribution.

Here you go. See you in next post.

Population, Sampling and Samples

Sampling Techniques

This post assumes you have seen the basic concepts of statistics mentioned in my previous post Statistics – Basic terminologies.

Population, Sampling and Samples

Population, Sampling and Samples

Lets discuss about sampling in detail in this post. Because sampling is crucial in your data analysis. Higher the quality of the samples, Higher would be your results.


The collection of all units of a specified type in a given region at particular point of time is called population or universe.


  • Population of persons in a region
  • Population of trees or birds in a forest

Sampling Unit

Elementary units or group of such units which besides being clearly defined, identifiable and observable are convenient for the purpose of sampling is called a sampling unit.


  • A Family in a budget
  • A farm or group of farms by a single household in a crop survey.

Sampling Frame

A map showing the boundaries of the sampling units is a sampling frame.

A list of all sampling units belongs to a population to be studied with their identification  particulars is a sampling frame.


  • List of farms in villages of India

Random Sampling or Probability Sampling

One of more sampling units from a population according to some specified procedures is said to constitute a sample. If its selection is governed by ascertainable laws of chance, it is called random/probability samples.

Assume a population consists of the n sampling units U1, U2, U3, ….. Un. We select a sample of n units by selecting them unit by unit with equal probability for every unit at each draw with or without replacing the sampling units selected in the previous draws.


  • Select one person from each economic tier
  • Select one tax payer from each tax slab

Non-Random Sampling

A sample selected by a non-random process is called as non-random sampling. A non-random sample which is drawn using certain judgement of getting right samples is called judgement or purposive sample. This type of surveys is generally performed in large scale surveys, as it is not possible to get strictly valid estimates of popular parameters under consideration .

For a unique identification of population one should know the following.

  • Elementary units
  • Population characteristics, which vary numerically from unit to unit.
Population Elementary Unit Characteristics
Students of a particular class of a school Each student Marks obtained in final exam
LED TVs produced by a company Each TV Length of life in years

Types of Population

  1. Finite (Countable numbers): E.g., Odd numbers (1, 3, 5 …)
  2. Infinite (infinite and uncountable elementary units): Eg., pressure, humidity
  3. Real (existing objects with factual observations): Eg., Indian Census
  4. Hypothetical (hypothetically illustrated by the investigator): Eg., (coin tossing, dice thrown)

Sampling procedure

Steps and examples are given below

Sampling Process.png

Types of Sampling

We already talked about Probability and non-probability methods. Let’s talk about other sampling methods now.

Types of Sampling

Probability Sampling

Simple Random Sampling

SRS is always using Equal Probability of Selection (EPS) (not all EPS selections are SRS). It is applicable when population is –

  1. small
  2. homogeneous
  3. readily available

This is done by assigning number to each unit in a sampling frame. A table of random number or lottery is used to determine the unit selection.


  • If sampling frame is large, population is large, this is impractical
  • Minority subgroups may not be adequately selected

Systematic Sampling

This would sort the population in any order and choose the samples in regular intervals.

i1, i2, i3, i4, i5, i6, i7, i8, i9

in the above population, I have taken the samples in red colour in regular intervals. Pls note I have not selected the first or last items of my population.

  1. Every 10th name in telephone directory
  2. Every 5th sapling in a sugarcane farm


  • Convenient
  • Selecting the sampling frame is easy
  • Sample evenly spread over entire population


  • Hidden priorities may affect the precision
  • Difficult to access the precision of estimate from single survey.

Stratified Sampling

The sampling may be biased. How? Assume, I’m sampling from a group of people of different villages. I may select only men, or women or more of one gender. My samples may affect my precision. Hence when the population includes number of distinct categories, the frame can be organized into separate ‘strata’. Each stratum would be sampled as an independent sub-population, in which individual elements are randomly selected. Men is a stratum, women is another stratum. This process is called Stratified Random Sampling.

Cluster Sampling

Here we do the sampling in a different way. We would be sampling twice.

We would choose areas of sampling

The population is divided into clusters (usually based on geographical locations)

Sample units are groups rather than individuals (Eg., Senior citizens of Palani Murugan Street, Flower vendors of Dhendapani shopping complex etc)

A sample of such cluster is selected (Eg., senior citizens of Palani Murugan Street)

All units from the select clusters are studied (All senior citizens of Palani Murugan Street).

Non-Probability Sampling

Non-random sampling shall be used for ‘not-mission critical’ surveys, as we are not seriously concerned about the accuracy of the results.

Convenience Sampling

This is collecting the samples from a population that is convenient to get (located near by, data readily available etc). We may not take a decision based on this sampling, as this doesn’t represent entire population. This is generally used in pilot phases.

For example, Taking agriculture yield values from internet, survey about the Government from the neighbourhood.

Judgement Sampling

We may do a sampling based on our own experience and preference. Samples are non-randomly chosen by the researcher based on his own judgement.

For example, a teacher may choose his own sampling to pick some students 👩‍🎓👨‍🎓 for extra coaching class, whom he thinks as poor in studies.

Quota Sampling

Quota, ration etc are not new to India 🇮🇳. Similar to stratified sampling, the population is divided into strata. then judgemental sampling is used.

Snowball sampling

Getting the efficient and cost-effective samples from the links of our known resources. For example, Google, when it released its email service, chose the users (samples) based on referrals.


Statistics – Basic terminologies

I’m starting a new blog posts series after my Big Data series of posts. I’d be starting with statistical concepts. I’m planning to take it until programming with R. Let’s see how it goes.

Let’s start with jargons or terminologies.

We’d be discussing about the following.

  1. Data
  2. Population
  3. Sample
  4. Sampling
  5. Characteristic
  6. Variable & Attribute
  7. Parameter


We observe numerical figures for a desired characteristics. Collection of such numerical figures is called data.

Let’s take the below given table. This denotes the number of flights operated by different Airlines for Tier 2 cities of Tamil Nadu state of India. This is a collection of numerical figures, which we call as a data set.

The number of flights operated by different Airlines for Tier 2 cities of Tamilnadu state of India. This is a collection of numerical figures, which we call as a data set.

The number of flights operated by different Airlines for Tier 2 cities of Tamilnadu state of India. This is a collection of numerical figures, which we call as a data set.

We may classify the data into two categories.

  1. Categorical Data (or Qualitative Data) – Examples: Weight=”low’, Height=”short”
  2. Numerical Data (or Quantitative Data) – Examples: Height=1.8m Weight = 70Kg.

Population, Sample and Sampling

Statistical investigations is always performed against a collection of metrics, individuals and their attributes. Such collection is called Population. For example – India is a vast country with 1.3 billion people.

Population, Sampling and Samples

Population, Sampling and Samples

Finite subset of population is sample. When we want to know about what Indian people think about China, it is impractical to consult all 1.3 billion people. We choose small set of people to do the survey. this small set is called sample. It is small part of something, used to represent the whole.


The process of selection is Sampling. Each sample/observation/data may measure different properties.


Quality possessed by a sample is called characteristic. For example height of the individuals, nationality of the group of passengers etc

Variable & Attribute

If a characteristic is measurable, it is variable. This is usually measured in numbers. For example, age, height etc.

Check the below given table. Number of flights operated by Air India, Silk Air etc are variables.

The number of flights operated by different Airlines for Tier 2 cities of Tamilnadu state of India. This is a collection of numerical figures, which we call as a data set.

The number of flights operated by different Airlines for Tier 2 cities of Tamilnadu state of India. This is a collection of numerical figures, which we call as a data set.

If the characteristic can not be measured, it is attribute. For example – single, married, widowed, divorced etc.

Following is an example for attribute. Delhi, Bangalore, Mumbai – are not numerical measures

Airline Hub

attribute variable comparison

Parameter & Statistic

Parameter is the characteristics of a population. Statistic is the characteristic of a Sample.

Population, Sampling and Samples

Population, Sampling and Samples

Let’s take the example again. What is the mean salary of Indians? – this mean salary is a characteristic. When you answer this question from population, it is called Population Mean μ. If you answer this question from Sample, it is called Sample Mean x̅.