Exploring data files with R

I have written about data types and data structures of R in my previous post Working with data types of R. We shall explore a data set in this post.

mtcars is a dataset exists in R already

> mtcars
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

File structure

Very first question would be about the size of the data set.

> dim(mtcars)
[1] 32 11

It contains 32 rows and 11 columns.

Now, how many variables we have in mtcars?

> names(mtcars)
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"

We shall preview the data using head and tail commands.

> head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
> tail(mtcars)
                mpg cyl  disp  hp drat    wt qsec vs am gear carb
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2

Similar to unix tail, head commands, you would see first and last 6 records in R.

This is the time to know about the structure of the data set.

> str(mtcars)
'data.frame':	32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

So, this is a data frame. Each variables are explained above.

How the data is being stored?

> mode(mtcars)
[1] "list"

It is stored as a list.

Let’s take another dataset available with R – airquality.

> head(airquality, n=10)
   Ozone Solar.R Wind Temp Month Day
1     41     190  7.4   67     5   1
2     36     118  8.0   72     5   2
3     12     149 12.6   74     5   3
4     18     313 11.5   62     5   4
5     NA      NA 14.3   56     5   5
6     28      NA 14.9   66     5   6
7     23     299  8.6   65     5   7
8     19      99 13.8   59     5   8
9      8      19 20.1   61     5   9
10    NA     194  8.6   69     5  10

You may parameterize head commands as I have shown above.

Let’s omit the records with NA.

> aqNoNA=na.omit(airquality)
> dim(aqNoNA)
[1] 111   6
> dim(airquality)
[1] 153   6

So, out new object aqNoNA contains the records without missing cases, totally 111 rows.

We have is.na command to check if the data is NA.

> is.na(airquality)
       Ozone Solar.R  Wind  Temp Month   Day
  [1,] FALSE   FALSE FALSE FALSE FALSE FALSE
  [2,] FALSE   FALSE FALSE FALSE FALSE FALSE
  [3,] FALSE   FALSE FALSE FALSE FALSE FALSE
  [4,] FALSE   FALSE FALSE FALSE FALSE FALSE
  [5,]  TRUE    TRUE FALSE FALSE FALSE FALSE
  [6,] FALSE    TRUE FALSE FALSE FALSE FALSE
> sum(is.na(airquality))
[1] 44

So totally 44 NAs found.

Summary command gives us minimum, quartile 1, median, mean, 3rd quartile, maximum and number of NAa.

> summary(airquality)
     Ozone           Solar.R           Wind             Temp           Month
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00   Min.   :5.000
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00   1st Qu.:6.000
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00   Median :7.000
 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88   Mean   :6.993
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00   3rd Qu.:8.000
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00   Max.   :9.000
 NA's   :37       NA's   :7
      Day
 Min.   : 1.0
 1st Qu.: 8.0
 Median :16.0
 Mean   :15.8
 3rd Qu.:23.0
 Max.   :31.0

1st quartile meant of 24% data

2nd quartile meant of 50 percentile of data

3rd quartile meant of 75 percentile of data

We shall filter the summary commands as given below.

> summary(airquality$Ozone)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
   1.00   18.00   31.50   42.13   63.25  168.00      37

We have given the summary of only one variable Ozone above.

We shall apply other mathematical functions as given below

> sd(aqNoNA$Ozone)
[1] 33.27597

We have calculated the standard deviation for Ozone above. I have taken the data without NA, as my sd operation will fail if I have NA.

Finding the Outlier

> myNumbers <- rep(c(1,4,10, 100), c(5, 5, 5, 1))
> myNumbers
 [1]   1   1   1   1   1   4   4   4   4   4  10  10  10  10  10 100
> mean(myNumbers)
[1] 10.9375
> sd(myNumbers)
[1] 24.04293
> scale(myNumbers)
             [,1]
 [1,] -0.41332316
 [2,] -0.41332316
 [3,] -0.41332316
 [4,] -0.41332316
 [5,] -0.41332316
 [6,] -0.28854636
 [7,] -0.28854636
 [8,] -0.28854636
 [9,] -0.28854636
[10,] -0.28854636
[11,] -0.03899275
[12,] -0.03899275
[13,] -0.03899275
[14,] -0.03899275
[15,] -0.03899275
[16,]  3.70431136
attr(,"scaled:center")
[1] 10.9375
attr(,"scaled:scale")
[1] 24.04293

scaling of the metrics (z score) is calculated using the formula

each sample data – mean / standard deviation

After scaling, if you find any values is greater than ±2, they are called outliers (odd points located away from the central measures). If SD > mean, it is outlier.

Outlier

Other way to test the normality is Shapiro test

> shapiro.test(myNumbers)

	Shapiro-Wilk normality test

data:  myNumbers
W = 0.40055, p-value = 3.532e-07

If the p-value is greater than 0.05, it is normalized data. Here it is not.

Data Analysis

Here comes another interesting part. How to analyze the data, after you upload your files.

To explain this, I’d prepare sample data set first.

> #Employee ID
> sn&lt;-seq(1, 10, 1)
> sn
 [1]  1  2  3  4  5  6  7  8  9 10
> #Employee gender
> gender&lt;-rep(c("male", "female"), c(6,4))
> gender
 [1] "male"   "male"   "male"   "male"   "male"   "male"   "female" "female" "female"
[10] "female"
> #available departments
> dept&lt;-rep(c("Admin", "HR", "Prod", "Contractor"), c(1, 3, 3, 3))
> dept
 [1] "Admin"      "HR"         "HR"         "HR"         "Prod"       "Prod"
 [7] "Prod"       "Contractor" "Contractor" "Contractor"
> #Employee Salary
> sal <- rnorm(10, 1000, 200);
> sal
 [1]  888.2026  876.6272  919.7453 1005.9058 1084.4704 1337.7696  909.4302  801.1482
 [9] 1025.9457 1182.9774
> #Now our data set
> mydataset &lt;- data.frame(sn, gender, sal, dept)
> mydataset
   sn gender       sal       dept
1   1   male  888.2026      Admin
2   2   male  876.6272         HR
3   3   male  919.7453         HR
4   4   male 1005.9058         HR
5   5   male 1084.4704       Prod
6   6   male 1337.7696       Prod
7   7 female  909.4302       Prod
8   8 female  801.1482 Contractor
9   9 female 1025.9457 Contractor
10 10 female 1182.9774 Contractor

So we have serial number, gender, salary and department.

Which gender is majority in the given data? we shall use table function to arrive at a simple frequency distribution table.

> #which gender is more in each dept
> #this is frequency distribution
> table(mydataset$gender)

female   male
     4      6

So we have 4 females and 6 males. Let’s group the above FD by department now.

> table(mydataset$gender, mydataset$dept)

         Admin Contractor HR Prod
  female     0          3  0    1
  male       1          0  3    2
> #assign the results to an object
> freqDis <- table(mydataset$gender, mydataset$dept);
> #transpose the table
> t(freqDis)

             female male
  Admin           0    1
  Contractor      3    0
  HR              0    3
  Prod            1    2

So all departments except contractors, have male as majority. The function t() stands for transpose.

Let’s do a proportion of the data using prop.table()

> #proportion
> prop.table(freqDis)

         Admin Contractor  HR Prod
  female   0.0        0.3 0.0  0.1
  male     0.1        0.0 0.3  0.2

Rather than proportion, % of males and females would give us a better visibility. hence we multiply proportion by 100.

> #Percentage
> prop.table(freqDis)*100

         Admin Contractor HR Prod
  female     0         30  0   10
  male      10          0 30   20

Column sum and row sums are frequently asked in our day today life.

> #sum
> colSums(freqDis)
     Admin Contractor         HR       Prod
         1          3          3          3

Row sum shall be calculated as below.

> rowSums(freqDis)
female   male
     4      6

Salary is interesting part in our profession. I’d like to see who earns more using aggregate() function? Male or Female?

> #who earns more - male or female?
> aggregate(sal~gender, mean, data = mydataset)
  gender       sal
1 female  979.8754
2   male 1018.7868
> #sal is continuous variable
> #gender is categorical variable
> #mean is the function
> #data is our data source

We used mean salary above. Lets use sum now.

> aggregate(sal~gender, sum, data = mydataset)
  gender      sal
1 female 3919.501
2   male 6112.721

Similarly, we use standard deviation.

> aggregate(sal~gender, sd, data = mydataset)
  gender      sal
1 female 163.5837
2   male 175.1006

So salary package for females looks to be more consistent than that of males.

We calculated all the above functions individually. psych package helps to calculate everything in one command.

> install.packages("psych")
Installing package into ‘D:/gandhari/documents/R/win-library/3.4’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/psych_1.7.5.zip'
Content type 'application/zip' length 3966969 bytes (3.8 MB)
downloaded 3.8 MB

package ‘psych’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\pandian\AppData\Local\Temp\Rtmpy2L0Yq\downloaded_packages
> library(psych)
> describe (mydataset)
        vars  n    mean     sd median trimmed    mad    min     max  range  skew
sn         1 10    5.50   3.03   5.50    5.50   3.71   1.00   10.00   9.00  0.00
gender*    2 10    1.60   0.52   2.00    1.62   0.00   1.00    2.00   1.00 -0.35
sal        3 10 1003.22 162.35 962.83  986.66 119.22 801.15 1337.77 536.62  0.71
dept*      4 10    2.80   1.03   3.00    2.88   1.48   1.00    4.00   3.00 -0.20
        kurtosis    se
sn         -1.56  0.96
gender*    -2.05  0.16
sal        -0.72 51.34
dept*      -1.42  0.33

n is total samples, sd is standard deviation etc. All the aggregated functions are calculated for serial numbmer, gender, salary and department.

In the above data, I get only the salary data, which is in 3rd row.

> describe (mydataset[,3])
   vars  n    mean     sd median trimmed    mad    min     max  range skew kurtosis
X1    1 10 1003.22 162.35 962.83  986.66 119.22 801.15 1337.77 536.62 0.71    -0.72
      se
X1 51.34

 

Lets group the above described data by gender, ie, mean, sd for males and females separately.

> describe.by(mydataset, mydataset$gender)

 Descriptive statistics by group
group: female
        vars n   mean     sd median trimmed    mad    min     max  range skew
sn         1 4   8.50   1.29   8.50    8.50   1.48   7.00   10.00   3.00 0.00
gender*    2 4   1.00   0.00   1.00    1.00   0.00   1.00    1.00   0.00  NaN
sal        3 4 979.88 163.58 967.69  979.88 166.64 801.15 1182.98 381.83 0.14
dept*      4 4   2.50   1.00   2.00    2.50   0.00   2.00    4.00   2.00 0.75
        kurtosis    se
sn         -2.08  0.65
gender*      NaN  0.00
sal        -2.04 81.79
dept*      -1.69  0.50
---------------------------------------------------------------
group: male
        vars n    mean     sd median trimmed    mad    min     max  range  skew
sn         1 6    3.50   1.87   3.50    3.50   2.22   1.00    6.00   5.00  0.00
gender*    2 6    2.00   0.00   2.00    2.00   0.00   2.00    2.00   0.00   NaN
sal        3 6 1018.79 175.10 962.83 1018.79 119.22 876.63 1337.77 461.14  0.83
dept*      4 6    3.00   1.10   3.00    3.00   0.74   1.00    4.00   3.00 -0.76
        kurtosis    se
sn         -1.80  0.76
gender*      NaN  0.00
sal        -1.02 71.48
dept*      -0.92  0.45
Warning message:
describe.by is deprecated.  Please use the describeBy function

 

 

 

Advertisements

Mean Deviation or Average Deviation

I spoke about positional measures in my previous post Measures of dispersion – Range, Quartile Deviation.

Mean Deviation

Let’s discuss about Mean Deviation, which is the arithmetic mean of the deviations of a data series computed from any measure of central tendency.

Mean Deviation = Σ |D| / N

Σ |D| is the total of all deviations and N is the number of samples.

Mean deviation is calculated by any measure of central tendency as an absolute value.

Coefficient of Mean Deviation

Coefficient of mean deviation is obtained by dividing the mean deviation by the average used for calculating the mean deviation.

How to calculate Mean Deviation?

  1. Calculate the mean, median or mode of the data series
  2. Take a deviation of items from average, ignoring the + – signs. Mark these deviations using |D|
  3. Compute the total of these deviations to get Σ |D|
  4. Divide the total by number of items.

Data set:

100, 150, 200, 250, 360, 490, 500, 600, 671

X |D| = x-x̄ |D| = X-Median
100 269 (360-100) 260 (360 – 100)
150 219 210
200 169 160
250 119 110
360 9 0
490 121 130
500 131 140
600 231 240
671 302 311
ΣX = 3321

Average or Mean is 3321/9 = 369

Median is 360

Σ |D| = 1570 Σ |D| = 1561

Mean Deviation from the mean = Σ |D|/N = 1570/9 = 174.44

Coefficient of mean deviation = Mean Deviation/Average Mean = 174.4/369 = 0.47

Mean Deviation from the median = Σ |D|/N  =1561/9 = 173.44

Coefficient of median deviation = Median deviation/Median = 173.44/360 = 0.48

DHLJOWjUQAEpxpg.jpg large

 

Measures of Central Tendency

A measure of central tendency gives us an idea about where the middle of the data lies.

389814_483519921673085_1070311293_n

We shall discuss about the following in this post.

  • Mean
  • Median
  • Mode

Mean

  • It shall be used against discrete and continuous data
  • This is equal to the sum of all values divided by the number of data
  • Usually it is represented by x-bar x̄.

x̄ = Σx/n

Lets take the following example to explain mean. Here is the data about the length of the runways in commercial airports of Tamil Nadu.

Airport Runway length in meters
Chennai – I 3,658
Chennai – II 2,925
Trichy 2,480
Madurai 2,285
Coimbatore 3,120
Tuticorin 1,351
Salem 1,806

Hence the mean length of runways

x̄  or μ = 3568+2925+2480+2285+3120+1351+1806 / 7
μ = 17,625/7
μ = 2518 meters!

You may use average function in Excel – =AVERAGE(D2:D8).

Median

When we sort the data in any order, the items found in the middle is called median.

Lets take the same example in ascending order.

Airport Runway length in meters Average
Tuticorin 1,351 2,518
Salem 1,806 2,518
Madurai 2,285 2,518
Trichy 2,480 2,518
Chennai – II 2,925 2,518
Coimbatore 3,120 2,518
Chennai – I 3,658 2,518

So 1351, 1806, 2285, 2480, 2925, 3120, 3658 we have 7 data. Lets take the data in the middle, which is 2480, which is our median.

If we have even number of samples, average of middle values is the median.

Excel function: =MEDIAN(D2:D8)

Mode

The observation data observed frequently in the population is called mode. Let’s modify the above given data slightly to explain mode.

Airport Runway length in meters
Tuticorin 1,000
Salem 2,000
Madurai 2,000
Trichy 2,000
Chennai – II 3,000
Coimbatore 3,000
Chennai – I 4,000

2000 is repeated thrice. Hence this is the mode of this data set.

Excel function: =MODE.SNGL(E2:E8)