The variance is the the sum of squared deviations from the mean. The variance for population data is denoted by $$\sigma^2$$ (read as sigma squared), and the variance calculated for sample data is denoted by $$s^2$$.

$\sigma^2 = \frac{\sum_{i=1}^n (x_i-\mu)^2}{N}$ and $s^2 = \frac{\sum_{i=1}^n (x_i-\bar x)^2}{n-1}$

where $$\sigma^2$$ is the population variance and $$s^2$$ is the sample variance. The quantity $$x_i-\mu$$ or $$x_i-\bar x$$ in the above formulas is called the deviation of the $$x_i$$ value ($$x_1, x_2,...,x_n$$) from the mean (Mann 2012).

The standard deviation is the most-used measure of dispersion. The value of the standard deviation tells how closely the values of a data set are clustered around the mean. In general, a lower value of the standard deviation for a data set indicates that the values of that data set are spread over a relatively smaller range around the mean. In contrast, a larger value of the standard deviation for a data set indicates that the values of that data set are spread over a relatively larger range around the mean (Mann 2012).

The standard deviation is obtained by taking the square root of the variance. Consequently, the standard deviation calculated for population data is denoted by $$\sigma$$ and the standard deviation calculated for sample data is denoted by $$s$$.

$\sigma = \sqrt{\frac{\sum_{i=1}^N (x_i-\mu)^2}{N}}$ and $s = \sqrt{\frac{\sum_{i=1}^n (x_i-\bar x)^2}{n-1}}$

where $$\sigma$$ is the standard deviation of the population and $$s$$ is the standard deviation of the sample.

As an exercise we compute for some numerical variables of interest in the students.quant data set the mean, the median, the variance and the standard deviation, and present it in a nice format.

students <- read.csv("https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv")
quant.vars <- c("name", "age", "nc.score", "height", "weight")
students.quant <- students[quant.vars]
head(students.quant, 10)
##                    name age nc.score height weight
## 1   Gonzales, Christina  19     1.91    160   64.8
## 2        Lozano, T'Hani  19     1.56    172   73.0
## 3        Williams, Hanh  22     1.24    168   70.6
## 4           Nem, Denzel  19     1.37    183   79.7
## 5       Powell, Heather  21     1.46    175   71.4
## 6        Perez, Jadrian  19     1.34    189   85.8
## 7         Clardy, Anita  21     1.11    156   65.9
## 8  Allen, Rebecca Marie  21     2.03    167   65.7
## 9         Tracy, Robert  18     1.29    195   94.4
## 10       Nimmons, Laura  18     1.19    165   66.0
#mean
students.quant.mean <- apply(students.quant[ ,!(colnames(students.quant) == "name")], 2, mean)
#median
students.quant.median<- apply(students.quant[ ,!(colnames(students.quant) == "name")], 2, median)
#variance
students.quant.var <- apply(students.quant[ ,!(colnames(students.quant) == "name")], 2, var)
#standard deviation
students.quant.sd <- apply(students.quant[ ,!(colnames(students.quant) == "name")], 2, sd)
# concatenate the vectors and round to 2 digits
students.quant.stats <- round(cbind(students.quant.mean,
students.quant.median,
students.quant.var,
students.quant.sd),2)
# rename column names
colnames(students.quant.stats) <- c('mean', 'median','variance', 'standard deviation')
students.quant.stats
##            mean median variance standard deviation
## age       22.54  21.00    36.79               6.07
## nc.score   2.17   2.04     0.66               0.81
## height   171.38 171.00   122.71              11.08
## weight    73.00  71.80    74.57               8.64

#### Use of the Standard Deviation

By using the mean and standard deviation, we can find the proportion or percentage of the total observations that fall within a given interval about the mean.

##### Chebyshevâ€™s Theorem

Chebyshevâ€™s theorem gives a lower bound for the area under a curve between two points that are on opposite sides of the mean and at the same distance from the mean.

For any number $$k$$ greater than 1, at least $$1-1/k^2$$ of the data values lie within $$k$$ standard deviations of the mean.

Let us use R to gain some intuition for Chebyshevâ€™s theorem.

k <- seq(1,4,by = 0.1)
auc <- 1-(1/k^2)
auc.percent <- round(auc*100)
cbind(k,auc.percent)
##         k auc.percent
##  [1,] 1.0           0
##  [2,] 1.1          17
##  [3,] 1.2          31
##  [4,] 1.3          41
##  [5,] 1.4          49
##  [6,] 1.5          56
##  [7,] 1.6          61
##  [8,] 1.7          65
##  [9,] 1.8          69
## [10,] 1.9          72
## [11,] 2.0          75
## [12,] 2.1          77
## [13,] 2.2          79
## [14,] 2.3          81
## [15,] 2.4          83
## [16,] 2.5          84
## [17,] 2.6          85
## [18,] 2.7          86
## [19,] 2.8          87
## [20,] 2.9          88
## [21,] 3.0          89
## [22,] 3.1          90
## [23,] 3.2          90
## [24,] 3.3          91
## [25,] 3.4          91
## [26,] 3.5          92
## [27,] 3.6          92
## [28,] 3.7          93
## [29,] 3.8          93
## [30,] 3.9          93
## [31,] 4.0          94

To put it in words: Let us pick a value for $$k$$: $$k= 2$$. This means that at least 75% of the data values lie within 2 standard deviations of the mean.

Let us plot Chebyshevâ€™s theorem with R:

plot(k,
auc.percent,
col = 'blue',
pch = 19,
xlab = 'k',
ylab = 'percent',
main = 'Chebyshev\'s theorem' )

The theorem applies to both sample and population data. Note that Chebyshevâ€™s theorem is applicable to a distribution of any shape. However, Chebyshevâ€™s theorem can be used only for $$k>1$$. This is so because when $$k = 1$$, the value of $$(1-1/k^2)$$ is zero, and when $$k<1$$, the value of $$(1-1/k^2)$$ is negative (Mann 2012).

##### Empirical Rule

Whereas Chebyshevâ€™s theorem is applicable to any kind of distribution, the empirical rule applies only to a specific type of distribution called a bell-shaped distribution or normal distribution. There are 3 rules:

For a bell-shaped distribution, approximately

1. 68% of the observations lie within one standard deviation of the mean.
2. 95% of the observations lie within two standard deviations of the mean.
3. 99.7% of the observations lie within three standard deviations of the mean.

As we by now have sufficient hacking power we will try to test if the three rules are valid. (1) First, we will explore the rnorm function in R to generate normal distributed data, and (2) second, we will go back to our students data set and validate those rules on that data set.

The normal distribution belongs to the family of continuous distributions. In R there are a lot of probability distributions available (here). To generate data from a normal distribution, one may use the rnorm() function, which is a random variable generator for the normal distribution.

We can sample n values from a normal distribution with a given mean (default is 0) and standard deviation (default is 1) using the rnorm() function: rnorm(n=1, mean=0, sd=1). Let us give it a try:

rnorm(n = 1, mean = 0, sd = 1)
## [1] 0.1678662
rnorm(n = 1, mean = 0, sd = 1)
## [1] 0.6837668
rnorm(n = 1, mean = 0, sd = 1)
## [1] 0.04708239
rnorm(n = 1, mean = 0, sd = 1)
## [1] -1.07926

We see, that the rnorm() function returns (pseudo-)random numbers. We can fairly easy ask the function to draw hundreds or thousands or even more (pseudo-)random numbers:

rnorm(n = 10, mean = 0, sd = 1)
##  [1]  0.02096229  0.43151220 -1.40139388  0.15506183 -0.34292814
##  [6]  0.42154610 -1.46577763 -0.11357803  0.21557431  1.79858520
rnorm(n = 100, mean = 0, sd = 1)
##   [1]  0.858610699 -0.538871553  1.178545523 -0.232470570  0.256386652
##   [6] -0.150851862  1.372969736  0.794787592  0.806399330 -0.073286967
##  [11]  0.032188881  0.714342227 -0.319559109  0.119267069 -2.669279361
##  [16] -0.312693227  1.000857586 -0.245512068 -1.648206475  0.591647404
##  [21]  0.464694686 -0.432167902  0.006935236  0.778072647 -1.230160775
##  [26] -0.812220686 -0.148308240  1.679072456  0.708138675 -1.225619764
##  [31] -0.995486237 -1.548960815  0.094843793  2.464564258 -2.020292831
##  [36] -1.899247968  0.173424246  0.913551913 -0.207261283  0.324360659
##  [41]  1.801755154 -0.325521976  0.604237508  0.779840046 -0.370660439
##  [46] -0.532740999  2.841560934  0.256340221 -0.099680818  0.370023816
##  [51] -0.988085742  1.113187596 -0.148752654 -0.427376407 -0.285248692
##  [56] -2.518159900 -1.133915893  1.015535149  1.250742898  0.762621867
##  [61]  1.076042353  1.017807435 -1.769951942 -0.776556420 -1.086160475
##  [66]  2.112988581 -0.939667579 -1.302148621  0.214206592 -1.260815904
##  [71]  1.006562146  0.475639593  0.259057639 -1.247925217 -2.692915441
##  [76] -0.219901170  0.151754088 -1.577207232  1.035310082  0.720603595
##  [81]  1.550685175  1.341701434  0.356812319 -1.784904912 -0.734375596
##  [86]  1.243931033  1.099941851  0.069806131 -0.243730087  0.444888324
##  [91]  1.869267474  0.825093450 -0.737191306 -0.282257181 -0.458562232
##  [96] -1.269095739  0.696992303  0.340557613  0.492573235  1.331791117
y.norm <- rnorm(n= 100000, mean = 0, sd = 1)

If we plot a histogram of those numbers, we see the eponymous bell shaped distribution.

hist(y.norm, breaks = 100, main = 'Normal distribution', xlab = '')

We already know the mean and the standard deviation of the y.norm vector, as we explicitly called the function rnorm() with mean=0 and sd=1. So we just have to count those numbers of the y.norm vector that are bigger than 1, and respectively smaller than -1, and 2, respectively -2, and 3, respectively -3, and relate them to the length of the vector, in our case 100,000, to validate the three rules claimed above.

sd1 <- sum(y.norm >-1 & y.norm < 1) / length(y.norm) * 100
sd2 <- sum(y.norm >-2 & y.norm < 2) / length(y.norm) * 100
sd3 <- sum(y.norm >-3 & y.norm < 3) / length(y.norm) * 100

cbind(c('1sd','2sd','3sd'), c(sd1, sd2, sd3))
##      [,1]  [,2]
## [1,] "1sd" "68.203"
## [2,] "2sd" "95.485"
## [3,] "3sd" "99.771"

Perfect match! The three empirical rules are obviously valid. To visualize our findings we re-plot the histogram and add some annotations. Please note that, in the hist()function we set the argument freq = F, which is the same as freq = FALSE. As a consequence, the resulting histogram does not show counts an the y-axis anymore, but the density values (normalized count divided by bin width), which means that the bar areas sum to 1.

h <- hist(y.norm, breaks = 100, plot = F)
cuts <- cut(h\$breaks, c(-Inf,-3,-2,-1,1,2,3,Inf), right = F) # right=False; sets intervals to be open on the right closed on the left
plot(h,
col = rep(c("white", "4","3","2","3","4", "white"))[cuts],
main = 'Normal distribution',
xlab = '',
freq = F,
ylim = c(0,0.6))

lwd = 3
# horzintal lines
lines(x = c(2,-2), y = c(0.48,0.48), type = "l", col=3, lwd = lwd)
lines(x = c(3,-3), y = c(0.55,0.55), type = "l", col=4, lwd = lwd)
lines(x = c(1,-1), y = c(0.41,0.41), type = "l", col=2, lwd = lwd)
# vertical lines
lines(x = c(1,1), y = c(0,0.41), type = "l", col=2, lwd = lwd)
lines(x = c(-1,-1), y = c(0,0.41), type = "l", col=2, lwd = lwd)
lines(x = c(2,2), y = c(0,0.48), type = "l", col=3, lwd = lwd)
lines(x = c(-2,-2), y = c(0,0.48), type = "l", col=3, lwd = lwd)
lines(x = c(3,3), y = c(0,0.55), type = "l", col=4, lwd = lwd)
lines(x = c(-3,-3), y = c(0,0.55), type = "l", col=4, lwd = lwd)
# text
text(0, 0.44, "68%", cex = 1.5, col=2)
text(0, 0.51, "95%", cex = 1.5, col=3)
text(0, 0.58, "99.7%", cex = 1.5, col=4)

Well, now Let us work on our 2nd task: Validate the three empirical rules on the students data set. Therefore we have to check whether any of the numeric variables in the students data set is normal distributed. We start by extracting numeric variables of interest from the students data set. Then we plot histograms for each of them and assess, whether the variable is normal distributed, or not. However, at first we check the data set by calling the function head().

cont.vars <- c("age", "nc.score", "height", "weight", 'score1', 'score2', 'salary')
students.quant <- students[,cont.vars]
head(students.quant, 10)
##    age nc.score height weight score1 score2 salary
## 1   19     1.91    160   64.8     NA     NA     NA
## 2   19     1.56    172   73.0     NA     NA     NA
## 3   22     1.24    168   70.6     45     46     NA
## 4   19     1.37    183   79.7     NA     NA     NA
## 5   21     1.46    175   71.4     NA     NA     NA
## 6   19     1.34    189   85.8     NA     NA     NA
## 7   21     1.11    156   65.9     NA     NA     NA
## 8   21     2.03    167   65.7     58     62     NA
## 9   18     1.29    195   94.4     57     67     NA
## 10  18     1.19    165   66.0     NA     NA     NA

To get an overview of the the shape of the distribution of each particular variable, we apply the histogram() function of the lattice package. If the lattice package is not yet installed on your computer, type install.packages("lattice")in your console. The coding is a little bit different than for standard histograms.

library(lattice)
histogram( ~ height + age + weight + nc.score + score1 + score2 + salary,
breaks = 50,
type = "density",
xlab = "",
ylab = "density",
layout = c(4,2),
scales = list(relation="free"),
col = 'black',
data = students.quant)