In this section we discuss hypothesis tests for two population standard deviations. Or in other words we discuss methods of inference for the standard deviations of one variable from two different populations. These methods are based on the \(F\)-distribution, named in honor of Sir Ronald Aylmer Fisher.

The \(F\)-distribution is a right-skewed probability density distribution with two shape parameter, \(v_1\) and \(v_2\), called the degrees of freedom for the numerator (\(v_1\)) and the degrees of freedom for the denominator (\(v_2\)).

\[ df = (v_1,v_2)\] As for any other density curve the area under the curve of the \(F\)-distribution corresponds to probabilities. The area under the cure, and thus, the probability, for any given interval and given \(df\) is computed with software. Alternatively, one may look them up in a table. In those tables in general the degrees of freedom for the numerator (\(v_1\)) are displayed along the top, whereas the degrees of freedom for the denominator (\(v_2\)) are displayed in the outside columns on the left.

In order to perform a hypothesis test for two population standard deviations, the value that corresponds to a specified area under a \(F\)-curve is calculated.

Given \(\alpha\), where \(\alpha\) corresponds to a probability between 0 and 1, \(F_{\alpha}\) denotes the value having an area \(\alpha\) to its right under a \(F\)-curve.

In the figure above \(F_{0.05}\) for \(df=(9,14)\) evaluates to \(\approx 2.6458\).

One interesting property of \(F\)-curves is the reciprocal property. It says that for a \(F\)-curve with \(df = (v_1, v_2)\), the \(F\)-value having the area \(\alpha\) to its left equals the reciprocal of the \(F\)-value having the area \(\alpha\) to its right for an \(F\)-curve with \(df = (v_2, v_1)\) (Weiss 2010). Adopted to the example from above, where \(F_{0.05}\) for \(df=(9,14)\) evaluates to \(\approx 2.6458\), this means that \(F_{0.95}\) for \(df=(14,9)\) evaluates to \(\frac{1}{2.6458}=0.378\).


Interval Estimation of \(\sigma_1 - \sigma_2\)

The \(100(1-\alpha)\)% confidence interval for \(\sigma\) is

\[\frac{1}{\sqrt{F_{\alpha /2}}} \times \frac{s_1}{s_2} \le \sigma \le \frac{1}{\sqrt{F_{1-\alpha /2}}} \times \frac{s_1}{s_2}\text{,} \] where where \(s_1\) and \(s_2\) are the sample standard deviations.


Two Standard Deviations \(F\)-test

The hypothesis-testing procedure the one-standard-deviation is called two standard deviations \(F\)-test. Hypothesis testing for two population standard deviations follows the same step-wise procedure as for other hypothesis tests. \[ \begin{array}{l} \hline \ \text{Step 1} & \text{State the null hypothesis } H_0 \text{ and alternative hypothesis } H_A \text{.}\\ \ \text{Step 2} & \text{Decide on the significance level, } \alpha\text{.} \\ \ \text{Step 3} & \text{Compute the value of the test statistic.} \\ \ \text{Step 4} &\text{Determine the p-value.} \\ \ \text{Step 5} & \text{If } p \le \alpha \text{, reject }H_0 \text{; otherwise, do not reject } H_0 \text{.} \\ \ \text{Step 6} &\text{Interpret the result of the hypothesis test.} \\ \hline \end{array} \]

The test statistic for a hypothesis test for a normally distributed variable and for independent samples of sizes \(n_1\) and \(n_2\) is is given by

\[F = \frac{s_1^2/\sigma_1^2}{s_2^2/\sigma_2^2}\text{,} \] with \(df = (n_1 - 1,\; n_2 - 1)\).

If \(H_0: \sigma_1 = \sigma_2\) is true, then the equation simplifies to

\[F = \frac{s_1^2}{s_2^2} \]


Two-Standard-Deviations \(F\)-test: An example

In order to get some hands-on experience we apply the two standard deviations \(F\)-test in an exercise. Therefore we load the students data set. You may download the students.csv file here. Import the data set and assign a proper name to it.

students <- read.csv("https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv")

The students data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary.

In order to showcase the two standard deviations \(F\)-test we examine once again the height variable in the students data set. We compare the spread of the height of female students and the spread of the height of male students. We want to test, if the standard deviation of the height of female students (\(\sigma_1\)) is different from the standard deviation of the height of male students (\(\sigma_2\)).


Data preparation

We start with data preparation.

female <- subset(students, gender=='Female')
male <- subset(students, gender=='Male')

n <- 25
female.sample <- sample(female$height, n)
male.sample <- sample(male$height, n)

sd.female <- sd(female.sample)
sd.female
## [1] 6.05475
sd.male <- sd(male.sample)
sd.male
## [1] 8.680054

Further, we check the normality assumption by plotting a Q-Q plot. In R we apply the qqnorm() and the qqline() functions for plotting Q-Q plots.

par(mfrow = c(1,2), mar = c(5,5,4,2))

# Sample data females
qqnorm(female.sample, main = 'Q-Q plot for weight of\n sampled female students', cex.main = 0.9)
qqline(female.sample, col = 3, lwd = 2)

# Sample data males
qqnorm(male.sample, main = 'Q-Q plot for weight of\n sampled male students', cex.main = 0.9)
qqline(male.sample, col = 4, lwd = 2)