In this section we discuss hypothesis tests for two population standard deviations. Or in other words we discuss methods of inference for the standard deviations of one variable from two different populations. These methods are based on the \(F\)-distribution, named in honor of Sir Ronald Aylmer Fisher.

The \(F\)-distribution is a right-skewed probability density distribution with two shape parameter, \(v_1\) and \(v_2\), called the degrees of freedom for the numerator (\(v_1\)) and the degrees of freedom for the denominator (\(v_2\)).

\[ df = (v_1,v_2)\] As for any other density curve the area under the curve of the \(F\)-distribution corresponds to probabilities. The area under the cure, and thus, the probability, for any given interval and given \(df\) is computed with software. Alternatively, one may look them up in a table. In those tables in general the degrees of freedom for the numerator (\(v_1\)) are displayed along the top, whereas the degrees of freedom for the denominator (\(v_2\)) are displayed in the outside columns on the left.

In order to perform a hypothesis test for two population standard deviations, the value that corresponds to a specified area under a \(F\)-curve is calculated.

Given \(\alpha\), where \(\alpha\) corresponds to a probability between 0 and 1, \(F_{\alpha}\) denotes the value having an area \(\alpha\) to its right under a \(F\)-curve.

In the figure above \(F_{0.05}\) for \(df=(9,14)\) evaluates to \(\approx 2.6458\).

One interesting property of \(F\)-curves is the **reciprocal property**. It says that for a \(F\)-curve with \(df = (v_1, v_2)\), the \(F\)-value having the area \(\alpha\) to its left equals the reciprocal of the \(F\)-value having the area \(\alpha\) to its right for an \(F\)-curve with \(df = (v_2, v_1)\) (Weiss 2010). Adopted to the example from above, where \(F_{0.05}\) for \(df=(9,14)\) evaluates to \(\approx 2.6458\), this means that \(F_{0.95}\) for \(df=(14,9)\) evaluates to \(\frac{1}{2.6458}=0.378\).

The \(100(1-\alpha)\)% confidence interval for \(\sigma\) is

\[\frac{1}{\sqrt{F_{\alpha /2}}} \times \frac{s_1}{s_2} \le \sigma \le \frac{1}{\sqrt{F_{1-\alpha /2}}} \times \frac{s_1}{s_2}\text{,} \] where where \(s_1\) and \(s_2\) are the sample standard deviations.

The hypothesis-testing procedure the one-standard-deviation is called **two standard deviations \(F\)-test**. Hypothesis testing for two population standard deviations follows the same step-wise procedure as for other hypothesis tests. \[
\begin{array}{l}
\hline
\ \text{Step 1} & \text{State the null hypothesis } H_0 \text{ and alternative hypothesis } H_A \text{.}\\
\ \text{Step 2} & \text{Decide on the significance level, } \alpha\text{.} \\
\ \text{Step 3} & \text{Compute the value of the test statistic.} \\
\ \text{Step 4} &\text{Determine the p-value.} \\
\ \text{Step 5} & \text{If } p \le \alpha \text{, reject }H_0 \text{; otherwise, do not reject } H_0 \text{.} \\
\ \text{Step 6} &\text{Interpret the result of the hypothesis test.} \\
\hline
\end{array}
\]

The test statistic for a hypothesis test for a normally distributed variable and for independent samples of sizes \(n_1\) and \(n_2\) is is given by

\[F = \frac{s_1^2/\sigma_1^2}{s_2^2/\sigma_2^2}\text{,} \] with \(df = (n_1 - 1,\; n_2 - 1)\).

If \(H_0: \sigma_1 = \sigma_2\) is true, then the equation simplifies to

\[F = \frac{s_1^2}{s_2^2} \]

In order to get some hands-on experience we apply the **two standard deviations \(F\)-test** in an exercise. Therefore we load the *students* data set. You may download the `students.csv`

file here. Import the data set and assign a proper name to it.

`students <- read.csv("https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv")`

The *students* data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: *stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary*.

In order to showcase the **two standard deviations \(F\)-test** we examine once again the `height`

variable in the *students* data set. We compare the spread of the height of female students and the spread of the height of male students. **We want to test, if the standard deviation of the height of female students (\(\sigma_1\)) is different from the standard deviation of the height of male students (\(\sigma_2\))**.

We start with data preparation.

- We subset the data set based on the variable
`gender`

. - Then we sample 25 female students and 25 male students.
- Then we calculate the standard deviations of the variable of interest (height in cm) for both samples and assign them the variables
`s.female`

and`s.male`

.

```
female <- subset(students, gender=='Female')
male <- subset(students, gender=='Male')
n <- 25
female.sample <- sample(female$height, n)
male.sample <- sample(male$height, n)
sd.female <- sd(female.sample)
sd.female
```

`## [1] 6.05475`

```
sd.male <- sd(male.sample)
sd.male
```

`## [1] 8.680054`

Further, we check the normality assumption by plotting a Q-Q plot. In R we apply the `qqnorm()`

and the `qqline()`

functions for plotting Q-Q plots.

```
par(mfrow = c(1,2), mar = c(5,5,4,2))
# Sample data females
qqnorm(female.sample, main = 'Q-Q plot for weight of\n sampled female students', cex.main = 0.9)
qqline(female.sample, col = 3, lwd = 2)
# Sample data males
qqnorm(male.sample, main = 'Q-Q plot for weight of\n sampled male students', cex.main = 0.9)
qqline(male.sample, col = 4, lwd = 2)
```