The purpose of a **one-way ANOVA hypothesis test** is to
compare \(k\) population/group means,
\(\mu_1, \mu_1, ...,\mu_k\). The
following assumptions have to be fulfilled in order to apply a one-way
ANOVA (Weiss, 2010):

- Random samples
- Independent samples
- For each population, the variable under consideration is normally distributed
- The standard deviations of the variable under consideration are the same for all populations

A one-way ANOVA hypothesis test follows the same step-wise procedure as other hypothesis tests:

\[ \begin{array}{l} \hline \ \text{Step 1} & \text{State the null hypothesis } H_0 \text{ and alternative hypothesis } H_A \text{.}\\ \ \text{Step 2} & \text{Decide on the significance level, } \alpha\text{.} \\ \ \text{Step 3} & \text{Compute the value of the test statistic.} \\ \ \text{Step 4} &\text{Determine the p-value.} \\ \ \text{Step 5} & \text{If } p \le \alpha \text{, reject }H_0 \text{; otherwise, do not reject } H_0 \text{.} \\ \ \text{Step 6} &\text{Interpret the result of the hypothesis test.} \\ \hline \end{array} \]

In order to get some hands-on experience we apply the **one-way
ANOVA hypothesis test** in an exercise. For this we import the
*students* data set, which you may also download here.

`students <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")`

The *students* data set consists of 8239 rows, each of them
representing a particular student, and 16 columns, each of them
corresponding to a variable/feature related to that particular student.
These self-explaining variables are: *stud.id, name, gender, age,
height, weight, religion, nc.score, semester, major, minor, score1,
score2, online.tutorial, graduated, salary*.

In order to showcase the **one-way ANOVA hypothesis
test** we examine the mean annual salary of graduates, grouped by
their major study subject. In this example the classification/grouping
is done based on one variable, the `major`

variable, the
so-called *factor*. Thus, we are conducting a one-way ANOVA. In
the following exercise **we want to test if the mean annual salary
of graduates is different among graduates of different study
subjects**.

We start our data analysis with the random sampling process. We
randomly sample 275 students from the data set using the
`sample()`

function in R. We want to make sure to sample only
from graduated students. That is why beforehand, we subset the data with
Râ€™s `subset()`

function. Further, we reduce our data set to
the two variables of interest, the categorical variable
`major`

and the numeric variable `salary`

. Then we
display the first 10 rows of the data set.

```
graduated <- subset(students, graduated == 1)
n <- 275
data_idx <- sample(x = 1:nrow(graduated), size = n)
data <- graduated[data_idx, c("major", "salary")]
head(data, 10)
```

```
## major salary
## 4029 Mathematics and Statistics 56157.26
## 844 Social Sciences 34136.26
## 7372 Social Sciences 27884.85
## 5228 Biology 46318.75
## 1682 Biology 43634.65
## 5720 Mathematics and Statistics 36055.36
## 209 Political Science 30611.51
## 63 Environmental Sciences 32519.22
## 5380 Economics and Finance 45135.45
## 2663 Economics and Finance 69348.61
```

Just for a matter of interest we visualize the counts for each of the
0 different study subjects in our sample by plotting a bar plot. We use
the `ggplot2`

library for plotting:

```
library(ggplot2)
p <- ggplot(data, aes(major)) +
geom_bar(width = .7) +
coord_flip() +
theme(axis.title.y = element_blank())
p
```

We can see that the different study subjects are not equally distributed in our sample, but that is okay.

Before we start with the hypothesis test, we check whether the assumptions for the one-way ANOVA hypothesis test are fulfilled. The samples are random and independent. We can check the normality assumption by plotting a normal probability plot (Q-Q plot) for each grouped variable:

```
p <- qplot(
sample = salary,
data = data,
color = major
) +
theme_minimal()
```

```
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
```

`p`

Well, not perfect, but we may assume the data for each group falls roughly on a straight line.

Next, we test the assumption of equal standard deviations. Therefore,
we calculate the standard deviation for each group. The programming
language R provides some excellent functions to calculate statistical
parameters across different groupings of a data set. These functions
refer to the `apply()`

function family. For our
example we use the `tapply()`

function to calculate the
standard deviation for each group.

```
# calculate standard deviation
sd_groups <- tapply(X = data$salary, INDEX = data$major, FUN = sd)
sd_groups
```

```
## Biology Economics and Finance
## 8085.088 10202.188
## Environmental Sciences Mathematics and Statistics
## 6325.725 8671.149
## Political Science Social Sciences
## 9979.958 6703.441
```

As a rule of thumb, we consider the equal standard deviations assumption as fulfilled, if the ratio of the largest to the smallest sample standard deviation is less than 2 (Weiss, 2010).

```
# calculate ratio of the largest to the smallest sample standard deviation
ratio <- max(as.vector(sd_groups)) / min(as.vector(sd_groups))
ratio
```

`## [1] 1.612809`

The ratio of the largest to the smallest sample standard deviation is 1.613. That is close to the threshold of 2, but for our toy example still acceptable. Thus, we conclude that all the assumptions are fulfilled.

Note:One-way ANOVA is robust to moderate violations of the normality assumption and the equal standard deviations assumption (Weiss, 2010).

In order to conduct the **one-way ANOVA hypothesis
test** we follow the step-wise implementation procedure for
hypothesis testing.

**Step 1: State the null hypothesis \(H_0\) and alternative hypothesis \(H_A\)**

The null hypothesis states that the mean annual salary is equal among all groups of graduates:

\[H_0: \quad \mu_1=\mu_2=\mu_3=\mu_4=\mu_5=\mu_6\]

Alternative hypothesis:

\[H_A: \quad \text{Not all the means are equal} \]

**Step 2: Decide on the significance level, \(\alpha\)**

\[\alpha = 0.01\]

`alpha <- 0.01`

**Step 3 and 4: Compute the value of the test statistic and the
p-value**

In order to calculate the \(F\) test
statistic, we need to determine several quantities beforehand. For
illustration purposes we manually compute the \(F\) test statistic in R. Step by step we
will populate the ANOVA table until we finally get the \(F\) test statistic and consequently the
*p*-value.

\[ \begin{array}{|l|c|} \hline \ \text{Source} & df & \text{Sum of Squares }(SS) & \text{Mean Squares }(MS) & F\text{-statistic} & p\text{-value}\\ \hline \ \text{Group} & k-1 & SSG & MSG=\frac{SSG}{k-1} & F = \frac{MSG}{MSE} & p\\ \ \text{Error} & n-k & SSE & MSE=\frac{SSE}{n-k} & & \\ \hline \ \text{Total} & n-1 & SST & & & \\ \hline \end{array} \]

We start with the first column and calculate the degrees of freedom:

```
k <- length(unique(data$major))
n <- nrow(data)
df_G <- k - 1
df_G
```

`## [1] 5`

```
df_E <- n - k
df_E
```

`## [1] 269`

```
df_T <- n - 1
df_T
```

`## [1] 274`

\[ \begin{array}{|l|c|} \hline \ \text{Source} & df & \text{Sum of Squares }(SS) & \text{Mean Squares }(MS) & F\text{-statistic} & p\text{-value}\\ \hline \ \text{Group} & 5 & SSG & MSG=\frac{SSG}{k-1} & F = \frac{MSG}{MSE} & p\\ \ \text{Error} & 269 & SSE & MSE=\frac{SSE}{n-k} & & \\ \hline \ \text{Total} & 274 & SST & & & \\ \hline \end{array} \]

Further, we calculate the three sums of squares, \(SST\), \(SSG\) and \(SSE\). Recall the equations from the previous page:

\[SST = \sum_{i=1}^n(x_i-\bar x)^2\text{,}\]

where \(x_i\) corresponds to the observations in the samples and \(\bar x\) to the overall mean of all samples.

```
# calculate overall mean
x_bar <- mean(data$salary)
# observation data
xi <- data$salary
# calculate SST
SST <- sum((xi - x_bar)^2)
SST
```

`## [1] 32242968219`

\[SSG = \sum_{j=1}^k n_j(\bar x_j-\bar x)^2\]

Here, \(n_j\) denotes the sample size for group \(j\), \(\bar x_j\) denotes the mean of group \(j\) and \(\bar x\) denotes the overall mean of all samples.

```
# calculate sample size for each group
n_j <- tapply(X = data$salary, INDEX = data$major, FUN = length)
# calculate mean for each group
xi_bar <- tapply(X = data$salary, INDEX = data$major, FUN = mean)
# calculate SSG
SSG <- sum(n_j * (xi_bar - x_bar)^2)
SSG
```

`## [1] 13003942832`

\[SSE = \sum_{j=1}^k (n_j-1)s_j^2\text{,}\]

where \(n_j\) denotes the sample size for group \(j\) and \(s^2_j\) the variance of group \(j\).

```
# calculate standard deviation for each group
s2_j <- tapply(X = data$salary, INDEX = data$major, FUN = sd)
# calculate SSE
SSE <- sum((n_j - 1) * s2_j^2)
SSE
```

`## [1] 19239025388`

```
# alternatively one may calculate SSE this way
SSE <- SST - SSG
SSE
```

`## [1] 19239025388`

\[ \begin{array}{|l|c|} \hline \ \text{Source} & df & \text{Sum of Squares }(SS) & \text{Mean Squares }(MS) & F\text{-statistic} & p\text{-value}\\ \hline \ \text{Group} & 5 & 1.3003943\times 10^{10} & MSG=\frac{SSG}{k-1} & F = \frac{MSG}{MSE} & p\\ \ \text{Error} & 269 & 1.9239025\times 10^{10} & MSE=\frac{SSE}{n-k} & & \\ \hline \ \text{Total} & 274 & 3.2242968\times 10^{10} & & & \\ \hline \end{array} \]

Now, we calculate the two measures of mean variability, \(MSG\) and \(MSE\):

```
# calculate MSG
MSG <- SSG / (k - 1)
MSG
```

`## [1] 2600788566`

```
# calculate MSE
MSE <- SSE / (n - k)
MSE
```

`## [1] 71520540`

\[ \begin{array}{|l|c|} \hline \ \text{Source} & df & \text{Sum of Squares }(SS) & \text{Mean Squares }(MS) & F\text{-statistic} & p\text{-value}\\ \hline \ \text{Group} & 5 & 1.3003943\times 10^{10} & 2.6007886\times 10^{9} & F = \frac{MSG}{MSE} & p\\ \ \text{Error} & 269 & 1.9239025\times 10^{10} & 7.152054\times 10^{7} & & \\ \hline \ \text{Total} & 274 & 3.2242968\times 10^{10} & & & \\ \hline \end{array} \]

Finally, we obtain the \(F\)-statistic by the ratio of \(MSG\) and \(MSE\):

```
Fstat <- MSG / MSE
Fstat
```

`## [1] 36.36422`

\[ \begin{array}{|l|c|} \hline \ \text{Source} & df & \text{Sum of Squares }(SS) & \text{Mean Squares }(MS) & F\text{-statistic} & p\text{-value}\\ \hline \ \text{Group} & 5 & 1.3003943\times 10^{10} & 2.6007886\times 10^{9} & 36.3642186 & p\\ \ \text{Error} & 269 & 1.9239025\times 10^{10} & 7.152054\times 10^{7} & & \\ \hline \ \text{Total} & 274 & 3.2242968\times 10^{10} & & & \\ \hline \end{array} \]

In the last step we calculate the *p*-value by calling the
`pf()`

function in R. Recall, how to calculate the degrees of
freedom:

\[df = (k-1, n-k)\]

Because the null hypothesis is rejected only when the test statistic,
\(F\), is too large, a one-way ANOVA
test is **always right-tailed**.

```
df1 <- k - 1
df2 <- n - k
p_value <- pf(q = Fstat, df1 = df1, df2 = df2, lower.tail = FALSE)
p_value
```

`## [1] 2.133722e-28`

\(p = 2.1337218\times 10^{-28}\).

\[ \begin{array}{|l|c|} \hline \ \text{Source} & df & \text{Sum of Squares }(SS) & \text{Mean Squares }(MS) & F\text{-statistic} & p\text{-value}\\ \hline \ \text{Group} & 5 & 1.3003943\times 10^{10} & 2.6007886\times 10^{9} & F = 36.3642186 & 2.1337218\times 10^{-28}\\ \ \text{Error} & 269 & 1.9239025\times 10^{10} & 7.152054\times 10^{7} & & \\ \hline \ \text{Total} & 274 & 3.2242968\times 10^{10} & & & \\ \hline \end{array} \]

**Step 5: If \(p \le \alpha\),
reject \(H_0\); otherwise, do not
reject \(H_0\)**

`p_value <= alpha`

`## [1] TRUE`

The *p*-value is smaller than the specified significance level
of 0.01; we reject \(H_0\). The test
results are statistically significant at the 1 % level and provide very
strong evidence against the null hypothesis.

**Step 6: Interpret the result of the hypothesis
test**

At the 1 % significance level the data provides very strong evidence to conclude, that at least one pair of group means is different from each other.

We just completed a one-way ANOVA hypothesis test in R manually. Awesome, but now we will redo that example and make use of the R machinery to obtain the same result in just one line of code!

In order to conduct a one-way ANOVA hypothesis test in R we apply the
`aov()`

function. The `aov()`

function expects the
so-called *formula-notation*. Thus, we provide our data by
separating the two variables of interest by `~`

. Further, we
provide a data frame in which the variables specified in the formula are
found.

```
anova <- aov(salary ~ major, data = data)
summary(anova)
```

```
## Df Sum Sq Mean Sq F value Pr(>F)
## major 5 1.300e+10 2.601e+09 36.36 <2e-16 ***
## Residuals 269 1.924e+10 7.152e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Worked out fine! Compare the output of the `aov()`

function to our result above. Again, we may conclude that at the 1 %
significance level the data provides very strong evidence to conclude,
that at least one pair of group means is different from each other.

**Citation**

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Hartmann,
K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis
using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.*