The purpose of a one-way ANOVA hypothesis test is to compare \(k\) population/group means, \(\mu_1, \mu_1, ...,\mu_k\). The following assumptions have to be fulfilled in order to apply a one-way ANOVA (Weiss, 2010):

A one-way ANOVA hypothesis test follows the same step-wise procedure as other hypothesis tests:

\[ \begin{array}{l} \hline \ \text{Step 1} & \text{State the null hypothesis } H_0 \text{ and alternative hypothesis } H_A \text{.}\\ \ \text{Step 2} & \text{Decide on the significance level, } \alpha\text{.} \\ \ \text{Step 3} & \text{Compute the value of the test statistic.} \\ \ \text{Step 4} &\text{Determine the p-value.} \\ \ \text{Step 5} & \text{If } p \le \alpha \text{, reject }H_0 \text{; otherwise, do not reject } H_0 \text{.} \\ \ \text{Step 6} &\text{Interpret the result of the hypothesis test.} \\ \hline \end{array} \]


One-way ANOVA Hypothesis Test: An example

In order to get some hands-on experience we apply the one-way ANOVA hypothesis test in an exercise. For this we import the students data set, which you may also download here.

students <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")

The students data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary.

In order to showcase the one-way ANOVA hypothesis test we examine the mean annual salary of graduates, grouped by their major study subject. In this example the classification/grouping is done based on one variable, the major variable, the so-called factor. Thus, we are conducting a one-way ANOVA. In the following exercise we want to test if the mean annual salary of graduates is different among graduates of different study subjects.


Data exploration and preparation

We start our data analysis with the random sampling process. We randomly sample 275 students from the data set using the sample() function in R. We want to make sure to sample only from graduated students. That is why beforehand, we subset the data with R’s subset() function. Further, we reduce our data set to the two variables of interest, the categorical variable major and the numeric variable salary. Then we display the first 10 rows of the data set.

graduated <- subset(students, graduated == 1)

n <- 275
data_idx <- sample(x = 1:nrow(graduated), size = n)
data <- graduated[data_idx, c("major", "salary")]
head(data, 10)
##                           major   salary
## 4029 Mathematics and Statistics 56157.26
## 844             Social Sciences 34136.26
## 7372            Social Sciences 27884.85
## 5228                    Biology 46318.75
## 1682                    Biology 43634.65
## 5720 Mathematics and Statistics 36055.36
## 209           Political Science 30611.51
## 63       Environmental Sciences 32519.22
## 5380      Economics and Finance 45135.45
## 2663      Economics and Finance 69348.61

Just for a matter of interest we visualize the counts for each of the 0 different study subjects in our sample by plotting a bar plot. We use the ggplot2 library for plotting:

library(ggplot2)
p <- ggplot(data, aes(major)) +
  geom_bar(width = .7) +
  coord_flip() +
  theme(axis.title.y = element_blank())
p

We can see that the different study subjects are not equally distributed in our sample, but that is okay.

Before we start with the hypothesis test, we check whether the assumptions for the one-way ANOVA hypothesis test are fulfilled. The samples are random and independent. We can check the normality assumption by plotting a normal probability plot (Q-Q plot) for each grouped variable:

p <- qplot(
  sample = salary,
  data = data,
  color = major
) +
  theme_minimal()
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
p

Well, not perfect, but we may assume the data for each group falls roughly on a straight line.

Next, we test the assumption of equal standard deviations. Therefore, we calculate the standard deviation for each group. The programming language R provides some excellent functions to calculate statistical parameters across different groupings of a data set. These functions refer to the apply() function family. For our example we use the tapply() function to calculate the standard deviation for each group.

# calculate standard deviation
sd_groups <- tapply(X = data$salary, INDEX = data$major, FUN = sd)
sd_groups
##                    Biology      Economics and Finance 
##                   8085.088                  10202.188 
##     Environmental Sciences Mathematics and Statistics 
##                   6325.725                   8671.149 
##          Political Science            Social Sciences 
##                   9979.958                   6703.441

As a rule of thumb, we consider the equal standard deviations assumption as fulfilled, if the ratio of the largest to the smallest sample standard deviation is less than 2 (Weiss, 2010).

# calculate ratio of the largest to the smallest sample standard deviation
ratio <- max(as.vector(sd_groups)) / min(as.vector(sd_groups))
ratio
## [1] 1.612809

The ratio of the largest to the smallest sample standard deviation is 1.613. That is close to the threshold of 2, but for our toy example still acceptable. Thus, we conclude that all the assumptions are fulfilled.

Note: One-way ANOVA is robust to moderate violations of the normality assumption and the equal standard deviations assumption (Weiss, 2010).


Hypothesis Testing

In order to conduct the one-way ANOVA hypothesis test we follow the step-wise implementation procedure for hypothesis testing.

Step 1: State the null hypothesis \(H_0\) and alternative hypothesis \(H_A\)

The null hypothesis states that the mean annual salary is equal among all groups of graduates:

\[H_0: \quad \mu_1=\mu_2=\mu_3=\mu_4=\mu_5=\mu_6\]

Alternative hypothesis:

\[H_A: \quad \text{Not all the means are equal} \]


Step 2: Decide on the significance level, \(\alpha\)

\[\alpha = 0.01\]

alpha <- 0.01

Step 3 and 4: Compute the value of the test statistic and the p-value

In order to calculate the \(F\) test statistic, we need to determine several quantities beforehand. For illustration purposes we manually compute the \(F\) test statistic in R. Step by step we will populate the ANOVA table until we finally get the \(F\) test statistic and consequently the p-value.

\[ \begin{array}{|l|c|} \hline \ \text{Source} & df & \text{Sum of Squares }(SS) & \text{Mean Squares }(MS) & F\text{-statistic} & p\text{-value}\\ \hline \ \text{Group} & k-1 & SSG & MSG=\frac{SSG}{k-1} & F = \frac{MSG}{MSE} & p\\ \ \text{Error} & n-k & SSE & MSE=\frac{SSE}{n-k} & & \\ \hline \ \text{Total} & n-1 & SST & & & \\ \hline \end{array} \]

We start with the first column and calculate the degrees of freedom:

k <- length(unique(data$major))
n <- nrow(data)

df_G <- k - 1
df_G
## [1] 5
df_E <- n - k
df_E
## [1] 269
df_T <- n - 1
df_T
## [1] 274

\[ \begin{array}{|l|c|} \hline \ \text{Source} & df & \text{Sum of Squares }(SS) & \text{Mean Squares }(MS) & F\text{-statistic} & p\text{-value}\\ \hline \ \text{Group} & 5 & SSG & MSG=\frac{SSG}{k-1} & F = \frac{MSG}{MSE} & p\\ \ \text{Error} & 269 & SSE & MSE=\frac{SSE}{n-k} & & \\ \hline \ \text{Total} & 274 & SST & & & \\ \hline \end{array} \]

Further, we calculate the three sums of squares, \(SST\), \(SSG\) and \(SSE\). Recall the equations from the previous page:

\[SST = \sum_{i=1}^n(x_i-\bar x)^2\text{,}\]

where \(x_i\) corresponds to the observations in the samples and \(\bar x\) to the overall mean of all samples.

# calculate overall mean
x_bar <- mean(data$salary)

# observation data
xi <- data$salary

# calculate SST
SST <- sum((xi - x_bar)^2)
SST
## [1] 32242968219

\[SSG = \sum_{j=1}^k n_j(\bar x_j-\bar x)^2\]

Here, \(n_j\) denotes the sample size for group \(j\), \(\bar x_j\) denotes the mean of group \(j\) and \(\bar x\) denotes the overall mean of all samples.

# calculate sample size for each group
n_j <- tapply(X = data$salary, INDEX = data$major, FUN = length)

# calculate mean for each group
xi_bar <- tapply(X = data$salary, INDEX = data$major, FUN = mean)

# calculate SSG
SSG <- sum(n_j * (xi_bar - x_bar)^2)
SSG
## [1] 13003942832

\[SSE = \sum_{j=1}^k (n_j-1)s_j^2\text{,}\]

where \(n_j\) denotes the sample size for group \(j\) and \(s^2_j\) the variance of group \(j\).

# calculate standard deviation for each group
s2_j <- tapply(X = data$salary, INDEX = data$major, FUN = sd)

# calculate SSE
SSE <- sum((n_j - 1) * s2_j^2)
SSE
## [1] 19239025388
# alternatively one may calculate SSE this way
SSE <- SST - SSG
SSE
## [1] 19239025388

\[ \begin{array}{|l|c|} \hline \ \text{Source} & df & \text{Sum of Squares }(SS) & \text{Mean Squares }(MS) & F\text{-statistic} & p\text{-value}\\ \hline \ \text{Group} & 5 & 1.3003943\times 10^{10} & MSG=\frac{SSG}{k-1} & F = \frac{MSG}{MSE} & p\\ \ \text{Error} & 269 & 1.9239025\times 10^{10} & MSE=\frac{SSE}{n-k} & & \\ \hline \ \text{Total} & 274 & 3.2242968\times 10^{10} & & & \\ \hline \end{array} \]

Now, we calculate the two measures of mean variability, \(MSG\) and \(MSE\):

# calculate MSG
MSG <- SSG / (k - 1)
MSG
## [1] 2600788566
# calculate MSE
MSE <- SSE / (n - k)
MSE
## [1] 71520540

\[ \begin{array}{|l|c|} \hline \ \text{Source} & df & \text{Sum of Squares }(SS) & \text{Mean Squares }(MS) & F\text{-statistic} & p\text{-value}\\ \hline \ \text{Group} & 5 & 1.3003943\times 10^{10} & 2.6007886\times 10^{9} & F = \frac{MSG}{MSE} & p\\ \ \text{Error} & 269 & 1.9239025\times 10^{10} & 7.152054\times 10^{7} & & \\ \hline \ \text{Total} & 274 & 3.2242968\times 10^{10} & & & \\ \hline \end{array} \]

Finally, we obtain the \(F\)-statistic by the ratio of \(MSG\) and \(MSE\):

Fstat <- MSG / MSE
Fstat
## [1] 36.36422

\[ \begin{array}{|l|c|} \hline \ \text{Source} & df & \text{Sum of Squares }(SS) & \text{Mean Squares }(MS) & F\text{-statistic} & p\text{-value}\\ \hline \ \text{Group} & 5 & 1.3003943\times 10^{10} & 2.6007886\times 10^{9} & 36.3642186 & p\\ \ \text{Error} & 269 & 1.9239025\times 10^{10} & 7.152054\times 10^{7} & & \\ \hline \ \text{Total} & 274 & 3.2242968\times 10^{10} & & & \\ \hline \end{array} \]

In the last step we calculate the p-value by calling the pf() function in R. Recall, how to calculate the degrees of freedom:

\[df = (k-1, n-k)\]

Because the null hypothesis is rejected only when the test statistic, \(F\), is too large, a one-way ANOVA test is always right-tailed.

df1 <- k - 1
df2 <- n - k

p_value <- pf(q = Fstat, df1 = df1, df2 = df2, lower.tail = FALSE)
p_value
## [1] 2.133722e-28

\(p = 2.1337218\times 10^{-28}\).

\[ \begin{array}{|l|c|} \hline \ \text{Source} & df & \text{Sum of Squares }(SS) & \text{Mean Squares }(MS) & F\text{-statistic} & p\text{-value}\\ \hline \ \text{Group} & 5 & 1.3003943\times 10^{10} & 2.6007886\times 10^{9} & F = 36.3642186 & 2.1337218\times 10^{-28}\\ \ \text{Error} & 269 & 1.9239025\times 10^{10} & 7.152054\times 10^{7} & & \\ \hline \ \text{Total} & 274 & 3.2242968\times 10^{10} & & & \\ \hline \end{array} \]


Step 5: If \(p \le \alpha\), reject \(H_0\); otherwise, do not reject \(H_0\)

p_value <= alpha
## [1] TRUE

The p-value is smaller than the specified significance level of 0.01; we reject \(H_0\). The test results are statistically significant at the 1 % level and provide very strong evidence against the null hypothesis.


Step 6: Interpret the result of the hypothesis test

At the 1 % significance level the data provides very strong evidence to conclude, that at least one pair of group means is different from each other.


Hypothesis Testing in R

We just completed a one-way ANOVA hypothesis test in R manually. Awesome, but now we will redo that example and make use of the R machinery to obtain the same result in just one line of code!

In order to conduct a one-way ANOVA hypothesis test in R we apply the aov() function. The aov() function expects the so-called formula-notation. Thus, we provide our data by separating the two variables of interest by ~. Further, we provide a data frame in which the variables specified in the formula are found.

anova <- aov(salary ~ major, data = data)
summary(anova)
##              Df    Sum Sq   Mean Sq F value Pr(>F)    
## major         5 1.300e+10 2.601e+09   36.36 <2e-16 ***
## Residuals   269 1.924e+10 7.152e+07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Worked out fine! Compare the output of the aov() function to our result above. Again, we may conclude that at the 1 % significance level the data provides very strong evidence to conclude, that at least one pair of group means is different from each other.


Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.