The purpose of a one-way ANOVA hypothesis test is to compare \(k\) population/group means, \(\mu_1, \mu_1, ...,\mu_k\). The following assumptions have to be fulfilled in order to apply a one-way ANOVA (Weiss, 2010):
A one-way ANOVA hypothesis test follows the same step-wise procedure as other hypothesis tests:
\[ \begin{array}{l} \hline \ \text{Step 1} & \text{State the null hypothesis } H_0 \text{ and alternative hypothesis } H_A \text{.}\\ \ \text{Step 2} & \text{Decide on the significance level, } \alpha\text{.} \\ \ \text{Step 3} & \text{Compute the value of the test statistic.} \\ \ \text{Step 4} &\text{Determine the p-value.} \\ \ \text{Step 5} & \text{If } p \le \alpha \text{, reject }H_0 \text{; otherwise, do not reject } H_0 \text{.} \\ \ \text{Step 6} &\text{Interpret the result of the hypothesis test.} \\ \hline \end{array} \]
In order to get some hands-on experience we apply the one-way ANOVA hypothesis test in an exercise. For this we import the students data set, which you may also download here.
students <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")
The students data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary.
In order to showcase the one-way ANOVA hypothesis
test we examine the mean annual salary of graduates, grouped by
their major study subject. In this example the classification/grouping
is done based on one variable, the major
variable, the
so-called factor. Thus, we are conducting a one-way ANOVA. In
the following exercise we want to test if the mean annual salary
of graduates is different among graduates of different study
subjects.
We start our data analysis with the random sampling process. We
randomly sample 275 students from the data set using the
sample()
function in R. We want to make sure to sample only
from graduated students. That is why beforehand, we subset the data with
R’s subset()
function. Further, we reduce our data set to
the two variables of interest, the categorical variable
major
and the numeric variable salary
. Then we
display the first 10 rows of the data set.
graduated <- subset(students, graduated == 1)
n <- 275
data_idx <- sample(x = 1:nrow(graduated), size = n)
data <- graduated[data_idx, c("major", "salary")]
head(data, 10)
## major salary
## 4029 Mathematics and Statistics 56157.26
## 844 Social Sciences 34136.26
## 7372 Social Sciences 27884.85
## 5228 Biology 46318.75
## 1682 Biology 43634.65
## 5720 Mathematics and Statistics 36055.36
## 209 Political Science 30611.51
## 63 Environmental Sciences 32519.22
## 5380 Economics and Finance 45135.45
## 2663 Economics and Finance 69348.61
Just for a matter of interest we visualize the counts for each of the
0 different study subjects in our sample by plotting a bar plot. We use
the ggplot2
library for plotting:
library(ggplot2)
p <- ggplot(data, aes(major)) +
geom_bar(width = .7) +
coord_flip() +
theme(axis.title.y = element_blank())
p
We can see that the different study subjects are not equally distributed in our sample, but that is okay.
Before we start with the hypothesis test, we check whether the assumptions for the one-way ANOVA hypothesis test are fulfilled. The samples are random and independent. We can check the normality assumption by plotting a normal probability plot (Q-Q plot) for each grouped variable:
p <- qplot(
sample = salary,
data = data,
color = major
) +
theme_minimal()
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
p
Well, not perfect, but we may assume the data for each group falls roughly on a straight line.
Next, we test the assumption of equal standard deviations. Therefore,
we calculate the standard deviation for each group. The programming
language R provides some excellent functions to calculate statistical
parameters across different groupings of a data set. These functions
refer to the apply()
function family. For our
example we use the tapply()
function to calculate the
standard deviation for each group.
# calculate standard deviation
sd_groups <- tapply(X = data$salary, INDEX = data$major, FUN = sd)
sd_groups
## Biology Economics and Finance
## 8085.088 10202.188
## Environmental Sciences Mathematics and Statistics
## 6325.725 8671.149
## Political Science Social Sciences
## 9979.958 6703.441
As a rule of thumb, we consider the equal standard deviations assumption as fulfilled, if the ratio of the largest to the smallest sample standard deviation is less than 2 (Weiss, 2010).
# calculate ratio of the largest to the smallest sample standard deviation
ratio <- max(as.vector(sd_groups)) / min(as.vector(sd_groups))
ratio
## [1] 1.612809
The ratio of the largest to the smallest sample standard deviation is 1.613. That is close to the threshold of 2, but for our toy example still acceptable. Thus, we conclude that all the assumptions are fulfilled.
Note: One-way ANOVA is robust to moderate violations of the normality assumption and the equal standard deviations assumption (Weiss, 2010).
In order to conduct the one-way ANOVA hypothesis test we follow the step-wise implementation procedure for hypothesis testing.
Step 1: State the null hypothesis \(H_0\) and alternative hypothesis \(H_A\)
The null hypothesis states that the mean annual salary is equal among all groups of graduates:
\[H_0: \quad \mu_1=\mu_2=\mu_3=\mu_4=\mu_5=\mu_6\]
Alternative hypothesis:
\[H_A: \quad \text{Not all the means are equal} \]
Step 2: Decide on the significance level, \(\alpha\)
\[\alpha = 0.01\]
alpha <- 0.01
Step 3 and 4: Compute the value of the test statistic and the p-value
In order to calculate the \(F\) test statistic, we need to determine several quantities beforehand. For illustration purposes we manually compute the \(F\) test statistic in R. Step by step we will populate the ANOVA table until we finally get the \(F\) test statistic and consequently the p-value.
\[ \begin{array}{|l|c|} \hline \ \text{Source} & df & \text{Sum of Squares }(SS) & \text{Mean Squares }(MS) & F\text{-statistic} & p\text{-value}\\ \hline \ \text{Group} & k-1 & SSG & MSG=\frac{SSG}{k-1} & F = \frac{MSG}{MSE} & p\\ \ \text{Error} & n-k & SSE & MSE=\frac{SSE}{n-k} & & \\ \hline \ \text{Total} & n-1 & SST & & & \\ \hline \end{array} \]
We start with the first column and calculate the degrees of freedom:
k <- length(unique(data$major))
n <- nrow(data)
df_G <- k - 1
df_G
## [1] 5
df_E <- n - k
df_E
## [1] 269
df_T <- n - 1
df_T
## [1] 274
\[ \begin{array}{|l|c|} \hline \ \text{Source} & df & \text{Sum of Squares }(SS) & \text{Mean Squares }(MS) & F\text{-statistic} & p\text{-value}\\ \hline \ \text{Group} & 5 & SSG & MSG=\frac{SSG}{k-1} & F = \frac{MSG}{MSE} & p\\ \ \text{Error} & 269 & SSE & MSE=\frac{SSE}{n-k} & & \\ \hline \ \text{Total} & 274 & SST & & & \\ \hline \end{array} \]
Further, we calculate the three sums of squares, \(SST\), \(SSG\) and \(SSE\). Recall the equations from the previous page:
\[SST = \sum_{i=1}^n(x_i-\bar x)^2\text{,}\]
where \(x_i\) corresponds to the observations in the samples and \(\bar x\) to the overall mean of all samples.
# calculate overall mean
x_bar <- mean(data$salary)
# observation data
xi <- data$salary
# calculate SST
SST <- sum((xi - x_bar)^2)
SST
## [1] 32242968219
\[SSG = \sum_{j=1}^k n_j(\bar x_j-\bar x)^2\]
Here, \(n_j\) denotes the sample size for group \(j\), \(\bar x_j\) denotes the mean of group \(j\) and \(\bar x\) denotes the overall mean of all samples.
# calculate sample size for each group
n_j <- tapply(X = data$salary, INDEX = data$major, FUN = length)
# calculate mean for each group
xi_bar <- tapply(X = data$salary, INDEX = data$major, FUN = mean)
# calculate SSG
SSG <- sum(n_j * (xi_bar - x_bar)^2)
SSG
## [1] 13003942832
\[SSE = \sum_{j=1}^k (n_j-1)s_j^2\text{,}\]
where \(n_j\) denotes the sample size for group \(j\) and \(s^2_j\) the variance of group \(j\).
# calculate standard deviation for each group
s2_j <- tapply(X = data$salary, INDEX = data$major, FUN = sd)
# calculate SSE
SSE <- sum((n_j - 1) * s2_j^2)
SSE
## [1] 19239025388
# alternatively one may calculate SSE this way
SSE <- SST - SSG
SSE
## [1] 19239025388
\[ \begin{array}{|l|c|} \hline \ \text{Source} & df & \text{Sum of Squares }(SS) & \text{Mean Squares }(MS) & F\text{-statistic} & p\text{-value}\\ \hline \ \text{Group} & 5 & 1.3003943\times 10^{10} & MSG=\frac{SSG}{k-1} & F = \frac{MSG}{MSE} & p\\ \ \text{Error} & 269 & 1.9239025\times 10^{10} & MSE=\frac{SSE}{n-k} & & \\ \hline \ \text{Total} & 274 & 3.2242968\times 10^{10} & & & \\ \hline \end{array} \]
Now, we calculate the two measures of mean variability, \(MSG\) and \(MSE\):
# calculate MSG
MSG <- SSG / (k - 1)
MSG
## [1] 2600788566
# calculate MSE
MSE <- SSE / (n - k)
MSE
## [1] 71520540
\[ \begin{array}{|l|c|} \hline \ \text{Source} & df & \text{Sum of Squares }(SS) & \text{Mean Squares }(MS) & F\text{-statistic} & p\text{-value}\\ \hline \ \text{Group} & 5 & 1.3003943\times 10^{10} & 2.6007886\times 10^{9} & F = \frac{MSG}{MSE} & p\\ \ \text{Error} & 269 & 1.9239025\times 10^{10} & 7.152054\times 10^{7} & & \\ \hline \ \text{Total} & 274 & 3.2242968\times 10^{10} & & & \\ \hline \end{array} \]
Finally, we obtain the \(F\)-statistic by the ratio of \(MSG\) and \(MSE\):
Fstat <- MSG / MSE
Fstat
## [1] 36.36422
\[ \begin{array}{|l|c|} \hline \ \text{Source} & df & \text{Sum of Squares }(SS) & \text{Mean Squares }(MS) & F\text{-statistic} & p\text{-value}\\ \hline \ \text{Group} & 5 & 1.3003943\times 10^{10} & 2.6007886\times 10^{9} & 36.3642186 & p\\ \ \text{Error} & 269 & 1.9239025\times 10^{10} & 7.152054\times 10^{7} & & \\ \hline \ \text{Total} & 274 & 3.2242968\times 10^{10} & & & \\ \hline \end{array} \]
In the last step we calculate the p-value by calling the
pf()
function in R. Recall, how to calculate the degrees of
freedom:
\[df = (k-1, n-k)\]
Because the null hypothesis is rejected only when the test statistic, \(F\), is too large, a one-way ANOVA test is always right-tailed.
df1 <- k - 1
df2 <- n - k
p_value <- pf(q = Fstat, df1 = df1, df2 = df2, lower.tail = FALSE)
p_value
## [1] 2.133722e-28
\(p = 2.1337218\times 10^{-28}\).
\[ \begin{array}{|l|c|} \hline \ \text{Source} & df & \text{Sum of Squares }(SS) & \text{Mean Squares }(MS) & F\text{-statistic} & p\text{-value}\\ \hline \ \text{Group} & 5 & 1.3003943\times 10^{10} & 2.6007886\times 10^{9} & F = 36.3642186 & 2.1337218\times 10^{-28}\\ \ \text{Error} & 269 & 1.9239025\times 10^{10} & 7.152054\times 10^{7} & & \\ \hline \ \text{Total} & 274 & 3.2242968\times 10^{10} & & & \\ \hline \end{array} \]
Step 5: If \(p \le \alpha\), reject \(H_0\); otherwise, do not reject \(H_0\)
p_value <= alpha
## [1] TRUE
The p-value is smaller than the specified significance level of 0.01; we reject \(H_0\). The test results are statistically significant at the 1 % level and provide very strong evidence against the null hypothesis.
Step 6: Interpret the result of the hypothesis test
At the 1 % significance level the data provides very strong evidence to conclude, that at least one pair of group means is different from each other.
We just completed a one-way ANOVA hypothesis test in R manually. Awesome, but now we will redo that example and make use of the R machinery to obtain the same result in just one line of code!
In order to conduct a one-way ANOVA hypothesis test in R we apply the
aov()
function. The aov()
function expects the
so-called formula-notation. Thus, we provide our data by
separating the two variables of interest by ~
. Further, we
provide a data frame in which the variables specified in the formula are
found.
anova <- aov(salary ~ major, data = data)
summary(anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## major 5 1.300e+10 2.601e+09 36.36 <2e-16 ***
## Residuals 269 1.924e+10 7.152e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Worked out fine! Compare the output of the aov()
function to our result above. Again, we may conclude that at the 1 %
significance level the data provides very strong evidence to conclude,
that at least one pair of group means is different from each other.
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.