A caveat of ANOVA is that if we reject the null hypothesis, we state that the means of the populations under consideration are not all the same. However, we may not decide which particular means are different nor the relation among the means.
In order to deal with that issue we apply methods called multiple comparisons. The problem with multiple comparisons is that the more hypotheses are tested on a particular data set, the more likely it is to incorrectly reject the null hypothesis. Thus, methods of multiple comparison require a higher significance threshold (\(\alpha\)) for individual comparisons, in order to compensate for the number of inferences being made.
A family of tests is the technical term for a series of tests performed on a data set. The family-wise error rate is the probability of making one or more false discoveries, or Type I errors, when performing multiple hypothesis tests.
Recall, that at a significance level of \(\alpha = 0.05\), the probability of making a Type I error is equal to \(0.05\) or \(5 \%\). Consequently, the probability of not making a Type I error is \(1-\alpha = 1 - 0.05 = 0.95\). Moreover, the probability of observing two events that are independent is the product of their probabilities. Thus, if we conduct two independent tests, the probability of not making a Type I error on the first and the second tests is
\[(1-\alpha) \times (1-\alpha) = (1-\alpha)^2 \text{.}\] If \(\alpha = 0.05\), we get a probability of not making a Type I error on the first and the second test of \((1 - \alpha)^2 = (1-0.05)^2 = 0.95^2 \approx 0.902\).
Written more formally, for a family of \(C\) tests, the probability of not making a Type I error for the whole family is
\[(1-\alpha)^C\text{.}\]
Let us now consider \(C=10\) and \(\alpha = 0.05\). This means we make 10 multiple comparisons on one data set. Then, the probability of not making a Type I error on the family is
\[(1-\alpha)^C=(1-0.05)^{10} \approx 0.599\text{.}\]
Consequently, the probability of making one or more Type I errors on the family of tests is
\[1 - (1-\alpha)^C \text{.}\] For our example, we find
\[1 - (1-\alpha)^C = 1 - (1-0.05)^{10} \approx 0.401 \text{.}\]
Thus, with \(\alpha = 0.05\) for each of the 10 multiple comparisons, the probability of incorrectly rejecting the null hypothesis is 0.401 or 40.1 %.
To account for that problem there exist several statistical methods. In this section we discuss the Bonferroni correction and the Tukey Multiple-Comparison Method, also known as Tukey’s HSD (honest significant difference) test.
In this section we will redo the example from the previous page. There, we applied a one-way ANOVA to test if the mean annual salary of graduates is different among graduates of different study subjects. However, this time we will make multiple comparisons in order to analyze the relation among all the group means.
Reload the students data set (you may download the
students.csv
file here):
students <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")
The students data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary.
From the students data set we randomly sample 275 graduated
students and reduce the data set to the two variables of interest, the
categorical variable major
and the numeric variable
salary
. For better readability in the forthcoming analysis
we replace the major study subject names with corresponding
abbreviations.
graduated <- subset(students, graduated == 1)
n <- 275
data_idx <- sample(x = 1:nrow(graduated), size = n)
data <- graduated[data_idx, c("major", "salary")]
data$major <- as.factor(data$major)
# rename class labels
levels(data$major) <- c(
"Bio", # Biology
"EcFi", # Economics and Finance
"EnS", # Environmental Sciences
"MaSt", # Mathematics and Statistics
"PoS", # Political Science
"SoS" # Social Sciences
)
Further, we conduct a one-way ANOVA hypothesis test in R, by applying
the aov()
function:
anova <- aov(salary ~ major, data = data)
summary(anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## major 5 1.206e+10 2.413e+09 36.75 <2e-16 ***
## Residuals 269 1.766e+10 6.566e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Bonferroni correction compensates the increased likelihood of incorrectly rejecting a null hypothesis (making a Type I error) due to multiple comparisons by adjusting the significance level \(\alpha\) in the form of
\[\alpha = \frac{\alpha}{m}\text{,}\] where \(m\) corresponds to the number of comparisons given by
\[m=\frac{k(k-1)}{2}\text{,}\] where \(k\) corresponds to the levels of the factor, which is the categorical classification variable.
As mentioned in the previous section a one-way analysis of variance
is the generalization of the pooled t-test to more than two
populations. Thus, the R function to perform multiple comparisons is
called pairwise.t.test()
. The
pairwise.t.test()
function takes one response vector
(x
), a grouping vector or factor (g
) and a
specified method for adjusting p-values
(p.adjust.method
) as arguments. We may type
p.adjust.methods
in the R console to list all available
adjustment methods for that function:
p.adjust.methods
## [1] "holm" "hochberg" "hommel" "bonferroni" "BH"
## [6] "BY" "fdr" "none"
First, we conduct a pairwise t-test without any adjustment, thus increasing the probability of incorrectly rejecting the null hypothesis.
ptt <- pairwise.t.test(x = data$salary, g = data$major, p.adj = "none")
ptt
##
## Pairwise comparisons using t tests with pooled SD
##
## data: data$salary and data$major
##
## Bio EcFi EnS MaSt PoS
## EcFi 0.52 - - - -
## EnS 1.2e-13 1.4e-15 - - -
## MaSt 0.91 0.64 3.4e-12 - -
## PoS 1.0e-11 1.9e-13 0.69 1.4e-10 -
## SoS 1.2e-14 2.4e-16 0.19 2.0e-13 0.10
##
## P value adjustment method: none
The pairwise t-test shows that at a significance level of 5 % the means for 6 combinations are not statistically significant different. These combinations are EcFi-Bio, MaSt-Bio, MaSt-EcFi, PoS-EnS, SoS-EnS, SoS-PoS, with p-values of 0.521, 0.911, 0.635, 0.687, 0.192, 0.105, respectively. For the remaining 9 combinations we reject \(H_0\); thus, for 9 combinations the means are different at a significance level of 5 %!
Second, we conduct a pairwise t-test with Bonferroni
adjustment. We use the same data as for the pairwise
t-test without adjustment, however this time we specify
p.adj = "bonferroni"
in the function call:
ptt <- pairwise.t.test(x = data$salary, g = data$major, p.adj = "bonferroni")
ptt
##
## Pairwise comparisons using t tests with pooled SD
##
## data: data$salary and data$major
##
## Bio EcFi EnS MaSt PoS
## EcFi 1 - - - -
## EnS 1.8e-12 2.1e-14 - - -
## MaSt 1 1 5.2e-11 - -
## PoS 1.6e-10 2.8e-12 1 2.1e-09 -
## SoS 1.8e-13 3.6e-15 1 3.0e-12 1
##
## P value adjustment method: bonferroni
The pairwise t-test with Bonferroni adjustment shows that at a significance level of 5 % the means for 6 combinations are not statistically significant different. These combinations are EcFi-Bio, MaSt-Bio, MaSt-EcFi, PoS-EnS, SoS-EnS, SoS-PoS, with p-values of 1, 1, 1, 1, 1, 1, respectively (in the Bonferroni procedure \(\alpha\) is divided by the number of tests or equivalently the p-value is multiplied by that number, and truncated back to 1 if the result is \(> 1\), and thus not a probability). For the remaining 9 combinations we reject \(H_0\); thus, for 9 combinations the means are different at a significance level of 5 %!
Please note, that in the case without p-value adjustment we rejected the null hypothesis for 9 combinations. For the pairwise t-test with Bonferroni adjustment we rejected the null hypothesis for 9 combinations.
The Tukey Multiple-Comparison Method, also known as Tukey’s HSD (honest significant difference) test is based on the studentized range distribution, sometimes referred to as the \(q\)-distribution. The \(q\)-distribution is a right-skewed probability density curve with two parameters, \(\kappa\) and \(\nu\), describing its shape. These parameters are given by
\[\kappa = k\] and \[\nu = n-k\text{,}\]
where \(n\) is the total number of observations and \(k\) the number of groups/classes.
Tukey’s test compares the means of every group to the means of every other group. It obtains the confidence interval for each
\[\mu_i-\mu_j\text{.}\]
If the confidence interval for a pairwise comparison includes 0, \(H_0\) is not rejected and they are not assumed to be significantly different. All other pairs for which the confidence interval does not include 0 are significantly different, thus \(H_0\) is rejected.
In R Tukey’s honest significant differences are computed with the
TukeyHSD()
function. The TukeyHSD()
function
expects a model object, specifically a fitted ANOVA model, as input
argument. We already calculated such a model above. However, for
illustration purposes we build the same model once again and assign it
to the variable called my_anova
. Further, in order to
specify the width of the confidence intervals, we provide the confidence
level by the conf.level
argument to the function.
my_anova <- aov(salary ~ major, data = data)
alpha <- 0.05
tukey <- TukeyHSD(x = my_anova, conf.level = 1 - alpha)
tukey
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = salary ~ major, data = data)
##
## $major
## diff lwr upr p adj
## EcFi-Bio 997.0335 -3458.756 5452.823 0.9876679
## EnS-Bio -12376.2875 -16917.819 -7834.756 0.0000000
## MaSt-Bio 191.0597 -4696.611 5078.731 0.9999975
## PoS-Bio -11703.9859 -16427.655 -6980.317 0.0000000
## SoS-Bio -14767.0087 -19955.810 -9578.208 0.0000000
## EnS-EcFi -13373.3210 -17894.755 -8851.887 0.0000000
## MaSt-EcFi -805.9738 -5674.976 4063.028 0.9969709
## PoS-EcFi -12701.0195 -17405.369 -7996.670 0.0000000
## SoS-EcFi -15764.0422 -20935.262 -10592.823 0.0000000
## MaSt-EnS 12567.3472 7619.759 17514.935 0.0000000
## PoS-EnS 672.3015 -4113.339 5457.942 0.9986186
## SoS-EnS -2390.7212 -7636.001 2854.558 0.7803265
## PoS-MaSt -11895.0457 -17010.333 -6779.758 0.0000000
## SoS-MaSt -14958.0684 -20505.750 -9410.386 0.0000000
## SoS-PoS -3063.0227 -8466.771 2340.726 0.5813784
The table/output shows the difference between each pair, the 95 % confidence intervals and the p-value of the pairwise comparisons. Look carefully at the table and you will see that for all 6 comparisons, where the confidence interval includes 0, the p-value is higher than the significance level \(\alpha\). If \(p > \alpha\) we do not reject \(H_0\), thus there is no statistically significant difference in the means of these two groups. In contrast, for all pairs where \(p \le \alpha\), we reject \(H_0\) and state, that there is a statistically significant difference in the means of these pairs.
The TukeyHSD()
function also provides a nice plotting
functionality, which visualizes the confidence intervals for each
pair:
par(mar = c(4, 6, 4, 4))
plot(tukey, las = 1, col = "brown")
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.