A caveat of ANOVA is that if we reject the null hypothesis, we state that the means of the populations under consideration are not all the same. However, we may not decide which particular means are different nor the relation among the means.

In order to deal with that issue we apply methods called multiple comparisons. The problem with multiple comparisons is that the more hypotheses are tested on a particular data set, the more likely it is to incorrectly reject the null hypothesis. Thus, methods of multiple comparison require a higher significance threshold (\(\alpha\)) for individual comparisons, in order to compensate for the number of inferences being made.


The Family-wise error rate

A family of tests is the technical term for a series of tests performed on a data set. The family-wise error rate is the probability of making one or more false discoveries, or Type I errors, when performing multiple hypothesis tests.

Recall, that at a significance level of \(\alpha = 0.05\), the probability of making a Type I error is equal to \(0.05\) or \(5 \%\). Consequently, the probability of not making a Type I error is \(1-\alpha = 1 - 0.05 = 0.95\). Moreover, the probability of observing two events that are independent is the product of their probabilities. Thus, if we conduct two independent tests, the probability of not making a Type I error on the first and the second tests is

\[(1-\alpha) \times (1-\alpha) = (1-\alpha)^2 \text{.}\] If \(\alpha = 0.05\), we get a probability of not making a Type I error on the first and the second test of \((1 - \alpha)^2 = (1-0.05)^2 = 0.95^2 \approx 0.902\).

Written more formally, for a family of \(C\) tests, the probability of not making a Type I error for the whole family is

\[(1-\alpha)^C\text{.}\]

Let us now consider \(C=10\) and \(\alpha = 0.05\). This means we make 10 multiple comparisons on one data set. Then, the probability of not making a Type I error on the family is

\[(1-\alpha)^C=(1-0.05)^{10} \approx 0.599\text{.}\]

Consequently, the probability of making one or more Type I errors on the family of tests is

\[1 - (1-\alpha)^C \text{.}\] For our example, we find

\[1 - (1-\alpha)^C = 1 - (1-0.05)^{10} \approx 0.401 \text{.}\]

Thus, with \(\alpha = 0.05\) for each of the 10 multiple comparisons, the probability of incorrectly rejecting the null hypothesis is 0.401 or 40.1 %.

To account for that problem there exist several statistical methods. In this section we discuss the Bonferroni correction and the Tukey Multiple-Comparison Method, also known as Tukey’s HSD (honest significant difference) test.


Example data

In this section we will redo the example from the previous page. There, we applied a one-way ANOVA to test if the mean annual salary of graduates is different among graduates of different study subjects. However, this time we will make multiple comparisons in order to analyze the relation among all the group means.

Reload the students data set (you may download the students.csv file here):

students <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")

The students data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary.

From the students data set we randomly sample 275 graduated students and reduce the data set to the two variables of interest, the categorical variable major and the numeric variable salary. For better readability in the forthcoming analysis we replace the major study subject names with corresponding abbreviations.

graduated <- subset(students, graduated == 1)

n <- 275
data_idx <- sample(x = 1:nrow(graduated), size = n)
data <- graduated[data_idx, c("major", "salary")]
data$major <- as.factor(data$major)
# rename class labels
levels(data$major) <- c(
  "Bio",  # Biology
  "EcFi", # Economics and Finance
  "EnS",  # Environmental Sciences
  "MaSt", # Mathematics and Statistics
  "PoS",  # Political Science
  "SoS"   # Social Sciences
)

Further, we conduct a one-way ANOVA hypothesis test in R, by applying the aov() function:

anova <- aov(salary ~ major, data = data)
summary(anova)
##              Df    Sum Sq   Mean Sq F value Pr(>F)    
## major         5 1.206e+10 2.413e+09   36.75 <2e-16 ***
## Residuals   269 1.766e+10 6.566e+07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Bonferroni correction

The Bonferroni correction compensates the increased likelihood of incorrectly rejecting a null hypothesis (making a Type I error) due to multiple comparisons by adjusting the significance level \(\alpha\) in the form of

\[\alpha = \frac{\alpha}{m}\text{,}\] where \(m\) corresponds to the number of comparisons given by

\[m=\frac{k(k-1)}{2}\text{,}\] where \(k\) corresponds to the levels of the factor, which is the categorical classification variable.


The pairwise t-Test in R: Bonferroni correction

As mentioned in the previous section a one-way analysis of variance is the generalization of the pooled t-test to more than two populations. Thus, the R function to perform multiple comparisons is called pairwise.t.test(). The pairwise.t.test() function takes one response vector (x), a grouping vector or factor (g) and a specified method for adjusting p-values (p.adjust.method) as arguments. We may type p.adjust.methods in the R console to list all available adjustment methods for that function:

p.adjust.methods
## [1] "holm"       "hochberg"   "hommel"     "bonferroni" "BH"        
## [6] "BY"         "fdr"        "none"

First, we conduct a pairwise t-test without any adjustment, thus increasing the probability of incorrectly rejecting the null hypothesis.

ptt <- pairwise.t.test(x = data$salary, g = data$major, p.adj = "none")
ptt
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  data$salary and data$major 
## 
##      Bio     EcFi    EnS     MaSt    PoS 
## EcFi 0.52    -       -       -       -   
## EnS  1.2e-13 1.4e-15 -       -       -   
## MaSt 0.91    0.64    3.4e-12 -       -   
## PoS  1.0e-11 1.9e-13 0.69    1.4e-10 -   
## SoS  1.2e-14 2.4e-16 0.19    2.0e-13 0.10
## 
## P value adjustment method: none

The pairwise t-test shows that at a significance level of 5 % the means for 6 combinations are not statistically significant different. These combinations are EcFi-Bio, MaSt-Bio, MaSt-EcFi, PoS-EnS, SoS-EnS, SoS-PoS, with p-values of 0.521, 0.911, 0.635, 0.687, 0.192, 0.105, respectively. For the remaining 9 combinations we reject \(H_0\); thus, for 9 combinations the means are different at a significance level of 5 %!

Second, we conduct a pairwise t-test with Bonferroni adjustment. We use the same data as for the pairwise t-test without adjustment, however this time we specify p.adj = "bonferroni" in the function call:

ptt <- pairwise.t.test(x = data$salary, g = data$major, p.adj = "bonferroni")
ptt
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  data$salary and data$major 
## 
##      Bio     EcFi    EnS     MaSt    PoS
## EcFi 1       -       -       -       -  
## EnS  1.8e-12 2.1e-14 -       -       -  
## MaSt 1       1       5.2e-11 -       -  
## PoS  1.6e-10 2.8e-12 1       2.1e-09 -  
## SoS  1.8e-13 3.6e-15 1       3.0e-12 1  
## 
## P value adjustment method: bonferroni

The pairwise t-test with Bonferroni adjustment shows that at a significance level of 5 % the means for 6 combinations are not statistically significant different. These combinations are EcFi-Bio, MaSt-Bio, MaSt-EcFi, PoS-EnS, SoS-EnS, SoS-PoS, with p-values of 1, 1, 1, 1, 1, 1, respectively (in the Bonferroni procedure \(\alpha\) is divided by the number of tests or equivalently the p-value is multiplied by that number, and truncated back to 1 if the result is \(> 1\), and thus not a probability). For the remaining 9 combinations we reject \(H_0\); thus, for 9 combinations the means are different at a significance level of 5 %!

Please note, that in the case without p-value adjustment we rejected the null hypothesis for 9 combinations. For the pairwise t-test with Bonferroni adjustment we rejected the null hypothesis for 9 combinations.


Tukey Multiple-Comparison Method

The Tukey Multiple-Comparison Method, also known as Tukey’s HSD (honest significant difference) test is based on the studentized range distribution, sometimes referred to as the \(q\)-distribution. The \(q\)-distribution is a right-skewed probability density curve with two parameters, \(\kappa\) and \(\nu\), describing its shape. These parameters are given by

\[\kappa = k\] and \[\nu = n-k\text{,}\]

where \(n\) is the total number of observations and \(k\) the number of groups/classes.

Tukey’s test compares the means of every group to the means of every other group. It obtains the confidence interval for each

\[\mu_i-\mu_j\text{.}\]

If the confidence interval for a pairwise comparison includes 0, \(H_0\) is not rejected and they are not assumed to be significantly different. All other pairs for which the confidence interval does not include 0 are significantly different, thus \(H_0\) is rejected.


Tukey’s honest significant difference test in R

In R Tukey’s honest significant differences are computed with the TukeyHSD() function. The TukeyHSD() function expects a model object, specifically a fitted ANOVA model, as input argument. We already calculated such a model above. However, for illustration purposes we build the same model once again and assign it to the variable called my_anova. Further, in order to specify the width of the confidence intervals, we provide the confidence level by the conf.level argument to the function.

my_anova <- aov(salary ~ major, data = data)
alpha <- 0.05

tukey <- TukeyHSD(x = my_anova, conf.level = 1 - alpha)
tukey
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = salary ~ major, data = data)
## 
## $major
##                  diff        lwr        upr     p adj
## EcFi-Bio     997.0335  -3458.756   5452.823 0.9876679
## EnS-Bio   -12376.2875 -16917.819  -7834.756 0.0000000
## MaSt-Bio     191.0597  -4696.611   5078.731 0.9999975
## PoS-Bio   -11703.9859 -16427.655  -6980.317 0.0000000
## SoS-Bio   -14767.0087 -19955.810  -9578.208 0.0000000
## EnS-EcFi  -13373.3210 -17894.755  -8851.887 0.0000000
## MaSt-EcFi   -805.9738  -5674.976   4063.028 0.9969709
## PoS-EcFi  -12701.0195 -17405.369  -7996.670 0.0000000
## SoS-EcFi  -15764.0422 -20935.262 -10592.823 0.0000000
## MaSt-EnS   12567.3472   7619.759  17514.935 0.0000000
## PoS-EnS      672.3015  -4113.339   5457.942 0.9986186
## SoS-EnS    -2390.7212  -7636.001   2854.558 0.7803265
## PoS-MaSt  -11895.0457 -17010.333  -6779.758 0.0000000
## SoS-MaSt  -14958.0684 -20505.750  -9410.386 0.0000000
## SoS-PoS    -3063.0227  -8466.771   2340.726 0.5813784

The table/output shows the difference between each pair, the 95 % confidence intervals and the p-value of the pairwise comparisons. Look carefully at the table and you will see that for all 6 comparisons, where the confidence interval includes 0, the p-value is higher than the significance level \(\alpha\). If \(p > \alpha\) we do not reject \(H_0\), thus there is no statistically significant difference in the means of these two groups. In contrast, for all pairs where \(p \le \alpha\), we reject \(H_0\) and state, that there is a statistically significant difference in the means of these pairs.

The TukeyHSD() function also provides a nice plotting functionality, which visualizes the confidence intervals for each pair:

par(mar = c(4, 6, 4, 4))
plot(tukey, las = 1, col = "brown")


Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.