A caveat of ANOVA is that if we reject the null hypothesis we state that the means of the populations under consideration are not all the same. However, we may not decide which particular means are different nor the relation among the means.

In order to deal with that issue we apply methods called multiple comparisons. The problem with multiple comparisons is that the more hypotheses are tested on a particular data set, the more likely it is to incorrectly reject the null hypothesis. Thus, methods of multiple comparison require a higher significance threshold (\(\alpha\)) for individual comparisons, in order to compensate for the number of inferences being made.


The Family-wise error rate

A family of tests is the technical term for a series of tests performed on a data set. The family-wise error rate is the probability of making one or more false discoveries, or Type I errors when performing multiple hypotheses tests.

Recall that at a significance level of \(\alpha = 0.05\), the probability of making a Type I error is equal to \(0.05\) or \(5\)%. Consequently, the probability of not making a Type I error is \(1-\alpha = 1 - 0.05 = 0.95\). Moreover, the probability of observing two events that are independent is the product of their probabilities. Thus, if we conduct two independent tests, the probability of not making a Type I error on the first and the second tests is

\[(1-\alpha) \times (1-\alpha) = (1-\alpha)^2\] If \(\alpha = 0.05\), we get a probability of not making a Type I error on the first and the second test of \((1 - \alpha)^2 = (1-0.05)^2 = 0.95^2 \approx 0.902\).

Written more formally, for a family of \(C\) tests, the probability of not making a Type I error for the whole family is

\[(1-\alpha)^C\text{.}\]

Let us now consider \(C=10\) and \(\alpha = 0.05\). Thus, we make 10 multiple comparisons on one data set. Then, the probability of not making a Type I error on the family is

\[(1-\alpha)^C=(1-0.05)^{10} \approx 0.599\] Consequently, the probability of making one or more Type I errors on the family of tests is

\[1 - (1-\alpha)^C\] For our example, we find

\[1 - (1-\alpha)^C = 1 - (1-0.05)^{10} \approx 0.401\]

Thus, with \(\alpha = 0.05\) for each of the 10 multiple comparisons, the probability of incorrectly rejecting the null hypothesis is 0.401 or 40.1%.

To account for that problem there exist several statistical methods. In this section we discuss the Bonferroni correction and the Tukey Multiple-Comparison Method, also known as Tukey’s HSD (honest significant difference) test.


Example data

In this section redo the example from the previous section. There we applied a one-way ANOVA to test if the mean annual salary of graduates is different among graduates of different study subjects. However, this time we will make multiple comparisons in order to analyse the relation among all the group means.

Reload the students data set (you may download the students.csv file here).

students <- read.csv("https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv")

The students data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary.

From the students data set we randomly sample 275 graduated students and reduce the data set to the two variables of interest, the categorical variable major and the numeric variable salary. For a better readability in the forthcoming analysis we replace the major study subject names with corresponding abbreviations.

graduated <- subset(students, graduated == 1)

n <- 275
data.idx <- sample(x = 1:nrow(graduated), size = n)
data <- graduated[data.idx, c('major','salary')]
# rename class labels
levels(data$major) <- c('Bio',     # Biology
                        'EcFi',    # Economics and Finance
                        'EnS',     # Environmental Sciences
                        'MaSt',    # Mathematics and Statistics
                        'PoS',     # Political Science
                        'SoS')     # Social Sciences

Further, we conduct a one-way ANOVA hypothesis test in R, by applying the aov() function.

anova <- aov(salary ~ major, data = data)
summary(anova)
##              Df    Sum Sq   Mean Sq F value Pr(>F)    
## major         5 1.178e+10 2.357e+09   30.86 <2e-16 ***
## Residuals   269 2.054e+10 7.636e+07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Bonferroni correction

The Bonferroni correction compensates the increased likelihood of incorrectly rejecting a null hypothesis (making a Type I error) due to multiple comparisons by adjusting the significance level \(\alpha\) in the form of

\[\alpha = \frac{\alpha}{m}\text{,}\] where \(m\) corresponds to the number of comparisons given by

\[m=\frac{k(k-1)}{2}\text{,}\] where \(k\) corresponds to the levels of the factor, which is the categorical classification variable.


The pairwise t-Test in R: Bonferroni correction

As mentioned in the previous section a one-way analysis of variance is the generalization of the pooled t-test to more than two populations. Thus, the R function to perform multiple comparisons is called pairwise.t.test(). The pairwise.t.test() function takes as arguments one response vector (x), a grouping vector or factor (g) and a specified method for adjusting p-values (p.adjust.method). We may type p.adjust.methods in the R console to list all available adjustment methods for that function.

p.adjust.methods
## [1] "holm"       "hochberg"   "hommel"     "bonferroni" "BH"        
## [6] "BY"         "fdr"        "none"

First, we conduct a pairwise t-test without any adjustment, thus increasing the probability of incorrectly rejecting the null hypothesis.

ptt <- pairwise.t.test(x = data$salary, g = data$major , p.adj = "none")
ptt
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  data$salary and data$major 
## 
##      Bio     EcFi    EnS     MaSt    PoS   
## EcFi 0.6625  -       -       -       -     
## EnS  7.0e-11 1.2e-09 -       -       -     
## MaSt 0.3825  0.2408  1.2e-07 -       -     
## PoS  6.4e-10 8.1e-09 0.6388  7.4e-07 -     
## SoS  < 2e-16 5.0e-16 0.0052  4.8e-14 0.0011
## 
## P value adjustment method: none

The pairwise t-test shows that at a significance level of 5% the means for 4 combinations are not statistically significant different. These combinations are EcFi-Bio, MaSt-Bio, MaSt-EcFi, PoS-EnS, with p-values of 0.662, 0.383, 0.241, 0.639, respectively. For the remaining 11 combinations we reject \(H_0\); thus, for 11 combinations the means are different at a significance level of 5%!

Second, we conduct a pairwise t-test with the Bonferroni adjustment. We use the same data as for the pairwise t-test without adjustment, however, this time we specify p.adj = "bonferroni" in the function call.

ptt <- pairwise.t.test(x =data$salary, g = data$major , p.adj = "bonferroni")
ptt
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  data$salary and data$major 
## 
##      Bio     EcFi    EnS     MaSt    PoS  
## EcFi 1.000   -       -       -       -    
## EnS  1.1e-09 1.8e-08 -       -       -    
## MaSt 1.000   1.000   1.7e-06 -       -    
## PoS  9.6e-09 1.2e-07 1.000   1.1e-05 -    
## SoS  < 2e-16 7.4e-15 0.079   7.2e-13 0.016
## 
## P value adjustment method: bonferroni

The pairwise t-test with the Bonferroni adjustment shows that at a significance level of 5% the means for 5 combinations are not statistically significant different. These combinations are EcFi-Bio, MaSt-Bio, MaSt-EcFi, PoS-EnS, SoS-EnS, with p-values of 1, 1, 1, 1, 0.079, respectively (in the Bonferroni procedure \(\alpha\) is divided by the number of tests or equivalently the p-value is multiplied by that number, and truncated back to 1 if the result is \(> 1\), and thus not a probability). For the remaining 10 combinations we rejected \(H_0\); thus, for 10 combinations the means are different at a significance level of 5%!

Please note, that in the case without a p-value adjustment we rejected the null hypothesis for 11 combinations. For the pairwise t-test with the Bonferroni adjustment we rejected the null hypothesis for 10 combinations.


Tukey Multiple-Comparison Method

The Tukey Multiple-Comparison Method, also known as Tukey’s HSD (honest significant difference) test is based on the studentized range distribution, sometimes referred to as the \(q\)-distribution. The \(q\)-distribution is a right-skewed probability density curve, with two parameters, \(\kappa\) and \(\nu\), describing its shape. These parameters are given by

\[\kappa = k\] and \[\nu = n-k\text{,}\]

where \(n\) is the total number of observations and \(k\) the number of groups/classes.

Tukey’s test compares the means of every group to the means of every other group. It obtains the confidence interval for each

\[\mu_i-\mu_j\text{.}\]

If the confidence interval for a pairwise comparisons includes 0, \(H_0\) is not rejected, they are not assumed to be significantly different. All other pairs for which the confidence interval does not include 0 are significantly different, thus \(H_0\) is rejected.


Tukey’s honest significant difference test in R

In R Tukey’s honest significant differences are computed by the TukeyHSD() function. The TukeyHSD() function expects a model object, specifically a fitted ANOVA model as input argument. We already calculated such a model above. However, for illustration purposes we build the same model once again and assign it to the variable called my.anova. Further, in order to specify the width of the confidence intervals, we provide the confidence level by the conf.level argument to the function.

my.anova <- aov(salary ~ major, data = data)
alpha = 0.05

tukey <- TukeyHSD(x = my.anova, conf.level= 1-alpha)
tukey
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = salary ~ major, data = data)
## 
## $major
##                  diff        lwr         upr     p adj
## EcFi-Bio     835.2044  -4651.189   6321.5982 0.9979683
## EnS-Bio   -11563.6640 -16449.935  -6677.3926 0.0000000
## MaSt-Bio   -1554.3446  -6654.879   3546.1894 0.9522553
## PoS-Bio   -10746.0766 -15555.957  -5936.1963 0.0000000
## SoS-Bio   -16735.2579 -21835.792 -11634.7239 0.0000000
## EnS-EcFi  -12398.8684 -18047.094  -6750.6428 0.0000000
## MaSt-EcFi  -2389.5491  -8224.123   3445.0248 0.8481939
## PoS-EcFi  -11581.2810 -17163.552  -5999.0097 0.0000001
## SoS-EcFi  -17570.4623 -23405.036 -11735.8885 0.0000000
## MaSt-EnS   10009.3194   4735.101  15283.5380 0.0000017
## PoS-EnS      817.5874  -4176.097   5811.2718 0.9971248
## SoS-EnS    -5171.5939 -10445.813    102.6248 0.0581961
## PoS-MaSt   -9191.7320 -14395.258  -3988.2060 0.0000109
## SoS-MaSt  -15180.9133 -20654.229  -9707.5979 0.0000000
## SoS-PoS    -5989.1813 -11192.707   -785.6553 0.0136871

The table/output shows the difference between each pair, the 95% confidence intervals and the p-value of the pairwise comparisons. Look carefully at the table and you will see that for all those 5 comparisons, where the confidence interval includes 0, the p-value is higher than the significance level \(\alpha\). If \(p > \alpha\) we do not reject \(H_0\), thus there is not statistically significant difference in the means of these two groups. In contrast, for all pairs where \(p \le \alpha\), we reject \(H_0\) and state, that there is a statistically significant difference in the means of these pairs. The TukeyHSD() function provides a nice plotting functionality, which visualizes the confidence intervals for each pair.

par(mar = c(4,6,4,4))
plot(tukey, las = 1, col = "brown" )