The \(\chi^2\) independence test is an inferential method to decide whether an association exists between two variables. Similar to other hypothesis tests the null hypothesis states that the two variables are not associated. In contrast, the alternative hypothesis states that the two variables are associated.

Recall that statistically dependent variables are called associated variables. In contrast, non-associated variables are called statistically independent variables. Further recall the concept of contingency tables (also known as two-way table, cross-tabulation table or cross tabs), which display the frequency distributions of bivariate data.


\(\chi^2\) Independence Test

The basic idea behind the \(\chi^2\) independence test is to compare the observed frequencies in a contingency table with the expected frequencies, given the null hypothesis of non-association is true. The expected frequency for each cell of a contingency table is given by

\[ E = \frac{R\times C}{n} \text{,}\] where \(R\) is the row total, \(C\) is the column total, and \(n\) is the sample size.

Let us construct an example for a better understanding. We consider an exit poll in form of a contingency table that displays the age of \(n= 1189\) people in categories from 18-29, 30-44, 45-64 and >65 years, and their political affiliation, which is “Conservative”, “Socialist” and “Other”. This table corresponds to the observed frequencies.

Observed frequencies: \[ \begin{array}{|l|c|} \hline \ & \text{Conservative} & \text{Socialist} & \text{Other} & \text{Total} \\ \hline \ \text{18-29} & 141 & 68 & 4 & 213\\ \ \text{30-44} & 179 & 159 & 7 & 345\\ \ \text{45-64} & 220 & 216 & 4 & 440\\ \ \text{65 & older} & 86 & 101 & 4 & 191\\ \hline \ \text{Total} & 626 & 544 & 19 & 1189\\ \hline \end{array} \]

Based on the equation given above we calculate the expected frequency for each cell.

Expected frequencies:

\[ \begin{array}{|l|c|} \hline \ & \text{Conservative} & \text{Socialist} & \text{Other} & \text{Total} \\ \hline \ \text{18-29} & \frac{213 \times 626 }{1189} \approx 112.14 & \frac{213 \times 544 }{1189} \approx97.45 & \frac{213 \times 19 }{1189} \approx3.4 & 213\\ \ \text{30-44} & \frac{345 \times 626 }{1189} \approx181.64 &\frac{345 \times 544 }{1189} \approx 157.85 & \frac{345 \times 19 }{1189} \approx5.51 & 345\\ \ \text{45-64} & \frac{440 \times 626 }{1189} \approx231.66 & \frac{440 \times 544 }{1189} \approx201.31 &\frac{440 \times 19 }{1189} \approx 7.03 & 440\\ \ \text{65 & older} &\frac{191 \times 626 }{1189} \approx 100.56 &\frac{191 \times 544 }{1189} \approx 87.39 & \frac{191 \times 19 }{1189} \approx3.05 & 191\\ \hline \ \text{Total} & 626 & 544 & 19 & 1189\\ \hline \end{array} \]

Once we know about the expected frequencies we have to check for two assumptions. First, we have to make sure that all expected frequencies are 1 or greater, and second, at most 20% of the expected frequencies are less than 5. By looking at the table we may confirm that both assumptions are fulfilled.

The actual comparison is done based on the the \(\chi^2\) test statistic for the observed frequency and the expected frequency. The \(\chi^2\) test statistic follows the \(\chi^2\) distribution and is given by

\[\chi^2= \sum{\frac{(O-E)^2}{E}}\text{,} \]

where \(O\) represents the observed frequency and \(E\) represents the expected frequency. Please note that \(\frac{(O-E)^2}{E}\) is evaluated for each cell and then summed up.

The number of degrees of freedom are given by

\[df = (r-1) \times (c-1)\text{,}\]

where \(r\) and \(c\) are the number of possible values for the two variables under consideration.

Adopted to the example from above this leads to a somehow long expression, which for the sake of brevity is just given for the first and the last row of the contingency tables of interest.

\[\chi^2 = \frac{141 \times 112.14}{112.14} + \frac{68 \times 97.45}{97.45} + \frac{4 \times 3.4}{3.4} + ... + \frac{86 \times 100.56}{100.56} + \frac{101 \times 87.39}{87.39} + \frac{4 \times 3.05}{3.05}\]

If the null hypothesis is true, the observed and expected frequencies are roughly equal, resulting in a small value of the \(\chi^2\) test statistic; thus, supporting \(H_0\). If, however, the value of the \(\chi^2\) test statistic is large, the data provides evidence against \(H_0\). In the next sections we further discuss how to asses the value \(\chi^2\) test statistic in the framework work of hypothesis testing.


\(\chi^2\) Independence Test: An Example

In order to get some hands-on experience we apply the \(\chi^2\) independence test in an exercise. Therefore we load the students data set. You may download the students.csv file here. Import the data set and assign a proper name to it.

students <- read.csv("https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv")

The students data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary.

In this exercise we want to examine if there is an association between the variables gender and major, or in other words, we want do know if male students favor different study subjects compared to female students.


Data preparation

We start with the data preparation. We do not want to deal with the whole data set of 8239 entries, thus we randomly select 865 student from the data set. The first step of data preparation is to display our data of interest in form of a contingency table. R provides the fancy table() function, which will do the job.

n <- 865
data.idx <- sample(1:nrow(students), n)
data <- students[data.idx,] 
conti.table <- table(data$major, data$gender)
conti.table
##                             
##                              Female Male
##   Biology                       112   58
##   Economics and Finance          36   93
##   Environmental Sciences         72   95
##   Mathematics and Statistics     25  114
##   Political Science              92   68
##   Social Sciences                63   37

Further, we determine the column sums and the row sums. The R software package provides the margin.table() function, which takes as an argument the variable conti.table and additionally one argument, either the integer \(1\), for the row sums, or \(2\) for the column sums.

row.sum <- margin.table(conti.table, 1)
row.sum
## 
##                    Biology      Economics and Finance 
##                        170                        129 
##     Environmental Sciences Mathematics and Statistics 
##                        167                        139 
##          Political Science            Social Sciences 
##                        160                        100
col.sum <- margin.table(conti.table, 2)
col.sum
## 
## Female   Male 
##    400    465

For visualization purposes we join the data in a matrix object. Moreover, we apply the as.vector() function to extract the numerical information from the table object. The resulting matrix corresponds to the observed frequencies.

conti.added <- cbind(conti.table, as.vector(row.sum))
conti.added <- rbind(conti.added , c(as.vector(col.sum), sum(conti.table))) 
conti.added
##                            Female Male    
## Biology                       112   58 170
## Economics and Finance          36   93 129
## Environmental Sciences         72   95 167
## Mathematics and Statistics     25  114 139
## Political Science              92   68 160
## Social Sciences                63   37 100
##                               400  465 865

In the next step we construct the expected frequencies. Recall the equation for above:

\[ E = \frac{R\times C}{n} \text{,}\] where \(R\) is the row total, \(C\) is the column total, and \(n\) is the sample size.

We compute the expected frequencies cell-wise, by implementing a nested for-loop. Thereby we go through all rows of the matrix column by column and calculate the expect frequency \(E\) for each particular cell.

# initialize empty matrix
expected.frequencies <- matrix(data = NA, 
                               nrow = nrow(conti.table), 
                               ncol = ncol(conti.table))

# nested for-loop
for (row.idx in 1:nrow(conti.table)){
  for (col.idx in 1:ncol(conti.table)){
    expected.frequencies[row.idx, col.idx] <- (row.sum[row.idx]*col.sum[col.idx])/sum(conti.table)
      }
  }
expected.frequencies
##          [,1]     [,2]
## [1,] 78.61272 91.38728
## [2,] 59.65318 69.34682
## [3,] 77.22543 89.77457
## [4,] 64.27746 74.72254
## [5,] 73.98844 86.01156
## [6,] 46.24277 53.75723

Nice! For a better readability we assign column and row names to the expected.frequencies matrix.

rownames(expected.frequencies) <- rownames(conti.table)
colnames(expected.frequencies) <- colnames(conti.table)
expected.frequencies
##                              Female     Male
## Biology                    78.61272 91.38728
## Economics and Finance      59.65318 69.34682
## Environmental Sciences     77.22543 89.77457
## Mathematics and Statistics 64.27746 74.72254
## Political Science          73.98844 86.01156
## Social Sciences            46.24277 53.75723

Once we know about the expected frequencies we have to check for two assumptions. First, we have to make sure that all expected frequencies are 1 or greater, and second, at most 20% of the expected frequencies are less than 5. By looking at the table we may confirm that both assumptions are fulfilled.

Now we have all the data we need to perform a \(\chi^2\) independence test.


Hypothesis testing

In order to conduct the \(\chi^2\) independence test we follow the step-wise implementation procedure for hypothesis testing. The \(\chi^2\) independence test follows the same step-wise procedure as discussed in the previous sections. \[ \begin{array}{l} \hline \ \text{Step 1} & \text{State the null hypothesis } H_0 \text{ and alternative hypothesis } H_A \text{.}\\ \ \text{Step 2} & \text{Decide on the significance level, } \alpha\text{.} \\ \ \text{Step 3} & \text{Compute the value of the test statistic.} \\ \ \text{Step 4} &\text{Determine the p-value.} \\ \ \text{Step 5} & \text{If } p \le \alpha \text{, reject }H_0 \text{; otherwise, do not reject } H_0 \text{.} \\ \ \text{Step 6} &\text{Interpret the result of the hypothesis test.} \\ \hline \end{array} \]

Step 1: State the null hypothesis \(H_0\) and alternative hypothesis \(H_A\)

The null hypothesis states that there is no association between the gender and the major study subject of students.

\[H_0: \text{No association between gender and major study subject}\]

Alternative hypothesis \[H_A: \quad \text{There is an association between gender and major study subject}\]


Step 2: Decide on the significance level, \(\alpha\)

\[\alpha = 0.05\]

alpha <- 0.05

Step 3 and 4: Compute the value of the test statistic and the p-value.

For illustration purposes we manually compute the test statistic in R. Recall the equation for the test statistic from above:

\[\chi^2= \sum{\frac{(O-E)^2}{E}}\text{,} \] where \(O\) represents the observed frequency and \(E\) represents the expected frequency.

To calculate the test statistic we perform calculations cell by cell. Thus, we again apply a nested for-loop. To make the code more readable we reassign the observed and expected frequencies to the variables of and ef respectively.

# Compute the value of the test statistic

of <- conti.table
ef <- expected.frequencies

# initialize test statistic to zero
x2 <- 0

# nested for-loop
for (row.idx in 1:nrow(of)){
  for (col.idx in 1:ncol(of)){
    cell <- (of[row.idx, col.idx]-ef[row.idx, col.idx])^2 / ef[row.idx, col.idx]
    x2 <- x2 + cell # update variable x2 cell by cell
      }
}
x2
## [1] 108.581

The numerical value of the test statistic is \(\approx 108.58\).

In order to calculate the p-value we apply the pchisq() function. Recall how to calculate the degrees of freedom:

\[df = (r-1) \times (c-1)\text{,}\]

where \(r\) and \(c\) are the number of possible values for the two variables under consideration.

# Compute df
df <- nrow(of) * ncol(of)

# Compute the p-value
p <- pchisq(x2, df = df, lower.tail = FALSE)
p
## [1] 1.141509e-17

Step 5: If \(p \le \alpha\), reject \(H_0\); otherwise, do not reject \(H_0\).

p <= alpha
## [1] TRUE

The p-value is less than the specified significance level of 0.05; we reject \(H_0\). The test results are statistically significant at the 5% level and provide very strong evidence against the null hypothesis.


Step 6: Interpret the result of the hypothesis test.

\(p = 1.1415093\times 10^{-17}\). At the 5% significance level, the data provides very strong evidence to conclude that there is an association betweeen gender and the major study subject.


Hypothesis testing in R

We just completed a \(\chi^2\) independence test in R manually. We can do the same in R by just one line of code!

Therefore we apply the chisq.test() function. For the function we either provide two vectors as data input, such as data$major and data$gender, or we provide table objects such as of and ef.

# vector data
chisq.test(data$major, data$gender)
## 
##  Pearson's Chi-squared test
## 
## data:  data$major and data$gender
## X-squared = 108.58, df = 5, p-value < 2.2e-16
# table objects
chisq.test(of, ef)
## 
##  Pearson's Chi-squared test
## 
## data:  of
## X-squared = 108.58, df = 5, p-value < 2.2e-16

Perfect! Compare the output of the chisq.test() function with our result from above. Again, we may conclude that at the 5% significance level, the data provides very strong evidence to conclude that there is an association betweeen gender and the major study subject.