20762_chi-square_independence

The $\chi^2$ independence test is an inferential method to decide whether an association exists between two variables. Similar to other hypothesis tests the null hypothesis states, that the two variables are not associated. In contrast, the alternative hypothesis states, that the two variables are associated.

Recall, that statistically dependent variables are called associated variables. In contrast, non-associated variables are called statistically independent variables. Further, recall the concept of contingency tables (also known as two-way table, cross-tabulation table or cross tabs), which display the frequency distributions of bivariate data.

$\chi^2$ Independence Test

The basic idea behind the $\chi^2$ independence test is to compare the observed frequencies in a contingency table with the expected frequencies, given the null hypothesis of non-association is true. The expected frequency for each cell of a contingency table is given by

\[ E = \frac{R\times C}{n} \text{,}\] where $R$ is the row total, $C$ is the column total and $n$ is the sample size.

Let us construct an example for better understanding. We consider an exit poll in form of a contingency table, that displays the age of $n= 1189$ people in categories from 18-29, 30-44, 45-64 and >65 years and their political affiliation, which is “Conservative”, “Socialist” or “Other”. This table corresponds to the observed frequencies.

Observed frequencies: \[ \begin{array}{|l|c|} \hline \ & \text{Conservative} & \text{Socialist} & \text{Other} & \text{Total} \\ \hline \ \text{18-29} & 141 & 68 & 4 & 213\\ \ \text{30-44} & 179 & 159 & 7 & 345\\ \ \text{45-64} & 220 & 216 & 4 & 440\\ \ \text{65 & older} & 86 & 101 & 4 & 191\\ \hline \ \text{Total} & 626 & 544 & 19 & 1189\\ \hline \end{array} \]

Based on the equation given above we calculate the expected frequency for each cell.

Expected frequencies:

\[ \begin{array}{|l|c|} \hline \ & \text{Conservative} & \text{Socialist} & \text{Other} & \text{Total} \\ \hline \ \text{18-29} & \frac{213 \times 626 }{1189} \approx 112.14 & \frac{213 \times 544 }{1189} \approx97.45 & \frac{213 \times 19 }{1189} \approx3.4 & 213\\ \ \text{30-44} & \frac{345 \times 626 }{1189} \approx181.64 &\frac{345 \times 544 }{1189} \approx 157.85 & \frac{345 \times 19 }{1189} \approx5.51 & 345\\ \ \text{45-64} & \frac{440 \times 626 }{1189} \approx231.66 & \frac{440 \times 544 }{1189} \approx201.31 &\frac{440 \times 19 }{1189} \approx 7.03 & 440\\ \ \text{65 & older} &\frac{191 \times 626 }{1189} \approx 100.56 &\frac{191 \times 544 }{1189} \approx 87.39 & \frac{191 \times 19 }{1189} \approx3.05 & 191\\ \hline \ \text{Total} & 626 & 544 & 19 & 1189\\ \hline \end{array} \]

Once we know the expected frequencies we have to check for two assumptions. First, we have to make sure that all expected frequencies are 1 or greater. Second, at most 20 % of the expected frequencies are less than 5. By looking at the table we may confirm that both assumptions are fulfilled.

The actual comparison is done based on the $\chi^2$ test statistic for the observed frequency and the expected frequency. The $\chi^2$ test statistic follows the $\chi^2$ distribution and is given by

\[\chi^2= \sum{\frac{(O-E)^2}{E}}\text{,} \]

where $O$ represents the observed frequency and $E$ represents the expected frequency. Please note that $\frac{(O-E)^2}{E}$ is evaluated for each cell and then summed up.

The number of degrees of freedom are given by

\[df = (r-1) \times (c-1)\text{,}\]

where $r$ and $c$ are the number of possible values for the two variables under consideration.

Adopted to the above example this leads to a somehow long expression, which, for the sake of brevity, is just given for the first and the last row of the contingency tables of interest:

\[\chi^2 = \frac{141 \times 112.14}{112.14} + \frac{68 \times 97.45}{97.45} + \frac{4 \times 3.4}{3.4} + ... + \frac{86 \times 100.56}{100.56} + \frac{101 \times 87.39}{87.39} + \frac{4 \times 3.05}{3.05}\]

If the null hypothesis is true, the observed and expected frequencies are roughly equal, resulting in a small value of the $\chi^2$ test statistic; thus, supporting $H_0$. If, however, the value of the $\chi^2$ test statistic is large, the data provides evidence against $H_0$. In the next sections we further discuss how to assess the value of the $\chi^2$ test statistic in the framework of hypothesis testing.

$\chi^2$ Independence Test: An Example

In order to get some hands-on experience we apply the $\chi^2$ independence test in an exercise. For this we import the students data set, which you may also download here.

students <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")

The students data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary.

In this exercise we want to examine if there is an association between the variables gender and major, or in other words, we want to know if male students favor different study subjects compared to female students.

Data preparation

We start with the data preparation. We do not want to deal with the whole data set of 8239 entries, thus we randomly select 865 student from the data set. The first step of data preparation is to display our data of interest in form of a contingency table. R provides the fancy table() function, which will do the job.

n <- 865
data_idx <- sample(1:nrow(students), n)
data <- students[data_idx, ]
conti_table <- table(data$major, data$gender)
conti_table

##                             
##                              Female Male
##   Biology                        94   71
##   Economics and Finance          49   79
##   Environmental Sciences         83  100
##   Mathematics and Statistics     29   98
##   Political Science             115   46
##   Social Sciences                62   39

Further, we determine the column sums and the row sums. The R software package provides the margin.table() function, which takes as an argument the variable conti_table and additionally one argument, either the integer $1$, for the row sums, or $2$, for the column sums.

row_sum <- margin.table(conti_table, 1)
row_sum

## 
##                    Biology      Economics and Finance 
##                        165                        128 
##     Environmental Sciences Mathematics and Statistics 
##                        183                        127 
##          Political Science            Social Sciences 
##                        161                        101

col_sum <- margin.table(conti_table, 2)
col_sum

## 
## Female   Male 
##    432    433

For visualization purposes we join the data in a matrix object. Moreover, we apply the as.vector() function to extract the numerical information from the table object. The resulting matrix corresponds to the observed frequencies.

conti_added <- cbind(conti_table, as.vector(row_sum))
conti_added <- rbind(conti_added, c(as.vector(col_sum), sum(conti_table)))
conti_added

##                            Female Male    
## Biology                        94   71 165
## Economics and Finance          49   79 128
## Environmental Sciences         83  100 183
## Mathematics and Statistics     29   98 127
## Political Science             115   46 161
## Social Sciences                62   39 101
##                               432  433 865

In the next step we construct the expected frequencies. Recall the equation above:

\[ E = \frac{R\times C}{n} \text{,}\]

where $R$ is the row total, $C$ is the column total and $n$ is the sample size.

We compute the expected frequencies cell-wise, by implementing a nested for-loop. Thereby we go through all rows of the matrix, column by column, and calculate the expect frequency $E$ for each particular cell.

# initialize empty matrix
expected_frequencies <- matrix(
  data = NA,
  nrow = nrow(conti_table),
  ncol = ncol(conti_table)
)

# nested for-loop
for (row.idx in 1:nrow(conti_table)) {
  for (col.idx in 1:ncol(conti_table)) {
    expected_frequencies[row.idx, col.idx] <- (row_sum[row.idx] * col_sum[col.idx]) / sum(conti_table)
  }
}
expected_frequencies

##          [,1]     [,2]
## [1,] 82.40462 82.59538
## [2,] 63.92601 64.07399
## [3,] 91.39422 91.60578
## [4,] 63.42659 63.57341
## [5,] 80.40694 80.59306
## [6,] 50.44162 50.55838

Nice! For better readability we assign column and row names to the expected_frequencies matrix.

rownames(expected_frequencies) <- rownames(conti_table)
colnames(expected_frequencies) <- colnames(conti_table)
expected_frequencies

##                              Female     Male
## Biology                    82.40462 82.59538
## Economics and Finance      63.92601 64.07399
## Environmental Sciences     91.39422 91.60578
## Mathematics and Statistics 63.42659 63.57341
## Political Science          80.40694 80.59306
## Social Sciences            50.44162 50.55838

Once we know the expected frequencies we have to check for two assumptions. First, we have to make sure that all expected frequencies are 1 or greater. Second, at most 20 % of the expected frequencies should be less than 5. By looking at the table we may confirm that both assumptions are fulfilled.

Now, we have all the data we need to perform a $\chi^2$ independence test.

Hypothesis testing

In order to conduct the $\chi^2$ independence test we follow the step-wise implementation procedure for hypothesis testing. The $\chi^2$ independence test follows the same step-wise procedure as discussed in the previous sections:

\[ \begin{array}{l} \hline \ \text{Step 1} & \text{State the null hypothesis } H_0 \text{ and alternative hypothesis } H_A \text{.}\\ \ \text{Step 2} & \text{Decide on the significance level, } \alpha\text{.} \\ \ \text{Step 3} & \text{Compute the value of the test statistic.} \\ \ \text{Step 4} &\text{Determine the p-value.} \\ \ \text{Step 5} & \text{If } p \le \alpha \text{, reject }H_0 \text{; otherwise, do not reject } H_0 \text{.} \\ \ \text{Step 6} &\text{Interpret the result of the hypothesis test.} \\ \hline \end{array} \]

Step 1: State the null hypothesis $H_0$ and alternative hypothesis $H_A$

The null hypothesis states that there is no association between the gender and the major study subject of students:

\[H_0: \text{No association between gender and major study subject}\]

Alternative hypothesis:

\[H_A: \quad \text{There is an association between gender and major study subject}\]

Step 2: Decide on the significance level, $\alpha$

\[\alpha = 0.05\]

alpha <- 0.05

Step 3 and 4: Compute the value of the test statistic and the p-value

For illustration purposes we manually compute the test statistic in R. Recall the equation for the test statistic from above:

\[\chi^2= \sum{\frac{(O-E)^2}{E}}\text{,} \]

where $O$ represents the observed frequency and $E$ represents the expected frequency.

To calculate the test statistic we perform calculations cell by cell. Thus, we again apply a nested for-loop. To make the code more readable we reassign the observed and expected frequencies to the variables of and ef, respectively.

# compute the value of the test statistic

of <- conti_table
ef <- expected_frequencies

# initialize test statistic to zero
x2 <- 0

# nested for-loop
for (row.idx in 1:nrow(of)) {
  for (col.idx in 1:ncol(of)) {
    cell <- (of[row.idx, col.idx] - ef[row.idx, col.idx])^2 / ef[row.idx, col.idx]
    x2 <- x2 + cell # update variable x2 cell by cell
  }
}
x2

## [1] 84.11274

The numerical value of the test statistic is $\approx 84.11$.

In order to calculate the p-value we apply the pchisq() function. Recall, how to calculate the degrees of freedom:

\[df = (r-1) \times (c-1)\text{,}\]

where $r$ and $c$ are the number of possible values for the two variables under consideration.

# compute df
df <- nrow(of) * ncol(of)

# compute the p-value
p <- pchisq(x2, df = df, lower.tail = FALSE)
p

## [1] 6.739166e-13

$p = 6.7391659\times 10^{-13}$.

Step 5: If $p \le \alpha$, reject $H_0$; otherwise, do not reject $H_0$

p <= alpha

## [1] TRUE

The p-value is smaller than the specified significance level of 0.05; we reject $H_0$. The test results are statistically significant at the 5 % level and provide very strong evidence against the null hypothesis.

Step 6: Interpret the result of the hypothesis test

At the 5 % significance level the data provides very strong evidence to conclude, that there is an association betweeen gender and the major study subject.

Hypothesis testing in R

We just completed a $\chi^2$ independence test in R manually. We can do the same in R in just one line of code!

Therefore, we apply the chisq.test() function. For the function we either provide two vectors as input data, such as data$major and data$gender, or we provide table objects such as of and ef.

# vector data
chisq.test(data$major, data$gender)

## 
##  Pearson's Chi-squared test
## 
## data:  data$major and data$gender
## X-squared = 84.113, df = 5, p-value < 2.2e-16

# table objects
chisq.test(of, ef)

## 
##  Pearson's Chi-squared test
## 
## data:  of
## X-squared = 84.113, df = 5, p-value < 2.2e-16

Perfect! Compare the output of the chisq.test() function with our result above. Again, we may conclude that at the 5 % significance level the data provides very strong evidence to conclude, that there is an association betweeen gender and the major study subject.

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.

\(\chi^2\) Independence Test

\(\chi^2\) Independence Test: An Example

Data preparation

Hypothesis testing

Hypothesis testing in R