The **\(\chi^2\) independence test** is an inferential method to decide whether an association exists between two variables. Similar to other hypothesis tests the null hypothesis states that the two variables are not associated. In contrast, the alternative hypothesis states that the two variables are associated.

Recall that statistically **dependent variables** are called **associated variables**. In contrast, non-associated variables are called statistically independent variables. Further recall the concept of **contingency tables** (also known as two-way table, cross-tabulation table or cross tabs), which display the frequency distributions of bivariate data.

The basic idea behind the **\(\chi^2\) independence test** is to compare the **observed frequencies** in a contingency table with the **expected frequencies**, given the null hypothesis of non-association is true. The expected frequency for each cell of a contingency table is given by

\[ E = \frac{R\times C}{n} \text{,}\] where \(R\) is the row total, \(C\) is the column total, and \(n\) is the sample size.

Let us construct an example for a better understanding. We consider an exit poll in form of a contingency table that displays the age of \(n= 1189\) people in categories from 18-29, 30-44, 45-64 and >65 years, and their political affiliation, which is “Conservative”, “Socialist” and “Other”. This table corresponds to the observed frequencies.

**Observed frequencies:** \[
\begin{array}{|l|c|}
\hline
\ & \text{Conservative} & \text{Socialist} & \text{Other} & \text{Total} \\
\hline
\ \text{18-29} & 141 & 68 & 4 & 213\\
\ \text{30-44} & 179 & 159 & 7 & 345\\
\ \text{45-64} & 220 & 216 & 4 & 440\\
\ \text{65 & older} & 86 & 101 & 4 & 191\\
\hline
\ \text{Total} & 626 & 544 & 19 & 1189\\
\hline
\end{array}
\]

Based on the equation given above we calculate the expected frequency for each cell.

**Expected frequencies:**

\[ \begin{array}{|l|c|} \hline \ & \text{Conservative} & \text{Socialist} & \text{Other} & \text{Total} \\ \hline \ \text{18-29} & \frac{213 \times 626 }{1189} \approx 112.14 & \frac{213 \times 544 }{1189} \approx97.45 & \frac{213 \times 19 }{1189} \approx3.4 & 213\\ \ \text{30-44} & \frac{345 \times 626 }{1189} \approx181.64 &\frac{345 \times 544 }{1189} \approx 157.85 & \frac{345 \times 19 }{1189} \approx5.51 & 345\\ \ \text{45-64} & \frac{440 \times 626 }{1189} \approx231.66 & \frac{440 \times 544 }{1189} \approx201.31 &\frac{440 \times 19 }{1189} \approx 7.03 & 440\\ \ \text{65 & older} &\frac{191 \times 626 }{1189} \approx 100.56 &\frac{191 \times 544 }{1189} \approx 87.39 & \frac{191 \times 19 }{1189} \approx3.05 & 191\\ \hline \ \text{Total} & 626 & 544 & 19 & 1189\\ \hline \end{array} \]

Once we know about the expected frequencies we have to check for two assumptions. First, we have to make sure that all expected frequencies are 1 or greater, and second, at most 20% of the expected frequencies are less than 5. By looking at the table we may confirm that both assumptions are fulfilled.

The actual comparison is done based on the the \(\chi^2\) test statistic for the observed frequency and the expected frequency. The \(\chi^2\) test statistic follows the \(\chi^2\) distribution and is given by

\[\chi^2= \sum{\frac{(O-E)^2}{E}}\text{,} \]

where \(O\) represents the observed frequency and \(E\) represents the expected frequency. Please note that \(\frac{(O-E)^2}{E}\) is evaluated for each cell and then summed up.

The number of degrees of freedom are given by

\[df = (r-1) \times (c-1)\text{,}\]

where \(r\) and \(c\) are the number of possible values for the two variables under consideration.

Adopted to the example from above this leads to a somehow long expression, which for the sake of brevity is just given for the first and the last row of the contingency tables of interest.

\[\chi^2 = \frac{141 \times 112.14}{112.14} + \frac{68 \times 97.45}{97.45} + \frac{4 \times 3.4}{3.4} + ... + \frac{86 \times 100.56}{100.56} + \frac{101 \times 87.39}{87.39} + \frac{4 \times 3.05}{3.05}\]

If the null hypothesis is true, the observed and expected frequencies are roughly equal, resulting in a small value of the \(\chi^2\) test statistic; thus, supporting \(H_0\). If, however, the value of the \(\chi^2\) test statistic is large, the data provides evidence against \(H_0\). In the next sections we further discuss how to asses the value \(\chi^2\) test statistic in the framework work of hypothesis testing.

In order to get some hands-on experience we apply the **\(\chi^2\) independence test** in an exercise. Therefore we load the *students* data set. You may download the `students.csv`

file here. Import the data set and assign a proper name to it.

`students <- read.csv("https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv")`

The *students* data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: *stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary*.

In this exercise we want to examine **if there is an association between the variables gender and major, or in other words, we want do know if male students favor different study subjects compared to female students**.

We start with the data preparation. We do not want to deal with the whole data set of 8239 entries, thus we randomly select 865 student from the data set. The first step of data preparation is to display our data of interest in form of a contingency table. R provides the fancy `table()`

function, which will do the job.

```
n <- 865
data.idx <- sample(1:nrow(students), n)
data <- students[data.idx,]
conti.table <- table(data$major, data$gender)
conti.table
```

```
##
## Female Male
## Biology 112 58
## Economics and Finance 36 93
## Environmental Sciences 72 95
## Mathematics and Statistics 25 114
## Political Science 92 68
## Social Sciences 63 37
```

Further, we determine the column sums and the row sums. The R software package provides the `margin.table()`

function, which takes as an argument the variable `conti.table`

and additionally one argument, either the integer \(1\), for the row sums, or \(2\) for the column sums.

```
row.sum <- margin.table(conti.table, 1)
row.sum
```

```
##
## Biology Economics and Finance
## 170 129
## Environmental Sciences Mathematics and Statistics
## 167 139
## Political Science Social Sciences
## 160 100
```

```
col.sum <- margin.table(conti.table, 2)
col.sum
```

```
##
## Female Male
## 400 465
```

For visualization purposes we join the data in a `matrix`

object. Moreover, we apply the `as.vector()`

function to extract the numerical information from the `table`

object. The resulting matrix corresponds to the **observed frequencies**.

```
conti.added <- cbind(conti.table, as.vector(row.sum))
conti.added <- rbind(conti.added , c(as.vector(col.sum), sum(conti.table)))
conti.added
```

```
## Female Male
## Biology 112 58 170
## Economics and Finance 36 93 129
## Environmental Sciences 72 95 167
## Mathematics and Statistics 25 114 139
## Political Science 92 68 160
## Social Sciences 63 37 100
## 400 465 865
```

In the next step we construct the **expected frequencies**. Recall the equation for above:

\[ E = \frac{R\times C}{n} \text{,}\] where \(R\) is the row total, \(C\) is the column total, and \(n\) is the sample size.

We compute the expected frequencies cell-wise, by implementing a nested for-loop. Thereby we go through all rows of the matrix column by column and calculate the expect frequency \(E\) for each particular cell.

```
# initialize empty matrix
expected.frequencies <- matrix(data = NA,
nrow = nrow(conti.table),
ncol = ncol(conti.table))
# nested for-loop
for (row.idx in 1:nrow(conti.table)){
for (col.idx in 1:ncol(conti.table)){
expected.frequencies[row.idx, col.idx] <- (row.sum[row.idx]*col.sum[col.idx])/sum(conti.table)
}
}
expected.frequencies
```

```
## [,1] [,2]
## [1,] 78.61272 91.38728
## [2,] 59.65318 69.34682
## [3,] 77.22543 89.77457
## [4,] 64.27746 74.72254
## [5,] 73.98844 86.01156
## [6,] 46.24277 53.75723
```

Nice! For a better readability we assign column and row names to the `expected.frequencies`

matrix.

```
rownames(expected.frequencies) <- rownames(conti.table)
colnames(expected.frequencies) <- colnames(conti.table)
expected.frequencies
```

```
## Female Male
## Biology 78.61272 91.38728
## Economics and Finance 59.65318 69.34682
## Environmental Sciences 77.22543 89.77457
## Mathematics and Statistics 64.27746 74.72254
## Political Science 73.98844 86.01156
## Social Sciences 46.24277 53.75723
```

Once we know about the expected frequencies we have to check for two assumptions. First, we have to make sure that all expected frequencies are 1 or greater, and second, at most 20% of the expected frequencies are less than 5. By looking at the table we may confirm that both assumptions are fulfilled.

Now we have all the data we need to perform a \(\chi^2\) independence test.

In order to conduct the **\(\chi^2\) independence test** we follow the step-wise implementation procedure for hypothesis testing. The **\(\chi^2\) independence test** follows the same step-wise procedure as discussed in the previous sections. \[
\begin{array}{l}
\hline
\ \text{Step 1} & \text{State the null hypothesis } H_0 \text{ and alternative hypothesis } H_A \text{.}\\
\ \text{Step 2} & \text{Decide on the significance level, } \alpha\text{.} \\
\ \text{Step 3} & \text{Compute the value of the test statistic.} \\
\ \text{Step 4} &\text{Determine the p-value.} \\
\ \text{Step 5} & \text{If } p \le \alpha \text{, reject }H_0 \text{; otherwise, do not reject } H_0 \text{.} \\
\ \text{Step 6} &\text{Interpret the result of the hypothesis test.} \\
\hline
\end{array}
\]

**Step 1: State the null hypothesis \(H_0\) and alternative hypothesis \(H_A\)**

The null hypothesis states that there is no association between the gender and the major study subject of students.

\[H_0: \text{No association between gender and major study subject}\]

**Alternative hypothesis** \[H_A: \quad \text{There is an association between gender and major study subject}\]

**Step 2: Decide on the significance level, \(\alpha\)**

\[\alpha = 0.05\]

`alpha <- 0.05`

**Step 3 and 4: Compute the value of the test statistic and the p-value.**

For illustration purposes we manually compute the test statistic in R. Recall the equation for the test statistic from above:

\[\chi^2= \sum{\frac{(O-E)^2}{E}}\text{,} \] where \(O\) represents the observed frequency and \(E\) represents the expected frequency.

To calculate the test statistic we perform calculations cell by cell. Thus, we again apply a nested for-loop. To make the code more readable we reassign the observed and expected frequencies to the variables `of`

and `ef`

respectively.

```
# Compute the value of the test statistic
of <- conti.table
ef <- expected.frequencies
# initialize test statistic to zero
x2 <- 0
# nested for-loop
for (row.idx in 1:nrow(of)){
for (col.idx in 1:ncol(of)){
cell <- (of[row.idx, col.idx]-ef[row.idx, col.idx])^2 / ef[row.idx, col.idx]
x2 <- x2 + cell # update variable x2 cell by cell
}
}
x2
```

`## [1] 108.581`

The numerical value of the test statistic is \(\approx 108.58\).

In order to calculate the *p*-value we apply the `pchisq()`

function. Recall how to calculate the degrees of freedom:

\[df = (r-1) \times (c-1)\text{,}\]

where \(r\) and \(c\) are the number of possible values for the two variables under consideration.

```
# Compute df
df <- nrow(of) * ncol(of)
# Compute the p-value
p <- pchisq(x2, df = df, lower.tail = FALSE)
p
```

`## [1] 1.141509e-17`

**Step 5: If \(p \le \alpha\), reject \(H_0\); otherwise, do not reject \(H_0\).**

`p <= alpha`

`## [1] TRUE`

The *p*-value is less than the specified significance level of 0.05; we reject \(H_0\). The test results are statistically significant at the 5% level and provide very strong evidence against the null hypothesis.

**Step 6: Interpret the result of the hypothesis test.**

\(p = 1.1415093\times 10^{-17}\). At the 5% significance level, the data provides very strong evidence to conclude that there is an association betweeen gender and the major study subject.

We just completed a \(\chi^2\) independence test in R manually. We can do the same in R by just one line of code!

Therefore we apply the `chisq.test()`

function. For the function we either provide two vectors as data input, such as `data$major`

and `data$gender`

, or we provide table objects such as `of`

and `ef`

.

```
# vector data
chisq.test(data$major, data$gender)
```

```
##
## Pearson's Chi-squared test
##
## data: data$major and data$gender
## X-squared = 108.58, df = 5, p-value < 2.2e-16
```

```
# table objects
chisq.test(of, ef)
```

```
##
## Pearson's Chi-squared test
##
## data: of
## X-squared = 108.58, df = 5, p-value < 2.2e-16
```

Perfect! Compare the output of the `chisq.test()`

function with our result from above. Again, we may conclude that at the 5% significance level, the data provides very strong evidence to conclude that there is an association betweeen gender and the major study subject.