The **\(\chi^2\) independence
test** is an inferential method to decide whether an association
exists between two variables. Similar to other hypothesis tests the null
hypothesis states, that the two variables are not associated. In
contrast, the alternative hypothesis states, that the two variables are
associated.

Recall, that statistically **dependent variables** are
called **associated variables**. In contrast,
non-associated variables are called statistically independent variables.
Further, recall the concept of **contingency tables** (also known as
two-way table, cross-tabulation table or cross tabs), which display the
frequency distributions of bivariate data.

The basic idea behind the **\(\chi^2\) independence test** is to
compare the **observed frequencies** in a contingency table
with the **expected frequencies**, given the null
hypothesis of non-association is true. The expected frequency for each
cell of a contingency table is given by

\[ E = \frac{R\times C}{n} \text{,}\] where \(R\) is the row total, \(C\) is the column total and \(n\) is the sample size.

Let us construct an example for better understanding. We consider an exit poll in form of a contingency table, that displays the age of \(n= 1189\) people in categories from 18-29, 30-44, 45-64 and >65 years and their political affiliation, which is “Conservative”, “Socialist” or “Other”. This table corresponds to the observed frequencies.

**Observed frequencies:** \[
\begin{array}{|l|c|}
\hline
\ & \text{Conservative} & \text{Socialist} & \text{Other}
& \text{Total} \\
\hline
\ \text{18-29} & 141 & 68 & 4 & 213\\
\ \text{30-44} & 179 & 159 & 7 & 345\\
\ \text{45-64} & 220 & 216 & 4 & 440\\
\ \text{65 & older} & 86 & 101 & 4 & 191\\
\hline
\ \text{Total} & 626 & 544 & 19 & 1189\\
\hline
\end{array}
\]

Based on the equation given above we calculate the expected frequency for each cell.

**Expected frequencies:**

\[ \begin{array}{|l|c|} \hline \ & \text{Conservative} & \text{Socialist} & \text{Other} & \text{Total} \\ \hline \ \text{18-29} & \frac{213 \times 626 }{1189} \approx 112.14 & \frac{213 \times 544 }{1189} \approx97.45 & \frac{213 \times 19 }{1189} \approx3.4 & 213\\ \ \text{30-44} & \frac{345 \times 626 }{1189} \approx181.64 &\frac{345 \times 544 }{1189} \approx 157.85 & \frac{345 \times 19 }{1189} \approx5.51 & 345\\ \ \text{45-64} & \frac{440 \times 626 }{1189} \approx231.66 & \frac{440 \times 544 }{1189} \approx201.31 &\frac{440 \times 19 }{1189} \approx 7.03 & 440\\ \ \text{65 & older} &\frac{191 \times 626 }{1189} \approx 100.56 &\frac{191 \times 544 }{1189} \approx 87.39 & \frac{191 \times 19 }{1189} \approx3.05 & 191\\ \hline \ \text{Total} & 626 & 544 & 19 & 1189\\ \hline \end{array} \]

Once we know the expected frequencies we have to check for two assumptions. First, we have to make sure that all expected frequencies are 1 or greater. Second, at most 20 % of the expected frequencies are less than 5. By looking at the table we may confirm that both assumptions are fulfilled.

The actual comparison is done based on the \(\chi^2\) test statistic for the observed frequency and the expected frequency. The \(\chi^2\) test statistic follows the \(\chi^2\) distribution and is given by

\[\chi^2= \sum{\frac{(O-E)^2}{E}}\text{,} \]

where \(O\) represents the observed frequency and \(E\) represents the expected frequency. Please note that \(\frac{(O-E)^2}{E}\) is evaluated for each cell and then summed up.

The number of degrees of freedom are given by

\[df = (r-1) \times (c-1)\text{,}\]

where \(r\) and \(c\) are the number of possible values for the two variables under consideration.

Adopted to the above example this leads to a somehow long expression, which, for the sake of brevity, is just given for the first and the last row of the contingency tables of interest:

\[\chi^2 = \frac{141 \times 112.14}{112.14} + \frac{68 \times 97.45}{97.45} + \frac{4 \times 3.4}{3.4} + ... + \frac{86 \times 100.56}{100.56} + \frac{101 \times 87.39}{87.39} + \frac{4 \times 3.05}{3.05}\]

If the null hypothesis is true, the observed and expected frequencies are roughly equal, resulting in a small value of the \(\chi^2\) test statistic; thus, supporting \(H_0\). If, however, the value of the \(\chi^2\) test statistic is large, the data provides evidence against \(H_0\). In the next sections we further discuss how to assess the value of the \(\chi^2\) test statistic in the framework of hypothesis testing.

In order to get some hands-on experience we apply the **\(\chi^2\) independence test** in an
exercise. For this we import the *students* data set, which you
may also download here.

`students <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")`

The *students* data set consists of 8239 rows, each of them
representing a particular student, and 16 columns, each of them
corresponding to a variable/feature related to that particular student.
These self-explaining variables are: *stud.id, name, gender, age,
height, weight, religion, nc.score, semester, major, minor, score1,
score2, online.tutorial, graduated, salary*.

In this exercise we want to examine **if there is an
association between the variables gender and
major, or in other words, we want to know if male students
favor different study subjects compared to female students**.

We start with the data preparation. We do not want to deal with the
whole data set of 8239 entries, thus we randomly select 865 student from
the data set. The first step of data preparation is to display our data
of interest in form of a contingency table. R provides the fancy
`table()`

function, which will do the job.

```
n <- 865
data_idx <- sample(1:nrow(students), n)
data <- students[data_idx, ]
conti_table <- table(data$major, data$gender)
conti_table
```

```
##
## Female Male
## Biology 94 71
## Economics and Finance 49 79
## Environmental Sciences 83 100
## Mathematics and Statistics 29 98
## Political Science 115 46
## Social Sciences 62 39
```

Further, we determine the column sums and the row sums. The R
software package provides the `margin.table()`

function,
which takes as an argument the variable `conti_table`

and
additionally one argument, either the integer \(1\), for the row sums, or \(2\), for the column sums.

```
row_sum <- margin.table(conti_table, 1)
row_sum
```

```
##
## Biology Economics and Finance
## 165 128
## Environmental Sciences Mathematics and Statistics
## 183 127
## Political Science Social Sciences
## 161 101
```

```
col_sum <- margin.table(conti_table, 2)
col_sum
```

```
##
## Female Male
## 432 433
```

For visualization purposes we join the data in a `matrix`

object. Moreover, we apply the `as.vector()`

function to
extract the numerical information from the `table`

object.
The resulting matrix corresponds to the **observed
frequencies**.

```
conti_added <- cbind(conti_table, as.vector(row_sum))
conti_added <- rbind(conti_added, c(as.vector(col_sum), sum(conti_table)))
conti_added
```

```
## Female Male
## Biology 94 71 165
## Economics and Finance 49 79 128
## Environmental Sciences 83 100 183
## Mathematics and Statistics 29 98 127
## Political Science 115 46 161
## Social Sciences 62 39 101
## 432 433 865
```

In the next step we construct the **expected
frequencies**. Recall the equation above:

\[ E = \frac{R\times C}{n} \text{,}\]

where \(R\) is the row total, \(C\) is the column total and \(n\) is the sample size.

We compute the expected frequencies cell-wise, by implementing a nested for-loop. Thereby we go through all rows of the matrix, column by column, and calculate the expect frequency \(E\) for each particular cell.

```
# initialize empty matrix
expected_frequencies <- matrix(
data = NA,
nrow = nrow(conti_table),
ncol = ncol(conti_table)
)
# nested for-loop
for (row.idx in 1:nrow(conti_table)) {
for (col.idx in 1:ncol(conti_table)) {
expected_frequencies[row.idx, col.idx] <- (row_sum[row.idx] * col_sum[col.idx]) / sum(conti_table)
}
}
expected_frequencies
```

```
## [,1] [,2]
## [1,] 82.40462 82.59538
## [2,] 63.92601 64.07399
## [3,] 91.39422 91.60578
## [4,] 63.42659 63.57341
## [5,] 80.40694 80.59306
## [6,] 50.44162 50.55838
```

Nice! For better readability we assign column and row names to the
`expected_frequencies`

matrix.

```
rownames(expected_frequencies) <- rownames(conti_table)
colnames(expected_frequencies) <- colnames(conti_table)
expected_frequencies
```

```
## Female Male
## Biology 82.40462 82.59538
## Economics and Finance 63.92601 64.07399
## Environmental Sciences 91.39422 91.60578
## Mathematics and Statistics 63.42659 63.57341
## Political Science 80.40694 80.59306
## Social Sciences 50.44162 50.55838
```

Once we know the expected frequencies we have to check for two assumptions. First, we have to make sure that all expected frequencies are 1 or greater. Second, at most 20 % of the expected frequencies should be less than 5. By looking at the table we may confirm that both assumptions are fulfilled.

Now, we have all the data we need to perform a \(\chi^2\) independence test.

In order to conduct the **\(\chi^2\) independence test** we
follow the step-wise implementation procedure for hypothesis testing.
The **\(\chi^2\) independence
test** follows the same step-wise procedure as discussed in the
previous sections:

\[ \begin{array}{l} \hline \ \text{Step 1} & \text{State the null hypothesis } H_0 \text{ and alternative hypothesis } H_A \text{.}\\ \ \text{Step 2} & \text{Decide on the significance level, } \alpha\text{.} \\ \ \text{Step 3} & \text{Compute the value of the test statistic.} \\ \ \text{Step 4} &\text{Determine the p-value.} \\ \ \text{Step 5} & \text{If } p \le \alpha \text{, reject }H_0 \text{; otherwise, do not reject } H_0 \text{.} \\ \ \text{Step 6} &\text{Interpret the result of the hypothesis test.} \\ \hline \end{array} \]

**Step 1: State the null hypothesis \(H_0\) and alternative hypothesis \(H_A\)**

The null hypothesis states that there is no association between the gender and the major study subject of students:

\[H_0: \text{No association between gender and major study subject}\]

Alternative hypothesis:

\[H_A: \quad \text{There is an association between gender and major study subject}\]

**Step 2: Decide on the significance level, \(\alpha\)**

\[\alpha = 0.05\]

`alpha <- 0.05`

**Step 3 and 4: Compute the value of the test statistic and the
p-value**

For illustration purposes we manually compute the test statistic in R. Recall the equation for the test statistic from above:

\[\chi^2= \sum{\frac{(O-E)^2}{E}}\text{,} \]

where \(O\) represents the observed frequency and \(E\) represents the expected frequency.

To calculate the test statistic we perform calculations cell by cell.
Thus, we again apply a nested for-loop. To make the code more readable
we reassign the observed and expected frequencies to the variables
`of`

and `ef`

, respectively.

```
# compute the value of the test statistic
of <- conti_table
ef <- expected_frequencies
# initialize test statistic to zero
x2 <- 0
# nested for-loop
for (row.idx in 1:nrow(of)) {
for (col.idx in 1:ncol(of)) {
cell <- (of[row.idx, col.idx] - ef[row.idx, col.idx])^2 / ef[row.idx, col.idx]
x2 <- x2 + cell # update variable x2 cell by cell
}
}
x2
```

`## [1] 84.11274`

The numerical value of the test statistic is \(\approx 84.11\).

In order to calculate the *p*-value we apply the
`pchisq()`

function. Recall, how to calculate the degrees of
freedom:

\[df = (r-1) \times (c-1)\text{,}\]

where \(r\) and \(c\) are the number of possible values for the two variables under consideration.

```
# compute df
df <- nrow(of) * ncol(of)
# compute the p-value
p <- pchisq(x2, df = df, lower.tail = FALSE)
p
```

`## [1] 6.739166e-13`

\(p = 6.7391659\times 10^{-13}\).

**Step 5: If \(p \le \alpha\),
reject \(H_0\); otherwise, do not
reject \(H_0\)**

`p <= alpha`

`## [1] TRUE`

The *p*-value is smaller than the specified significance level
of 0.05; we reject \(H_0\). The test
results are statistically significant at the 5 % level and provide very
strong evidence against the null hypothesis.

**Step 6: Interpret the result of the hypothesis
test**

At the 5 % significance level the data provides very strong evidence to conclude, that there is an association betweeen gender and the major study subject.

We just completed a \(\chi^2\) independence test in R manually. We can do the same in R in just one line of code!

Therefore, we apply the `chisq.test()`

function. For the
function we either provide two vectors as input data, such as
`data$major`

and `data$gender`

, or we provide
table objects such as `of`

and `ef`

.

```
# vector data
chisq.test(data$major, data$gender)
```

```
##
## Pearson's Chi-squared test
##
## data: data$major and data$gender
## X-squared = 84.113, df = 5, p-value < 2.2e-16
```

```
# table objects
chisq.test(of, ef)
```

```
##
## Pearson's Chi-squared test
##
## data: of
## X-squared = 84.113, df = 5, p-value < 2.2e-16
```

Perfect! Compare the output of the `chisq.test()`

function
with our result above. Again, we may conclude that at the 5 %
significance level the data provides very strong evidence to conclude,
that there is an association betweeen gender and the major study
subject.

**Citation**

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Hartmann,
K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis
using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.*