The **$\chi^{2}$ independence test** is an inferential method to decide whether an association exists between two variables. Like other hypothesis tests, the null hypothesis states that the two variables are not associated. In contrast, the alternative hypothesis states that the two variables are associated.

Recall that statistically **dependent variables** are called **associated variables**. In contrast, non-associated variables are called statistically independent variables. Further, recall the concept of **contingency tables** (also known as two-way table, cross-tabulation table or cross tabs), which display the frequency distributions of bivariate data.

The basic idea behind the **$\chi^{2}$ independence test** is to compare the **observed frequencies** in a contingency table with the **expected frequencies**, given that the null hypothesis of non-association is true. The expected frequency for each cell of a contingency table is given by

where $R$ is the row total, $C$ is the column total and $n$ is the sample size.

Let us construct an example for better understanding. We consider an exit poll in the form of a contingency table that displays the age of $n= 1.189$ people in categories from 18-29, 30-44, 45-64 and >65 years and their political affiliation, which is "Conservative", "Socialist" or "Other". This table corresponds to the observed frequencies.

**Observed frequencies:**

Conservative | Socialist | Other | $$\sum$$ | |
---|---|---|---|---|

18-29 |
141 | 68 | 4 | 213 |

30-44 |
179 | 159 | 7 | 345 |

45-64 |
220 | 216 | 4 | 440 |

> 65 |
86 | 101 | 4 | 191 |

$$\sum$$ | 626 | 544 | 19 | 1189 |

We calculate the expected frequency for each cell based on the above equation.

**Expected frequencies:**

Conservative | Socialist | Other | $$\sum$$ | |
---|---|---|---|---|

18-29 |
$$\frac {213 \times 626} {1189} \approx 112.14$$ | $$\frac {213 \times 544} {1189} \approx 97.45 $$ | $$\frac {213 \times 19} {1189} \approx 3.4$$ | 213 |

30-44 |
$$\frac {345 \times 626} {1189} \approx 181.64$$ | $$\frac {345 \times 544} {1189} \approx 157.85$$ | $$\frac {345 \times 19} {1189} \approx 5.51$$ | 345 |

45-64 |
$$\frac {440 \times 626} {1189} \approx 231.66$$ | $$\frac {440 \times 544} {1189} \approx 201.31$$ | $$\frac {440 \times 19} {1189} \approx 7.03$$ | 440 |

> 65 |
$$\frac {191 \times 626} {1189} \approx 100.56$$ | $$\frac {191 \times 544} {1189} \approx 87.39$$ | $$\frac {191 \times 19} {1189} \approx 3.05$$ | 191 |

$$\sum$$ | 626 | 544 | 19 | 1189 |

Once we know the expected frequencies, we have to check for two assumptions.

- we have to ensure all expected frequencies are one or greater
- at most, 20 % of the expected frequencies are less than 5.

We may confirm that both assumptions are fulfilled by looking at the table.

The actual comparison is based on the $\chi^{2}$ test statistic for the observed and expected frequencies. The $\chi^{2}$ test statistic follows the $\chi^{2}$ distribution and is given by:

$$\chi^{2}= \sum {\frac {(O - E)^{2} } {E} }$$where $O$ represents the observed frequency and $E$ represents the expected frequency. Please note that $\frac {(O - E)^{2} } {E}$ is evaluated for each cell and then summed up.

The number of degrees of freedom is given by:

$$df = (r - 1) \times (c - 1)$$where $r$ and $c$ are the number of possible values for the two variables under consideration.

Adopted to the above example, this leads to a somehow long-expression, which, for the sake of brevity, is just given for the first and the last row of the contingency tables of interest:

$$\chi^{2} = \frac {141 \times 112.14} {112.14} + \frac {68 \times 97.45} {97.45} + \frac {4 \times 3.4} {3.4} + ... + \frac {86 \times 100.56} {100.56} + \frac {101 \times 87.39} {87.39} + \frac {4 \times 3.05} {3.05}$$If the null hypothesis is true, the observed and expected frequencies are roughly equal, resulting in a small value of the $\chi^{2}$ test statistic, thus, supporting $H_{0}$. If, however, the value of the $\chi^{2}$ test statistic is large, the data provide evidence against $H_{0}$. In the following sections, we further discuss how to assess the value of the $\chi^{2}$ test statistic in the framework of hypothesis testing.

In order to get some hands-on experience, we apply the **$\chi^{2}$ independence test** in an exercise. For this, we load the `students`

*data set*. You may download the `students.csv`

file here and import it from your local file system, or you load it directly as a web resource. In either case, you import the data set to python as `pandas`

`dataframe`

object by using the `read_csv`

method:

Note:Make sure the`numpy`

and`pandas`

packages are part of your`mamba`

environment!

In [1]:

```
import pandas as pd
import numpy as np
students = pd.read_csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")
```

The *students* data set consists of 8239 rows, each representing a particular student, and 16 columns corresponding to a variable/feature related to that particular student. These self-explaining variables are:

- stud.id
- name
- gender
- age
- height
- weight
- religion
- nc.score
- semester
- major
- minor
- score1
- score2
- online.tutorial
- graduated
- salary

In this exercise, we want to examine **if there is an association between the variables gender and major, or in other words, we want to know if male students favour different study subjects compared to female students**.

We start with data preparation. We only want to deal with some of the data set of 8239 entries. Thus we randomly select 865 students from the data set. The first step of data preparation is to display our data of interest as a contingency table. `Pandas`

provides the fancy `crosstab()`

function, which will do the job for us!

In [2]:

```
n = 865
sample = students.sample(n, random_state = 8)
observed_frequencies_table = pd.crosstab(sample.major, sample.gender, margins=False)
observed_frequencies_table
```

Out[2]:

gender | Female | Male |
---|---|---|

major | ||

Biology | 102 | 66 |

Economics and Finance | 53 | 82 |

Environmental Sciences | 63 | 88 |

Mathematics and Statistics | 33 | 93 |

Political Science | 111 | 61 |

Social Sciences | 78 | 35 |

`margins = True`

:

In [3]:

```
pd.crosstab(sample.major, sample.gender, margins=True)
```

Out[3]:

gender | Female | Male | All |
---|---|---|---|

major | |||

Biology | 102 | 66 | 168 |

Economics and Finance | 53 | 82 | 135 |

Environmental Sciences | 63 | 88 | 151 |

Mathematics and Statistics | 33 | 93 | 126 |

Political Science | 111 | 61 | 172 |

Social Sciences | 78 | 35 | 113 |

All | 440 | 425 | 865 |

In the next step, we construct the **expected frequencies**. Recall the equation above:

where $R$ is the row total, $C$ is the column total and $n$ is the sample size.

We compute the expected frequencies cell-wise by implementing a nested for-loop. We go through all rows of the `dataframe`

, column by column, and calculate the expected frequency $E$ for each cell.

In [4]:

```
n = 865
observed = pd.crosstab(sample.major, sample.gender, margins=False)
expected_frequencies_table = pd.crosstab(sample.major, sample.gender, margins=False)
for row in range(0, expected_frequencies_table.shape[0]):
for column in range(0, expected_frequencies_table.shape[1]):
exp = (np.sum(observed_frequencies_table.iloc[row, :]) * np.sum(observed_frequencies_table.iloc[:, column])) / n
expected_frequencies_table.iloc[row, column] = exp
expected_frequencies_table
```

Out[4]:

gender | Female | Male |
---|---|---|

major | ||

Biology | 85.456647 | 82.543353 |

Economics and Finance | 68.670520 | 66.329480 |

Environmental Sciences | 76.809249 | 74.190751 |

Mathematics and Statistics | 64.092486 | 61.907514 |

Political Science | 87.491329 | 84.508671 |

Social Sciences | 57.479769 | 55.520231 |

Once we know the expected frequencies we have to check for two assumptions:

- we have to ensure all expected frequencies are one or greater
- at most, 20 % of the expected frequencies are less than 5.

By looking at the table, we may confirm that both assumptions are fulfilled.

Now, we have all the data we need to perform a $\chi^{2}$ independence test.

In order to conduct the **$\chi^{2}$ independence test**, we follow the step-wise implementation procedure for hypothesis testing. The **$\chi^{2}$ independence test** follows the same step-wise procedure as discussed in the previous sections:

- State the null hypothesis $H_{0}$ and alternative hypothesis $H_{A}$.
- Decide on the significance level, $\alpha$.
- Compute the value of the test statistic.
- Determine the
*p*-value. - If $p \le \alpha$, reject $H_{0}$; otherwise, do not reject $H_{0}$.
- Interpret the result of the hypothesis test.

**Step 1: State the null hypothesis $H_{0}$ and alternative hypothesis $H_{A}$**

The null hypothesis states that there is no association between gender and the major study subject of students:

$$H_{0}: \text {No association between gender and major study subject}$$Alternative hypothesis:

$$H_{A}: \quad \text {There is an association between gender and major study subject}$$**Step 2: Decide on the significance level, $\alpha$**

In [5]:

```
alpha = 0.05
```

**Step 3 and 4: Compute the value of the test statistic and the p-value**

For illustration purposes we will manually compute the test statistic with Python firstly. Recall the equation for the test statistic from above:

$$\chi^{2}= \sum {\frac {(O - E)^{2}} {E} }$$where $O$ represents the observed frequency and $E$ represents the expected frequency.

In [6]:

```
chi_squared = np.sum(np.sum(((observed_frequencies_table - expected_frequencies_table) ** 2) / expected_frequencies_table))
chi_squared
```

Out[6]:

77.31526633939147

The numerical value of the test statistic is $\approx 77.32$.

In order to calculate the *p*-value, we apply the `chi2.cdf`

function derived by the `scipy`

package over the `stats`

module to calculate the probability of occurrence for the test statistic based on the *$\chi^{2}$ distribution*. To do so, we also need the *degrees of freedom*. Recall how to calculate the degrees of freedom:

where $r$ and $c$ are the number of possible values for the two variables under consideration.

In [7]:

```
from scipy.stats import chi2
df = (observed_frequencies_table.shape[0] - 1) * (observed_frequencies_table.shape[1] - 1)
p = 1 - chi2.cdf(chi_squared, df = df)
p
```

Out[7]:

3.1086244689504383e-15

$p = 3.1086245 \times 10^{-15}$.

**Step 5: If $p \le \alpha$, reject $H_{0}$; otherwise, do not reject $H_{0}$**

In [8]:

```
# reject H0?
p < alpha
```

Out[8]:

True

The *p*-value is smaller than the significance level of 0.05; we reject $H_{0}$. The test results are statistically significant at the 5 % level and provide very strong evidence against the null hypothesis.

**Step 6: Interpret the result of the hypothesis test**

At the 5 % significance level, the data provide very strong evidence to conclude that there is an association between gender and the major study subject.

`scipy`

¶We just manually completed a $\chi^{2}$ independence test in Python. We can do the same with just one line of code by using the power of Python's package universe, namely the `scipy`

package!

In order to conduct a $\chi^{2}$ independence test in Python over the `stats`

module from the `scipy`

package, we apply the `chi2_contingency()`

function. We only have to provide a contingency table of the **observed frequencies** as `pandas`

`dataframe`

or `numpy`

`array`

. Additional information regarding the function's usage can be derived directly from the function's documentation of `scipy`

.

In [9]:

```
from scipy.stats import chi2_contingency
test_result = chi2_contingency(observed_frequencies_table)
test_result
```

Out[9]:

Chi2ContingencyResult(statistic=77.31526633939146, pvalue=3.056046255717623e-15, dof=5, expected_freq=array([[85.4566474 , 82.5433526 ], [68.67052023, 66.32947977], [76.80924855, 74.19075145], [64.09248555, 61.90751445], [87.49132948, 84.50867052], [57.47976879, 55.52023121]]))

The `chi2_contingency()`

function returns an `object`

, which provides all relevant information regarding the performed $\chi^{2}$ independence test. This also includes the **teststatic** as well as the corresponding ** p-value** of the test result. In detail, the

`object`

consists of the following properties:`<object>.statistic`

holds the actual teststatic and represents the empirical test value.`<object>.pvalue`

represents the*p*-value of the performed significance test.`<object>.dof`

represents the degrees of freedom.`<object>.dof`

stores the contingency table of the**expected frequencies**as`numpy`

`array`

.

Consequently, the teststatistic ($\chi^{2}_{emp}$) is retrieved over:

In [10]:

```
test_result.statistic
```

Out[10]:

77.31526633939146

The ** p-value** is retrieved over:

In [11]:

```
test_result.pvalue
```

Out[11]:

3.056046255717623e-15

Lastly, we want to provide a nicely printed output of the testresults:

In [12]:

```
print("Teststatistic = {}".format(round(test_result.statistic, 5)))
print("p-value = {}".format(round(test_result.pvalue, 15)))
```

Teststatistic = 77.31527 p-value = 3e-15

*p*-value, the test result match perfectly. Again, we may conclude that at the 5 % significance level, the data provide very strong evidence to conclude an association between gender and the major study subject.

**Citation**

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis
using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.*