The $\chi^{2}$ independence test is an inferential method to decide whether an association exists between two variables. Like other hypothesis tests, the null hypothesis states that the two variables are not associated. In contrast, the alternative hypothesis states that the two variables are associated.
Recall that statistically dependent variables are called associated variables. In contrast, non-associated variables are called statistically independent variables. Further, recall the concept of contingency tables (also known as two-way table, cross-tabulation table or cross tabs), which display the frequency distributions of bivariate data.
The basic idea behind the $\chi^{2}$ independence test is to compare the observed frequencies in a contingency table with the expected frequencies, given that the null hypothesis of non-association is true. The expected frequency for each cell of a contingency table is given by
$$ E = \frac {R \times C} {n}$$where $R$ is the row total, $C$ is the column total and $n$ is the sample size.
Let us construct an example for better understanding. We consider an exit poll in the form of a contingency table that displays the age of $n= 1.189$ people in categories from 18-29, 30-44, 45-64 and >65 years and their political affiliation, which is "Conservative", "Socialist" or "Other". This table corresponds to the observed frequencies.
Observed frequencies:
Conservative | Socialist | Other | $$\sum$$ | |
---|---|---|---|---|
18-29 | 141 | 68 | 4 | 213 |
30-44 | 179 | 159 | 7 | 345 |
45-64 | 220 | 216 | 4 | 440 |
> 65 | 86 | 101 | 4 | 191 |
$$\sum$$ | 626 | 544 | 19 | 1189 |
We calculate the expected frequency for each cell based on the above equation.
Expected frequencies:
Conservative | Socialist | Other | $$\sum$$ | |
---|---|---|---|---|
18-29 | $$\frac {213 \times 626} {1189} \approx 112.14$$ | $$\frac {213 \times 544} {1189} \approx 97.45 $$ | $$\frac {213 \times 19} {1189} \approx 3.4$$ | 213 |
30-44 | $$\frac {345 \times 626} {1189} \approx 181.64$$ | $$\frac {345 \times 544} {1189} \approx 157.85$$ | $$\frac {345 \times 19} {1189} \approx 5.51$$ | 345 |
45-64 | $$\frac {440 \times 626} {1189} \approx 231.66$$ | $$\frac {440 \times 544} {1189} \approx 201.31$$ | $$\frac {440 \times 19} {1189} \approx 7.03$$ | 440 |
> 65 | $$\frac {191 \times 626} {1189} \approx 100.56$$ | $$\frac {191 \times 544} {1189} \approx 87.39$$ | $$\frac {191 \times 19} {1189} \approx 3.05$$ | 191 |
$$\sum$$ | 626 | 544 | 19 | 1189 |
Once we know the expected frequencies, we have to check for two assumptions.
We may confirm that both assumptions are fulfilled by looking at the table.
The actual comparison is based on the $\chi^{2}$ test statistic for the observed and expected frequencies. The $\chi^{2}$ test statistic follows the $\chi^{2}$ distribution and is given by:
$$\chi^{2}= \sum {\frac {(O - E)^{2} } {E} }$$where $O$ represents the observed frequency and $E$ represents the expected frequency. Please note that $\frac {(O - E)^{2} } {E}$ is evaluated for each cell and then summed up.
The number of degrees of freedom is given by:
$$df = (r - 1) \times (c - 1)$$where $r$ and $c$ are the number of possible values for the two variables under consideration.
Adopted to the above example, this leads to a somehow long-expression, which, for the sake of brevity, is just given for the first and the last row of the contingency tables of interest:
$$\chi^{2} = \frac {141 \times 112.14} {112.14} + \frac {68 \times 97.45} {97.45} + \frac {4 \times 3.4} {3.4} + ... + \frac {86 \times 100.56} {100.56} + \frac {101 \times 87.39} {87.39} + \frac {4 \times 3.05} {3.05}$$If the null hypothesis is true, the observed and expected frequencies are roughly equal, resulting in a small value of the $\chi^{2}$ test statistic, thus, supporting $H_{0}$. If, however, the value of the $\chi^{2}$ test statistic is large, the data provide evidence against $H_{0}$. In the following sections, we further discuss how to assess the value of the $\chi^{2}$ test statistic in the framework of hypothesis testing.
In order to get some hands-on experience, we apply the $\chi^{2}$ independence test in an exercise. For this, we load the students
data set. You may download the students.csv
file here and import it from your local file system, or you load it directly as a web resource. In either case, you import the data set to python as pandas
dataframe
object by using the read_csv
method:
Note: Make sure the
numpy
andpandas
packages are part of yourmamba
environment!
import pandas as pd
import numpy as np
students = pd.read_csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")
The students data set consists of 8239 rows, each representing a particular student, and 16 columns corresponding to a variable/feature related to that particular student. These self-explaining variables are:
In this exercise, we want to examine if there is an association between the variables gender
and major
, or in other words, we want to know if male students favour different study subjects compared to female students.
We start with data preparation. We only want to deal with some of the data set of 8239 entries. Thus we randomly select 865 students from the data set. The first step of data preparation is to display our data of interest as a contingency table. Pandas
provides the fancy crosstab()
function, which will do the job for us!
n = 865
sample = students.sample(n, random_state = 8)
observed_frequencies_table = pd.crosstab(sample.major, sample.gender, margins=False)
observed_frequencies_table
gender | Female | Male |
---|---|---|
major | ||
Biology | 102 | 66 |
Economics and Finance | 53 | 82 |
Environmental Sciences | 63 | 88 |
Mathematics and Statistics | 33 | 93 |
Political Science | 111 | 61 |
Social Sciences | 78 | 35 |
We could also calculate and display the full contingency table by setting the argument margins = True
:
pd.crosstab(sample.major, sample.gender, margins=True)
gender | Female | Male | All |
---|---|---|---|
major | |||
Biology | 102 | 66 | 168 |
Economics and Finance | 53 | 82 | 135 |
Environmental Sciences | 63 | 88 | 151 |
Mathematics and Statistics | 33 | 93 | 126 |
Political Science | 111 | 61 | 172 |
Social Sciences | 78 | 35 | 113 |
All | 440 | 425 | 865 |
In the next step, we construct the expected frequencies. Recall the equation above:
$$ E = \frac {R \times C} {n}$$where $R$ is the row total, $C$ is the column total and $n$ is the sample size.
We compute the expected frequencies cell-wise by implementing a nested for-loop. We go through all rows of the dataframe
, column by column, and calculate the expected frequency $E$ for each cell.
n = 865
observed = pd.crosstab(sample.major, sample.gender, margins=False)
expected_frequencies_table = pd.crosstab(sample.major, sample.gender, margins=False)
for row in range(0, expected_frequencies_table.shape[0]):
for column in range(0, expected_frequencies_table.shape[1]):
exp = (np.sum(observed_frequencies_table.iloc[row, :]) * np.sum(observed_frequencies_table.iloc[:, column])) / n
expected_frequencies_table.iloc[row, column] = exp
expected_frequencies_table
gender | Female | Male |
---|---|---|
major | ||
Biology | 85.456647 | 82.543353 |
Economics and Finance | 68.670520 | 66.329480 |
Environmental Sciences | 76.809249 | 74.190751 |
Mathematics and Statistics | 64.092486 | 61.907514 |
Political Science | 87.491329 | 84.508671 |
Social Sciences | 57.479769 | 55.520231 |
Once we know the expected frequencies we have to check for two assumptions:
By looking at the table, we may confirm that both assumptions are fulfilled.
Now, we have all the data we need to perform a $\chi^{2}$ independence test.
In order to conduct the $\chi^{2}$ independence test, we follow the step-wise implementation procedure for hypothesis testing. The $\chi^{2}$ independence test follows the same step-wise procedure as discussed in the previous sections:
Step 1: State the null hypothesis $H_{0}$ and alternative hypothesis $H_{A}$
The null hypothesis states that there is no association between gender and the major study subject of students:
$$H_{0}: \text {No association between gender and major study subject}$$Alternative hypothesis:
$$H_{A}: \quad \text {There is an association between gender and major study subject}$$Step 2: Decide on the significance level, $\alpha$
$$\alpha = 0.05$$alpha = 0.05
Step 3 and 4: Compute the value of the test statistic and the p-value
For illustration purposes we will manually compute the test statistic with Python firstly. Recall the equation for the test statistic from above:
$$\chi^{2}= \sum {\frac {(O - E)^{2}} {E} }$$where $O$ represents the observed frequency and $E$ represents the expected frequency.
chi_squared = np.sum(np.sum(((observed_frequencies_table - expected_frequencies_table) ** 2) / expected_frequencies_table))
chi_squared
77.31526633939147
The numerical value of the test statistic is $\approx 77.32$.
In order to calculate the p-value, we apply the chi2.cdf
function derived by the scipy
package over the stats
module to calculate the probability of occurrence for the test statistic based on the $\chi^{2}$ distribution. To do so, we also need the degrees of freedom. Recall how to calculate the degrees of freedom:
where $r$ and $c$ are the number of possible values for the two variables under consideration.
from scipy.stats import chi2
df = (observed_frequencies_table.shape[0] - 1) * (observed_frequencies_table.shape[1] - 1)
p = 1 - chi2.cdf(chi_squared, df = df)
p
3.1086244689504383e-15
$p = 3.1086245 \times 10^{-15}$.
Step 5: If $p \le \alpha$, reject $H_{0}$; otherwise, do not reject $H_{0}$
# reject H0?
p < alpha
True
The p-value is smaller than the significance level of 0.05; we reject $H_{0}$. The test results are statistically significant at the 5 % level and provide very strong evidence against the null hypothesis.
Step 6: Interpret the result of the hypothesis test
At the 5 % significance level, the data provide very strong evidence to conclude that there is an association between gender and the major study subject.
scipy
¶We just manually completed a $\chi^{2}$ independence test in Python. We can do the same with just one line of code by using the power of Python's package universe, namely the scipy
package!
In order to conduct a $\chi^{2}$ independence test in Python over the stats
module from the scipy
package, we apply the chi2_contingency()
function. We only have to provide a contingency table of the observed frequencies as pandas
dataframe
or numpy
array
. Additional information regarding the function's usage can be derived directly from the function's documentation of scipy
.
from scipy.stats import chi2_contingency
test_result = chi2_contingency(observed_frequencies_table)
test_result
Chi2ContingencyResult(statistic=77.31526633939146, pvalue=3.056046255717623e-15, dof=5, expected_freq=array([[85.4566474 , 82.5433526 ], [68.67052023, 66.32947977], [76.80924855, 74.19075145], [64.09248555, 61.90751445], [87.49132948, 84.50867052], [57.47976879, 55.52023121]]))
The chi2_contingency()
function returns an object
, which provides all relevant information regarding the performed $\chi^{2}$ independence test. This also includes the teststatic as well as the corresponding p-value of the test result. In detail, the object
consists of the following properties:
<object>.statistic
holds the actual teststatic and represents the empirical test value.<object>.pvalue
represents the p-value of the performed significance test.<object>.dof
represents the degrees of freedom.<object>.dof
stores the contingency table of the expected frequencies as numpy
array
.Consequently, the teststatistic ($\chi^{2}_{emp}$) is retrieved over:
test_result.statistic
77.31526633939146
The p-value is retrieved over:
test_result.pvalue
3.056046255717623e-15
Lastly, we want to provide a nicely printed output of the testresults:
print("Teststatistic = {}".format(round(test_result.statistic, 5)))
print("p-value = {}".format(round(test_result.pvalue, 15)))
Teststatistic = 77.31527 p-value = 3e-15
Compared with the manually calculated $\chi^{2}_{emp}$ value and the p-value, the test result match perfectly. Again, we may conclude that at the 5 % significance level, the data provide very strong evidence to conclude an association between gender and the major study subject.
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.