20762_chi-square_independence

The $\chi^{2}$ independence test is an inferential method to decide whether an association exists between two variables. Like other hypothesis tests, the null hypothesis states that the two variables are not associated. In contrast, the alternative hypothesis states that the two variables are associated.

Recall that statistically dependent variables are called associated variables. In contrast, non-associated variables are called statistically independent variables. Further, recall the concept of contingency tables (also known as two-way table, cross-tabulation table or cross tabs), which display the frequency distributions of bivariate data.

$\chi^{2}$ Independence Test¶

The basic idea behind the $\chi^{2}$ independence test is to compare the observed frequencies in a contingency table with the expected frequencies, given that the null hypothesis of non-association is true. The expected frequency for each cell of a contingency table is given by

$$ E = \frac {R \times C} {n}$$

where $R$ is the row total, $C$ is the column total and $n$ is the sample size.

Let us construct an example for better understanding. We consider an exit poll in the form of a contingency table that displays the age of $n= 1.189$ people in categories from 18-29, 30-44, 45-64 and >65 years and their political affiliation, which is "Conservative", "Socialist" or "Other". This table corresponds to the observed frequencies.

Observed frequencies:

	Conservative	Socialist	Other	$$\sum$$
18-29	141	68	4	213
30-44	179	159	7	345
45-64	220	216	4	440
> 65	86	101	4	191
$$\sum$$	626	544	19	1189

We calculate the expected frequency for each cell based on the above equation.

Expected frequencies:

	Conservative	Socialist	Other	$$\sum$$
18-29	$$\frac {213 \times 626} {1189} \approx 112.14$$	$$\frac {213 \times 544} {1189} \approx 97.45 $$	$$\frac {213 \times 19} {1189} \approx 3.4$$	213
30-44	$$\frac {345 \times 626} {1189} \approx 181.64$$	$$\frac {345 \times 544} {1189} \approx 157.85$$	$$\frac {345 \times 19} {1189} \approx 5.51$$	345
45-64	$$\frac {440 \times 626} {1189} \approx 231.66$$	$$\frac {440 \times 544} {1189} \approx 201.31$$	$$\frac {440 \times 19} {1189} \approx 7.03$$	440
> 65	$$\frac {191 \times 626} {1189} \approx 100.56$$	$$\frac {191 \times 544} {1189} \approx 87.39$$	$$\frac {191 \times 19} {1189} \approx 3.05$$	191
$$\sum$$	626	544	19	1189

Once we know the expected frequencies, we have to check for two assumptions.

we have to ensure all expected frequencies are one or greater
at most, 20 % of the expected frequencies are less than 5.

We may confirm that both assumptions are fulfilled by looking at the table.

The actual comparison is based on the $\chi^{2}$ test statistic for the observed and expected frequencies. The $\chi^{2}$ test statistic follows the $\chi^{2}$ distribution and is given by:

$$\chi^{2}= \sum {\frac {(O - E)^{2} } {E} }$$

where $O$ represents the observed frequency and $E$ represents the expected frequency. Please note that $\frac {(O - E)^{2} } {E}$ is evaluated for each cell and then summed up.

The number of degrees of freedom is given by:

$$df = (r - 1) \times (c - 1)$$

where $r$ and $c$ are the number of possible values for the two variables under consideration.

Adopted to the above example, this leads to a somehow long-expression, which, for the sake of brevity, is just given for the first and the last row of the contingency tables of interest:

$$\chi^{2} = \frac {141 \times 112.14} {112.14} + \frac {68 \times 97.45} {97.45} + \frac {4 \times 3.4} {3.4} + ... + \frac {86 \times 100.56} {100.56} + \frac {101 \times 87.39} {87.39} + \frac {4 \times 3.05} {3.05}$$

If the null hypothesis is true, the observed and expected frequencies are roughly equal, resulting in a small value of the $\chi^{2}$ test statistic, thus, supporting $H_{0}$. If, however, the value of the $\chi^{2}$ test statistic is large, the data provide evidence against $H_{0}$. In the following sections, we further discuss how to assess the value of the $\chi^{2}$ test statistic in the framework of hypothesis testing.

$\chi^{2}$ Independence Test: An Example¶

In order to get some hands-on experience, we apply the $\chi^{2}$ independence test in an exercise. For this, we load the students data set. You may download the students.csv file here and import it from your local file system, or you load it directly as a web resource. In either case, you import the data set to python as pandas dataframe object by using the read_csv method:

Note: Make sure the numpy and pandas packages are part of your mamba environment!

In [1]:

import pandas as pd
import numpy as np

students = pd.read_csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")

The students data set consists of 8239 rows, each representing a particular student, and 16 columns corresponding to a variable/feature related to that particular student. These self-explaining variables are:

stud.id
name
gender
age
height
weight
religion
nc.score
semester
major
minor
score1
score2
online.tutorial
graduated
salary

In this exercise, we want to examine if there is an association between the variables gender and major, or in other words, we want to know if male students favour different study subjects compared to female students.

Data preparation¶

We start with data preparation. We only want to deal with some of the data set of 8239 entries. Thus we randomly select 865 students from the data set. The first step of data preparation is to display our data of interest as a contingency table. Pandas provides the fancy crosstab() function, which will do the job for us!

In [2]:

n = 865

sample = students.sample(n, random_state = 8)

observed_frequencies_table = pd.crosstab(sample.major, sample.gender, margins=False)
observed_frequencies_table

Out[2]:

gender	Female	Male
major
Biology	102	66
Economics and Finance	53	82
Environmental Sciences	63	88
Mathematics and Statistics	33	93
Political Science	111	61
Social Sciences	78	35

We could also calculate and display the full contingency table by setting the argument margins = True:

In [3]:

pd.crosstab(sample.major, sample.gender, margins=True)

Out[3]:

gender	Female	Male	All
major
Biology	102	66	168
Economics and Finance	53	82	135
Environmental Sciences	63	88	151
Mathematics and Statistics	33	93	126
Political Science	111	61	172
Social Sciences	78	35	113
All	440	425	865

In the next step, we construct the expected frequencies. Recall the equation above:

$$ E = \frac {R \times C} {n}$$

where $R$ is the row total, $C$ is the column total and $n$ is the sample size.

We compute the expected frequencies cell-wise by implementing a nested for-loop. We go through all rows of the dataframe, column by column, and calculate the expected frequency $E$ for each cell.

In [4]:

n = 865

observed = pd.crosstab(sample.major, sample.gender, margins=False)
expected_frequencies_table = pd.crosstab(sample.major, sample.gender, margins=False)

for row in range(0, expected_frequencies_table.shape[0]):
    for column in range(0, expected_frequencies_table.shape[1]):
        exp = (np.sum(observed_frequencies_table.iloc[row, :]) * np.sum(observed_frequencies_table.iloc[:, column])) / n
        expected_frequencies_table.iloc[row, column] = exp
        
expected_frequencies_table

Out[4]:

gender	Female	Male
major
Biology	85.456647	82.543353
Economics and Finance	68.670520	66.329480
Environmental Sciences	76.809249	74.190751
Mathematics and Statistics	64.092486	61.907514
Political Science	87.491329	84.508671
Social Sciences	57.479769	55.520231

Once we know the expected frequencies we have to check for two assumptions:

we have to ensure all expected frequencies are one or greater
at most, 20 % of the expected frequencies are less than 5.

By looking at the table, we may confirm that both assumptions are fulfilled.

Now, we have all the data we need to perform a $\chi^{2}$ independence test.

Hypothesis testing¶

In order to conduct the $\chi^{2}$ independence test, we follow the step-wise implementation procedure for hypothesis testing. The $\chi^{2}$ independence test follows the same step-wise procedure as discussed in the previous sections:

State the null hypothesis $H_{0}$ and alternative hypothesis $H_{A}$.
Decide on the significance level, $\alpha$.
Compute the value of the test statistic.
Determine the p-value.
If $p \le \alpha$, reject $H_{0}$; otherwise, do not reject $H_{0}$.
Interpret the result of the hypothesis test.

Step 1: State the null hypothesis $H_{0}$ and alternative hypothesis $H_{A}$

The null hypothesis states that there is no association between gender and the major study subject of students:

$$H_{0}: \text {No association between gender and major study subject}$$

Alternative hypothesis:

$$H_{A}: \quad \text {There is an association between gender and major study subject}$$

Step 2: Decide on the significance level, $\alpha$

$$\alpha = 0.05$$

In [5]:

alpha = 0.05

Step 3 and 4: Compute the value of the test statistic and the p-value

For illustration purposes we will manually compute the test statistic with Python firstly. Recall the equation for the test statistic from above:

$$\chi^{2}= \sum {\frac {(O - E)^{2}} {E} }$$

where $O$ represents the observed frequency and $E$ represents the expected frequency.

In [6]:

chi_squared = np.sum(np.sum(((observed_frequencies_table - expected_frequencies_table) ** 2) / expected_frequencies_table))
chi_squared

Out[6]:

77.31526633939147

The numerical value of the test statistic is $\approx 77.32$.

In order to calculate the p-value, we apply the chi2.cdf function derived by the scipy package over the stats module to calculate the probability of occurrence for the test statistic based on the $\chi^{2}$ distribution. To do so, we also need the degrees of freedom. Recall how to calculate the degrees of freedom:

$$df = (r - 1) \times (c - 1)$$

where $r$ and $c$ are the number of possible values for the two variables under consideration.

In [7]:

from scipy.stats import chi2

df = (observed_frequencies_table.shape[0] - 1) * (observed_frequencies_table.shape[1] - 1)

p = 1 - chi2.cdf(chi_squared, df = df)
p

Out[7]:

3.1086244689504383e-15

$p = 3.1086245 \times 10^{-15}$.

Step 5: If $p \le \alpha$, reject $H_{0}$; otherwise, do not reject $H_{0}$

In [8]:

# reject H0?

p < alpha

Out[8]:

True

The p-value is smaller than the significance level of 0.05; we reject $H_{0}$. The test results are statistically significant at the 5 % level and provide very strong evidence against the null hypothesis.

Step 6: Interpret the result of the hypothesis test

At the 5 % significance level, the data provide very strong evidence to conclude that there is an association between gender and the major study subject.

Hypothesis testing in Python with `scipy`¶

We just manually completed a $\chi^{2}$ independence test in Python. We can do the same with just one line of code by using the power of Python's package universe, namely the scipy package!

In order to conduct a $\chi^{2}$ independence test in Python over the stats module from the scipy package, we apply the chi2_contingency() function. We only have to provide a contingency table of the observed frequencies as pandas dataframe or numpy array. Additional information regarding the function's usage can be derived directly from the function's documentation of scipy.

In [9]:

from scipy.stats import chi2_contingency

test_result = chi2_contingency(observed_frequencies_table)

test_result

Out[9]:

Chi2ContingencyResult(statistic=77.31526633939146, pvalue=3.056046255717623e-15, dof=5, expected_freq=array([[85.4566474 , 82.5433526 ],
       [68.67052023, 66.32947977],
       [76.80924855, 74.19075145],
       [64.09248555, 61.90751445],
       [87.49132948, 84.50867052],
       [57.47976879, 55.52023121]]))

The chi2_contingency() function returns an object, which provides all relevant information regarding the performed $\chi^{2}$ independence test. This also includes the teststatic as well as the corresponding p-value of the test result. In detail, the object consists of the following properties:

<object>.statistic holds the actual teststatic and represents the empirical test value.
<object>.pvalue represents the p-value of the performed significance test.
<object>.dof represents the degrees of freedom.
<object>.dof stores the contingency table of the expected frequencies as numpy array.

Consequently, the teststatistic ($\chi^{2}_{emp}$) is retrieved over:

In [10]:

test_result.statistic

Out[10]:

77.31526633939146

The p-value is retrieved over:

In [11]:

test_result.pvalue

Out[11]:

3.056046255717623e-15

Lastly, we want to provide a nicely printed output of the testresults:

In [12]:

print("Teststatistic = {}".format(round(test_result.statistic, 5)))
print("p-value = {}".format(round(test_result.pvalue, 15)))

Teststatistic = 77.31527
p-value = 3e-15

Compared with the manually calculated $\chi^{2}_{emp}$ value and the p-value, the test result match perfectly. Again, we may conclude that at the 5 % significance level, the data provide very strong evidence to conclude an association between gender and the major study subject.

$\chi^{2}$ Independence Test¶

$\chi^{2}$ Independence Test: An Example¶

Data preparation¶

Hypothesis testing¶

Hypothesis testing in Python with scipy¶

Hypothesis testing in Python with `scipy`¶