The $\chi^{2}$ goodness-of-fit test is applied to perform hypothesis tests on the distribution of a qualitative (categorical) variable or a discrete quantitative variable that has only finitely many possible values.

The basic logic of the $\chi^{2}$ goodness-of-fit test is to compare the frequencies of two variables. We compare a sample's observed frequencies with the expected frequencies.

Consider a simple example:

On September 22, 2013, the German Federal Election 2013 was held. More than 44 million people turned up to vote. 41.5 % of German voters decided to vote for the Christian Democratic Union (CDU) and 25.7 % for the Social Democratic Party (SPD). For simplicity, we subsume the remaining percentage of votes (32.8 %) as Others.

Based on that data, we may build a frequency table:

Party Percentage Relative frequency
CDU 41.5 0.415
SPD 25.7 0.275
Others 32.8 0.328
$\sum$ 100 1

The third column of the table above corresponds to the relative frequencies of the German population/voters. For this exercise, we take a random sample. We asked 123 students of FU Berlin about their party affiliation and recorded their answers. Afterwards, we counted the occurrence of each category (party) in our sample. These quantities are the observed frequencies. The actual corresponding counts are:

Party Observed sample frequencies
CDU 43
SPD 36
Others 44
$\sum$ 123

In the next step we compute the expected frequency, denoted $E$, for each category:

$$E = n \times p \text{,}$$

where $n$ is the sample size and $p$ is the corresponding relative population frequency taken from election results given in the table above. Applying this information, we expect the following absolute frequencies per party:

$$E_{CDU} = n \times p = 123 \times 0.415 = 51.045$$$$E_{SPD} = n \times p = 123 \times 0.257 = 31.611$$$$E_{Others} = n \times p = 123 \times 0.382 = 46.986$$

Note: Although we deal with individual counts, represented by integer values, the expected frequency, $E$, is a floating point number. That is fine.

Now, we put the observed frequencies and the expected frequencies together into one table:

Party Observed sample frequencies Expected sample frequencies
CDU 43 51.045
SPD 36 31.611
Others 44 46.986
$\sum$ 123 129.642

Great! Once we have the expected frequencies, we have to check for two assumptions.

  1. we have to ensure all expected frequencies are one or greater
  2. at most, 20 % of the expected frequencies are less than 5.

By looking at the table, we may confirm that both assumptions are fulfilled.

Now, we have all ingredients we need, except the test statistic, to perform a $\chi^{2}$ goodness-of-fit test.

The $\chi^{2}$ test statistic for a goodness-of-fit is given by:

$$\chi^{2} = \sum \frac {(O - E)^{2}} {E}$$

where $O$ corresponds to the observed frequencies and $E$ to the expected frequencies. If the null hypothesis is true, the test statistic $\chi^{2}$ approximates a chi-square distribution.

The number of degrees of freedom is one less than the number of possible values (categories) for the variable under consideration. Hence:

$$df = c - 1$$

Based on the observed and expected frequencies given in the table above it is fairly straightforward to calculate the $\chi^{2}$-value. However, to make the calculation procedure easier, we put all the necessary computational steps into one table. The observed sample frequencies are shortened and named as $O$ while the expected sample frequencies are named as $E$:

Party $$O$$ $$E$$ $$O - E$$ $$(O - E)^{2}$$ $$\frac {(O - E)^{2}} {E}$$
CDU 43 51.045 -8.045 64.722 1.268
SPD 36 31.611 4.389 19.263 0.609
Others 44 46.986 -2.986 8.916 0.190
$\sum$ 123 129.642 -6.642 - 2.067

Conclusively, the $\chi{2}$ test statistic for a goodness-of-fit evaluates to 2.06709 for our sample data.

$$\chi^{2} = \sum \frac {(O - E)^{2}} {E} \approx 2.067$$

The observed and expected frequencies are roughly equal if the null hypothesis is true. This results in a small value of the $\chi^{2}$ test statistic, thus, supporting $H_{0}$. If, however, the value of the $\chi^{2}$ test statistic is large, the data provide evidence against $H_{0}$.

In our case, we may compare the empirical $\chi^{2}$ test statistic with the corresponding critical $\chi^{2}$ value for a significance level of 95 % with a degree of freedom of 3 categories minus 1. To derive the critical value $\chi^{2}$ with Python, we apply the chi2.ppf function over the stats module within the scipy package:

Note: Make sure the scipy package is part of your mamba environment!

In [1]:
from scipy.stats import chi2

chi2.ppf(0.95, df = 2)
Out[1]:
5.991464547107979

Since our empirical $\chi^{2}$ value is smaller than the critical $\chi^{2}$ value, we cannot reject the null hypotheses!

$\chi^{2}$ goodness-of-fit test: An Example¶

In order to get some hands-on experience, we apply the $\chi^{2}$ goodness-of-fit test in an exercise. For this, we load the students data set. You may download the students.csv file here and import it from your local file system, or you load it directly as a web resource. In either case, you import the data set to python as pandas dataframe object by using the read_csv method:

Note: Make sure the numpy and pandas packages are part of your mamba environment!

In [2]:
import pandas as pd
import numpy as np

students = pd.read_csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")

The students data set consists of 8239 rows, each representing a particular student, and 16 columns corresponding to a variable/feature related to that particular student. These self-explaining variables are:

  • stud.id
  • name
  • gender
  • age
  • height
  • weight
  • religion
  • nc.score
  • semester
  • major
  • minor
  • score1
  • score2
  • online.tutorial
  • graduated
  • salary

Recall $\chi^{2}$ goodness-of-fit tests are applied for qualitative (categorical) or discrete quantitative variables. There are several categorical variables in the students data set, such as gender, religion, major, minor and graduated.

In order to showcase the $\chi^{2}$ goodness-of-fit test, we examine if religions are equally distributed among students compared to the distribution of religions among the population of the European Union. The data on the continental scale is provided in the report "Discrimination in the EU in 2012" (European Union: European Commission, Special Eurobarometer, 393, p. 233).

The report provides data for eight categories of how people ascribed themselves:

  • 48 % as Catholic
  • 16 % as Non believers/Agnostic
  • 12 % as Protestant
  • 8% as Orthodox
  • 7% as Atheist
  • 4 % as Other Christian
  • 3 % as Other religion/None stated
  • 2 % as Muslim.

We plot the data in the form of a pie chart for a better understanding:

Note: Make sure the matplotlib and the seaborn package are part of your mamba environment!

In [3]:
import seaborn as sns

data = [48, 16, 12, 8, 7, 4, 3, 2]
religions = ["Catholic", "Non believer/\nAgnostic", "Protestant", 
             "Orthodox", "Atheist", "Other Christian",
             "Other religion/None stated", "Muslim"]

data = pd.Series(data, index = religions)

data.plot.pie(colors = sns.color_palette("Set3", 8))
Out[3]:
<Axes: >

Data preparation¶

We start with data exploration and data preparation.

First, we want to know which categories are available in the students data set for the column religion. Therefore, we apply the unique() method, which provides access to the levels (categories) of a variable:

In [4]:
print(students["religion"].unique())
['Muslim' 'Other' 'Protestant' 'Catholic' 'Orthodox']

Obviously, in the students data set there are 5 different categories, compared to 8 categories provided by the report of the EU. Thus, in order to make comparisons, we summarize the categories of EU report to 5 categories:

  1. "Catholic"
  2. "Muslim"
  3. "Orthodox"
  4. "Protestant"
  5. "Other"

Be careful not to mix-up categories during that step!

In [5]:
data_raw = [48, 2, 8, (16 + 7 + 4 + 3), 12]
religions = ["Catholic", "Muslim", "Orthodox", "Other", "Protestant"]

data = pd.Series(data_raw, index = religions, name = "relative_frequency") / 100
data.to_frame()
Out[5]:
relative_frequency
Catholic 0.48
Muslim 0.02
Orthodox 0.08
Other 0.30
Protestant 0.12

Now, we take a random sample based on the students data set. The sample size is $n = 256$. Afterwards, we count the number of students in each particular religion category using the groupby() function. Recall that this quantity corresponds to the observed frequencies.

In [6]:
n = 256

sample = students.sample(n, random_state = 8).groupby(["religion"])
sample.size().to_frame("Oberserved Frequencies")
Out[6]:
Oberserved Frequencies
religion
Catholic 80
Muslim 12
Orthodox 21
Other 104
Protestant 39

Let's combine both pieces of information into a nice-looking table by converting it to a pandas dataframe object:

In [7]:
df = pd.DataFrame({'relative frequencies' : data,
                   'observed frequencies' : sample.size()})
df
Out[7]:
relative frequencies observed frequencies
Catholic 0.48 80
Muslim 0.02 12
Orthodox 0.08 21
Other 0.30 104
Protestant 0.12 39

In the next step we calculate the expected frequencies add the information as seperate column to our existing dataframe df. Recall the equation:

$$E = n \times p$$
In [8]:
df["expected frequencies"] = df["relative frequencies"] * 256
df
Out[8]:
relative frequencies observed frequencies expected frequencies
Catholic 0.48 80 122.88
Muslim 0.02 12 5.12
Orthodox 0.08 21 20.48
Other 0.30 104 76.80
Protestant 0.12 39 30.72

Once we know the expected frequencies, we must check for two assumptions.

  1. We must ensure that all expected frequencies are one or greater.
  2. At most, 20 % of the expected frequencies should be less than 5.

We may confirm that both assumptions are fulfilled by looking at the table.

Perfect, now we are done with the preparation! The data set can be analyzed with the $\chi^{2}$ goodness-of-fit test. Recall the question we are interested in: Is the religion equally distributed among students compared to the distribution of the religion among the population of the European Union?

Hypothesis Testing¶

In order to conduct the $\chi^{2}$ goodness-of-fit test, we follow the step-wise implementation procedure for hypothesis testing. The $\chi^{2}$ goodness-of-fit test follows the same step-wise, generalized test scheme for hypothesis tests:


  1. State the null hypothesis $H_{0}$ and alternative hypothesis $H_{A}$.
  2. Decide on the significance level, $\alpha$.
  3. Compute the value of the test statistic.
  4. Determine the p-value.
  5. If $p \le \alpha$, reject $H_{0}$; otherwise, do not reject $H_{0}$.
  6. Interpret the result of the hypothesis test.

Step 1: State the null hypothesis $H_{0}$ and alternative hypothesis $H_{A}$

The null hypothesis states that the religion is equally distributed among students compared to the distribution of the religion among the population of the European Union:

$$H_{0}: \quad \text {The variable has the specified distribution}$$

Alternative hypothesis:

$$H_{A}: \quad \text {The variable does not have the specified distribution}$$

Step 2: Decide on the significance level, $\alpha$

$$\alpha = 0.01$$
In [9]:
alpha = 0.01

Step 3 and 4: Compute the value of the test statistic and the p-value

For illustration purposes we will manually compute the test statistic with Python firstly. Recall the equation for the test statistic from above:

$$\chi^{2} = \sum \frac {(O - E)^{2}} {E}$$
In [10]:
O_E = (df["observed frequencies"] - df["expected frequencies"]) ** 2
chi_squared = np.sum(O_E / df["expected frequencies"])
chi_squared
Out[10]:
36.086588541666664

The numerical value of the test statistic is $\approx 36.0866$.

In order to calculate the p-value, we apply the chi2.cdf function derived by the scipy package over the stats module to calculate the probability of occurrence for the test statistic based on the $\chi^{2}$ distribution. To do so, we also need the degrees of freedom. Recall how to calculate the degrees of freedom:

$$df = (c - 1)$$
In [11]:
from scipy.stats import chi2

p = 1 - chi2.cdf(chi_squared, df = df.shape[0] - 1)
p
Out[11]:
2.7774032629324097e-07

$p = 2.77740326 \times 10^{-7}$.


Step 5: If $p \le \alpha$, reject $H_{0}$; otherwise, do not reject $H_{0}$

In [12]:
# reject H0?
p < alpha
Out[12]:
True

The p-value is smaller than the specified significance level of 0.01; we reject $H_{0}$. The test results are statistically significant at the 1 % level and provide very strong evidence against the null hypothesis.


Step 6: Interpret the result of the hypothesis test

At the 1 % significance level the data provides very strong evidence to conclude, that the religion distribution among students differs from the religion distribution of the population of the European Union.

Hypothesis testing in Python with scipy¶

We manually completed a $\chi^{2}$ goodness-of-fit test in Python. Very cool, but now we redo that example and the power of Python's package universe, namely the scipy package, to obtain the same result as above in just one line of code!

In order to conduct a $\chi^{2}$ goodness-of-fit test in Python over the stats module from the scipy package, we apply the chisquare() function. We only have to provide observed, and the expected frequencies We provide that information over a column selection based on our dataframe. Additional information regarding the function's usage can be derived directly from the function's documentation of scipy.

In [13]:
from scipy import stats

test_result = stats.chisquare(df["observed frequencies"], df["expected frequencies"])

test_result
Out[13]:
Power_divergenceResult(statistic=36.086588541666664, pvalue=2.777403262517103e-07)

The chisquare() function returns an object, which provides the teststatic as well as the corresponding p-value of the test result. Those values can be retrieved over the following properties:

  • <object>.statistic holds the actual teststatic and represents the empirical test value.
  • <object>.pvalue represents the p-value of the performed significance test.

Consequently, the teststatistic ($\chi^{2}_{emp}$) is retrieved over:

In [14]:
test_result.statistic
Out[14]:
36.086588541666664

The p-value is retrieved over:

In [15]:
test_result.pvalue
Out[15]:
2.777403262517103e-07

Lastly, we want to provide a nicely printed output of the testresults:

In [16]:
print("Teststatistic = {}".format(round(test_result.statistic, 5)))
print("p-value = {}".format(round(test_result.pvalue, 7)))
Teststatistic = 36.08659
p-value = 3e-07

Compared with the manually calculated $\chi^{2}_{emp}$ value and the p-value, they match perfectly. Again, at the 1 % significance level, the data provide very strong evidence to conclude that the religion distribution among students differs from the religion distribution of the population of the European Union.

Exercise: With his famous pea plant experiments, Augustinian monk Gregor Mendel discovered the inheritance law of recessive and dominant traits in genes. His results show a 1:3 ratio of green to yellow peas from cross-bred seeds. Assume we repeated his experiment and got 123 green and 355 yellow pea plants. Does our observation confirm Mendel's inheritance law? Perform a test with 95 % significance level!

Exercise: With his famous pea plant experiments, Augustinian monk Gregor Mendel discovered the inheritance law of recessive and dominant traits in genes. His results show a 1:3 ratio of green to yellow peas from cross-bred seeds. Assume we repeated his experiment and got 123 green and 355 yellow pea plants. Does our observation confirm Mendel's inheritance law? Perform a test with 95 % significance level!

In [17]:
observed = [123, 355]
expected = [np.sum(observed) * 0.25, np.sum(observed) * 0.75 ]

test_result = stats.chisquare(observed, expected)

print("Chi_squared = {}".format(round(test_result.statistic, 5)))
print("p-value = {}".format(round(test_result.pvalue, 5)))

print("Because the p value ({}) is less than alpha (0.05) we do not have any evidence to reject H0.".
      format(round(test_result.pvalue, 3)))
Chi_squared = 0.13668
p-value = 0.7116
Because the p value (0.712) is less than alpha (0.05) we do not have any evidence to reject H0.

Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.