The **$\chi^{2}$ goodness-of-fit test** is applied to perform hypothesis tests on the distribution of a qualitative (categorical) variable or a discrete quantitative variable that has only finitely many possible values.

The basic logic of the $\chi^{2}$ goodness-of-fit test is to compare the frequencies of two variables. We compare a sample's **observed frequencies** with the **expected frequencies**.

Consider a simple example:

On September 22, 2013, the German Federal Election 2013 was held. More than 44 million people turned up to vote. 41.5 % of German voters decided to vote for the *Christian Democratic Union (CDU)* and 25.7 % for the *Social Democratic Party (SPD)*. For simplicity, we subsume the remaining percentage of votes (32.8 %) as *Others*.

Based on that data, we may build a frequency table:

Party | Percentage | Relative frequency |
---|---|---|

CDU | 41.5 | 0.415 |

SPD | 25.7 | 0.275 |

Others | 32.8 | 0.328 |

$\sum$ | 100 | 1 |

The third column of the table above corresponds to the **relative frequencies** of the German population/voters.
For this exercise, we take a random sample. We asked 123 students of FU Berlin about their party affiliation and recorded their answers. Afterwards, we counted the occurrence of each category (party) in our sample. These quantities are the **observed frequencies**. The actual corresponding counts are:

Party | Observed sample frequencies |
---|---|

CDU | 43 |

SPD | 36 |

Others | 44 |

$\sum$ | 123 |

In the next step we compute the **expected frequency**, denoted $E$, for each category:

where $n$ is the sample size and $p$ is the corresponding relative population frequency taken from election results given in the table above. Applying this information, we expect the following absolute frequencies per party:

$$E_{CDU} = n \times p = 123 \times 0.415 = 51.045$$$$E_{SPD} = n \times p = 123 \times 0.257 = 31.611$$$$E_{Others} = n \times p = 123 \times 0.382 = 46.986$$

Note:Although we deal with individual counts, represented by integer values, theexpected frequency, $E$, is a floating point number. That is fine.

Now, we put the **observed frequencies** and the **expected frequencies** together into one table:

Party | Observed sample frequencies | Expected sample frequencies |
---|---|---|

CDU | 43 | 51.045 |

SPD | 36 | 31.611 |

Others | 44 | 46.986 |

$\sum$ | 123 | 129.642 |

Great! Once we have the expected frequencies, we have to check for two assumptions.

- we have to ensure all expected frequencies are one or greater
- at most, 20 % of the expected frequencies are less than 5.

By looking at the table, we may confirm that both assumptions are fulfilled.

Now, we have all ingredients we need, except the test statistic, to perform a $\chi^{2}$ goodness-of-fit test.

The $\chi^{2}$ test statistic for a goodness-of-fit is given by:

$$\chi^{2} = \sum \frac {(O - E)^{2}} {E}$$where $O$ corresponds to the observed frequencies and $E$ to the expected frequencies. If the null hypothesis is true, the test statistic $\chi^{2}$ approximates a *chi-square* distribution.

The number of degrees of freedom is one less than the number of possible values (categories) for the variable under consideration. Hence:

$$df = c - 1$$Based on the observed and expected frequencies given in the table above it is fairly straightforward to calculate the $\chi^{2}$-value. However, to make the calculation procedure easier, we put all the necessary computational steps into one table. The **observed sample frequencies** are shortened and named as $O$ while the **expected sample frequencies** are named as $E$:

Party | $$O$$ | $$E$$ | $$O - E$$ | $$(O - E)^{2}$$ | $$\frac {(O - E)^{2}} {E}$$ |
---|---|---|---|---|---|

CDU | 43 | 51.045 | -8.045 | 64.722 | 1.268 |

SPD | 36 | 31.611 | 4.389 | 19.263 | 0.609 |

Others | 44 | 46.986 | -2.986 | 8.916 | 0.190 |

$\sum$ | 123 | 129.642 | -6.642 | - | 2.067 |

Conclusively, the $\chi{2}$ test statistic for a goodness-of-fit evaluates to 2.06709 for our sample data.

$$\chi^{2} = \sum \frac {(O - E)^{2}} {E} \approx 2.067$$The observed and expected frequencies are roughly equal if the null hypothesis is true. This results in a small value of the $\chi^{2}$ test statistic, thus, supporting $H_{0}$. If, however, the value of the $\chi^{2}$ test statistic is large, the data provide evidence against $H_{0}$.

In our case, we may compare the empirical $\chi^{2}$ test statistic with the corresponding critical $\chi^{2}$ value for a significance level of 95 % with a degree of freedom of 3 categories minus 1. To derive the critical value $\chi^{2}$ with Python, we apply the `chi2.ppf`

function over the `stats`

module within the `scipy`

package:

Note:Make sure the`scipy`

package is part of your`mamba`

environment!

In [1]:

```
from scipy.stats import chi2
chi2.ppf(0.95, df = 2)
```

Out[1]:

5.991464547107979

In order to get some hands-on experience, we apply the **$\chi^{2}$ goodness-of-fit test** in an exercise. For this, we load the `students`

*data set*. You may download the `students.csv`

file here and import it from your local file system, or you load it directly as a web resource. In either case, you import the data set to python as `pandas`

`dataframe`

object by using the `read_csv`

method:

Note:Make sure the`numpy`

and`pandas`

packages are part of your`mamba`

environment!

In [2]:

```
import pandas as pd
import numpy as np
students = pd.read_csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")
```

The *students* data set consists of 8239 rows, each representing a particular student, and 16 columns corresponding to a variable/feature related to that particular student. These self-explaining variables are:

- stud.id
- name
- gender
- age
- height
- weight
- religion
- nc.score
- semester
- major
- minor
- score1
- score2
- online.tutorial
- graduated
- salary

Recall $\chi^{2}$ goodness-of-fit tests are applied for qualitative (categorical) or discrete quantitative variables. There are several categorical variables in the *students* data set, such as `gender`

, `religion`

, `major`

, `minor`

and `graduated`

.

In order to showcase the **$\chi^{2}$ goodness-of-fit test**, we examine if religions are equally distributed among students compared to the distribution of religions among the population of the European Union. The data on the continental scale is provided in the report "Discrimination in the EU in 2012" (European Union: European Commission, Special Eurobarometer, 393, p. 233).

The report provides data for eight categories of how people ascribed themselves:

- 48 % as Catholic
- 16 % as Non believers/Agnostic
- 12 % as Protestant
- 8% as Orthodox
- 7% as Atheist
- 4 % as Other Christian
- 3 % as Other religion/None stated
- 2 % as Muslim.

We plot the data in the form of a pie chart for a better understanding:

Note:Make sure the`matplotlib`

and the`seaborn`

package are part of your`mamba`

environment!

In [3]:

```
import seaborn as sns
data = [48, 16, 12, 8, 7, 4, 3, 2]
religions = ["Catholic", "Non believer/\nAgnostic", "Protestant",
"Orthodox", "Atheist", "Other Christian",
"Other religion/None stated", "Muslim"]
data = pd.Series(data, index = religions)
data.plot.pie(colors = sns.color_palette("Set3", 8))
```

Out[3]:

<Axes: >

We start with data exploration and data preparation.

First, we want to know which categories are available in the *students* data set for the column `religion`

. Therefore, we apply the `unique()`

method, which provides access to the levels (categories) of a variable:

In [4]:

```
print(students["religion"].unique())
```

['Muslim' 'Other' 'Protestant' 'Catholic' 'Orthodox']

Obviously, in the `students`

*data set* there are 5 different categories, compared to 8 categories provided by the report of the EU. Thus, in order to make comparisons, we summarize the categories of EU report to 5 categories:

`"Catholic"`

`"Muslim"`

`"Orthodox"`

`"Protestant"`

`"Other"`

Be careful not to mix-up categories during that step!

In [5]:

```
data_raw = [48, 2, 8, (16 + 7 + 4 + 3), 12]
religions = ["Catholic", "Muslim", "Orthodox", "Other", "Protestant"]
data = pd.Series(data_raw, index = religions, name = "relative_frequency") / 100
data.to_frame()
```

Out[5]:

relative_frequency | |
---|---|

Catholic | 0.48 |

Muslim | 0.02 |

Orthodox | 0.08 |

Other | 0.30 |

Protestant | 0.12 |

`students`

*data set*. The sample size is $n = 256$. Afterwards, we count the number of students in each particular `religion`

category using the `groupby()`

function.
Recall that this quantity corresponds to the **observed frequencies**.

In [6]:

```
n = 256
sample = students.sample(n, random_state = 8).groupby(["religion"])
sample.size().to_frame("Oberserved Frequencies")
```

Out[6]:

Oberserved Frequencies | |
---|---|

religion | |

Catholic | 80 |

Muslim | 12 |

Orthodox | 21 |

Other | 104 |

Protestant | 39 |

`pandas`

`dataframe`

object:

In [7]:

```
df = pd.DataFrame({'relative frequencies' : data,
'observed frequencies' : sample.size()})
df
```

Out[7]:

relative frequencies | observed frequencies | |
---|---|---|

Catholic | 0.48 | 80 |

Muslim | 0.02 | 12 |

Orthodox | 0.08 | 21 |

Other | 0.30 | 104 |

Protestant | 0.12 | 39 |

In the next step we calculate the **expected frequencies** add the information as seperate column to our existing `dataframe`

`df`

. Recall the equation:

In [8]:

```
df["expected frequencies"] = df["relative frequencies"] * 256
df
```

Out[8]:

relative frequencies | observed frequencies | expected frequencies | |
---|---|---|---|

Catholic | 0.48 | 80 | 122.88 |

Muslim | 0.02 | 12 | 5.12 |

Orthodox | 0.08 | 21 | 20.48 |

Other | 0.30 | 104 | 76.80 |

Protestant | 0.12 | 39 | 30.72 |

Once we know the expected frequencies, we must check for two assumptions.

- We must ensure that all expected frequencies are one or greater.
- At most, 20 % of the expected frequencies should be less than 5.

We may confirm that both assumptions are fulfilled by looking at the table.

Perfect, now we are done with the preparation! The data set can be analyzed with the $\chi^{2}$ goodness-of-fit test. Recall the question we are interested in: **Is the religion equally distributed among students compared to the distribution of the religion among the population of the European Union?**

In order to conduct the **$\chi^{2}$ goodness-of-fit test**, we follow the step-wise implementation procedure for hypothesis testing. The $\chi^{2}$ goodness-of-fit test follows the same step-wise, generalized test scheme for hypothesis tests:

- State the null hypothesis $H_{0}$ and alternative hypothesis $H_{A}$.
- Decide on the significance level, $\alpha$.
- Compute the value of the test statistic.
- Determine the
*p*-value. - If $p \le \alpha$, reject $H_{0}$; otherwise, do not reject $H_{0}$.
- Interpret the result of the hypothesis test.

**Step 1: State the null hypothesis $H_{0}$ and alternative hypothesis $H_{A}$**

The null hypothesis states that the religion is equally distributed among students compared to the distribution of the religion among the population of the European Union:

$$H_{0}: \quad \text {The variable has the specified distribution}$$Alternative hypothesis:

$$H_{A}: \quad \text {The variable does not have the specified distribution}$$**Step 2: Decide on the significance level, $\alpha$**

In [9]:

```
alpha = 0.01
```

**Step 3 and 4: Compute the value of the test statistic and the p-value**

For illustration purposes we will manually compute the test statistic with Python firstly. Recall the equation for the test statistic from above:

$$\chi^{2} = \sum \frac {(O - E)^{2}} {E}$$In [10]:

```
O_E = (df["observed frequencies"] - df["expected frequencies"]) ** 2
chi_squared = np.sum(O_E / df["expected frequencies"])
chi_squared
```

Out[10]:

36.086588541666664

The numerical value of the test statistic is $\approx 36.0866$.

In order to calculate the *p*-value, we apply the `chi2.cdf`

function derived by the `scipy`

package over the `stats`

module to calculate the probability of occurrence for the test statistic based on the *$\chi^{2}$ distribution*. To do so, we also need the *degrees of freedom*. Recall how to calculate the degrees of freedom:

In [11]:

```
from scipy.stats import chi2
p = 1 - chi2.cdf(chi_squared, df = df.shape[0] - 1)
p
```

Out[11]:

2.7774032629324097e-07

$p = 2.77740326 \times 10^{-7}$.

**Step 5: If $p \le \alpha$, reject $H_{0}$; otherwise, do not reject $H_{0}$**

In [12]:

```
# reject H0?
p < alpha
```

Out[12]:

True

The *p*-value is smaller than the specified significance level of 0.01; we reject $H_{0}$. The test results are statistically significant at the 1 % level and provide very strong evidence against the null hypothesis.

**Step 6: Interpret the result of the hypothesis test**

At the 1 % significance level the data provides very strong evidence to conclude, that the religion distribution among students differs from the religion distribution of the population of the European Union.

`scipy`

¶We manually completed a $\chi^{2}$ goodness-of-fit test in Python. Very cool, but now we redo that example and the power of Python's package universe, namely the `scipy`

package, to obtain the same result as above in just one line of code!

In order to conduct a $\chi^{2}$ goodness-of-fit test in Python over the `stats`

module from the `scipy`

package, we apply the `chisquare()`

function. We only have to provide ** observed**, and the

`dataframe`

. Additional information regarding the function's usage can be derived directly from the function's documentation of `scipy`

.In [13]:

```
from scipy import stats
test_result = stats.chisquare(df["observed frequencies"], df["expected frequencies"])
test_result
```

Out[13]:

Power_divergenceResult(statistic=36.086588541666664, pvalue=2.777403262517103e-07)

The `chisquare()`

function returns an `object`

, which provides the **teststatic** as well as the corresponding ** p-value** of the test result. Those values can be retrieved over the following properties:

`<object>.statistic`

holds the actual teststatic and represents the empirical test value.`<object>.pvalue`

represents the*p*-value of the performed significance test.

Consequently, the teststatistic ($\chi^{2}_{emp}$) is retrieved over:

In [14]:

```
test_result.statistic
```

Out[14]:

36.086588541666664

The ** p-value** is retrieved over:

In [15]:

```
test_result.pvalue
```

Out[15]:

2.777403262517103e-07

Lastly, we want to provide a nicely printed output of the testresults:

In [16]:

```
print("Teststatistic = {}".format(round(test_result.statistic, 5)))
print("p-value = {}".format(round(test_result.pvalue, 7)))
```

Teststatistic = 36.08659 p-value = 3e-07

*p*-value, they match perfectly. Again, at the 1 % significance level, the data provide very strong evidence to conclude that the religion distribution among students differs from the religion distribution of the population of the European Union.

Exercise: With his famous pea plant experiments, Augustinian monk Gregor Mendel discovered the inheritance law of recessive and dominant traits in genes. His results show a 1:3 ratio of green to yellow peas from cross-bred seeds. Assume we repeated his experiment and got 123 green and 355 yellow pea plants. Does our observation confirm Mendel's inheritance law? Perform a test with 95 % significance level!

Exercise: With his famous pea plant experiments, Augustinian monk Gregor Mendel discovered the inheritance law of recessive and dominant traits in genes. His results show a 1:3 ratio of green to yellow peas from cross-bred seeds. Assume we repeated his experiment and got 123 green and 355 yellow pea plants. Does our observation confirm Mendel's inheritance law? Perform a test with 95 % significance level!

In [17]:

```
observed = [123, 355]
expected = [np.sum(observed) * 0.25, np.sum(observed) * 0.75 ]
test_result = stats.chisquare(observed, expected)
print("Chi_squared = {}".format(round(test_result.statistic, 5)))
print("p-value = {}".format(round(test_result.pvalue, 5)))
print("Because the p value ({}) is less than alpha (0.05) we do not have any evidence to reject H0.".
format(round(test_result.pvalue, 3)))
```

**Citation**

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis
using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.*