The **\(\chi^2\)
goodness-of-fit test** is applied to perform hypothesis tests on
the distribution of a qualitative (categorical) variable or a discrete
quantitative variable, that has only finitely many possible values.

The basic logic of the \(\chi^2\)
goodness-of-fit test is to compare the frequencies of two variables. We
compare the **observed frequencies** of a sample with the
**expected frequencies**.

Consider a simple example:

On September 22, 2013 the German Federal Election 2013 was held. More than 44
million people turned up to vote. 41.5 % of German voters decided to
vote for the *Christian Democratic Union (CDU)* and 25.7 % for
the *Social Democratic Party (SPD)*. For the sake of simplicity
we subsume the remaining percentage of votes (32.8 %) as
*Others*.

Based on that data, we may build a frequency table:

\[ \begin{array}{|l|c|} \hline \ \text{Party} & \text{Percentage} & \text{Relative frequency}\\ \hline \ \text{CDU} & 41.5 & 0.415 \\ \ \text{SPD} & 25.7 & 0.257 \\ \ \text{Others} & 32.8 & 0.328 \\ \hline \ & 100 & 1 \\ \hline \end{array} \]

The third column of the table above corresponds to the
**relative frequencies** of the German population/voters.
For this exercise we take a random sample. We ask 123 students of FU
Berlin about their party affiliation and record the following
answers:

```
## [1] "CDU" "SPD" "Others" "SPD" "SPD" "CDU" "Others" "SPD"
## [9] "Others" "Others" "SPD" "Others" "Others" "Others" "CDU" "SPD"
## [17] "CDU" "CDU" "CDU" "SPD" "SPD" "Others" "Others" "SPD"
## [25] "Others" "SPD" "Others" "Others" "CDU" "CDU" "SPD" "SPD"
## [33] "Others" "SPD" "CDU" "Others" "SPD" "CDU" "CDU" "CDU"
## [41] "CDU" "Others" "Others" "CDU" "CDU" "CDU" "CDU" "Others"
## [49] "CDU" "SPD" "CDU" "Others" "SPD" "CDU" "Others" "CDU"
## [57] "CDU" "SPD" "SPD" "Others" "Others" "CDU" "Others" "CDU"
## [65] "SPD" "Others" "SPD" "SPD" "SPD" "Others" "SPD" "Others"
## [73] "SPD" "CDU" "Others" "CDU" "Others" "Others" "CDU" "CDU"
## [81] "CDU" "Others" "Others" "SPD" "CDU" "Others" "SPD" "SPD"
## [89] "SPD" "CDU" "CDU" "Others" "CDU" "Others" "CDU" "CDU"
## [97] "SPD" "CDU" "Others" "Others" "Others" "CDU" "Others" "SPD"
## [105] "Others" "SPD" "SPD" "Others" "Others" "CDU" "SPD" "CDU"
## [113] "CDU" "SPD" "SPD" "CDU" "Others" "SPD" "Others" "Others"
## [121] "Others" "CDU" "CDU"
```

In the next step we count the occurrence of each category (party) in
our sample. These quantities are the **observed
frequencies**:

```
## sample_FUB
## CDU Others SPD
## 43 44 36
```

In the next step we compute the **expected frequency**,
denoted \(E\), for each category:

\[E = n \times p\text{,}\]

where \(n\) is the sample size and \(p\) is the corresponding relative frequency taken from the table above.

\[E_{CDU} = n\times p = 123 \times 0.415 = 51.045\]

\[E_{SPD} = n\times p = 123 \times 0.257 = 31.611\]

\[E_{Others} = n\times p = 123 \times 0.328 = 40.344\]

Note:Although we deal with individual counts, represented by integer values, theexpected frequency, \(E\), is a floating point number. That is fine.

Now, we put the **observed frequencies** and the
**expected frequencies** together into one table:

\[ \begin{array}{|l|c|} \hline \ \text{Party} & \text{Observed frequency} & \text{Expected frequency}\\ \hline \ \text{CDU} & 43 & 51.045 \\ \ \text{SPD} & 36 & 31.611 \\ \ \text{Others} & 44 & 40.344 \\ \hline \ & 123 & 123 \\ \hline \end{array} \]

Great! Once we have the expected frequencies, we have to check for two assumptions. First, we have to make sure that all expected frequencies are 1 or greater, and second, at most 20 % of the expected frequencies are less than 5. By looking at the table we may confirm that both assumptions are fulfilled.

Now, we have all ingredients we need, except the test statistic, to perform a \(\chi^2\) goodness-of-fit test.

The \(\chi^2\) test statistic for a goodness-of-fit is given by

\[\chi^2=\sum
\frac{(O-E)^2}{E}\text{,}\] where \(O\) corresponds to the observed frequencies
and \(E\) to the expected frequencies.
The test statistic \(\chi^2\)
approximates a *chi-square* distribution, if the null hypothesis
is true. The number of degrees of freedom is 1 less than the number of
possible values (categories) for the variable under consideration:

\[df = c-1\]

Based on the observed and expected frequencies given in the table above it is fairly straightforward to calculate the \(\chi^2\)-value. However, to make the calculation procedure easier to follow, we put together all the necessary computational steps into one table:

\[ \begin{array}{|l|c|} \hline \ \text{Religion} & \text{Observed} & \text{Expected} & \text{Difference} & \text{Square of difference} & \chi^2\text{ subtotal}\\ \ & \text{frequency} & \text{frequency} & O-E & (O-E)^2 & (O-E)^2/E\\ \hline \ \text{CDU} & 43 & 51.045 & -8.045 & 64.722025 & 1.2679405\\ \ \text{SPD} & 36 & 31.611 & 4.389 & 19.263321 & 0.6093866\\ \ \text{Others} & 36 & 40.344 & 3.656 & 13.366336 & 0.3313091\\ \hline \ & 123 & 123 & 3.5527137\times 10^{-15} & & 2.2086363\\ \hline \end{array} \]

In our example the \(\chi^2\) test statistic for a goodness-of-fit evaluates to

\[\chi^2=\sum \frac{(O-E)^2}{E} \approx2.209\]

If the null hypothesis is true, the observed and expected frequencies are roughly equal. This results in a small value of the \(\chi^2\) test statistic, thus, supporting \(H_0\). If, however, the value of the \(\chi^2\) test statistic is large, the data provides evidence against \(H_0\).

In our case, we may compare the empirical \(\chi^2\) test statistic with the corresponding critical \(\chi^2\) value for a significance level of 95 % with a degree of freedom of 3 categories minus 1:

`qchisq(0.95, df = 3 - 1)`

`## [1] 5.991465`

Since our empirical \(\chi^2\) value is smaller than the critical \(\chi^2\) value, we cannot reject the null hypotheses!

In order to get some hands-on experience we apply the **\(\chi^2\) goodness-of-fit test** in
an exercise. For this we load the *students* data set, which you
may also download here.

`students <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")`

The *students* data set consists of 8239 rows, each of them
representing a particular student, and 16 columns, each of them
corresponding to a variable/feature related to that particular student.
These self-explaining variables are: *stud.id, name, gender, age,
height, weight, religion, nc.score, semester, major, minor, score1,
score2, online.tutorial, graduated, salary*.

Recall, \(\chi^2\) goodness-of-fit
tests are applied for qualitative (categorical) variables or discrete
quantitative variables. There are several categorical variables in the
*students* data set, such as `gender`

,
`religion`

, `major`

, `minor`

and
`graduated`

, among others.

In order to showcase the **\(\chi^2\) goodness-of-fit test** we
examine, if religions are equally distributed among students compared to
the distribution of religions among the population of the European
Union. The data on continental scale is provided in the report
“Discrimination in the EU in 2012” (European Union: European Commission, Special
Eurobarometer, 393, p. 233). The report provides data for 8
categories: 48 % of the people are ascribed as Catholic, 16 % as Non
believer/Agnostic, 12 % as Protestant, 8% as Orthodox, 7% as Atheist, 4
% as Other Christian, 3 % as Other religion/None stated and 2 % as
Muslim. We plot the data in form of a pie chart for a better
understanding:

```
data <- c(48, 16, 12, 8, 7, 4, 3, 2)
data_labels <- c(
"Catholic", "Non believer/\nAgnostic", "Protestant",
"Orthodox", "Atheist", "Other Christian",
"Other religion/None stated", "Muslim"
)
par(mar = c(3, 2, 3, 2))
library(RColorBrewer)
cols <- brewer.pal(length(data), "Set3")
pie(
x = data,
labels = data_labels,
col = cols,
radius = 1
)
```

We start with data exploration and data preparation.

First, we want to know which categories are available in the
*students* data set. Therefore, we apply the
`unique()`

function, which provides access to the levels
(categories) of a variable.

`unique(students$religion)`

`## [1] "Muslim" "Other" "Protestant" "Catholic" "Orthodox"`

Obviously, in the *students* data set there are 0 different
categories, compared to 8 categories provided by the report of the EU.
Thus, in order to make comparisons, we summarize the categories of EU
report to 5 categories: “Catholic”, “Muslim”, “Orthodox”, “Protestant”
and “Other”. Be careful not to mix-up categories during that step!

```
# set category names
data_labels <- c("Catholic", "Muslim", "Orthodox", "Other", "Protestant")
# recode European data according to category names
data_raw <- c(48, 2, 8, sum(16, 7, 4, 3), 12)
# generate a data.frame object
data <- data.frame(data_raw / 100)
row.names(data) <- data_labels
colnames(data) <- "relative_frequency"
data
```

```
## relative_frequency
## Catholic 0.48
## Muslim 0.02
## Orthodox 0.08
## Other 0.30
## Protestant 0.12
```

Now, we take a random sample. We randomly pick 256 students and count
the number of students in each particular category of the
`religion`

using the `table()`

function. Recall,
that this quantity corresponds to the **observed
frequencies**.

```
n <- 256
students_sample <- sample(students$religion, n)
O_frequencies <- table(students_sample)
O_frequencies
```

```
## students_sample
## Catholic Muslim Orthodox Other Protestant
## 83 15 16 82 60
```

With one line of code we insert the observed frequencies into
`data`

, the `data.frame`

we constructed above.

```
data$observed_frequencies <- O_frequencies
data
```

```
## relative_frequency observed_frequencies
## Catholic 0.48 83
## Muslim 0.02 15
## Orthodox 0.08 16
## Other 0.30 82
## Protestant 0.12 60
```

In the next step we calculate the **expected
frequencies**. Recall the equation:

\[E = n \times p\]

We insert the expected frequencies as a new column in
`data`

.

```
n <- 256
data$expected_frequencies <- n * data$relative_frequency
data
```

```
## relative_frequency observed_frequencies expected_frequencies
## Catholic 0.48 83 122.88
## Muslim 0.02 15 5.12
## Orthodox 0.08 16 20.48
## Other 0.30 82 76.80
## Protestant 0.12 60 30.72
```

Once we know the expected frequencies, we have to check for two assumptions. First, we have to make sure that all expected frequencies are 1 or greater. Second, at most 20 % of the expected frequencies should be less than 5. By looking at the table we may confirm that both assumptions are fulfilled.

Perfect, now we are done with the preparation! The data set is ready
to be analyzed with the \(\chi^2\)
goodness-of-fit test. Recall the question we are interested in:
**Is the religion equally distributed among students compared to
the distribution of the religion among the population of the European
Union?**

In order to conduct the **\(\chi^2\) goodness-of-fit test** we
follow the step-wise implementation procedure for hypothesis testing.
The \(\chi^2\) goodness-of-fit test
follows the same step-wise procedure as hypothesis tests for the
population mean:

\[ \begin{array}{l} \hline \ \text{Step 1} & \text{State the null hypothesis } H_0 \text{ and alternative hypothesis } H_A \text{.}\\ \ \text{Step 2} & \text{Decide on the significance level, } \alpha\text{.} \\ \ \text{Step 3} & \text{Compute the value of the test statistic.} \\ \ \text{Step 4} &\text{Determine the p-value.} \\ \ \text{Step 5} & \text{If } p \le \alpha \text{, reject }H_0 \text{; otherwise, do not reject } H_0 \text{.} \\ \ \text{Step 6} &\text{Interpret the result of the hypothesis test.} \\ \hline \end{array} \]

**Step 1: State the null hypothesis \(H_0\) and alternative hypothesis \(H_A\)**

The null hypothesis states that the religion is equally distributed among students compared to the distribution of the religion among the population of the European Union:

\[H_0: \quad \text{The variable has the specified distribution}\]

Alternative hypothesis:

\[H_A: \quad \text{The variable does not have the specified distribution} \]

**Step 2: Decide on the significance level, \(\alpha\)**

\[\alpha = 0.01\]

`alpha <- 0.01`

**Step 3 and 4: Compute the value of the test statistic and the
p-value**

For illustration purposes we manually compute the test statistic in R. Recall the equation for the test statistic from above:

\[\chi^2=\sum \frac{(O-E)^2}{E}\]

```
# compute the value of the test statistic
x2 <- sum((data$observed_frequencies - data$expected_frequencies)^2 / data$expected_frequencies)
x2
```

`## [1] 61.24772`

The numerical value of the test statistic is \(\approx 61.25\).

In order to calculate the *p*-value we apply the
`pchisq()`

function. Recall, how to calculate the degrees of
freedom:

\[df = (c - 1)\]

```
# compute df
df <- nrow(data) - 1
# compute the p-value
p <- pchisq(q = x2, df = df, lower.tail = FALSE)
p
```

`## [1] 1.585774e-12`

\(p = 1.5857736\times 10^{-12}\).

**Step 5: If \(p \le \alpha\),
reject \(H_0\); otherwise, do not
reject \(H_0\)**

`p <= alpha`

`## [1] TRUE`

The *p*-value is smaller than the specified significance level
of 0.01; we reject \(H_0\). The test
results are statistically significant at the 1 % level and provide very
strong evidence against the null hypothesis.

**Step 6: Interpret the result of the hypothesis
test**

At the 1 % significance level the data provides very strong evidence to conclude, that the religion distribution among students differs from the religion distribution of the population of the European Union.

We just manually completed a \(\chi^2\) goodness-of-fit test in R. Very cool, but now we redo that example and make use of the R machinery to obtain the same result as above in just one line of code!

In order to conduct a \(\chi^2\)
goodness-of-fit test in R we apply the `chisq.test()`

function. We provide two vectors as input data:
`data$observed_frequencies`

and
`data$expected_frequencies`

.

`chisq.test(data$observed_frequencies, p = data$relative_frequency)`

```
##
## Chi-squared test for given probabilities
##
## data: data$observed_frequencies
## X-squared = 61.248, df = 4, p-value = 1.586e-12
```

Worked out fine! Compare the output of the `chisq.test()`

function with our result above. Again, we may conclude that at the 1 %
significance level the data provides very strong evidence to conclude,
that the religion distribution among students differs from the religion
distribution of the population of the European Union.

Exercise: With his famous pea plant experiments Augustinian monk Gregor Mendel discovered the inheritance law of recessive and dominant traits in genes. His results show a 1:3 ratio of green to yellow peas from cross-bred seeds. Assume we repeated his experiment and got 123 green and 355 yellow pea plants. Does our observation confirm Mendel’s inheritance law? Perform a test with 95 % significance level!

`### your code here`

`chisq.test(c(123, 355), p = c(0.25, 0.75))`

```
##
## Chi-squared test for given probabilities
##
## data: c(123, 355)
## X-squared = 0.13668, df = 1, p-value = 0.7116
```

**Citation**

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Hartmann,
K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis
using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.*