The **contingency coefficient**, $C$, is a $\chi^2$-based measure of association for categorical data.
It relies on the **$\chi^2$ test for independence**.
The $\chi^2$ statistic allows to assess whether or not there is a statistical relationship between the variables of a **contingency table** (also known as two-way table, cross-tabulation table or cross tabs).
In this kind of table the distribution of the variables is shown in a matrix format.

In order to calculate the **contingency coefficient** $C$ we have to determine the $\chi^2$ statistic in advance.

The $\chi^2$ statistic is given by

$$\chi^2= \sum{\frac{(O-E)^2}{E}}\text{,} $$where $O$ represents the observed frequency and $E$ represents the expected frequency. Please note that $\frac{(O-E)^2}{E}$ is evaluated for each cell of a contingency table and then summed up.

We showcase an example to explain the calculation of the $\chi^2$ statistic based on categorical observation data in more depth. Consider an exam at the end of the semester. There are three groups of students: Students have either passed, not passed or not participated in the exam. Further, there have been exercises for the students to work on throughout the semester. We categorize the number of exercises each particular student completed into four groups: None, less then half $(<0.5)$, more than half $(\ge0.5)$, all of them.

The resulting contingency table looks like this:

In [2]:

```
# First, let's import all the needed libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tabulate as tab
```

$$
\begin{array}{l|ccc}
\hline
\ & \text{None} & <0.5 & > 0.5 & \text{all} \\
\hline
\ \text{passed} & 12 & 13 & 24 & 14 \\
\ \text{not passed} & 22 & 11 & 8 & 6 \\
\ \text{not participated} & 11 & 14 & 6 & 7 \\
\hline
\end{array}
$$

First, let us construct a `DataFrame`

object and assign it the name `obs`

to remind us that this data corresponds to the **observed frequency**:

Please note that we will have to add an index and column name to get a proper contingency table.

In [3]:

```
data = np.array([[12, 22, 11], [13, 11, 14], [24, 8, 6], [14, 6, 7]]) # data
```

In [4]:

```
obs = pd.DataFrame(
data,
columns=["passed", "not passed", "not participated"],
index=["None", "<0.5", ">0.5", "all"],
)
obs.index.name = "Exam"
obs.columns.name = "Homework"
obs
```

Out[4]:

Homework | passed | not passed | not participated |
---|---|---|---|

Exam | |||

None | 12 | 22 | 11 |

<0.5 | 13 | 11 | 14 |

>0.5 | 24 | 8 | 6 |

all | 14 | 6 | 7 |

Perfect, now we have a proper representation of our data. However, one piece is still missing to complete the contingency table; the row sums and column sums.

There are several ways to compute the row and column sums in Python. We will simply apply Python's in-built function called `sum()`

and add the `axis-argument`

, indicating the row-wise (`1`

) or column-wise (`0`

) calculation.

In [5]:

```
# Sum each column:
margin_col = obs.sum(axis=0, numeric_only=None)
margin_col
```

Out[5]:

Homework passed 63 not passed 47 not participated 38 dtype: int64

In [6]:

```
# Sum each row:
margin_row = obs.sum(axis=1, numeric_only=None)
margin_row
```

Out[6]:

Exam None 45 <0.5 38 >0.5 38 all 27 dtype: int64

Putting all pieces together the contingency table looks like this:

$$ \begin{array}{l|cccc|c} \hline \ & \text{None} & <0.5 & > 0.5 & \text{all} & \text{row sum} \\ \hline \ \text{passed} & 12 & 13 & 24 & 14 & 63 \\ \ \text{not passed} & 22 & 11 & 8 & 6 & 47\\ \ \text{not participated} & 11 & 14 & 6 & 7 & 38\\ \hline \ \text{column sum} & 45 & 38 & 38 & 27 & \\ \end{array} $$Great, now we have a table filled with the observed frequencies.
In the next step we calculate the **expected frequencies**.
To calculate the expected frequencies $(E)$ we apply this equation:

where $R$ is the row total, $C$ is the column total and $n$ is the sample size.

Please note that we have to calculate the expected frequency for each particular table entry, thus we have to do $3 \times 4 = 12$ calculations.

Again, Python provides several ways to achieve that task. The option of building a nested for loop to go through every cell and do the calculations step-wise is definitely fine!

A much simpler way, is the direct calculation using the given formula applied to the observed frequencies. The Input will be the observed frequencies (`obs.values()`

). The `columns`

respectively `index`

argument are the specification of the columns respectively row indices to be returned.

However, we can also use the `expected_freq()`

function from the `scipy.stats.contingency`

module. This function simply computes the expected frequencies (output) from a contingency table (input).

We assign the result to a variable denoted as `expected`

, in order to remind us that this table corresponds to the **expected frequencies**.

In [7]:

```
## solution one: Calculation using the given formula on the numpy array
pd.DataFrame(
(data.sum(0) * data.sum(1)[:, None]) / data.sum(),
columns=obs.columns,
index=obs.index,
)
```

Out[7]:

Homework | passed | not passed | not participated |
---|---|---|---|

Exam | |||

None | 19.155405 | 14.290541 | 11.554054 |

<0.5 | 16.175676 | 12.067568 | 9.756757 |

>0.5 | 16.175676 | 12.067568 | 9.756757 |

all | 11.493243 | 8.574324 | 6.932432 |

In [8]:

```
## solution two: employing the scipy stats package
from scipy.stats.contingency import expected_freq
expected = pd.DataFrame(expected_freq(obs), columns=obs.columns, index=obs.index)
expected
```

Out[8]:

Homework | passed | not passed | not participated |
---|---|---|---|

Exam | |||

None | 19.155405 | 14.290541 | 11.554054 |

<0.5 | 16.175676 | 12.067568 | 9.756757 |

>0.5 | 16.175676 | 12.067568 | 9.756757 |

all | 11.493243 | 8.574324 | 6.932432 |

Now, we can calculate the $\chi^2$ statistic. Recall the equation:

$$\chi^2= \sum{\frac{(O-E)^2}{E}}\text{,} $$where $O$ represents the observed frequency and $E$ represents the expected frequency.

In [9]:

```
chisqVal = np.sum((data - expected_freq(obs)) ** 2 / expected_freq(obs))
chisqVal
```

Out[9]:

17.344387665138406

The $\chi^2$ statistic evaluates to 17.3444.

Before we finally calculate the contingency coefficient, there is nice in-built plotting function to visualize categorical data.
The `mosaic()`

function visualizes contingency tables and helps to assess the distributions of the data and possible dependencies. We will have to add the `stack()`

function, to be able to plot the contingency table with the `mosaic`

function to avoid recomputing the contingency table.

In [10]:

```
from statsmodels.graphics.mosaicplot import mosaic
mosaic(obs.stack(), title="Observations")
plt.show()
```

The contingency coefficient, denoted as $C^*$, adjusts the $\chi^2$ statistic by the sample size, $n$. It can be written as

$$C^*=\sqrt{\frac{\chi^2}{n+\chi^2}}\text{,}$$where $\chi^2$ corresponds to the $\chi^2$ statistic and $n$ corresponds to the number of observations.

When there is no relationship between two variables, $C^*$ is close to $0$. The contingency coefficient $C^*$ cannot exceed values $> 1$, but it may be less than $1$, even when two variables are perfectly related to each other. Since this is not desirable, $C^*$ is adjusted so it reaches a maximum of $1$ when there is complete association in a table of any number of rows and columns. This can be denoted as $C^*_{max}$ and calculated as follows:

$$C^*_{max}=\sqrt{\frac{k-1}{k}}\text{,}$$where $k$ is the number of rows or the number of columns, whichever is less, $k=min(\text{rows,columns})$.

Then the adjusted contingency coefficient is computed by

$$C=\frac{C^*}{C^*_{max}}=\sqrt\frac{k \cdot \chi^2}{(k-1)(n+\chi^2)}$$In the section above the $\chi^2$ statistic was assigned to the variable `chisqVal`

and was calculated as 17.3444.
Now, we plug that value into the equation for the contingency coefficient, $C^*$.

In [11]:

```
C_star = np.sqrt(chisqVal / (np.sum(data) + chisqVal))
C_star
```

Out[11]:

0.3238804670641156

The contingency coefficient $C^*$ evaluates to 0.3238.

Finally, we apply the equation for the adjusted contingency coefficient, $C$.

In [12]:

```
count_row = obs.shape[0] # number of rows
count_col = obs.shape[1] # number of columns
# Or, more concisely
r, c = obs.shape
k = min(r, c)
C_star_max = np.sqrt((k - 1) / k)
C = C_star / C_star_max
C
```

Out[12]:

0.39667094098068806

In [13]:

```
round(C, 2)
```

Out[13]:

0.4

**Citation**

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis
using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.*