20243_contingency_coefficient

The contingency coefficient, $C$, is a $\chi^2$-based measure of association for categorical data. It relies on the $\chi^2$ test for independence. The $\chi^2$ statistic allows to assess whether or not there is a statistical relationship between the variables of a contingency table (also known as two-way table, cross-tabulation table or cross tabs). In this kind of table the distribution of the variables is shown in a matrix format.

In order to calculate the contingency coefficient $C$ we have to determine the $\chi^2$ statistic in advance.

Calculation of the $\chi^2$ statistic¶

The $\chi^2$ statistic is given by

$$\chi^2= \sum{\frac{(O-E)^2}{E}}\text{,} $$

where $O$ represents the observed frequency and $E$ represents the expected frequency. Please note that $\frac{(O-E)^2}{E}$ is evaluated for each cell of a contingency table and then summed up.

We showcase an example to explain the calculation of the $\chi^2$ statistic based on categorical observation data in more depth. Consider an exam at the end of the semester. There are three groups of students: Students have either passed, not passed or not participated in the exam. Further, there have been exercises for the students to work on throughout the semester. We categorize the number of exercises each particular student completed into four groups: None, less then half $(<0.5)$, more than half $(\ge0.5)$, all of them.

The resulting contingency table looks like this:

In [2]:

# First, let's import all the needed libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tabulate as tab

$$ \begin{array}{l|ccc} \hline \ & \text{None} & <0.5 & > 0.5 & \text{all} \\ \hline \ \text{passed} & 12 & 13 & 24 & 14 \\ \ \text{not passed} & 22 & 11 & 8 & 6 \\ \ \text{not participated} & 11 & 14 & 6 & 7 \\ \hline \end{array} $$

First, let us construct a DataFrame object and assign it the name obs to remind us that this data corresponds to the observed frequency:

Please note that we will have to add an index and column name to get a proper contingency table.

In [3]:

data = np.array([[12, 22, 11], [13, 11, 14], [24, 8, 6], [14, 6, 7]])  # data

In [4]:

obs = pd.DataFrame(
    data,
    columns=["passed", "not passed", "not participated"],
    index=["None", "<0.5", ">0.5", "all"],
)

obs.index.name = "Exam"
obs.columns.name = "Homework"

obs

Out[4]:

Homework	passed	not passed	not participated
Exam
None	12	22	11
<0.5	13	11	14
>0.5	24	8	6
all	14	6	7

Perfect, now we have a proper representation of our data. However, one piece is still missing to complete the contingency table; the row sums and column sums.

There are several ways to compute the row and column sums in Python. We will simply apply Python's in-built function called sum() and add the axis-argument, indicating the row-wise (1) or column-wise (0) calculation.

In [5]:

# Sum each column:
margin_col = obs.sum(axis=0, numeric_only=None)
margin_col

Out[5]:

Homework
passed              63
not passed          47
not participated    38
dtype: int64

In [6]:

# Sum each row:
margin_row = obs.sum(axis=1, numeric_only=None)
margin_row

Out[6]:

Exam
None    45
<0.5    38
>0.5    38
all     27
dtype: int64

Putting all pieces together the contingency table looks like this:

$$ \begin{array}{l|cccc|c} \hline \ & \text{None} & <0.5 & > 0.5 & \text{all} & \text{row sum} \\ \hline \ \text{passed} & 12 & 13 & 24 & 14 & 63 \\ \ \text{not passed} & 22 & 11 & 8 & 6 & 47\\ \ \text{not participated} & 11 & 14 & 6 & 7 & 38\\ \hline \ \text{column sum} & 45 & 38 & 38 & 27 & \\ \end{array} $$

Great, now we have a table filled with the observed frequencies. In the next step we calculate the expected frequencies. To calculate the expected frequencies $(E)$ we apply this equation:

$$ E = \frac{R\times C}{n} \text{,}$$

where $R$ is the row total, $C$ is the column total and $n$ is the sample size.

Please note that we have to calculate the expected frequency for each particular table entry, thus we have to do $3 \times 4 = 12$ calculations.

Again, Python provides several ways to achieve that task. The option of building a nested for loop to go through every cell and do the calculations step-wise is definitely fine!

A much simpler way, is the direct calculation using the given formula applied to the observed frequencies. The Input will be the observed frequencies (obs.values()). The columns respectively index argument are the specification of the columns respectively row indices to be returned.

However, we can also use the expected_freq() function from the scipy.stats.contingency module. This function simply computes the expected frequencies (output) from a contingency table (input).

We assign the result to a variable denoted as expected, in order to remind us that this table corresponds to the expected frequencies.

In [7]:

## solution one: Calculation using the given formula on the numpy array

pd.DataFrame(
    (data.sum(0) * data.sum(1)[:, None]) / data.sum(),
    columns=obs.columns,
    index=obs.index,
)

Out[7]:

Homework	passed	not passed	not participated
Exam
None	19.155405	14.290541	11.554054
<0.5	16.175676	12.067568	9.756757
>0.5	16.175676	12.067568	9.756757
all	11.493243	8.574324	6.932432

In [8]:

## solution two: employing the scipy stats package
from scipy.stats.contingency import expected_freq

expected = pd.DataFrame(expected_freq(obs), columns=obs.columns, index=obs.index)
expected

Out[8]:

Homework	passed	not passed	not participated
Exam
None	19.155405	14.290541	11.554054
<0.5	16.175676	12.067568	9.756757
>0.5	16.175676	12.067568	9.756757
all	11.493243	8.574324	6.932432

Now, we can calculate the $\chi^2$ statistic. Recall the equation:

$$\chi^2= \sum{\frac{(O-E)^2}{E}}\text{,} $$

where $O$ represents the observed frequency and $E$ represents the expected frequency.

In [9]:

chisqVal = np.sum((data - expected_freq(obs)) ** 2 / expected_freq(obs))
chisqVal

Out[9]:

17.344387665138406

The $\chi^2$ statistic evaluates to 17.3444.

Before we finally calculate the contingency coefficient, there is nice in-built plotting function to visualize categorical data. The mosaic() function visualizes contingency tables and helps to assess the distributions of the data and possible dependencies. We will have to add the stack() function, to be able to plot the contingency table with the mosaic function to avoid recomputing the contingency table.

In [10]:

from statsmodels.graphics.mosaicplot import mosaic

mosaic(obs.stack(), title="Observations")
plt.show()

Calculation of the contingency coefficient $C$¶

The contingency coefficient, denoted as $C^*$, adjusts the $\chi^2$ statistic by the sample size, $n$. It can be written as

$$C^*=\sqrt{\frac{\chi^2}{n+\chi^2}}\text{,}$$

where $\chi^2$ corresponds to the $\chi^2$ statistic and $n$ corresponds to the number of observations.

When there is no relationship between two variables, $C^*$ is close to $0$. The contingency coefficient $C^*$ cannot exceed values $> 1$, but it may be less than $1$, even when two variables are perfectly related to each other. Since this is not desirable, $C^*$ is adjusted so it reaches a maximum of $1$ when there is complete association in a table of any number of rows and columns. This can be denoted as $C^*_{max}$ and calculated as follows:

$$C^*_{max}=\sqrt{\frac{k-1}{k}}\text{,}$$

where $k$ is the number of rows or the number of columns, whichever is less, $k=min(\text{rows,columns})$.

Then the adjusted contingency coefficient is computed by

$$C=\frac{C^*}{C^*_{max}}=\sqrt\frac{k \cdot \chi^2}{(k-1)(n+\chi^2)}$$

In the section above the $\chi^2$ statistic was assigned to the variable chisqVal and was calculated as 17.3444. Now, we plug that value into the equation for the contingency coefficient, $C^*$.

In [11]:

C_star = np.sqrt(chisqVal / (np.sum(data) + chisqVal))
C_star

Out[11]:

0.3238804670641156

The contingency coefficient $C^*$ evaluates to 0.3238.

Finally, we apply the equation for the adjusted contingency coefficient, $C$.

In [12]:

count_row = obs.shape[0]  # number of rows
count_col = obs.shape[1]  # number of columns
# Or, more concisely
r, c = obs.shape

k = min(r, c)
C_star_max = np.sqrt((k - 1) / k)
C = C_star / C_star_max
C

Out[12]:

0.39667094098068806

In [13]:

round(C, 2)

Out[13]:

0.4

The adjusted contingency coefficient $C$ evaluates to 0.3967. Recall, the contingency coefficient ranges from 0 to 1. A contingency coefficient of roughly 0.4 does not indicate a strong relation between the exam results and the willingness of students to complete exercises during the semester.

Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.