The contingency coefficient, $C$, is a $\chi^2$-based measure of association for categorical data. It relies on the $\chi^2$ test for independence. The $\chi^2$ statistic allows to assess whether or not there is a statistical relationship between the variables of a contingency table (also known as two-way table, cross-tabulation table or cross tabs). In this kind of table the distribution of the variables is shown in a matrix format.
In order to calculate the contingency coefficient $C$ we have to determine the $\chi^2$ statistic in advance.
The $\chi^2$ statistic is given by
$$\chi^2= \sum{\frac{(O-E)^2}{E}}\text{,} $$where $O$ represents the observed frequency and $E$ represents the expected frequency. Please note that $\frac{(O-E)^2}{E}$ is evaluated for each cell of a contingency table and then summed up.
We showcase an example to explain the calculation of the $\chi^2$ statistic based on categorical observation data in more depth. Consider an exam at the end of the semester. There are three groups of students: Students have either passed, not passed or not participated in the exam. Further, there have been exercises for the students to work on throughout the semester. We categorize the number of exercises each particular student completed into four groups: None, less then half $(<0.5)$, more than half $(\ge0.5)$, all of them.
The resulting contingency table looks like this:
# First, let's import all the needed libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tabulate as tab
First, let us construct a DataFrame
object and assign it the name obs
to remind us that this data corresponds to the observed frequency:
Please note that we will have to add an index and column name to get a proper contingency table.
data = np.array([[12, 22, 11], [13, 11, 14], [24, 8, 6], [14, 6, 7]]) # data
obs = pd.DataFrame(
data,
columns=["passed", "not passed", "not participated"],
index=["None", "<0.5", ">0.5", "all"],
)
obs.index.name = "Exam"
obs.columns.name = "Homework"
obs
Homework | passed | not passed | not participated |
---|---|---|---|
Exam | |||
None | 12 | 22 | 11 |
<0.5 | 13 | 11 | 14 |
>0.5 | 24 | 8 | 6 |
all | 14 | 6 | 7 |
Perfect, now we have a proper representation of our data. However, one piece is still missing to complete the contingency table; the row sums and column sums.
There are several ways to compute the row and column sums in Python. We will simply apply Python's in-built function called sum()
and add the axis-argument
, indicating the row-wise (1
) or column-wise (0
) calculation.
# Sum each column:
margin_col = obs.sum(axis=0, numeric_only=None)
margin_col
Homework passed 63 not passed 47 not participated 38 dtype: int64
# Sum each row:
margin_row = obs.sum(axis=1, numeric_only=None)
margin_row
Exam None 45 <0.5 38 >0.5 38 all 27 dtype: int64
Putting all pieces together the contingency table looks like this:
$$ \begin{array}{l|cccc|c} \hline \ & \text{None} & <0.5 & > 0.5 & \text{all} & \text{row sum} \\ \hline \ \text{passed} & 12 & 13 & 24 & 14 & 63 \\ \ \text{not passed} & 22 & 11 & 8 & 6 & 47\\ \ \text{not participated} & 11 & 14 & 6 & 7 & 38\\ \hline \ \text{column sum} & 45 & 38 & 38 & 27 & \\ \end{array} $$Great, now we have a table filled with the observed frequencies. In the next step we calculate the expected frequencies. To calculate the expected frequencies $(E)$ we apply this equation:
$$ E = \frac{R\times C}{n} \text{,}$$where $R$ is the row total, $C$ is the column total and $n$ is the sample size.
Please note that we have to calculate the expected frequency for each particular table entry, thus we have to do $3 \times 4 = 12$ calculations.
Again, Python provides several ways to achieve that task. The option of building a nested for loop to go through every cell and do the calculations step-wise is definitely fine!
A much simpler way, is the direct calculation using the given formula applied to the observed frequencies. The Input will be the observed frequencies (obs.values()
). The columns
respectively index
argument are the specification of the columns respectively row indices to be returned.
However, we can also use the expected_freq()
function from the scipy.stats.contingency
module. This function simply computes the expected frequencies (output) from a contingency table (input).
We assign the result to a variable denoted as expected
, in order to remind us that this table corresponds to the expected frequencies.
## solution one: Calculation using the given formula on the numpy array
pd.DataFrame(
(data.sum(0) * data.sum(1)[:, None]) / data.sum(),
columns=obs.columns,
index=obs.index,
)
Homework | passed | not passed | not participated |
---|---|---|---|
Exam | |||
None | 19.155405 | 14.290541 | 11.554054 |
<0.5 | 16.175676 | 12.067568 | 9.756757 |
>0.5 | 16.175676 | 12.067568 | 9.756757 |
all | 11.493243 | 8.574324 | 6.932432 |
## solution two: employing the scipy stats package
from scipy.stats.contingency import expected_freq
expected = pd.DataFrame(expected_freq(obs), columns=obs.columns, index=obs.index)
expected
Homework | passed | not passed | not participated |
---|---|---|---|
Exam | |||
None | 19.155405 | 14.290541 | 11.554054 |
<0.5 | 16.175676 | 12.067568 | 9.756757 |
>0.5 | 16.175676 | 12.067568 | 9.756757 |
all | 11.493243 | 8.574324 | 6.932432 |
Now, we can calculate the $\chi^2$ statistic. Recall the equation:
$$\chi^2= \sum{\frac{(O-E)^2}{E}}\text{,} $$where $O$ represents the observed frequency and $E$ represents the expected frequency.
chisqVal = np.sum((data - expected_freq(obs)) ** 2 / expected_freq(obs))
chisqVal
17.344387665138406
The $\chi^2$ statistic evaluates to 17.3444.
Before we finally calculate the contingency coefficient, there is nice in-built plotting function to visualize categorical data.
The mosaic()
function visualizes contingency tables and helps to assess the distributions of the data and possible dependencies. We will have to add the stack()
function, to be able to plot the contingency table with the mosaic
function to avoid recomputing the contingency table.
from statsmodels.graphics.mosaicplot import mosaic
mosaic(obs.stack(), title="Observations")
plt.show()
The contingency coefficient, denoted as $C^*$, adjusts the $\chi^2$ statistic by the sample size, $n$. It can be written as
$$C^*=\sqrt{\frac{\chi^2}{n+\chi^2}}\text{,}$$where $\chi^2$ corresponds to the $\chi^2$ statistic and $n$ corresponds to the number of observations.
When there is no relationship between two variables, $C^*$ is close to $0$. The contingency coefficient $C^*$ cannot exceed values $> 1$, but it may be less than $1$, even when two variables are perfectly related to each other. Since this is not desirable, $C^*$ is adjusted so it reaches a maximum of $1$ when there is complete association in a table of any number of rows and columns. This can be denoted as $C^*_{max}$ and calculated as follows:
$$C^*_{max}=\sqrt{\frac{k-1}{k}}\text{,}$$where $k$ is the number of rows or the number of columns, whichever is less, $k=min(\text{rows,columns})$.
Then the adjusted contingency coefficient is computed by
$$C=\frac{C^*}{C^*_{max}}=\sqrt\frac{k \cdot \chi^2}{(k-1)(n+\chi^2)}$$In the section above the $\chi^2$ statistic was assigned to the variable chisqVal
and was calculated as 17.3444.
Now, we plug that value into the equation for the contingency coefficient, $C^*$.
C_star = np.sqrt(chisqVal / (np.sum(data) + chisqVal))
C_star
0.3238804670641156
The contingency coefficient $C^*$ evaluates to 0.3238.
Finally, we apply the equation for the adjusted contingency coefficient, $C$.
count_row = obs.shape[0] # number of rows
count_col = obs.shape[1] # number of columns
# Or, more concisely
r, c = obs.shape
k = min(r, c)
C_star_max = np.sqrt((k - 1) / k)
C = C_star / C_star_max
C
0.39667094098068806
round(C, 2)
0.4
The adjusted contingency coefficient $C$ evaluates to 0.3967. Recall, the contingency coefficient ranges from 0 to 1. A contingency coefficient of roughly 0.4 does not indicate a strong relation between the exam results and the willingness of students to complete exercises during the semester.
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.