The contingency coefficient, \(C\), is a \(\chi^2\)-based measure of association for categorical data. It relies on the \(\chi^2\) test for independence. The \(\chi^2\) statistic allows to assess whether or not there is a statistical relationship between the variables of a contingency table (also known as two-way table, cross-tabulation table or cross tabs). In this kind of table the distribution of the variables is shown in a matrix format.
In order to calculate the contingency coefficient \(C\) we have to determine the \(\chi^2\) statistic in advance.
The \(\chi^2\) statistic is given by
\[\chi^2= \sum{\frac{(O-E)^2}{E}}\text{,} \]
where \(O\) represents the observed frequency and \(E\) represents the expected frequency. Please note that \(\frac{(O-E)^2}{E}\) is evaluated for each cell of a contingency table and then summed up.
We showcase an example to explain the calculation of the \(\chi^2\) statistic based on categorical observation data in more depth. Consider an exam at the end of the semester. There are three groups of students: Students have either passed, not passed or not participated in the exam. Further, there have been exercises for the students to work on throughout the semester. We categorize the number of exercises each particular student completed into four groups: None, less then half \((<0.5)\), more than half \((\ge0.5)\), all of them.
The resulting contingency table looks like this:
\[ \begin{array}{l|ccc} \hline \ & \text{None} & <0.5 & > 0.5 & \text{all} \\ \hline \ \text{passed} & 12 & 13 & 24 & 14 \\ \ \text{not passed} & 22 & 11 & 8 & 6 \\ \ \text{not participated} & 11 & 14 & 6 & 7 \\ \hline \end{array} \]
First, let us construct a table
object and assign it the
name obs
to remind us that this data corresponds to the
observed frequency:
data <- c(12, 13, 24, 14, 22, 11, 8, 6, 11, 14, 6, 7) # data
obs <- matrix(data, nrow = 3, byrow = TRUE) # structure in a matrix
obs <- as.table(obs) # transform into table
dimnames(obs) <- list(
"Exam" = c("passed", "not passed", "not participated"),
"HomeWork" = c("None", "<0.5", ">0.5", "all")
) # name axes
obs
## HomeWork
## Exam None <0.5 >0.5 all
## passed 12 13 24 14
## not passed 22 11 8 6
## not participated 11 14 6 7
Perfect, now we have a proper representation of our data in R. However, one piece is still missing to complete the contingency table; the row sums and column sums.
There are several ways to compute the row and column sums in R. One
option is to use a function of the apply()
family and make
use of the rowSums()
and colSums()
functions
appropriately. We will use R’s in-built function called
margin.table()
. The function takes two arguments, the data
and a number to indicate row-wise or column-wise.
margin_row <- margin.table(obs, 1) # row-wise
margin_row
## Exam
## passed not passed not participated
## 63 47 38
margin_col <- margin.table(obs, 2) # column-wise
margin_col
## HomeWork
## None <0.5 >0.5 all
## 45 38 38 27
Putting all pieces together the contingency table looks like this:
\[ \begin{array}{l|cccc|c} \hline \ & \text{None} & <0.5 & > 0.5 & \text{all} & \text{row sum} \\ \hline \ \text{passed} & 12 & 13 & 24 & 14 & 63\\ \ \text{not passed} & 22 & 11 & 8 & 6 & 47\\ \ \text{not participated} & 11 & 14 & 6 & 7 & 38\\ \hline \ \text{column sum} & 45 & 38 & 38 & 27 & \\ \end{array} \]
Great, now we have a table filled with the observed frequencies. In the next step we calculate the expected frequencies. To calculate the expected frequencies \((E)\) we apply this equation:
\[ E = \frac{R\times C}{n} \text{,}\]
where \(R\) is the row total, \(C\) is the column total and \(n\) is the sample size.
Please note that we have to calculate the expected frequency for each
particular table entry, thus we have to do \(3
\times 4 = 12\) calculations. Again, R provides several ways to
achieve that task. The option of building a nested for loop to go
through every cell and do the calculations step-wise is definitely fine!
However, we can also use the vectorized, and thus much faster,
outer()
function in combination with the
rowSums()
and colSums()
functions. We assign
the result to a variable denoted as expected
, in order to
remind us that this table corresponds to the expected frequencies.
expected <- outer(rowSums(obs), colSums(obs), FUN = "*") / sum(data)
expected
## None <0.5 >0.5 all
## passed 19.15541 16.175676 16.175676 11.493243
## not passed 14.29054 12.067568 12.067568 8.574324
## not participated 11.55405 9.756757 9.756757 6.932432
Now, we can calculate the \(\chi^2\) statistic. Recall the equation:
\[\chi^2= \sum{\frac{(O-E)^2}{E}}\text{,} \]
where \(O\) represents the observed frequency and \(E\) represents the expected frequency.
chisqVal <- sum((obs - expected)^2 / expected)
chisqVal
## [1] 17.34439
The \(\chi^2\) statistic evaluates to 17.3443877.
Before we finally calculate the contingency coefficient, there is
nice in-built plotting function to visualize categorical data. The
mosaicplot()
function visualizes contingency tables and
helps to assess the distributions of the data and possible
dependencies.
mosaicplot(obs, main = "Observations")
The contingency coefficient, denoted as \(C^*\), adjusts the \(\chi^2\) statistic by the sample size, \(n\). It can be written as
\[C^*=\sqrt{\frac{\chi^2}{n+\chi^2}}\text{,}\]
where \(\chi^2\) corresponds to the \(\chi^2\) statistic and \(n\) corresponds to the number of observations.
When there is no relationship between two variables, \(C^*\) is close to \(0\). The contingency coefficient \(C^*\) cannot exceed values \(> 1\), but it may be less than \(1\), even when two variables are perfectly related to each other. Since this is not desirable, \(C^*\) is adjusted so it reaches a maximum of \(1\) when there is complete association in a table of any number of rows and columns. This can be denoted as \(C^*_{max}\) and calculated as follows:
\[C^*_{max}=\sqrt{\frac{k-1}{k}}\text{,}\]
where \(k\) is the number of rows or the number of columns, whichever is less, \(k=min(\text{rows,columns})\).
Then the adjusted contingency coefficient is computed by
\[C=\frac{C^*}{C^*_{max}}=\sqrt\frac{k \cdot \chi^2}{(k-1)(n+\chi^2)}\]
In the section above the \(\chi^2\)
statistic was assigned to the variable chisqVal
and was
calculated as 17.3443877. Now, we plug that value into the equation for
the contingency coefficient, \(C^*\).
C_star <- sqrt(chisqVal / (sum(obs) + chisqVal))
C_star
## [1] 0.3238805
The contingency coefficient \(C^*\) evaluates to 0.3238805.
Finally, we apply the equation for the adjusted contingency coefficient, \(C\).
k <- min(nrow(obs), ncol(obs))
C_star_max <- sqrt((k - 1) / k)
C <- C_star / C_star_max
C
## [1] 0.3966709
The adjusted contingency coefficient \(C\) evaluates to 0.3966709. Recall, the contingency coefficient ranges from 0 to 1. A contingency coefficient of roughly 0.4 does not indicate a strong relation between the exam results and the willingness of students to complete exercises during the semester.
Before we end this section we want to point out a package called Tools for Descriptive Statistics and Exploratory Data Analysis. The package is a collection of miscellaneous basic statistic functions and convenience wrappers for efficient data description. It is a toolbox, which facilitates the (notoriously time consuming) first descriptive tasks in data analysis, consisting of calculating descriptive statistics, drawing graphical summaries and reporting the results.
Feel free to download the package
(install.packages("DescTools")
) and play around with it. In
the code segment below the Assocs()
function from the
DescTools
package was applied to our observation data set
from above. The function returns different association measures
simultaneously.
library(DescTools)
Assocs(obs)
## estimate lwr.ci upr.ci
## Contingency Coeff. 0.3239 - -
## Cramer V 0.2421 0.0661 0.3258
## Kendall Tau-b -0.1648 -0.2981 -0.0315
## Goodman Kruskal Gamma -0.2319 -0.4182 -0.0457
## Stuart Tau-c -0.1720 -0.3105 -0.0335
## Somers D C|R -0.1759 -0.3187 -0.0331
## Somers D R|C -0.1545 -0.2791 -0.0298
## Pearson Correlation -0.1767 -0.3286 -0.0158
## Spearman Correlation -0.1940 -0.3445 -0.0337
## Lambda C|R 0.1456 0.0083 0.2830
## Lambda R|C 0.1294 0.0000 0.2974
## Lambda sym 0.1383 0.0072 0.2693
## Uncertainty Coeff. C|R 0.0418 0.0024 0.0813
## Uncertainty Coeff. R|C 0.0532 0.0031 0.1034
## Uncertainty Coeff. sym 0.0469 0.0027 0.0910
## Mutual Information 0.0827 - -
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.