The contingency coefficient, \(C\), is a \(\chi^2\)-based measure of association for categorical data. It relies on the \(\chi^2\) test for independence. The \(\chi^2\) statistic allows to assess whether or not there is a statistical relationship between the variables in the contingency tables (also known as two-way table, cross-tabulation table or cross tabs). In this kind of tables the distribution of the variables is shown in a matrix format.

In order to calculate the contingency coefficient \(C\) we have to determine \(\chi^2\) statistic in advance.


Calculation of the \(\chi^2\) statistic

The \(\chi^2\) statistic is given by

\[\chi^2= \sum{\frac{(O-E)^2}{E}}\text{,} \]

where \(O\) represents the observed frequency and \(E\) represents the expected frequency. Please note that \(\frac{(O-E)^2}{E}\) is evaluated for each cell and then summed up.

We showcase an example to explain the calculation of the \(\chi^2\) statistic based on categorical observation data in more depth. Consider an exam at the end of the semester. There are three groups of students: Students either passed, not passed or not participated in the exam. Further, we categorize the number of exercises each particular student completed during the semester into four groups: None, less then half \((<0.5)\), more than half \((\ge0.5)\), all of them.

The resulting contingency table looks like this:

\[ \begin{array}{l|ccc} \hline \ & \text{None} & <0.5 & > 0.5 & \text{all} \\ \hline \ \text{passed} & 12 & 13 & 24 & 14 \\ \ \text{not passed} & 22 & 11 & 8 & 6 \\ \ \text{not participated} & 11 & 14 & 6 & 7 \\ \hline \end{array} \]

First, let us construct a table object and assign it the name obs, to remind us that this data corresponds to the observed frequency:

data  <- c(12,13,24,14,22,11,8,6,11,14,6,7) # data
obs <-  matrix(data, nrow=3, byrow = TRUE) # structure in a matrix
obs <- as.table(obs) # transform into table
dimnames(obs) <- list('Exam' = c('passed', 'not passed', 'not participated'),
                      'HomeWork' = c('None', '<0.5', '>0.5', 'all')) # name axes
obs
##                   HomeWork
## Exam               None <0.5 >0.5 all
##   passed             12   13   24  14
##   not passed         22   11    8   6
##   not participated   11   14    6   7

Perfect, now we have a proper representation of our data in R. However, one piece is still missing to complete a contingency table. The row sums and column sums.

There are several ways to compute the row and column sums in R. One may use a function of the apply() family and make use of the rowSums() and colSums() functions appropriately, however, R ships with a in-built function called margin.table(). The function takes two arguments, the data and a number to indicate row-wise or column-wise.

margin.row <- margin.table(obs, 1) # row-wise
margin.row
## Exam
##           passed       not passed not participated 
##               63               47               38
margin.col <- margin.table(obs, 2) # column-wise
margin.col
## HomeWork
## None <0.5 >0.5  all 
##   45   38   38   27

Putting all pieces together the contingency table looks like this:

\[ \begin{array}{l|cccc|c} \hline \ & \text{None} & <0.5 & > 0.5 & \text{all} & \text{row sum} \\ \hline \ \text{passed} & 12 & 13 & 24 & 14 & 63\\ \ \text{not passed} & 22 & 11 & 8 & 6 & 47\\ \ \text{not participated} & 11 & 14 & 6 & 7 & 38\\ \hline \ \text{column sum} & 45 & 38 & 38 & 27 & \\ \end{array} \]

Great, now we have a table filled with the observed frequencies. In the next step we calculate the expected frequencies. To calculate the expected frequencies \((E)\) we apply this equation:

\[ E = \frac{R\times C}{n} \text{,}\]

where \(R\) is the row total, \(C\) is the column total, and \(n\) is the sample size. Please note that we have to calculate the expected frequency for each particular table entry, thus we have to do \(3 \times 4 = 12\) calculations. Again, R provides several ways to achieve that task. One may build a nested for loop to go through every cell and do the calculations step-wise, that is definitely fine! However, we may also use the vectorized, and thus much faster, outer() function in combination with the rowSums() and colSums() functions. We assign the result to a variable denoted as expected, in order to remind us that this table corresponds to the expected frequencies.

expected <- outer(rowSums(obs), colSums(obs), FUN = '*') / sum(data)
expected 
##                      None      <0.5      >0.5       all
## passed           19.15541 16.175676 16.175676 11.493243
## not passed       14.29054 12.067568 12.067568  8.574324
## not participated 11.55405  9.756757  9.756757  6.932432

In the next step we finally calculate the \(\chi^2\) statistic. Recall the equation:

\[\chi^2= \sum{\frac{(O-E)^2}{E}}\text{,} \]

where \(O\) represents the observed frequency and \(E\) represents the expected frequency.

chisqVal <- sum((obs-expected)^2 / expected) 
chisqVal
## [1] 17.34439

The \(\chi^2\) statistic evaluates to 17.3443877.

Before we finally calculate the contingency coefficient, there is nice plotting function to visualize categorical data built into R. The mosaicplot() function visualizes contingency tables and help to assess the distributions of the data and possible dependencies.

mosaicplot(obs, main = "Observations")


Calculation of the contingency coefficient \(C\)

The contingency coefficient, denoted as \(C^*\), adjusts the \(\chi^2\) statistic by the sample size \((n)\). It can be written as

\[C^*=\sqrt{\frac{\chi^2}{n+\chi^2}}\text{,}\] where \(\chi^2\) corresponds to the \(\chi^2\) statistic and \(n\) corresponds to the number of observations. When there is no relationship between two variables, \(C^* = 0\). The contingency coefficient \(C^*\) cannot exceed values \(> 1\), but the contingency coefficient may be less than \(1\), even when two variables are perfectly related to each other. This is not as desirable, thus \(C^*\) is adjusted so it reaches a maximum of \(1\) when there is complete association in a table of any number of rows and columns. Therefore, we calculate \(C^*_{max}\), which is

\[C^*_{max}=\sqrt{\frac{k-1}{k}}\text{,}\]

where \(k\) is the number of rows or the number of columns, whichever is less, \(k=min(\text{rows,columns})\). Then the adjusted contingency coefficient is computed by

\[C=\frac{C^*}{C^*_{max}}=\sqrt\frac{k \cdot \chi^2}{(k-1)(n+\chi^2)}\]

In the section above the \(\chi^2\) statistic was assigned the the variable chisqVal and was calculated as 17.3443877. Now we plug that value into the equation for the contingency coefficient, \(C^*\).

C.star <- sqrt(chisqVal/(sum(obs)+chisqVal))
C.star
## [1] 0.3238805

The contingency coefficient \(C^*\) evaluates to 0.3238805.

Now we apply the equation for the adjusted contingency coefficient, \(C\).

k <- min(nrow(obs), ncol(obs))
C.star.max <- sqrt((k-1)/k)
C <- C.star/C.star.max
C
## [1] 0.3966709

The adjusted contingency coefficient \(C\) evaluates to 0.3966709. Recall, the contingency coefficient ranges from 0 to 1. A contingency coefficient of 0.4 does not indicate a strong relation between the results the exam and the willingness of students to complete exercises during the semester.


Before we end this section I want to point out to a recently published package called Tools for Descriptive Statistics and Exploratory Data Analysis. The package is a collection of miscellaneous basic statistic functions and convenience wrappers for efficiently describing data. The package is a toolbox, which facilitates the (notoriously time consuming) first descriptive tasks in data analysis, consisting of calculating descriptive statistics, drawing graphical summaries and reporting the results.

Feel free to download the package and play with it. In the code segment below the Assocs(obs) function from the DescTools package was applied to our example data set from above. The function returns different association measures simultaneously.

library(DescTools) 
Assocs(obs)
##                           estimate      lwr.ci      upr.ci
## Phi Coeff.              3.4230e-01           -           -
## Contingency Coeff.      3.2390e-01           -           -
## Cramer V                2.4210e-01  6.6100e-02  3.2580e-01
## Goodman Kruskal Gamma  -2.3190e-01 -4.1820e-01 -4.5700e-02
## Kendall Tau-b          -1.6480e-01 -2.9810e-01 -3.1500e-02
## Stuart Tau-c           -1.7200e-01 -3.1050e-01 -3.3500e-02
## Somers D C|R           -1.7590e-01 -3.1870e-01 -3.3100e-02
## Somers D R|C           -1.5450e-01 -2.7910e-01 -2.9800e-02
## Pearson Correlation    -1.7670e-01 -3.2860e-01 -1.5800e-02
## Spearman Correlation   -1.9400e-01 -3.4450e-01 -3.3700e-02
## Lambda C|R              1.4560e-01  8.3000e-03  2.8300e-01
## Lambda R|C              1.2940e-01      0.0000  2.9740e-01
## Lambda sym              1.3830e-01  7.2000e-03  2.6930e-01
## Uncertainty Coeff. C|R  4.1800e-02  2.4000e-03  8.1300e-02
## Uncertainty Coeff. R|C  5.3200e-02  3.1000e-03  1.0340e-01
## Uncertainty Coeff. sym  4.6900e-02  2.7000e-03  9.1000e-02
## Mutual Information      8.2700e-02           -           -