The **contingency coefficient**, \(C\), is a \(\chi^2\)-based measure of association for
categorical data. It relies on the **\(\chi^2\) test
for independence**. The \(\chi^2\) statistic allows to assess whether
or not there is a statistical relationship between the variables of a **contingency table** (also known as
two-way table, cross-tabulation table or cross tabs). In this kind of
table the distribution of the variables is shown in a matrix format.

In order to calculate the **contingency coefficient**
\(C\) we have to determine the \(\chi^2\) statistic in advance.

The \(\chi^2\) statistic is given by

\[\chi^2= \sum{\frac{(O-E)^2}{E}}\text{,} \]

where \(O\) represents the observed frequency and \(E\) represents the expected frequency. Please note that \(\frac{(O-E)^2}{E}\) is evaluated for each cell of a contingency table and then summed up.

We showcase an example to explain the calculation of the \(\chi^2\) statistic based on categorical observation data in more depth. Consider an exam at the end of the semester. There are three groups of students: Students have either passed, not passed or not participated in the exam. Further, there have been exercises for the students to work on throughout the semester. We categorize the number of exercises each particular student completed into four groups: None, less then half \((<0.5)\), more than half \((\ge0.5)\), all of them.

The resulting contingency table looks like this:

\[ \begin{array}{l|ccc} \hline \ & \text{None} & <0.5 & > 0.5 & \text{all} \\ \hline \ \text{passed} & 12 & 13 & 24 & 14 \\ \ \text{not passed} & 22 & 11 & 8 & 6 \\ \ \text{not participated} & 11 & 14 & 6 & 7 \\ \hline \end{array} \]

First, let us construct a `table`

object and assign it the
name `obs`

to remind us that this data corresponds to the
**observed frequency**:

```
data <- c(12, 13, 24, 14, 22, 11, 8, 6, 11, 14, 6, 7) # data
obs <- matrix(data, nrow = 3, byrow = TRUE) # structure in a matrix
obs <- as.table(obs) # transform into table
dimnames(obs) <- list(
"Exam" = c("passed", "not passed", "not participated"),
"HomeWork" = c("None", "<0.5", ">0.5", "all")
) # name axes
obs
```

```
## HomeWork
## Exam None <0.5 >0.5 all
## passed 12 13 24 14
## not passed 22 11 8 6
## not participated 11 14 6 7
```

Perfect, now we have a proper representation of our data in R. However, one piece is still missing to complete the contingency table; the row sums and column sums.

There are several ways to compute the row and column sums in R. One
option is to use a function of the `apply()`

family and make
use of the `rowSums()`

and `colSums()`

functions
appropriately. We will use R’s in-built function called
`margin.table()`

. The function takes two arguments, the data
and a number to indicate row-wise or column-wise.

```
margin_row <- margin.table(obs, 1) # row-wise
margin_row
```

```
## Exam
## passed not passed not participated
## 63 47 38
```

```
margin_col <- margin.table(obs, 2) # column-wise
margin_col
```

```
## HomeWork
## None <0.5 >0.5 all
## 45 38 38 27
```

Putting all pieces together the contingency table looks like this:

\[ \begin{array}{l|cccc|c} \hline \ & \text{None} & <0.5 & > 0.5 & \text{all} & \text{row sum} \\ \hline \ \text{passed} & 12 & 13 & 24 & 14 & 63\\ \ \text{not passed} & 22 & 11 & 8 & 6 & 47\\ \ \text{not participated} & 11 & 14 & 6 & 7 & 38\\ \hline \ \text{column sum} & 45 & 38 & 38 & 27 & \\ \end{array} \]

Great, now we have a table filled with the observed frequencies. In
the next step we calculate the **expected frequencies**. To
calculate the expected frequencies \((E)\) we apply this equation:

\[ E = \frac{R\times C}{n} \text{,}\]

where \(R\) is the row total, \(C\) is the column total and \(n\) is the sample size.

Please note that we have to calculate the expected frequency for each
particular table entry, thus we have to do \(3
\times 4 = 12\) calculations. Again, R provides several ways to
achieve that task. The option of building a nested for loop to go
through every cell and do the calculations step-wise is definitely fine!
However, we can also use the vectorized, and thus much faster,
`outer()`

function in combination with the
`rowSums()`

and `colSums()`

functions. We assign
the result to a variable denoted as `expected`

, in order to
remind us that this table corresponds to the expected frequencies.

```
expected <- outer(rowSums(obs), colSums(obs), FUN = "*") / sum(data)
expected
```

```
## None <0.5 >0.5 all
## passed 19.15541 16.175676 16.175676 11.493243
## not passed 14.29054 12.067568 12.067568 8.574324
## not participated 11.55405 9.756757 9.756757 6.932432
```

Now, we can calculate the \(\chi^2\) statistic. Recall the equation:

\[\chi^2= \sum{\frac{(O-E)^2}{E}}\text{,} \]

where \(O\) represents the observed frequency and \(E\) represents the expected frequency.

```
chisqVal <- sum((obs - expected)^2 / expected)
chisqVal
```

`## [1] 17.34439`

The \(\chi^2\) statistic evaluates to 17.3443877.

Before we finally calculate the contingency coefficient, there is
nice in-built plotting function to visualize categorical data. The
`mosaicplot()`

function visualizes contingency tables and
helps to assess the distributions of the data and possible
dependencies.

`mosaicplot(obs, main = "Observations")`

The contingency coefficient, denoted as \(C^*\), adjusts the \(\chi^2\) statistic by the sample size, \(n\). It can be written as

\[C^*=\sqrt{\frac{\chi^2}{n+\chi^2}}\text{,}\]

where \(\chi^2\) corresponds to the \(\chi^2\) statistic and \(n\) corresponds to the number of observations.

When there is no relationship between two variables, \(C^*\) is close to \(0\). The contingency coefficient \(C^*\) cannot exceed values \(> 1\), but it may be less than \(1\), even when two variables are perfectly related to each other. Since this is not desirable, \(C^*\) is adjusted so it reaches a maximum of \(1\) when there is complete association in a table of any number of rows and columns. This can be denoted as \(C^*_{max}\) and calculated as follows:

\[C^*_{max}=\sqrt{\frac{k-1}{k}}\text{,}\]

where \(k\) is the number of rows or the number of columns, whichever is less, \(k=min(\text{rows,columns})\).

Then the adjusted contingency coefficient is computed by

\[C=\frac{C^*}{C^*_{max}}=\sqrt\frac{k \cdot \chi^2}{(k-1)(n+\chi^2)}\]

In the section above the \(\chi^2\)
statistic was assigned to the variable `chisqVal`

and was
calculated as 17.3443877. Now, we plug that value into the equation for
the contingency coefficient, \(C^*\).

```
C_star <- sqrt(chisqVal / (sum(obs) + chisqVal))
C_star
```

`## [1] 0.3238805`

The contingency coefficient \(C^*\) evaluates to 0.3238805.

Finally, we apply the equation for the adjusted contingency coefficient, \(C\).

```
k <- min(nrow(obs), ncol(obs))
C_star_max <- sqrt((k - 1) / k)
C <- C_star / C_star_max
C
```

`## [1] 0.3966709`

The adjusted contingency coefficient \(C\) evaluates to 0.3966709. Recall, the contingency coefficient ranges from 0 to 1. A contingency coefficient of roughly 0.4 does not indicate a strong relation between the exam results and the willingness of students to complete exercises during the semester.

Before we end this section we want to point out a package called **Tools for Descriptive Statistics and Exploratory
Data Analysis**. The package is a collection of miscellaneous
basic statistic functions and convenience wrappers for efficient data
description. It is a toolbox, which facilitates the (notoriously time
consuming) first descriptive tasks in data analysis, consisting of
calculating descriptive statistics, drawing graphical summaries and
reporting the results.

Feel free to download the package
(`install.packages("DescTools")`

) and play around with it. In
the code segment below the `Assocs()`

function from the
`DescTools`

package was applied to our observation data set
from above. The function returns different association measures
simultaneously.

```
library(DescTools)
Assocs(obs)
```

```
## estimate lwr.ci upr.ci
## Contingency Coeff. 0.3239 - -
## Cramer V 0.2421 0.0661 0.3258
## Kendall Tau-b -0.1648 -0.2981 -0.0315
## Goodman Kruskal Gamma -0.2319 -0.4182 -0.0457
## Stuart Tau-c -0.1720 -0.3105 -0.0335
## Somers D C|R -0.1759 -0.3187 -0.0331
## Somers D R|C -0.1545 -0.2791 -0.0298
## Pearson Correlation -0.1767 -0.3286 -0.0158
## Spearman Correlation -0.1940 -0.3445 -0.0337
## Lambda C|R 0.1456 0.0083 0.2830
## Lambda R|C 0.1294 0.0000 0.2974
## Lambda sym 0.1383 0.0072 0.2693
## Uncertainty Coeff. C|R 0.0418 0.0024 0.0813
## Uncertainty Coeff. R|C 0.0532 0.0031 0.1034
## Uncertainty Coeff. sym 0.0469 0.0027 0.0910
## Mutual Information 0.0827 - -
```

**Citation**

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Hartmann,
K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis
using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.*