20243_contingency_coeficient.knit

The contingency coefficient, \(C\), is a \(\chi^2\)-based measure of association for categorical data. It relies on the \(\chi^2\) test for independence. The \(\chi^2\) statistic allows to assess whether or not there is a statistical relationship between the variables of a contingency table (also known as two-way table, cross-tabulation table or cross tabs). In this kind of table the distribution of the variables is shown in a matrix format.

In order to calculate the contingency coefficient \(C\) we have to determine the \(\chi^2\) statistic in advance.

Calculation of the \(\chi^2\) statistic

The \(\chi^2\) statistic is given by

\[\chi^2= \sum{\frac{(O-E)^2}{E}}\text{,} \]

where \(O\) represents the observed frequency and \(E\) represents the expected frequency. Please note that \(\frac{(O-E)^2}{E}\) is evaluated for each cell of a contingency table and then summed up.

We showcase an example to explain the calculation of the \(\chi^2\) statistic based on categorical observation data in more depth. Consider an exam at the end of the semester. There are three groups of students: Students have either passed, not passed or not participated in the exam. Further, there have been exercises for the students to work on throughout the semester. We categorize the number of exercises each particular student completed into four groups: None, less then half \((<0.5)\), more than half \((\ge0.5)\), all of them.

The resulting contingency table looks like this:

\[ \begin{array}{l|ccc} \hline \ & \text{None} & <0.5 & > 0.5 & \text{all} \\ \hline \ \text{passed} & 12 & 13 & 24 & 14 \\ \ \text{not passed} & 22 & 11 & 8 & 6 \\ \ \text{not participated} & 11 & 14 & 6 & 7 \\ \hline \end{array} \]

First, let us construct a table object and assign it the name obs to remind us that this data corresponds to the observed frequency:

data <- c(12, 13, 24, 14, 22, 11, 8, 6, 11, 14, 6, 7) # data
obs <- matrix(data, nrow = 3, byrow = TRUE) # structure in a matrix
obs <- as.table(obs) # transform into table
dimnames(obs) <- list(
  "Exam" = c("passed", "not passed", "not participated"),
  "HomeWork" = c("None", "<0.5", ">0.5", "all")
) # name axes
obs

##                   HomeWork
## Exam               None <0.5 >0.5 all
##   passed             12   13   24  14
##   not passed         22   11    8   6
##   not participated   11   14    6   7

Perfect, now we have a proper representation of our data in R. However, one piece is still missing to complete the contingency table; the row sums and column sums.

There are several ways to compute the row and column sums in R. One option is to use a function of the apply() family and make use of the rowSums() and colSums() functions appropriately. We will use R’s in-built function called margin.table(). The function takes two arguments, the data and a number to indicate row-wise or column-wise.

margin_row <- margin.table(obs, 1) # row-wise
margin_row

## Exam
##           passed       not passed not participated 
##               63               47               38

margin_col <- margin.table(obs, 2) # column-wise
margin_col

## HomeWork
## None <0.5 >0.5  all 
##   45   38   38   27

Putting all pieces together the contingency table looks like this:

\[ \begin{array}{l|cccc|c} \hline \ & \text{None} & <0.5 & > 0.5 & \text{all} & \text{row sum} \\ \hline \ \text{passed} & 12 & 13 & 24 & 14 & 63\\ \ \text{not passed} & 22 & 11 & 8 & 6 & 47\\ \ \text{not participated} & 11 & 14 & 6 & 7 & 38\\ \hline \ \text{column sum} & 45 & 38 & 38 & 27 & \\ \end{array} \]

Great, now we have a table filled with the observed frequencies. In the next step we calculate the expected frequencies. To calculate the expected frequencies \((E)\) we apply this equation:

\[ E = \frac{R\times C}{n} \text{,}\]

where \(R\) is the row total, \(C\) is the column total and \(n\) is the sample size.

Please note that we have to calculate the expected frequency for each particular table entry, thus we have to do \(3 \times 4 = 12\) calculations. Again, R provides several ways to achieve that task. The option of building a nested for loop to go through every cell and do the calculations step-wise is definitely fine! However, we can also use the vectorized, and thus much faster, outer() function in combination with the rowSums() and colSums() functions. We assign the result to a variable denoted as expected, in order to remind us that this table corresponds to the expected frequencies.

expected <- outer(rowSums(obs), colSums(obs), FUN = "*") / sum(data)
expected

##                      None      <0.5      >0.5       all
## passed           19.15541 16.175676 16.175676 11.493243
## not passed       14.29054 12.067568 12.067568  8.574324
## not participated 11.55405  9.756757  9.756757  6.932432

Now, we can calculate the \(\chi^2\) statistic. Recall the equation:

\[\chi^2= \sum{\frac{(O-E)^2}{E}}\text{,} \]

where \(O\) represents the observed frequency and \(E\) represents the expected frequency.

chisqVal <- sum((obs - expected)^2 / expected)
chisqVal

## [1] 17.34439

The \(\chi^2\) statistic evaluates to 17.3443877.

Before we finally calculate the contingency coefficient, there is nice in-built plotting function to visualize categorical data. The mosaicplot() function visualizes contingency tables and helps to assess the distributions of the data and possible dependencies.

mosaicplot(obs, main = "Observations")

Calculation of the contingency coefficient \(C\)

The contingency coefficient, denoted as \(C^*\), adjusts the \(\chi^2\) statistic by the sample size, \(n\). It can be written as

\[C^*=\sqrt{\frac{\chi^2}{n+\chi^2}}\text{,}\]

where \(\chi^2\) corresponds to the \(\chi^2\) statistic and \(n\) corresponds to the number of observations.

When there is no relationship between two variables, \(C^*\) is close to \(0\). The contingency coefficient \(C^*\) cannot exceed values \(> 1\), but it may be less than \(1\), even when two variables are perfectly related to each other. Since this is not desirable, \(C^*\) is adjusted so it reaches a maximum of \(1\) when there is complete association in a table of any number of rows and columns. This can be denoted as \(C^*_{max}\) and calculated as follows:

\[C^*_{max}=\sqrt{\frac{k-1}{k}}\text{,}\]

where \(k\) is the number of rows or the number of columns, whichever is less, \(k=min(\text{rows,columns})\).

Then the adjusted contingency coefficient is computed by

\[C=\frac{C^*}{C^*_{max}}=\sqrt\frac{k \cdot \chi^2}{(k-1)(n+\chi^2)}\]

In the section above the \(\chi^2\) statistic was assigned to the variable chisqVal and was calculated as 17.3443877. Now, we plug that value into the equation for the contingency coefficient, \(C^*\).

C_star <- sqrt(chisqVal / (sum(obs) + chisqVal))
C_star

## [1] 0.3238805

The contingency coefficient \(C^*\) evaluates to 0.3238805.

Finally, we apply the equation for the adjusted contingency coefficient, \(C\).

k <- min(nrow(obs), ncol(obs))
C_star_max <- sqrt((k - 1) / k)
C <- C_star / C_star_max
C

## [1] 0.3966709

The adjusted contingency coefficient \(C\) evaluates to 0.3966709. Recall, the contingency coefficient ranges from 0 to 1. A contingency coefficient of roughly 0.4 does not indicate a strong relation between the exam results and the willingness of students to complete exercises during the semester.

Before we end this section we want to point out a package called Tools for Descriptive Statistics and Exploratory Data Analysis. The package is a collection of miscellaneous basic statistic functions and convenience wrappers for efficient data description. It is a toolbox, which facilitates the (notoriously time consuming) first descriptive tasks in data analysis, consisting of calculating descriptive statistics, drawing graphical summaries and reporting the results.

Feel free to download the package (install.packages("DescTools")) and play around with it. In the code segment below the Assocs() function from the DescTools package was applied to our observation data set from above. The function returns different association measures simultaneously.

library(DescTools)
Assocs(obs)

##                        estimate  lwr.ci  upr.ci
## Contingency Coeff.       0.3239       -       -
## Cramer V                 0.2421  0.0661  0.3258
## Kendall Tau-b           -0.1648 -0.2981 -0.0315
## Goodman Kruskal Gamma   -0.2319 -0.4182 -0.0457
## Stuart Tau-c            -0.1720 -0.3105 -0.0335
## Somers D C|R            -0.1759 -0.3187 -0.0331
## Somers D R|C            -0.1545 -0.2791 -0.0298
## Pearson Correlation     -0.1767 -0.3286 -0.0158
## Spearman Correlation    -0.1940 -0.3445 -0.0337
## Lambda C|R               0.1456  0.0083  0.2830
## Lambda R|C               0.1294  0.0000  0.2974
## Lambda sym               0.1383  0.0072  0.2693
## Uncertainty Coeff. C|R   0.0418  0.0024  0.0813
## Uncertainty Coeff. R|C   0.0532  0.0031  0.1034
## Uncertainty Coeff. sym   0.0469  0.0027  0.0910
## Mutual Information       0.0827       -       -

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.