[ Collection: Introduction to R ]
Contingency tables (also called cross tabulations) are tables showing the intersections of two variables. For example, there are two variants of the preposition toward(s) (“in the direction of”): one with an s at the end and one without. There are several national varieties of English, very prominent among them British English and American English. Both variants of the preposition occur in both varieties, thus, we have two variables (Variant of the Preposition, with the values toward and towards) and Variety of English (with the values British and American). Obviously, this gives us four intersections: British ∩ toward, British ∩ towards, American ∩ toward and American ∩ towards. If we check the frequency of these intersections in the LOB and BROWN corpora and represent the results as a contingency table, we get the following:
toward | towards | Total | |
---|---|---|---|
British | 318 | 14 | 332 |
American | 64 | 386 | 450 |
Total | 382 | 400 | 782 |
There are two ways of creating a contingency table in R: you can enter the values manually, or you can create the table from a raw data list in the form of a data frame.
In order to create a contingency table manually, you first have to create a vector (see Vectors) containing the values, and store this vector in a variable. For the table above, this vector would look like this (if we call the variable myvector
– of course, we can give it any name we want):
c(318, 64, 14, 386) -> myvector
These are the values in the first column followed by those in the second column – the totals are not part of the table – if we need them, we can have R calculate them later.
The first step in transforming this vector to a table is to use the command matrix()
, which takes a vector as input and transforms it to a table with a certain number of columns, specified using the ncol
option. In our case, this would look as follows (if we call the variable mytable
):
matrix(myvector, ncol=2) -> mytable
If you display this variable (by typing mytable
and hitting return), you get the following:
[,1] [,2] [1,] 318 14 [2,] 64 386
The values are displayed in the right way, but the rows and columns do not have names yet. You can refer to them by using the indices shown: for example, to display the first row of the table, type mytable[1,]
, to display the second column, type mytable[,2]
, and to display a specific cell, give both the row and the column number, e.g. mytable[1,2]
to display the second cell in the first row.
Strictly speaking, this is all you need to use this contingency table in other contexts, but you may want to add row and column labels so that you and others know what information is contained in this table. To add row and column labels, you use the functions rownames()
and colnames()
: as their names suggest, these functions provide access to the parts of a contingency table that contain the row and column labels, so you can simply construct a vector that contains the correct number of text strings and assign this vector to the relevant part of the table:
rownames(mytable) <- c("British", "American") colnames(mytable) <- c("toward", "towards")
If you now display the table (by typing mytable
and hitting return), you get the following:
toward towards British 318 14 American 64 386
You can still refer to the columns, rows and cells in the way just described, but you can also use the labels instead of numbers (you need to put them in quotation marks, as they are text strings). For example, to display the first row of the table, you can type mytable["British",]
, to display the second column, you can type mytable[,"towards"]
, and to display a specific cell, give both the row and the column number, e.g. mytable["British","towards"]
to display the second cell in the first row.
If you have imported a raw data table as a data frame (see Data Frames), you can crosstabulate two columns of this data frame to create a contingency table. There is a sample csv file containing the distribution of the word forms toward and towards across different genres in British and American English (from the LOB and BROWN corpora here: data-towards.csv. Import it into a data frame called Toward
(as described in Importing Data).
You can now create a contingency table using the table()
command, which needs two columns from the data frame as input. Use the command head()
to display the first few rows of the data frame:
head(Toward)
You will see the following:
Variety Genre Variant 1 British Press_Reportage towards 2 British Press_Reportage towards 3 British Press_Reportage toward 4 British Press_Reportage towards 5 British Press_Reportage towards 6 British Press_Reportage towards
The first and the third column are relevant to our contingency table. They can be referred to by Toward$Variety
and Toward$Variant
, so the following command will produce a contingency table:
table(Toward$Variety,Toward$Variant) -> mytable
Type mytable
to display it:
toward towards American 386 64 British 14 318
As you can see, this is the same table you created manually in the preceding section, but the rows and columns are ordered differently: the table
command orders rows and columns alphabetically. If you don't like this order, you can reorder them (see below).
You can add rows or columns to an existing matrix, no matter how you created it. For rows, this is done by using the rbind()
command. First, create a variable containing a vector with the numbers you want to add as a row, and name this variable as you want the new row to be named. For example, to add the frequencies of toward and towards in Indian English to the corpus (the data are from the KOLHAPUR corpus):
Indian <- c(18,337)
The rbind()
command needs two arguments: the matrix to which you want to add a row, and the vector containing the row you want to add. Let us write the result to the same variable mytable
:
rbind(mytable, Indian) -> mytable
If you now type mytable
, you will get the following:
toward towards American 386 64 British 14 318 Indian 17 327
The cbind()
command works in the same way. For example, to add a column containing the frequencies of the expression in the direction of, we create a corresponding variable (the frequencies are from BROWN, LOB and KOLHAPUR):
in_the_direction_of <- c(11,12,11)
We then add this column to our table:
cbind(mytable, in_the_direction_of) -> mytable
Typing mytable
gives us:
toward towards in_the_direction_of American 386 64 11 British 14 318 12 Indian 18 337 11
As mentioned above, your contingency table should contain only the intersections of your variables (shown in bold in the introduction), not the row totals, column totals and table total: statistical procedures expect a matrix to contain only data, if the totals are needed, they will be calculated internally. Also, if you want to create a box plot from a contingency table (see Box Plots), it should not contain any totals. However, when you show a table in a research report, it should contain totals, so here is how to add them.
R has special commands for creating these totals: rowSums()
and colSums()
, which take a matrix as an argument and produce a variable containing the row or column totals. For example, typing rowSums(mytable)
produces the following output:
American British Indian 461 344 366
Let us add these row totals to our table using the cbind()
command (note that the row totals must be added as a column and vice versa) and store the result in a new variable mytable_totals
cbind(mytable,rowSums(mytable)) -> mytable_totals
Typing mytable_totals
displays the following:
toward towards in_the_direction_of American 386 64 11 461 British 14 318 12 344 Indian 18 337 11 366
If you want to add the column name Totals, use the colnames()
command introduced above – since you only want to change the fourth position, attach [4]
to the end:
colnames(mytable_totals)[4] <- "Total"
Now, let us add the column totals to mytable_totals
as a new row, storing the result in the same variable:
rbind(mytable_totals,colSums(mytable_totals)) -> mytable_totals
Let's add the row name Total using rownames()
:
rownames(mytable_totals)[4] <- "Total"
Typing mytable_totals
now gives us the following:
toward towards in_the_direction_of Total American 386 64 11 461 British 14 318 12 344 Indian 18 337 11 366 Total 418 719 34 1171
As mentioned above, table()
put your rows and columns in alphabetical order. If you want them in a different order, there are various ways of doing so – none of them very straightforward, but also not very complicated. The easiest way is to exploit the possibility of accessing individual cells by giving the row and column number in square brackets, as shown above.
Instead of giving an individual row and column number, you can give a vector of numbers. For example, to display top left cell of our table (the cell in the first row and first column), you type mytable[1,1]
– so, to display all cells of our table, you type mytable[c(1,2,3), c(1,2,3)]
(try it).
Now, if you want rows and/or columns displayed in a different order, you simply order them differently in the vectors. For example, the order in the variable mytable
is as follows:
toward towards in_the_direction_of American 386 64 11 British 14 318 12 Indian 18 337 11
If we want to change the order of rows to British, Indian, American, and the order of columns to in_the_direction_of, towards, toward, you type:
mytable[c(2,3,1), c(3,2,1)]
This gives you:
in_the_direction_of towards toward British 12 318 14 Indian 11 337 18 American 11 64 386
Of course, you can store this new order in a variable, if you want.
matrix
elements in R behave pretty much like matrices in the mathematical sense, for example, in operations such as addition, subtraction, multiplication etc.t()
, with the matrix as argument.