User Tools

Site Tools


cqp:collocates

[ Collection: Introduction to CQP ]

4c. Collocate lists and tables

This section introduces a method for creating collocate lists and tables from a concordance. It presupposes that you have read Section 1, Section 2 and Section 3f. Attention: This is not a built-in function of the Corpus Workbench, but an extension in the form of a script (a small program) provided by our workgroup. If you want to work with it outside of our installation, read the instructions at the end of this section.

Two ways of summarizing collocates

Many concordancers offer some way of quickly summarizing the collocates of a search term – the words occurring in a certain span around the search term.

One type of summary is an (alphabetical) collocate list that gives the frequency of every word in the span on the left of the search term, the span on the right, and every individual position within these spans, as in the following example (for the lemma love in the BNC):

For example, within a span of four words to the left and four words to the right of the lemma love, the word form ability occurs nine times – seven times on the left and twice on the right side. Specifically, it occurs three times at the position four words to the left and four times at the position two words to the left, and once each at the position three and four words to the right.

Another type of summary is a collocate table, with the words at each position in the span ordered in decreasing frequency:

For example, the word form I is the most frequent token at the position one word to the left, followed by in and of; the comma is the most frequent token one word to the right, followed by the period and to.

Summarizing collocates in CQP

So, how do we create such summaries in CQP? We can't – the Corpus Workbench does not offer such a function. However, in our installation, it is provided as part of our collection of scripts for transforming the output of the CWB: instead of exporting a concordance to an external file directly, we can apply a script called collocates.pl to the output, and create the two types of summary shown above.

1. Preparing the concordance

The script creates summaries for a span of four words to the left and to the right of a search term, so the first step is to make sure that our concordance has a context of four words to the left and to the right. This is done using the set Context command (see Section 3f):

set Context 4 words

After having set the context to four words, let us create a concordance of the lemma love, and save it to a variable called Love (there is no song containing the line a variable called Love, but maybe there should be):

Love = [hw="love"]

2. Transforming and exporting the concordance

We then export the concordance as described in Section 3f, but instead of sending the output directly to a file, we send it to the script collocates.pl and then to the file. If our concordance is called Love, we type:

cat Love  > "| collocates.pl > love.csv"

This writes the output to a csv (comma-separated values) file that can be opened with a spreadsheet editor such as LibreOffice Calc, Apple Numbers, Microsoft Excel or Google Sheets. Depending on how we run the script, we can get either a collocate table or a collocate list.

Collocate tables

If we run the script as just shown, it will create a collocate table – in our view, if you want a simple summary of collocates, this is the best format. The table will be case sensitive, i.e., In will be counted separately from in, Of separately from of, etc. Usually, we will want our collocate table to be case insensitive, i.e., to ignore case. To do this, we add -c after the name of the script:

cat Love  > "| collocates.pl -c > love-case-insensitive.csv"

Collocate lists

If we want the list format instead of the table format, we add -l (for list) after the name of the script:

cat Love  > "| collocates.pl -l > love-list.csv"

This will produce a list like that shown above. Again, this list will be case sensitive. If we do not want this, we have to add both the l (for list) and the c (for case-insensitive):

cat Love  > "| collocates.pl -lc > love-list-case-insensitive.csv"

Summary and outlook

This section introduced an extension to our CWB installation that allows you to create collocate lists and tables. You can now read the following to section in any order:

Running the script outside of our installation

The script collocates.pl is a simple Perl script, so you can run in on any computer that has Perl installed (i.e., all computers running Linux or macOS; Windows users have to install Perl separately). You can download the script here.

[ Introduction to CQP: Section 1Section 2Section 3Section 4Section 5Section 6 ]

cqp/collocates.txt · Last modified: 2024/01/21 07:09 by aamoakuh