Differences

This shows you the differences between two versions of the page.

--- cqp:collocates [2020/04/20 16:57] – created astefanowitsch
+++ cqp:collocates [2024/06/20 13:53] (current) – external edit 127.0.0.1
@@ Line 1: / Line 1: @@
+**[ [[cqp:introduction|Collection: Introduction to CQP]] ]**
+====== 4c. Collocate lists and tables ======
+//This section introduces a method for creating collocate lists and tables from a concordance. It presupposes that you have read [[cqp:corpus-structure|Section 1]], [[cqp:simple-queries|Section 2]] and [[cqp:concordances|Section 3f]]. **Attention:** This is not a built-in function of the Corpus Workbench, but an extension in the form of a script (a small program) provided by our workgroup. If you want to work with it outside of our installation, read the instructions at the end of this section.//
+===== Two ways of summarizing collocates =====
+Many concordancers offer some way of quickly summarizing the collocates of a search term -- the words occurring in a certain span around the search term.
+One type of summary is an (alphabetical) **collocate list** that gives the frequency of every word in the span on the left of the search term, the span on the right, and every individual position within these spans, as in the following example (for the lemma //love// in the BNC):
+{{ :cqp:collocate-list.png?nolink |}}
+For example, within a span of four words to the left and four words to the right of the lemma //love//, the word form //ability// occurs nine times -- seven times on the left and twice on the right side. Specifically, it occurs three times at the position four words to the left and four times at the position two words to the left, and once each at the position three and four words to the right.
+Another type of summary is a **collocate table**, with the words at each position in the span ordered in decreasing frequency:
+{{ :cqp:collocate-table.png?nolink |}}
+For example, the word form //I// is the most frequent token at the position one word to the left, followed by //in// and //of//; the comma is the most frequent token one word to the right, followed by the period and //to//.
+===== Summarizing collocates in CQP =====
+So, how do we create such summaries in CQP? We can't -- the Corpus Workbench does not offer such a function. However, in our installation, it is provided as part of our collection of scripts for transforming the output of the CWB: instead of exporting a concordance to an external file directly, we can apply a script called ''collocates.pl'' to the output, and create the two types of summary shown above.
+==== 1. Preparing the concordance ====
+The script creates summaries for a span of four words to the left and to the right of a search term, so the first step is to make sure that our concordance has a context of four words to the left and to the right. This is done using the ''set Context'' command (see [[cqp:concordances|Section 3f]]):
+	set Context 4 words
+After having set the context to four words, let us create a concordance of the lemma //love//, and save it to a variable called ''Love'' (there is no song containing the line //a variable called Love//, but maybe there should be):
+	Love = [hw="love"]
+==== 2. Transforming and exporting the concordance ====
+We then export the concordance as described in [[cqp:concordances#exporting_a_concordance_to_an_external_file|Section 3f]], but instead of sending the output directly to a file, we send it to the script ''collocates.pl'' and then to the file. If our concordance is called Love, we type:
+	cat Love  > "| collocates.pl > love.csv"
+This writes the output to a ''csv'' (comma-separated values) file that can be opened with a spreadsheet editor such as LibreOffice Calc, Apple Numbers, Microsoft Excel or Google Sheets. Depending on how we run the script, we can get either a collocate table or a collocate list.
+===== Collocate tables =====
+If we run the script as just shown, it will create a collocate table -- in our view, if you want a simple summary of collocates, this is the best format. The table will be case sensitive, i.e., //In// will be counted separately from //in//, //Of// separately from //of//, etc. Usually, we will want our collocate table to be case insensitive, i.e., to ignore case. To do this, we add ''-c'' after the name of the script:
+	cat Love  > "| collocates.pl -c > love-case-insensitive.csv"
+===== Collocate lists =====
+If we want the list format instead of the table format, we add ''-l'' (for //list//) after the name of the script:
+	cat Love  > "| collocates.pl -l > love-list.csv"
+This will produce a list like that shown above. Again, this list will be case sensitive. If we do not want this, we have to add both the ''l'' (for //list//) and the ''c'' (for //case-insensitive//):
+	cat Love  > "| collocates.pl -lc > love-list-case-insensitive.csv"
+===== Summary and outlook =====
+This section introduced an extension to our CWB installation that allows you to create collocate lists and tables. You can now read the following to section in any order:
+  * [[cqp:frequency-lists|Section 4a]]: Frequency lists
+  * [[cqp:grouping-data|Section 4b]]: Grouping data
+===== Running the script outside of our installation =====
+The script ''collocates.pl'' is a simple Perl script, so you can run in on any computer that has Perl installed (i.e., all computers running Linux or macOS; Windows users have to install Perl separately). You can download the script {{ :cqp:collocates.pl.zip | here}}.
+**[ Introduction to CQP: [[cqp:corpus-structure|Section 1]] -- [[cqp:simple-queries|Section 2]] -- [[cqp:advanced-querying|Section 3]] -- [[cqp:beyond-queries|Section 4]] -- [[cqp:expert-tricks|Section 5]]  -- [[cqp:exercises|Section 6]] ]**