User Tools

Site Tools


cqp:sorting-sampling

[ Collection: Introduction to CQP ]

3g. Sorting and sampling

This section explains how to sort a concordance and how to create random samples from a concordance. It presupposes that you have read Section 1 and Section 2. The last subsection introduces more ways to create samples from a concordance, and presupposes that you have read Section 3f.

Sorting

When we create a concordance, it contains the results in the order in which they occur in the corpus. Frequently, we want to sort them in some way – for example, by the first word (or lemma, or part of speech) to the right of the match (i.e. following it) or by the first word on the left (i.e. preceding it).

This is done using the sort command, but before we can use this, we need to know how to refer to the match itself and to the tokens to the left and to the right. The convention is simple: the match itself (or, in a complex match, the first token of the match) is called match[0]. The words to the right (i.e., following match[0]) are numbered by positive numbers – the first word to the right is called match[1], the second match[2], and so on. The words to the left (i.e., preceding match[0]) are numbered by negative numbers – the first word to the left is called match[-1], the second word to the left match[-2], and so on.

The sort command is used as follows: sort Name by attribute on match[x]. For example, to sort the concordance we have saved as Love by the first word to the left, we type:

sort Love by word%c on match[-1]

Try it – the concordance will now be displayed in the specified order. This is useful, for example, if we are interested in what modal verbs precede the verb love or what adjectives precede the noun love. Or, we could sort the concordance by the part of speech of the first word to the left as follows:

sort Love by pos on match[1]

Try it – note that now all hits where love is followed by an adjective are listed first (with a few errors where the part-of-speech tag is wrong). This is because the BNC uses AJ0 as the tag for uninflected adjectives, and this is the first tag alphabetically.

We can also sort a concordance randomly, by typing sort Name randomize, for example:

sort Love randomize

Careful: this will obviously give us a different order every time we execute this command.

Randomized Sampling

The Open CorpusWorkbench offers a comfortable but relatively restricted way of creating samples. For more advanced options, see Section 5f.

Often, a concordance is too large for a given research project. Let us assume that we want to know what love is (to cite the famous Foreigner song) – i.e., how the word love is used in English. The BNC contains 28480 examples of this word – this is clearly too much for a single person to look at – too much love will kill you, as the Queen song warns us. Instead, we might want to look at 100, 150 or 500 randomly selected lines, or at 10 or 20 percent of the concordance.

This is done by the command reduce, which can be used with a number – reduce Name to x or a percentage – reduce Name to x%. To reduce the concordance named Love to a 15% sample, we type:

reduce Love to 15%

Try it. Note that the concordance is now irreversibly reduced – our original concordance no longer exists. Thus, it is a good idea to create a copy of the original concordance, and to then reduce the copy, leaving the original intact:

LoveSample = Love
reduce LoveSample to 15%

To reduce the concordance named Love to exactly 100 matches, we type:

LoveSample = Love
reduce LoveSample to 100

In both cases, the sample is a random one. This means that if we reduce the same concordance several times, we get a different sample each time: CQP creates a random number before reducing the concordance, and uses this number to decide which lines to keep. If we want to create a reproducible random sample, we can tell CQP which number to use by using the command randomize x (where X is a positive number) right before the reduce command. If someone else now uses the same number, or if we use the same number a second time, they (or we) will get exactly the same random sample again. For example, we could set the random number to 50 (because there are fifty ways to leave your lover, as Paul Simon has correctly observed):

LoveSample = Love
randomize 50;
reduce LoveSample to 100

Try executing these commands several times – you will get the same sample every time. So, slip out the back, Jack, make a new plan, Stan, use CQP, Lee, this software is free.

Summary and outlook

This section has shown you how to sort concordances and how to create simple random samples from a large concordance (more advanced sampling techniques are described in Section 5f). Building on this, you can look at the following sections in any order.

[ Introduction to CQP: Section 1Section 2Section 3Section 4Section 5Section 6 ]

cqp/sorting-sampling.txt · Last modified: 2024/01/21 07:15 by aamoakuh