User Tools

Site Tools


cqp:advanced-sampling

5f. Advanced sampling

This section explains how to create samples from concordances using more advanced techniques than those provided by the Open Corpus Workbench. It requires sending your concordance to a small program we provide as part of INLET: csample.pl. If you want to use this script on your own computer or in your own installation of the Open Corpus Workbench, you can download it here.

Types of sampling

When drawing a sample from a larger dataset, we would like the sample to be representative of the larger dataset, such that any analytical results we derive from the former are likely to be true of the latter. There are two main strategies for drawing such a sample. First, we can select a certain number of cases at random, in such a way that every case has the same chance of ending up in the sample – let us call this a (simple) random sample. Second, if the data set is ordered (or can be ordered based on external criteria), we can choose a starting point and then go through the data set and choose every nth case (let us call this a systematic sample).

Both types of sampling have advantages and disadvantages: Random sampling has the obvious advantage that it makes no assumptions about the data set and leaves the inclusion of a case into the sample completely to chance. Thus, we minimise the possibility of researcher bias. On the other hand, random sampling does not preclude the possibility that some parts of the original data contribute more cases to the sample than others and will therefore be overrepresented. Concordances often have structure: they contain hits in the order in which they occur in the corpus, thus, examples that occur next to each other are more likely to occur in the same text than examples that occur further apart. Here, systematic sampling has the advantage that the cases included in the sample will be distributed evenly throughout the corpus – examples that occur next to each other will never be included. The disadvantage is that any bias in the construction of the corpus will definitely also end up in the sample.

Drawing samples from concordances

The Open Corpus Workbench does not offer much flexibility – the command random (described in Section Section 3g) can only be used to draw a simple random sample of a given size. If you need advanced sampling options, however, you can send your concordance to the script csample.pl before you save it – just like you can send your concordance to tidycwb.pl to clean it up (see Section 5d or to collocates.pl to create collocate lists and tables (see Section 4c). As always, this is done as follows (see below for concrete examples):

cat [CONCORDANCE] > " | csample.pl [OPTIONS] > [OUTPUT FILE]"

The options are what makes the script powerful: you can use them to specify the type of sampling that you want to perform as well as to control the format of the output.

The script csample.pl offers both types of sampling just described, and it offers each each of them in two versions. Let us look at this in more detail.

After specifying the sampling method – simple random sampling or systematic random sampling, you can proceed in two ways. First, you can specify a basis on which lines should be included in the sample. In the case of random sampling, this basis is the probability with which a concordance line will be included in the sample – for example, a probability of 0.01 (i.e. one percent), 0.33 (i.e. thirty-three percent), etc.). In the case of systematic sampling, it is the distance between lines – e.g., 100 (i.e. every one-hundredth line), 3 (i.e. every third line), etc. In both cases, the size of your sample will be a consequence of the value you provide.

Second, alternatively, you can specify a sample size. In this case, the probability (for random samples) or the distance between lines (for systematic samples) will be calculated to ensure that the sample will have the desired size.

You choose the method using the option -​-method (or -m) and the selection criterion using the option -​-basis (or -b). There are four combinations of method and selection criterion.

Systematic sampling with a given distance between selected lines

The option sp (or systematicproportion) provides a systematic selection of every nth case, with -​-basis (or -b) specifying the distance between lines. For example,

cat Love  > "| csample.pl --method sp --basis 7 > love-sample.txt"

will select every seventh line (if you leave out the -​-basis option, every second line will be selected).

You can specify which line should be the first line in your sample. If you run the command as just shown, it will start counting lines on the first line of the concordance, which means that the seventh line will be the first one to be included in your sample, followed by the 14th, the 21st, the 28th, and so on. In order to create different samples from the same file, you can tell the script which line should be the first in your sample using the option -​-offset followed by the number of that line. For example, if you want the first line of the original concordance to be the first line of your sample, the command should look like this:

cat Love  > "| csample.pl --number 7 --offset 1 > love-sample.txt"

In this case, your sample will contain lines 1, 8, 15, 22, 29 etc. If you use -​-offset 10, the 10th line from the concordance will be the first line from your sample, and so on.

Systematic sampling with a given sample size

The option sf (or systematicfixed) provides a systematic selection of every nth case, with -​-basis (or -b) specifying the size of the sample. For example,

cat Love  > "| csample.pl --method sf --basis 30 > love-sample.txt"

will produce a sample of 30 lines, by first determining what distance between lines will result in such a sample and then selecting lines based on this distance. This method only works if the concordance has a standard CWB header (which, unless you have made deliberate changes to your settings, should always be the case on our INLET system). If no basis is specified, a sample of 50 cases will be produced.

Random sampling with a given probability

The option rp (or randomproportion) provides a random sample, with -​-basis specifying the probability that any given line will be included in the sample. For example,

cat Love  > "| csample.pl --method rp --basis 0.33 > love-sample.txt"

will go through the concordance and give every line a chance of 0.33 (thirty-three percent) to be included. Thus, the final sample will be roughly one-third the size of the concordance (since the selection is random, the size will vary). If no basis is specified, the prpbability of inclusion will be 0.5.

Random sampling with a given sample size

The option rf (or randomfixed) provides a random sample with the size specified by -​-basis. For example,

cat Love  > "| csample.pl --method rf --basis 30 > love-sample.txt"

will produce a random sample of thirty cases selected randomly, with every example getting the same chance to be included. If no basis is specified, a sample of fifty cases will be produced.

Additional options

If you run the commands as shown above, the output will consist only of the sample. If you want to keep the header of your concordance (which includes information about the corpus used, the query you ran, etc.), simply add -​-header (or -h) to your command. For example:

cat Love  > "| csample.pl --method sp --basis 3 --header > love-sample.txt"

Your output will now include the header, with added information about the sample size in relation to the size of the original concordance and about the sampling method used.

Finally, you can transform your sample into a csv file, as described in Sction 5d. You don't need a separate script to do so; simply add the option -​-format csv:

cat Love  > "| csample.pl --method sp --basis 3 --format csv > love-sample.csv"

This is useful if you want to import your sample into a spreadsheet.

cqp/advanced-sampling.txt · Last modified: 2021/01/22 10:22 by astefanowitsch