cqp:sorting-sampling
Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
cqp:sorting-sampling [2020/04/19 15:17] – created astefanowitsch | cqp:sorting-sampling [2024/06/20 13:53] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | **[ [[cqp: | ||
+ | ====== 3g. Sorting and sampling ====== | ||
+ | |||
+ | //This section explains how to sort a concordance and how to create random samples from a concordance. It presupposes that you have read [[cqp: | ||
+ | |||
+ | |||
+ | ===== Sorting ===== | ||
+ | |||
+ | When we create a concordance, | ||
+ | |||
+ | This is done using the '' | ||
+ | |||
+ | The '' | ||
+ | |||
+ | sort Love by word%c on match[-1] | ||
+ | |||
+ | Try it -- the concordance will now be displayed in the specified order. This is useful, for example, if we are interested in what modal verbs precede the verb //love// or what adjectives precede the noun //love//. Or, we could sort the concordance by the part of speech of the first word to the left as follows: | ||
+ | |||
+ | sort Love by pos on match[1] | ||
+ | |||
+ | Try it -- note that now all hits where //love// is followed by an adjective are listed first (with a few errors where the part-of-speech tag is wrong). This is because the BNC uses '' | ||
+ | |||
+ | We can also sort a concordance randomly, by typing '' | ||
+ | |||
+ | sort Love randomize | ||
+ | |||
+ | Careful: this will obviously give us a different order every time we execute this command. | ||
+ | |||
+ | ===== Randomized Sampling ===== | ||
+ | |||
+ | //The Open CorpusWorkbench offers a comfortable but relatively restricted way of creating samples. For more advanced options, see [[cqp: | ||
+ | |||
+ | Often, a concordance is too large for a given research project. Let us assume that we want to know what love is (to cite the famous Foreigner song) -- i.e., how the word //love// is used in English. The BNC contains 28480 examples of this word -- this is clearly too much for a single person to look at -- too much love will kill you, as the Queen song warns us. Instead, we might want to look at 100, 150 or 500 randomly selected lines, or at 10 or 20 percent of the concordance. | ||
+ | |||
+ | This is done by the command '' | ||
+ | |||
+ | reduce Love to 15% | ||
+ | |||
+ | Try it. Note that the concordance is now irreversibly reduced -- our original concordance no longer exists. Thus, it is a good idea to create a copy of the original concordance, | ||
+ | |||
+ | LoveSample = Love | ||
+ | reduce LoveSample to 15% | ||
+ | |||
+ | To reduce the concordance named //Love// to exactly 100 matches, we type: | ||
+ | |||
+ | LoveSample = Love | ||
+ | reduce LoveSample to 100 | ||
+ | |||
+ | In both cases, the sample is a random one. This means that if we reduce the same concordance several times, we get a different sample each time: CQP creates a random number before reducing the concordance, | ||
+ | |||
+ | LoveSample = Love | ||
+ | randomize 50; | ||
+ | reduce LoveSample to 100 | ||
+ | |||
+ | Try executing these commands several times -- you will get the same sample every time. So, slip out the back, Jack, make a new plan, Stan, use CQP, Lee, this software is free. | ||
+ | |||
+ | ===== Summary and outlook ===== | ||
+ | |||
+ | This section has shown you how to sort concordances and how to create simple random samples from a large concordance (more advanced sampling techniques are described in [[cqp: | ||
+ | |||
+ | * [[cqp: | ||
+ | * [[cqp: | ||
+ | * [[cqp: | ||
+ | * [[cqp: | ||
+ | * [[cqp: | ||
+ | * [[cqp: | ||
+ | |||
+ | **[ Introduction to CQP: [[cqp: |