Differences

This shows you the differences between two versions of the page.

--- cqp:sorting-sampling [2020/04/19 15:17] – created astefanowitsch
+++ cqp:sorting-sampling [2024/06/20 13:53] (current) – external edit 127.0.0.1
@@ Line 1: / Line 1: @@
+**[ [[cqp:introduction|Collection: Introduction to CQP]] ]**
+====== 3g. Sorting and sampling ======
+//This section explains how to sort a concordance and how to create random samples from a concordance. It presupposes that you have read [[cqp:corpus-structure|Section 1]] and [[cqp:simple-queries|Section 2]]. The last subsection introduces more ways to create samples from a concordance, and presupposes that you have read [[cqp:concordances|Section 3f]].//
+===== Sorting =====
+When we create a concordance, it contains the results in the order in which they occur in the corpus. Frequently, we want to sort them in some way -- for example, by the first word (or lemma, or part of speech) to the right of the match (i.e. following it) or by the first word on the left (i.e. preceding it).
+This is done using the ''sort'' command, but before we can use this, we need to know how to refer to the match itself and to the tokens to the left and to the right. The convention is simple: the match itself (or, in a complex match, the first token of the match) is called ''match[0]''. The words to the right (i.e., following match[0]) are numbered by positive numbers -- the first word to the right is called ''match[1]'', the second ''match[2]'', and so on. The words to the left (i.e., preceding match[0]) are numbered by negative numbers -- the first word to the left is called ''match[-1]'', the second word to the left ''match[-2]'', and so on.
+The ''sort'' command is used as follows: ''sort //Name// by //attribute// on match[x]''. For example, to sort the concordance we have saved as ''Love'' by the first word to the left, we type:
+	sort Love by word%c on match[-1]
+Try it -- the concordance will now be displayed in the specified order. This is useful, for example, if we are interested in what modal verbs precede the verb //love// or what adjectives precede the noun //love//. Or, we could sort the concordance by the part of speech of the first word to the left as follows:
+	sort Love by pos on match[1]
+Try it -- note that now all hits where //love// is followed by an adjective are listed first (with a few errors where the part-of-speech tag is wrong). This is because the BNC uses ''AJ0'' as the tag for uninflected adjectives, and this is the first tag alphabetically.
+We can also sort a concordance randomly, by typing ''sort //Name// randomize'', for example:
+	sort Love randomize
+Careful: this will obviously give us a different order every time we execute this command.
+===== Randomized Sampling =====
+//The Open CorpusWorkbench offers a comfortable but relatively restricted way of creating samples. For more advanced options, see [[cqp:advanced-sampling|Section 5f]].//
+Often, a concordance is too large for a given research project. Let us assume that we want to know what love is (to cite the famous Foreigner song) -- i.e., how the word //love// is used in English. The BNC contains 28480 examples of this word -- this is clearly too much for a single person to look at -- too much love will kill you, as the Queen song warns us. Instead, we might want to look at 100, 150 or 500 randomly selected lines, or at 10 or 20 percent of the concordance.
+This is done by the command ''reduce'', which can be used with a number -- ''reduce //Name// to x'' or a percentage -- ''reduce //Name// to x%''. To reduce the concordance named Love to a 15% sample, we type:
+	reduce Love to 15%
+Try it. Note that the concordance is now irreversibly reduced -- our original concordance no longer exists. Thus, it is a good idea to create a copy of the original concordance, and to then reduce the copy, leaving the original intact:
+	LoveSample = Love
+	reduce LoveSample to 15%
+To reduce the concordance named //Love// to exactly 100 matches, we type:
+	LoveSample = Love
+	reduce LoveSample to 100
+In both cases, the sample is a random one. This means that if we reduce the same concordance several times, we get a different sample each time: CQP creates a random number before reducing the concordance, and uses this number to decide which lines to keep. If we want to create a reproducible random sample, we can tell CQP which number to use by using the command ''randomize x'' (where X is a positive number) right before the ''reduce'' command. If someone else now uses the same number, or if we use the same number a second time, they (or we) will get exactly the same random sample again. For example, we could set the random number to 50 (because there are fifty ways to leave your lover, as Paul Simon has correctly observed):
+	LoveSample = Love
+	randomize 50;
+	reduce LoveSample to 100
+Try executing these commands several times -- you will get the same sample every time. So, slip out the back, Jack, make a new plan, Stan, use CQP, Lee, this software is free.
+===== Summary and outlook =====
+This section has shown you how to sort concordances and how to create simple random samples from a large concordance (more advanced sampling techniques are described in [[cqp:advanced-sampling|Section 5f]]). Building on this, you can look at the following sections in any order.
+  * [[cqp:extending-queries-combinations|Section 3a]]: Extending simple queries: Alternative attributes and values
+  * [[cqp:extending-queries-alternatives|Section 3b]]: Extending simple queries: Combinations of attributes and values
+  * [[cqp:complex-queries|Section 3c]]: Complex Queries
+  * [[cqp:metadata|Section 3d]]: Metadata
+  * [[cqp:regular-expressions-basics|Section 3e]]: Regular expressions (basics)
+  * [[cqp:concordances|Section 3f]]: Working with concordances
+**[ Introduction to CQP: [[cqp:corpus-structure|Section 1]] -- [[cqp:simple-queries|Section 2]] -- [[cqp:advanced-querying|Section 3]] -- [[cqp:beyond-queries|Section 4]] -- [[cqp:expert-tricks|Section 5]] -- [[cqp:exercises|Section 6]] ]**