cqp:advanced-sampling
Differences
This shows you the differences between two versions of the page.
cqp:advanced-sampling [2021/01/22 10:22] – created astefanowitsch | cqp:advanced-sampling [2024/06/20 13:53] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== 5f. Advanced sampling ====== | ||
+ | |||
+ | //This section explains how to create samples from concordances using more advanced techniques than those provided by the Open Corpus Workbench. It requires sending your concordance to a small program we provide as part of INLET: '' | ||
+ | |||
+ | ===== Types of sampling ===== | ||
+ | |||
+ | When drawing a sample from a larger dataset, we would like the sample to be representative of the larger dataset, such that any analytical results we derive from the former are likely to be true of the latter. There are two main strategies for drawing such a sample. First, we can select a certain number of cases at random, in such a way that every case has the same chance of ending up in the sample – let us call this a //(simple) random sample//. Second, if the data set is ordered (or can be ordered based on external criteria), we can choose a starting point and then go through the data set and choose every nth case (let us call this a // | ||
+ | |||
+ | Both types of sampling have advantages and disadvantages: | ||
+ | |||
+ | ===== Drawing samples from concordances ===== | ||
+ | |||
+ | The Open Corpus Workbench does not offer much flexibility – the command '' | ||
+ | |||
+ | cat [CONCORDANCE] > " | csample.pl [OPTIONS] > [OUTPUT FILE]" | ||
+ | |||
+ | The options are what makes the script powerful: you can use them to specify the type of sampling that you want to perform as well as to control the format of the output. | ||
+ | |||
+ | The script '' | ||
+ | |||
+ | After specifying the sampling method -- simple random sampling or systematic random sampling, you can proceed in two ways. First, you can specify a basis on which lines should be included in the sample. In the case of random sampling, this basis is the probability with which a concordance line will be included in the sample -- for example, a probability of 0.01 (i.e. one percent), 0.33 (i.e. thirty-three percent), etc.). In the case of systematic sampling, it is the distance between lines -- e.g., 100 (i.e. every one-hundredth line), 3 (i.e. every third line), etc. In both cases, the size of your sample will be a consequence of the value you provide. | ||
+ | |||
+ | Second, alternatively, | ||
+ | |||
+ | You choose the method using the option '' | ||
+ | |||
+ | ====Systematic sampling with a given distance between selected lines==== | ||
+ | |||
+ | The option '' | ||
+ | |||
+ | cat Love > "| csample.pl --method sp --basis 7 > love-sample.txt" | ||
+ | |||
+ | will select every seventh line (if you leave out the '' | ||
+ | |||
+ | You can specify which line should be the first line in your sample. If you run the command as just shown, it will start counting lines on the first line of the concordance, | ||
+ | |||
+ | cat Love > "| csample.pl --number 7 --offset 1 > love-sample.txt" | ||
+ | |||
+ | In this case, your sample will contain lines 1, 8, 15, 22, 29 etc. If you use '' | ||
+ | |||
+ | ====Systematic sampling with a given sample size==== | ||
+ | |||
+ | The option '' | ||
+ | |||
+ | cat Love > "| csample.pl --method sf --basis 30 > love-sample.txt" | ||
+ | |||
+ | will produce a sample of 30 lines, by first determining what distance between lines will result in such a sample and then selecting lines based on this distance. This method only works if the concordance has a standard CWB header (which, unless you have made deliberate changes to your settings, should always be the case on our INLET system). If no basis is specified, a sample of 50 cases will be produced. | ||
+ | |||
+ | ====Random sampling with a given probability==== | ||
+ | |||
+ | The option '' | ||
+ | |||
+ | cat Love > "| csample.pl --method rp --basis 0.33 > love-sample.txt" | ||
+ | |||
+ | will go through the concordance and give every line a chance of 0.33 (thirty-three percent) to be included. Thus, the final sample will be roughly one-third the size of the concordance (since the selection is random, the size will vary). If no basis is specified, the prpbability of inclusion will be 0.5. | ||
+ | |||
+ | ====Random sampling with a given sample size==== | ||
+ | |||
+ | The option '' | ||
+ | |||
+ | cat Love > "| csample.pl --method rf --basis 30 > love-sample.txt" | ||
+ | |||
+ | will produce a random sample of thirty cases selected randomly, with every example getting the same chance to be included. If no basis is specified, a sample of fifty cases will be produced. | ||
+ | |||
+ | =====Additional options===== | ||
+ | |||
+ | If you run the commands as shown above, the output will consist only of the sample. If you want to keep the header of your concordance (which includes information about the corpus used, the query you ran, etc.), simply add '' | ||
+ | |||
+ | cat Love > "| csample.pl --method sp --basis 3 --header > love-sample.txt" | ||
+ | |||
+ | Your output will now include the header, with added information about the sample size in relation to the size of the original concordance and about the sampling method used. | ||
+ | |||
+ | Finally, you can transform your sample into a csv file, as described in [[cqp: | ||
+ | |||
+ | cat Love > "| csample.pl --method sp --basis 3 --format csv > love-sample.csv" | ||
+ | |||
+ | This is useful if you want to import your sample into a spreadsheet. | ||