User Tools

Site Tools


cqp:frequency-lists

[ Collection: Introduction to CQP ]

4a. Frequency lists

This section explains how to create frequency lists from a concordance. It presupposes that you have read Section 1, Section 2 and Section 3 (in particular, Section 3c and Section 3g).

Simple word counts

“My love for you is immeasureable”, sings Rod Stewart, but there are many things we can measure about the word love (and other expressions), and in corpus linguistics, measuring typically means counting. Many research projects will require us to export a concordance, categorize the examples in various ways and then quantify our categorization decisions, but sometimes we need simple frequency information about words or larger expressions, and CQP offers an easy way of getting this information: the count command that can be applied to the last concordance created (count Last) or to a concordance saved in a variable (e.g. count Love). We always have to specify the column of our corpus that the count should be applied to – e.g. word, pos, or hw (or lemma – e.g. count Last by word.

For example, if we have a concordance called Love containing the lemma love (which we can create using the command Love = [hw="love"], see Section 3f), we can count the different word forms for this lemma (attaching %c to ignore the differences in upper and lower case):

count Love by word%c

We get a result like the following:

22315   love  [#0-#22314]
4313    loved  [#22315-#26627]
1341    loves  [#26628-#27968]
510     loving  [#27970-#28479]
1       lovest  [#27969]

The first column gives the frequency, the second column the word form, and the third column the line number(s) in the sorted concordance corresponding to the form given in the second column. We can display just those lines by adding them to the cat command. For example, to display just the lines containing the form loving, we type:

cat Love 27970 28479

As just hinted at, applying the count command automatically sorts the concordance (dispay it using cat Love, you will see that it is sorted by word form). To undo this and put the concordance back into the original order (i.e., the order in which the examples occur in the corpus), simply type sort Love.

We can also count the concordance by part of speech, by typing the following:

count Love by pos

This gives us a result like this:

11437   NN1  [#1-#11437]
3788    VVB  [#14273-#18060]
3028    VVI  [#23072-#26099]
2879    VVD  [#19428-#22306]
2650    NN1-VVB  [#11438-#14087]
1367    VVB-NN1  [#18061-#19427]
1149    VVZ  [#27279-#28427]
1040    VVN  [#26100-#27139]
265     VVG-AJ0  [#22807-#23071]
252     VVD-VVN  [#22310-#22561]
245     VVG  [#22562-#22806]
113     VVN-VVD  [#27166-#27278]
81      NN2  [#14088-#14168]
59      NN2-VVZ  [#14169-#14227]
52      VVZ-NN2  [#28428-#28479]
43      NP0  [#14228-#14270]
26      VVN-AJ0  [#27140-#27165]
3       VVD-AJ0  [#22307-#22309]
2       UNC  [#14271-#14272]
1       AJS  [#0]

The structure is the same as before, except now we have the pos tags in the second column instead of the lemmas.

More complex word counts

As discussed in Section 3c, we can (and often do) query expressions that are longer than one word. For example, inspired by the love song “She drives me crazy” by the Fine Young Cannibals, we may wonder what adjectives can occur in the pattern [drive someone ADJECTIVE]. We could construct a query like the following:

[hw="drive"] [pos="PNP"] [pos="AJ0"]

This should find all instances of the lemma drive, followed by a personal pronoun (me, you, etc.), followed by an uninflected adjective. Try it, and you will see that it does indeed. We can now apply the count command:

count Last by word%c

This will give us a result like this, …

15      driving me mad  [#96-#110]
14      driving me crazy  [#77-#90]
8       drive you mad  [#28-#35]
8       drove him mad  [#117-#124]
7       drive me mad  [#11-#17]
7       drives me mad  [#52-#58]
6       drive you crazy  [#21-#26]
5       drive me crazy  [#1-#5]
5       drives me crazy  [#44-#48]
4       drives you mad  [#68-#71]

… which is not really what we want: we are interested in the adjectives and their frequencies, but due to the different forms of the verb drive and the different pronouns, we do not get a clean list. The problem of the different verb forms could be solved by counting by hw instead of word%c, but the different pronouns would continue to confound the picture. Instead, we would like to be able to count just the adjectives – and we can: CQP allows us to specify a position at which to produce a frequency list, which will then ignore the rest of the match. As described in Section 3g, the first token of the match always has position 0 (match[0]), and the tokens on the right are numbered with increasing positive numbers, the tokens on the left with increasing negative numbers. Our query consists of three tokens, with the adjective in second position to the right of match[0], i.e., match[2]. Thus, we can use the count command as follows:

count Last by word%c on match[2]

This will give us the following, which is exactly what we want:

66      mad  [#66-#131]
37      crazy  [#9-#45]
13      insane  [#53-#65]
5       batty  [#1-#5]
3       daft  [#46-#48]
2       dotty  [#50-#51]
1       barmy  [#0]
1       blue  [#6]
1       clear  [#7]
1       crackers  [#8]
1       demented  [#49]
1       frantic  [#52]
1       onshore  [#132]
1       quackers  [#133]
1       spare  [#134]

Even more complex word counts

But what if we want to create a frequency list of an expression that is larger than a single word but smaller than the entire match? For example, we might notice that the expression drive someone crazy also has a variant drive someone to distraction, with a preposition and a noun instead of an adjective. We could construct the following query to find such cases:

[hw="drive"] [pos="PNP"] [pos="PRP"] [pos="NN1"]

Now, we would like a frequency list of the sequence of preposition and noun at the end of the match. Fortunately, this is not a problem, as CQP allows us to select not just a single position from a match, but also a range of positions. This is done by giving the first and last position of the range, separated by two periods: on match[x] .. match[y]. In our case:

count Last by word%c on match[2] .. match[3]

Try it. The first few lines of the frequency list should look like this:

7       to distraction  [#19-#25]
5       to hospital  [#30-#34]
5       to suicide  [#41-#45]
2       into opposition  [#10-#11]
2       to madness  [#37-#38]
2       to school  [#39-#40]
2       to victory  [#46-#47]

In this case, the range contains two tokens that are next to each other, but the same notation works for larger ranges. For example, we may notice the expression drive someone up the wall, and wonder if there are other cases like this, with an article between the preposition and the noun. We could construct a query like the following to capture such cases:

[hw="drive"] [pos="PNP"] [pos="PRP"] [pos="AT0"] [pos="N.*"]

We can then produce a frequency list of the last three tokens of the match like this:

count Last by word%c on match[2] ..  match[4]

This will give us a list like the following:

13      to the airport  [#68-#80]
13      up the wall  [#132-#144]
11      to the station  [#110-#120]
4       into the arms  [#24-#27]
2       into the street  [#42-#43]
2       to the conclusion  [#89-#90]
2       to the hospital  [#98-#99]
2       to the meeting  [#101-#102]
2       to the police  [#105-#106]

The phrase up the wall seems to be the only case of the expression we are looking for (but at least drive someone into the arms (of …) reminds us of Smokie's famous love song “Lay back in the arms of someone”) – all other cases seem to be instances where someone literally drives someone to some place. Although, come to think of it, drive someone to the meeting could be a great idiom for insanity, as anyone who has ever participated in a meeting can affirm.

Summary and outlook

This section introduced frequency lists. You can now read the following sections in any order:

[ Introduction to CQP: Section 1Section 2Section 3Section 4Section 5Section 6 ]

cqp/frequency-lists.txt · Last modified: 2024/01/21 07:16 by aamoakuh