[ Collection: Introduction to CQP ]
This section explains how to create complex frequency lists from a concordance, where the frequencies of one variable are grouped by a second variable. It presupposes that you have read Section 1, Section 2, Section 3c and Section 4a).
The frequency lists introduced in Section 4a are very useful as long as we are interested in the frequency of single words or connected sequences of words. But sometimes, we may want to determine the frequency of word combinations that do not occur in connected sequences.
An example: Inspired by the line You tricked me into loving you from Alexandra Burke's song You broke my heart, we might wonder what verbs co-occur in the expression [VERB someone into VERBING]. The following query should find most cases of this expression in the BNC:
[pos="VV."] [] [word="into"%c] [pos="VVG"]
It looks for any lexical verb (i.e. any verb except be, have and do), followed by any word (this will capture pronouns and proper names, that most typically occur in this position in the expression, followed by the word form into, followed by a lexical verb in the ing-form.
If we create a lemmatized frequency list of this query, it looks as follows:
count Last by hw;
3 fool you into think [#132-#134] 2 aggravate he into produce [#0-#1] 2 bully i into say [#39-#40] 2 con i into post [#66-#67] 2 delude themselves into believe [#92-#93] 2 delude yourself into think [#95-#96] 2 hoodwink we into believe [#190-#191] 2 mislead you into think [#253-#254] 2 talk he into give [#361-#362] 2 talk he into let [#365-#366] 2 talk i into come [#371-#372] 1 fool anyone into think [#144] ...
This is not useful, because we are not interested in the specific content of the second slot — for example, the first and the last line of the excerpt shown here are examples of the same combination of verbs – fool and think, and we would like their frequencies to be combined.
We can create a frequency list of just the verb combinations by using the group command as follows:
group Last matchend hw by match[0] hw;
This command creates a special type of frequency list where the last word in our query (matchend) is combined with the first word (match[0]) and then counted. The result looks like this, giving us a much clearer picture:
#---------------------------------------------------------------------
fool think 12
mislead think 6
deceive think 5
delude believe 4
mislead believe 4
delude think 3
fool believe 3
force make 3
talk give 3
go 3
trick believe 3
...
(Note that the gap in the first column after talk indicates that the preceding verb is repeated.)
Now, this does not really look like a “grouped” frequency list yet, because the combinations are still listed in order of descending frequency. We can actually group the list by modifying the command slightly and repeating the word group before the word by:
group Last matchend hw group by match[0] hw
If we run this command, we get an output like the following, which is sorted by the first verb (the grouping variable) first, and then by order of frequency:
#---------------------------------------------------------------------
talk do 3
give 3
go 3
accept 2
come 2
let 2
take 2
abandon 1
act 1
become 1
believe 1
buy 1
care 1
have 1
join 1
leave 1
make 1
offer 1
play 1
realize 1
release 1
ring 1
stay 1
try 1
use 1
wear 1
#---------------------------------------------------------------------
force make 3
accept 1
agree 1
choose 1
commit 1
concede 1
discard 1
do 1
give 1
have 1
hide 1
lose 1
probe 1
see 1
speak 1
switch 1
try 1
use 1
work 1
...
Note that many combinations occur only once, we might not be interested in those. We can add the command cut at the end to specify the minimum frequency that a combinations must have in order to be included in the list. For example:
group Last matchend hw group by match[0] hw cut 2
This gives us the following, more readable output that we could now use to think about why we talk people into actions like giving, going, accepting, etc., but fool them into thinking or believing something:
#---------------------------------------------------------------------
talk give 3
go 3
accept 2
come 2
let 2
take 2
#---------------------------------------------------------------------
fool think 12
believe 3
#---------------------------------------------------------------------
force make 3
#---------------------------------------------------------------------
bully get 2
say 2
take 2
#---------------------------------------------------------------------
trick believe 3
#---------------------------------------------------------------------
mislead think 6
believe 4
#---------------------------------------------------------------------
provoke make 2
#---------------------------------------------------------------------
deceive think 5
believe 2
#---------------------------------------------------------------------
delude believe 4
think 3
#---------------------------------------------------------------------
lead define 2
#---------------------------------------------------------------------
trap make 2
#---------------------------------------------------------------------
con post 2
#---------------------------------------------------------------------
lure make 2
#---------------------------------------------------------------------
go make 2
#---------------------------------------------------------------------
tempt make 2
#---------------------------------------------------------------------
lull believe 2
#---------------------------------------------------------------------
aggravate produce 2
#---------------------------------------------------------------------
hoodwink believe 2
Of course, you can switch the order of variables to group the first verb by the second instead:
group Last match[0] hw group by matchend hw cut 2;
Look at the output and think about what this can show you as opposed to the table above.
We can also use the group command to create frequency lists that are grouped by metadata – in fact, this is probably the more typical way of using it.
For example, we might be interested in whether there is a connection between social class and the usage of so-called “non-standard” language features such as the form ain't. If our corpus includes metadata about class, we can answer this question using a simple query and the group command as follows.
First, we create a concordance of the form ain't and the two standard forms that it can represent – isn't and aren't. We limit this query to utterances that contain information about class, which in the BNC is recorded in the attribute u_class using the labels AB (roughly ‘upper and middle middle class’), C1 (roughly ‘lower middle class’), C2 (roughly ‘skilled working class’) and DE (roughly ‘unskilled working class’):
[word="(am|are|is|ai)"%c] [word="n't"%c] :: match.u_class="(AB|C1|C2|DE)"
We can then group the frequency of the first word of the query by the values of the variable u_class:
group Last match word group by match u_class
The result looks like this:
#---------------------------------------------------------------------
C1 is 1107
are 362
ai 218
Is 45
Are 32
Ai 13
#---------------------------------------------------------------------
C2 is 690
ai 592
are 344
Is 44
Are 34
Ai 31
#---------------------------------------------------------------------
AB is 968
are 335
Is 73
Are 52
ai 49
Ai 6
#---------------------------------------------------------------------
DE is 362
ai 330
are 175
Ai 21
Are 17
Is 11
Unfortunately, the group command does not allow the %c flag, so we have to do a bit of manual counting, adding up the frequencies of upper case and lower case instances of each word. This gives us the following table:
| Class | isn't | aren't | ain't |
|---|---|---|---|
| AB | 1041 | 387 | 55 |
| C1 | 1152 | 394 | 231 |
| C2 | 734 | 378 | 623 |
| DE | 373 | 192 | 351 |
We can now calculate the percentage of cases of ain't in each group of speakers by dividing the frequency of ain't by the overall frequency of all three forms in each class and multiplying the result by 100. This gives us percentages of 3.71% for the upper and middle middle class, 13.00% for the lower middle class, 35.91% for the skilled working class and 38.32% for the unskilled working class. In other words: the lower the social status, the higher the usage of ain't — Ain't No Love in the Heart of the City, as Bobby Bland sang, but sure is a lot of ain't in the highrises on the outskirts.