User Tools

Site Tools


cqp:grouping-data

[ Collection: Introduction to CQP ]

4b. Grouping

This section explains how to create complex frequency lists from a concordance, where the frequencies of one variable are grouped by a second variable. It presupposes that you have read Section 1, Section 2, Section 3c and Section 4a).

Grouping by structural features

The frequency lists introduced in Section 4a are very useful as long as we are interested in the frequency of single words or connected sequences of words. But sometimes, we may want to determine the frequency of word combinations that do not occur in connected sequences.

An example: Inspired by the line You tricked me into loving you from Alexandra Burke's song You broke my heart, we might wonder what verbs co-occur in the expression [VERB someone into VERBING]. The following query should find most cases of this expression in the BNC:

[pos="VV."] [] [word="into"%c] [pos="VVG"]

It looks for any lexical verb (i.e. any verb except be, have and do), followed by any word (this will capture pronouns and proper names, that most typically occur in this position in the expression, followed by the word form into, followed by a lexical verb in the ing-form.

If we create a lemmatized frequency list of this query, it looks as follows:

count Last by hw;
3       fool you into think  [#132-#134]
2       aggravate he into produce  [#0-#1]
2       bully i into say  [#39-#40]
2       con i into post  [#66-#67]
2       delude themselves into believe  [#92-#93]
2       delude yourself into think  [#95-#96]
2       hoodwink we into believe  [#190-#191]
2       mislead you into think  [#253-#254]
2       talk he into give  [#361-#362]
2       talk he into let  [#365-#366]
2       talk i into come  [#371-#372]
1       fool anyone into think  [#144]
...

This is not useful, because we are not interested in the specific content of the second slot — for example, the first and the last line of the excerpt shown here are examples of the same combination of verbs – fool and think, and we would like their frequencies to be combined.

We can create a frequency list of just the verb combinations by using the group command as follows:

group Last matchend hw by match[0] hw;

This command creates a special type of frequency list where the last word in our query (matchend) is combined with the first word (match[0]) and then counted. The result looks like this, giving us a much clearer picture:

#---------------------------------------------------------------------
fool                          think                                 12
mislead                       think                                  6
deceive                       think                                  5
delude                        believe                                4
mislead                       believe                                4
delude                        think                                  3
fool                          believe                                3
force                         make                                   3
talk                          give                                   3
                              go                                     3
trick                         believe                                3
...

(Note that the gap in the first column after talk indicates that the preceding verb is repeated.)

Now, this does not really look like a “grouped” frequency list yet, because the combinations are still listed in order of descending frequency. We can actually group the list by modifying the command slightly and repeating the word group before the word by:

group Last matchend hw group by match[0] hw

If we run this command, we get an output like the following, which is sorted by the first verb (the grouping variable) first, and then by order of frequency:

#---------------------------------------------------------------------
talk                          do                                     3
                              give                                   3
                              go                                     3
                              accept                                 2
                              come                                   2
                              let                                    2
                              take                                   2
                              abandon                                1
                              act                                    1
                              become                                 1
                              believe                                1
                              buy                                    1
                              care                                   1
                              have                                   1
                              join                                   1
                              leave                                  1
                              make                                   1
                              offer                                  1
                              play                                   1
                              realize                                1
                              release                                1
                              ring                                   1
                              stay                                   1
                              try                                    1
                              use                                    1
                              wear                                   1
#---------------------------------------------------------------------
force                         make                                   3
                              accept                                 1
                              agree                                  1
                              choose                                 1
                              commit                                 1
                              concede                                1
                              discard                                1
                              do                                     1
                              give                                   1
                              have                                   1
                              hide                                   1
                              lose                                   1
                              probe                                  1
                              see                                    1
                              speak                                  1
                              switch                                 1
                              try                                    1
                              use                                    1
                              work                                   1
...

Note that many combinations occur only once, we might not be interested in those. We can add the command cut at the end to specify the minimum frequency that a combinations must have in order to be included in the list. For example:

group Last matchend hw group by match[0] hw cut 2

This gives us the following, more readable output that we could now use to think about why we talk people into actions like giving, going, accepting, etc., but fool them into thinking or believing something:

#---------------------------------------------------------------------
talk                          give                                   3
                              go                                     3
                              accept                                 2
                              come                                   2
                              let                                    2
                              take                                   2
#---------------------------------------------------------------------
fool                          think                                 12
                              believe                                3
#---------------------------------------------------------------------
force                         make                                   3
#---------------------------------------------------------------------
bully                         get                                    2
                              say                                    2
                              take                                   2
#---------------------------------------------------------------------
trick                         believe                                3
#---------------------------------------------------------------------
mislead                       think                                  6
                              believe                                4
#---------------------------------------------------------------------
provoke                       make                                   2
#---------------------------------------------------------------------
deceive                       think                                  5
                              believe                                2
#---------------------------------------------------------------------
delude                        believe                                4
                              think                                  3
#---------------------------------------------------------------------
lead                          define                                 2
#---------------------------------------------------------------------
trap                          make                                   2
#---------------------------------------------------------------------
con                           post                                   2
#---------------------------------------------------------------------
lure                          make                                   2
#---------------------------------------------------------------------
go                            make                                   2
#---------------------------------------------------------------------
tempt                         make                                   2
#---------------------------------------------------------------------
lull                          believe                                2
#---------------------------------------------------------------------
aggravate                     produce                                2
#---------------------------------------------------------------------
hoodwink                      believe                                2

Of course, you can switch the order of variables to group the first verb by the second instead:

group Last match[0] hw group by matchend hw cut 2;

Look at the output and think about what this can show you as opposed to the table above.

Grouping by metadata

We can also use the group command to create frequency lists that are grouped by metadata – in fact, this is probably the more typical way of using it.

For example, we might be interested in whether there is a connection between social class and the usage of so-called “non-standard” language features such as the form ain't. If our corpus includes metadata about class, we can answer this question using a simple query and the group command as follows.

First, we create a concordance of the form ain't and the two standard forms that it can represent – isn't and aren't. We limit this query to utterances that contain information about class, which in the BNC is recorded in the attribute u_class using the labels AB (roughly ‘upper and middle middle class’), C1 (roughly ‘lower middle class’), C2 (roughly ‘skilled working class’) and DE (roughly ‘unskilled working class’):

[word="(am|are|is|ai)"%c] [word="n't"%c] :: match.u_class="(AB|C1|C2|DE)"

We can then group the frequency of the first word of the query by the values of the variable u_class:

group Last match word group by match u_class

The result looks like this:

#---------------------------------------------------------------------
C1                            is                                  1107
                              are                                  362
                              ai                                   218
                              Is                                    45
                              Are                                   32
                              Ai                                    13
#---------------------------------------------------------------------
C2                            is                                   690
                              ai                                   592
                              are                                  344
                              Is                                    44
                              Are                                   34
                              Ai                                    31
#---------------------------------------------------------------------
AB                            is                                   968
                              are                                  335
                              Is                                    73
                              Are                                   52
                              ai                                    49
                              Ai                                     6
#---------------------------------------------------------------------
DE                            is                                   362
                              ai                                   330
                              are                                  175
                              Ai                                    21
                              Are                                   17
                              Is                                    11

Unfortunately, the group command does not allow the %c flag, so we have to do a bit of manual counting, adding up the frequencies of upper case and lower case instances of each word. This gives us the following table:

Class isn't aren't ain't
AB 1041 387 55
C1 1152 394 231
C2 734 378 623
DE 373 192 351

We can now calculate the percentage of cases of ain't in each group of speakers by dividing the frequency of ain't by the overall frequency of all three forms in each class and multiplying the result by 100. This gives us percentages of 3.71% for the upper and middle middle class, 13.00% for the lower middle class, 35.91% for the skilled working class and 38.32% for the unskilled working class. In other words: the lower the social status, the higher the usage of ain'tAin't No Love in the Heart of the City, as Bobby Bland sang, but sure is a lot of ain't in the highrises on the outskirts.

Summary and outlook

This section introduced the group command. You can now read Section 4c and then continue on the the expert tricks!

[ Introduction to CQP: Section 1Section 2Section 3Section 4Section 5Section 6 ]

cqp/grouping-data.txt · Last modified: by astefanowitsch

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki