**[ [[cqp:introduction|Collection: Introduction to CQP]] ]** ====== 4b. Grouping ====== //This section explains how to create complex frequency lists from a concordance, where the frequencies of one variable are grouped by a second variable. It presupposes that you have read [[cqp:corpus-structure|Section 1]], [[cqp:simple-queries|Section 2]], [[cqp:complex-queries|Section 3c]] and [[cqp:frequency-lists|Section 4a]]).// ===== Grouping by structural features ===== The frequency lists introduced in [[cqp:frequency-lists|Section 4a]] are very useful as long as we are interested in the frequency of single words or connected sequences of words. But sometimes, we may want to determine the frequency of word combinations that do not occur in connected sequences. An example: Inspired by the line //You tricked me into loving you// from Alexandra Burke's song //You broke my heart//, we might wonder what verbs co-occur in the expression [VERB someone into VERBING]. The following query should find most cases of this expression in the BNC: [pos="VV."] [] [word="into"%c] [pos="VVG"] It looks for any lexical verb (i.e. any verb except //be//, //have// and //do//), followed by any word (this will capture pronouns and proper names, that most typically occur in this position in the expression, followed by the word form //into//, followed by a lexical verb in the //ing//-form. If we create a lemmatized frequency list of this query, it looks as follows: count Last by hw; 3 fool you into think [#132-#134] 2 aggravate he into produce [#0-#1] 2 bully i into say [#39-#40] 2 con i into post [#66-#67] 2 delude themselves into believe [#92-#93] 2 delude yourself into think [#95-#96] 2 hoodwink we into believe [#190-#191] 2 mislead you into think [#253-#254] 2 talk he into give [#361-#362] 2 talk he into let [#365-#366] 2 talk i into come [#371-#372] 1 fool anyone into think [#144] ... This is not useful, because we are not interested in the specific content of the second slot — for example, the first and the last line of the excerpt shown here are examples of the same combination of verbs -- //fool// and //think//, and we would like their frequencies to be combined. We can create a frequency list of just the verb combinations by using the ''group'' command as follows: group Last matchend hw by match[0] hw; This command creates a special type of frequency list where the last word in our query (''matchend'') is combined with the first word (''match[0]'') and then counted. The result looks like this, giving us a much clearer picture: #--------------------------------------------------------------------- fool think 12 mislead think 6 deceive think 5 delude believe 4 mislead believe 4 delude think 3 fool believe 3 force make 3 talk give 3 go 3 trick believe 3 ... (Note that the gap in the first column after //talk// indicates that the preceding verb is repeated.) Now, this does not really look like a “grouped” frequency list yet, because the combinations are still listed in order of descending frequency. We can //actually// group the list by modifying the command slightly and repeating the word ''group'' before the word ''by'': group Last matchend hw group by match[0] hw If we run this command, we get an output like the following, which is sorted by the first verb (the grouping variable) first, and then by order of frequency: #--------------------------------------------------------------------- talk do 3 give 3 go 3 accept 2 come 2 let 2 take 2 abandon 1 act 1 become 1 believe 1 buy 1 care 1 have 1 join 1 leave 1 make 1 offer 1 play 1 realize 1 release 1 ring 1 stay 1 try 1 use 1 wear 1 #--------------------------------------------------------------------- force make 3 accept 1 agree 1 choose 1 commit 1 concede 1 discard 1 do 1 give 1 have 1 hide 1 lose 1 probe 1 see 1 speak 1 switch 1 try 1 use 1 work 1 ... Note that many combinations occur only once, we might not be interested in those. We can add the command ''cut'' at the end to specify the minimum frequency that a combinations must have in order to be included in the list. For example: group Last matchend hw group by match[0] hw cut 2 This gives us the following, more readable output that we could now use to think about why we //talk// people into actions like //giving//, //going//, //accepting//, etc., but //fool// them into //thinking// or //believing// something: #--------------------------------------------------------------------- talk give 3 go 3 accept 2 come 2 let 2 take 2 #--------------------------------------------------------------------- fool think 12 believe 3 #--------------------------------------------------------------------- force make 3 #--------------------------------------------------------------------- bully get 2 say 2 take 2 #--------------------------------------------------------------------- trick believe 3 #--------------------------------------------------------------------- mislead think 6 believe 4 #--------------------------------------------------------------------- provoke make 2 #--------------------------------------------------------------------- deceive think 5 believe 2 #--------------------------------------------------------------------- delude believe 4 think 3 #--------------------------------------------------------------------- lead define 2 #--------------------------------------------------------------------- trap make 2 #--------------------------------------------------------------------- con post 2 #--------------------------------------------------------------------- lure make 2 #--------------------------------------------------------------------- go make 2 #--------------------------------------------------------------------- tempt make 2 #--------------------------------------------------------------------- lull believe 2 #--------------------------------------------------------------------- aggravate produce 2 #--------------------------------------------------------------------- hoodwink believe 2 Of course, you can switch the order of variables to group the first verb by the second instead: group Last match[0] hw group by matchend hw cut 2; Look at the output and think about what this can show you as opposed to the table above. ===== Grouping by metadata ===== We can also use the ''group'' command to create frequency lists that are grouped by metadata -- in fact, this is probably the more typical way of using it. For example, we might be interested in whether there is a connection between social class and the usage of so-called “non-standard” language features such as the form //ain't//. If our corpus includes metadata about class, we can answer this question using a simple query and the ''group'' command as follows. First, we create a concordance of the form //ain't// and the two standard forms that it can represent -- //isn't// and //aren't//. We limit this query to utterances that contain information about class, which in the BNC is recorded in the attribute ''u_class'' using the labels ''AB'' (roughly ‘upper and middle middle class’), ''C1'' (roughly ‘lower middle class’), ''C2'' (roughly ‘skilled working class’) and ''DE'' (roughly ‘unskilled working class’): [word="(am|are|is|ai)"%c] [word="n't"%c] :: match.u_class="(AB|C1|C2|DE)" We can then group the frequency of the first word of the query by the values of the variable ''u_class'': group Last match word group by match u_class The result looks like this: #--------------------------------------------------------------------- C1 is 1107 are 362 ai 218 Is 45 Are 32 Ai 13 #--------------------------------------------------------------------- C2 is 690 ai 592 are 344 Is 44 Are 34 Ai 31 #--------------------------------------------------------------------- AB is 968 are 335 Is 73 Are 52 ai 49 Ai 6 #--------------------------------------------------------------------- DE is 362 ai 330 are 175 Ai 21 Are 17 Is 11 Unfortunately, the ''group'' command does not allow the ''%c'' flag, so we have to do a bit of manual counting, adding up the frequencies of upper case and lower case instances of each word. This gives us the following table: ^ Class ^ //isn't// ^ //aren't// ^ //ain't// ^ | AB | 1041 | 387 | 55 | | C1 | 1152 | 394 | 231 | | C2 | 734 | 378 | 623 | | DE | 373 | 192 | 351 | We can now calculate the percentage of cases of //ain't// in each group of speakers by dividing the frequency of //ain't// by the overall frequency of all three forms in each class and multiplying the result by 100. This gives us percentages of 3.71% for the upper and middle middle class, 13.00% for the lower middle class, 35.91% for the skilled working class and 38.32% for the unskilled working class. In other words: the lower the social status, the higher the usage of //ain't// — //Ain't No Love in the Heart of the City//, as Bobby Bland sang, but sure is a lot of //ain't// in the highrises on the outskirts. ===== Summary and outlook ===== This section introduced the ''group'' command. You can now read [[cqp:collocates|Section 4c]] and then continue on the the expert tricks! **[ Introduction to CQP: [[cqp:corpus-structure|Section 1]] -- [[cqp:simple-queries|Section 2]] -- [[cqp:advanced-querying|Section 3]] -- [[cqp:beyond-queries|Section 4]] -- [[cqp:expert-tricks|Section 5]] -- [[cqp:exercises|Section 6]] ]**