User Tools

Site Tools


cqp:cleaning-output

5d. Tidying up the output

The concordances and the output created by the count and group commands in CQP can be saved to text files and viewed using a text editor, but often, you need a more structured format which you can import into spreadsheet programs (like LibreOffice Calc, MS Excel or Apple Numbers) or into statistics programs like R. For this purpose, all three types of output can be converted to csv files using a small program we provide as part of INLET: tidycwb.pl. Regardless of what type of output you are dealing with, you can simply send it to this program before saving it to a file, and it will recognize what output it is dealing with and convert it in a useful way.

Concordances

Let us assume you have created a concordance of the lemma love in the BNC and saved it in a variable called ''Love''. Instead of saving it directly, you can send it to the script tidycwb.pl using the | operator and then save it:

cat Love > " | tidycwb.pl > love.csv"

The script will create a csv file with the corpus in the first column, the corpus position in the second column, followed by any metadata you may have displayed using the PrintStructure command each in its own column, followed by the left context, the hit and the right context each in its own column. For example, if you have activated the PrintStructures file_id and text_genre, the regular concordance would have looked like this:

 2233: <text_id A00><text_author_sex unknown>: s have provided much <love> and care to many hu
 8920: <text_id A01><text_author_sex unknown>:  ‘ I think I 'm in <love> … ’ ‘ How do 
13733: <text_id A01><text_author_sex unknown>:  stress to those you <love> most . Not to have 
15915: <text_id A01><text_author_sex unknown>: demonstration of the <love> of Jesus shown by y
38084: <text_id A03><text_author_sex mixed>: oes , the people all <love> the King so much it
42738: <text_id A04><text_author_sex male>: tion and imaginative <loves> , the return of the
47797: <text_id A04><text_author_sex male>: ything , has neither <love> nor hate , and volu
60042: <text_id A04><text_author_sex male>: ted on it because he <loved> it , and he thereby
61421: <text_id A04><text_author_sex male>:  . The reader with a <love> of art is not alway
69613: <text_id A04><text_author_sex male>: <80><99> John Constable 's <love> for the work of van

In contrast, the tidied concordance now looks like this:

"BNC","2233","A00","unknown","s have provided much","love","and care to many hu"
"BNC","8920","A01","unknown","‘ I think I 'm in","love","… ’ ‘ How do "
"BNC","13733","A01","unknown","stress to those you","love","most . Not to have "
"BNC","15915","A01","unknown","demonstration of the","love","of Jesus shown by y"
"BNC","38084","A03","mixed","oes , the people all","love","the King so much it"
"BNC","42738","A04","male","tion and imaginative","loves",", the return of the"
"BNC","47797","A04","male","ything , has neither","love","nor hate , and volu"
"BNC","60042","A04","male","ted on it because he","loved","it , and he thereby"
"BNC","61421","A04","male",". The reader with a","love","of art is not alway"
"BNC","69613","A04","male","<80><99> John Constable 's","love","for the work of van"

When imported into a spreadsheet program, this file will be displayed as follows – now you can add additional columns to add your own annotation to the hits:

BNC 2233 A00 unknown s have provided much love and care to many hu
BNC 8920 A01 unknown ‘ I think I 'm in love … ’ ‘ How do
BNC 13733 A01 unknown stress to those you love most . Not to have
BNC 15915 A01 unknown demonstration of the love of Jesus shown by y
BNC 38084 A03 mixed oes , the people all love the King so much it
BNC 42738 A04 male tion and imaginative loves , the return of the
BNC 47797 A04 male ything , has neither love nor hate , and volu
BNC 60042 A04 male ted on it because he loved it , and he thereby
BNC 61421 A04 male . The reader with a love of art is not alway
BNC 69613 A04 male <80><99> John Constable 's love for the work of van

Frequency lists

Let us assume you have created a concordance of the lemma love in the BNC and you want to create and save a frequency list of the word forms. Again, instead of saving it directly, you can send it to the script tidycwb.pl using the | operator and then save it:

count Love by word > " | tidycwb.pl > love.csv"

The script will create a frequency list with the word form in the first column and the frequency in the second column. Saving the output directly would have given you the following output:

20160   love  [#2308-#22467]
4253    loved  [#22468-#26720]
1969    Love  [#207-#2175]
1295    loves  [#26721-#28015]
463     loving  [#28017-#28479]
186     LOVE  [#0-#185]
51      Loved  [#2176-#2226]
41      Loves  [#2227-#2267]
40      Loving  [#2268-#2307]
9       LOVED  [#186-#194]
7       LOVING  [#200-#206]
5       LOVES  [#195-#199]
1       lovest  [#28016]

In contrast, the tidied frequency list looks like this:

"love",20160
"loved",4253
"Love",1969
"loves",1295
"loving",463
"LOVE",186
"Loved",51
"Loves",41
"Loving",40
"LOVED",9
"LOVING",7
"LOVES",5
"lovest",1

Or, imported into a spreadsheet:

love 20160
loved 4253
Love 1969
loves 1295
loving 463
LOVE 186
Loved 51
Loves 41
Loving 40
LOVED 9
LOVING 7
LOVES 5
lovest 1

Output of the ''group'' command

Let us assume you have created a concordance of the lemma love in the BNC and you want to group the part of speech (using the class tag) by the text mode. Again, instead of saving it directly, you can send it to the script tidycwb.pl using the | operator and then save it:

group Love match class by match text_mode  > " | tidycwb.pl > love.csv"

If you had saved the output directly, it would have looked like this:

#---------------------------------------------------------------------
written                       SUBST                              13041
                              VERB                               12364
spoken                        VERB                                1831
                              SUBST                               1190
---                           SUBST                                 39
                              VERB                                  12
spoken                        UNC                                    2
written                       ADJ                                    1

Instead, the tidied output looks like this:

"written","SUBST",13041
"written","VERB",12364
"spoken","VERB",1831
"spoken","SUBST",1190
"---","SUBST",39
"---","VERB",12
"spoken","UNC",2
"written","ADJ",1

Or, imported into a spreadsheet, like this:

written SUBST 13041
written VERB 12364
spoken VERB 1831
spoken SUBST 1190
SUBST 39
VERB 12
spoken UNC 2
written ADJ 1

Note that the default output does not repeat the contents in the first column if it would be the same in the next row – this means you cannot sort it. The tidied output does repeat the contents in the first column in every row, so if you sort it, you don't lose any information!

[ Introduction to CQP: Section 1Section 2Section 3Section 4Section 5 ]

cqp/cleaning-output.txt · Last modified: 2020/07/16 23:23 by astefanowitsch