−Table of Contents
5d. Tidying up the output
The concordances and the output created by the count
and group
commands in CQP can be saved to text files and viewed using a text editor, but often, you need a more structured format which you can import into spreadsheet programs (like LibreOffice Calc, MS Excel or Apple Numbers) or into statistics programs like R. For this purpose, all three types of output can be converted to csv files using a small program we provide as part of INLET: tidycwb.pl
. Regardless of what type of output you are dealing with, you can simply send it to this program before saving it to a file, and it will recognize what output it is dealing with and convert it in a useful way.
Concordances
Let us assume you have created a concordance of the lemma love in the BNC and saved it in a variable called ''Love''. Instead of saving it directly, you can send it to the script tidycwb.pl using the |
operator and then save it:
cat Love > " | tidycwb.pl > love.csv"
The script will create a csv file with the corpus in the first column, the corpus position in the second column, followed by any metadata you may have displayed using the PrintStructure command each in its own column, followed by the left context, the hit and the right context each in its own column. For example, if you have activated the PrintStructures file_id
and text_genre
, the regular concordance would have looked like this:
2233: <text_id A00><text_author_sex unknown>: s have provided much <love> and care to many hu 8920: <text_id A01><text_author_sex unknown>: ‘ I think I 'm in <love> … ’ ‘ How do 13733: <text_id A01><text_author_sex unknown>: stress to those you <love> most . Not to have 15915: <text_id A01><text_author_sex unknown>: demonstration of the <love> of Jesus shown by y 38084: <text_id A03><text_author_sex mixed>: oes , the people all <love> the King so much it 42738: <text_id A04><text_author_sex male>: tion and imaginative <loves> , the return of the 47797: <text_id A04><text_author_sex male>: ything , has neither <love> nor hate , and volu 60042: <text_id A04><text_author_sex male>: ted on it because he <loved> it , and he thereby 61421: <text_id A04><text_author_sex male>: . The reader with a <love> of art is not alway 69613: <text_id A04><text_author_sex male>: <80><99> John Constable 's <love> for the work of van
In contrast, the tidied concordance now looks like this:
"BNC","2233","A00","unknown","s have provided much","love","and care to many hu" "BNC","8920","A01","unknown","‘ I think I 'm in","love","… ’ ‘ How do " "BNC","13733","A01","unknown","stress to those you","love","most . Not to have " "BNC","15915","A01","unknown","demonstration of the","love","of Jesus shown by y" "BNC","38084","A03","mixed","oes , the people all","love","the King so much it" "BNC","42738","A04","male","tion and imaginative","loves",", the return of the" "BNC","47797","A04","male","ything , has neither","love","nor hate , and volu" "BNC","60042","A04","male","ted on it because he","loved","it , and he thereby" "BNC","61421","A04","male",". The reader with a","love","of art is not alway" "BNC","69613","A04","male","<80><99> John Constable 's","love","for the work of van"
When imported into a spreadsheet program, this file will be displayed as follows – now you can add additional columns to add your own annotation to the hits:
BNC | 2233 | A00 | unknown | s have provided much | love | and care to many hu |
BNC | 8920 | A01 | unknown | ‘ I think I 'm in | love | … ’ ‘ How do |
BNC | 13733 | A01 | unknown | stress to those you | love | most . Not to have |
BNC | 15915 | A01 | unknown | demonstration of the | love | of Jesus shown by y |
BNC | 38084 | A03 | mixed | oes , the people all | love | the King so much it |
BNC | 42738 | A04 | male | tion and imaginative | loves | , the return of the |
BNC | 47797 | A04 | male | ything , has neither | love | nor hate , and volu |
BNC | 60042 | A04 | male | ted on it because he | loved | it , and he thereby |
BNC | 61421 | A04 | male | . The reader with a | love | of art is not alway |
BNC | 69613 | A04 | male | <80><99> John Constable 's | love | for the work of van |
Frequency lists
Let us assume you have created a concordance of the lemma love in the BNC and you want to create and save a frequency list of the word forms. Again, instead of saving it directly, you can send it to the script tidycwb.pl using the |
operator and then save it:
count Love by word > " | tidycwb.pl > love.csv"
The script will create a frequency list with the word form in the first column and the frequency in the second column. Saving the output directly would have given you the following output:
20160 love [#2308-#22467] 4253 loved [#22468-#26720] 1969 Love [#207-#2175] 1295 loves [#26721-#28015] 463 loving [#28017-#28479] 186 LOVE [#0-#185] 51 Loved [#2176-#2226] 41 Loves [#2227-#2267] 40 Loving [#2268-#2307] 9 LOVED [#186-#194] 7 LOVING [#200-#206] 5 LOVES [#195-#199] 1 lovest [#28016]
In contrast, the tidied frequency list looks like this:
"love",20160 "loved",4253 "Love",1969 "loves",1295 "loving",463 "LOVE",186 "Loved",51 "Loves",41 "Loving",40 "LOVED",9 "LOVING",7 "LOVES",5 "lovest",1
Or, imported into a spreadsheet:
love | 20160 |
loved | 4253 |
Love | 1969 |
loves | 1295 |
loving | 463 |
LOVE | 186 |
Loved | 51 |
Loves | 41 |
Loving | 40 |
LOVED | 9 |
LOVING | 7 |
LOVES | 5 |
lovest | 1 |
Output of the ''group'' command
Let us assume you have created a concordance of the lemma love in the BNC and you want to group the part of speech (using the class
tag) by the text mode. Again, instead of saving it directly, you can send it to the script tidycwb.pl using the |
operator and then save it:
group Love match class by match text_mode > " | tidycwb.pl > love.csv"
If you had saved the output directly, it would have looked like this:
#--------------------------------------------------------------------- written SUBST 13041 VERB 12364 spoken VERB 1831 SUBST 1190 --- SUBST 39 VERB 12 spoken UNC 2 written ADJ 1
Instead, the tidied output looks like this:
"written","SUBST",13041 "written","VERB",12364 "spoken","VERB",1831 "spoken","SUBST",1190 "---","SUBST",39 "---","VERB",12 "spoken","UNC",2 "written","ADJ",1
Or, imported into a spreadsheet, like this:
written | SUBST | 13041 |
written | VERB | 12364 |
spoken | VERB | 1831 |
spoken | SUBST | 1190 |
— | SUBST | 39 |
— | VERB | 12 |
spoken | UNC | 2 |
written | ADJ | 1 |
Note that the default output does not repeat the contents in the first column if it would be the same in the next row – this means you cannot sort it. The tidied output does repeat the contents in the first column in every row, so if you sort it, you don't lose any information!
[ Introduction to CQP: Section 1 – Section 2 – Section 3 – Section 4 – Section 5 ]