5d. Tidying up the output

The concordances and the output created by the count and group commands in CQP can be saved to text files and viewed using a text editor, but often, you need a more structured format which you can import into spreadsheet programs (like LibreOffice Calc, MS Excel or Apple Numbers) or into statistics programs like R. For this purpose, all three types of output can be converted to csv files using a small program we provide as part of INLET: tidycwb.pl. Regardless of what type of output you are dealing with, you can simply send it to this program before saving it to a file, and it will recognize what output it is dealing with and convert it in a useful way.

Concordances

Let us assume you have created a concordance of the lemma love in the BNC and saved it in a variable called ''Love''. Instead of saving it directly, you can send it to the script tidycwb.pl using the | operator and then save it:

cat Love > " | tidycwb.pl > love.csv"

The script will create a csv file with the corpus in the first column, the corpus position in the second column, followed by any metadata you may have displayed using the PrintStructure command each in its own column, followed by the left context, the hit and the right context each in its own column. For example, if you have activated the PrintStructures file_id and text_genre, the regular concordance would have looked like this:

 2233: <text_id A00><text_author_sex unknown>: s have provided much <love> and care to many hu
 8920: <text_id A01><text_author_sex unknown>:  ‘ I think I 'm in <love> … ’ ‘ How do 
13733: <text_id A01><text_author_sex unknown>:  stress to those you <love> most . Not to have 
15915: <text_id A01><text_author_sex unknown>: demonstration of the <love> of Jesus shown by y
38084: <text_id A03><text_author_sex mixed>: oes , the people all <love> the King so much it
42738: <text_id A04><text_author_sex male>: tion and imaginative <loves> , the return of the
47797: <text_id A04><text_author_sex male>: ything , has neither <love> nor hate , and volu
60042: <text_id A04><text_author_sex male>: ted on it because he <loved> it , and he thereby
61421: <text_id A04><text_author_sex male>:  . The reader with a <love> of art is not alway
69613: <text_id A04><text_author_sex male>: <80><99> John Constable 's <love> for the work of van

In contrast, the tidied concordance now looks like this:

"BNC","2233","A00","unknown","s have provided much","love","and care to many hu"
"BNC","8920","A01","unknown","‘ I think I 'm in","love","… ’ ‘ How do "
"BNC","13733","A01","unknown","stress to those you","love","most . Not to have "
"BNC","15915","A01","unknown","demonstration of the","love","of Jesus shown by y"
"BNC","38084","A03","mixed","oes , the people all","love","the King so much it"
"BNC","42738","A04","male","tion and imaginative","loves",", the return of the"
"BNC","47797","A04","male","ything , has neither","love","nor hate , and volu"
"BNC","60042","A04","male","ted on it because he","loved","it , and he thereby"
"BNC","61421","A04","male",". The reader with a","love","of art is not alway"
"BNC","69613","A04","male","<80><99> John Constable 's","love","for the work of van"

When imported into a spreadsheet program, this file will be displayed as follows – now you can add additional columns to add your own annotation to the hits:

BNC	2233	A00	unknown	s have provided much	love	and care to many hu
BNC	8920	A01	unknown	‘ I think I 'm in	love	… ’ ‘ How do
BNC	13733	A01	unknown	stress to those you	love	most . Not to have
BNC	15915	A01	unknown	demonstration of the	love	of Jesus shown by y
BNC	38084	A03	mixed	oes , the people all	love	the King so much it
BNC	42738	A04	male	tion and imaginative	loves	, the return of the
BNC	47797	A04	male	ything , has neither	love	nor hate , and volu
BNC	60042	A04	male	ted on it because he	loved	it , and he thereby
BNC	61421	A04	male	. The reader with a	love	of art is not alway
BNC	69613	A04	male	<80><99> John Constable 's	love	for the work of van

Frequency lists

Let us assume you have created a concordance of the lemma love in the BNC and you want to create and save a frequency list of the word forms. Again, instead of saving it directly, you can send it to the script tidycwb.pl using the | operator and then save it:

count Love by word > " | tidycwb.pl > love.csv"

The script will create a frequency list with the word form in the first column and the frequency in the second column. Saving the output directly would have given you the following output:

20160   love  [#2308-#22467]
4253    loved  [#22468-#26720]
1969    Love  [#207-#2175]
1295    loves  [#26721-#28015]
463     loving  [#28017-#28479]
186     LOVE  [#0-#185]
51      Loved  [#2176-#2226]
41      Loves  [#2227-#2267]
40      Loving  [#2268-#2307]
9       LOVED  [#186-#194]
7       LOVING  [#200-#206]
5       LOVES  [#195-#199]
1       lovest  [#28016]

In contrast, the tidied frequency list looks like this:

"love",20160
"loved",4253
"Love",1969
"loves",1295
"loving",463
"LOVE",186
"Loved",51
"Loves",41
"Loving",40
"LOVED",9
"LOVING",7
"LOVES",5
"lovest",1

Or, imported into a spreadsheet:

love	20160
loved	4253
Love	1969
loves	1295
loving	463
LOVE	186
Loved	51
Loves	41
Loving	40
LOVED	9
LOVING	7
LOVES	5
lovest	1

Output of the ''group'' command

Let us assume you have created a concordance of the lemma love in the BNC and you want to group the part of speech (using the class tag) by the text mode. Again, instead of saving it directly, you can send it to the script tidycwb.pl using the | operator and then save it:

group Love match class by match text_mode  > " | tidycwb.pl > love.csv"

If you had saved the output directly, it would have looked like this:

#---------------------------------------------------------------------
written                       SUBST                              13041
                              VERB                               12364
spoken                        VERB                                1831
                              SUBST                               1190
---                           SUBST                                 39
                              VERB                                  12
spoken                        UNC                                    2
written                       ADJ                                    1

Instead, the tidied output looks like this:

"written","SUBST",13041
"written","VERB",12364
"spoken","VERB",1831
"spoken","SUBST",1190
"---","SUBST",39
"---","VERB",12
"spoken","UNC",2
"written","ADJ",1

Or, imported into a spreadsheet, like this:

written	SUBST	13041
written	VERB	12364
spoken	VERB	1831
spoken	SUBST	1190
—	SUBST	39
—	VERB	12
spoken	UNC	2
written	ADJ	1

Note that the default output does not repeat the contents in the first column if it would be the same in the next row – this means you cannot sort it. The tidied output does repeat the contents in the first column in every row, so if you sort it, you don't lose any information!

[ Introduction to CQP: Section 1 – Section 2 – Section 3 – Section 4 – Section 5 ]

Linguistics Resources

Table of Contents

5d. Tidying up the output

Concordances

Frequency lists

Output of the ''group'' command