−Table of Contents
[ Collection: Introduction to CQP ]
3d. Metadata
This section introduces two important ways of working with metadata. It presupposes that you have read Section 1 and Section 2.
As explained in Section 1, corpora often contain metadata in the form of xml tags like the following (simplified) example:
<text id="TBA" genre="novel" author_sex="female" date="2000"> … </text>
When the corpus is compiled, the base name of the tag (in this case, text
) and the attributes (in this case, id
, genre
, etc.) are combined into attributes like text_id
, text_genre
etc. If you want to know which metadata is available for a given corpus, check the info file (simply select the corpus, type info
and hit the RETURN
key, then use the SPACE
key to scroll down, the b
key to scroll up, and the q
key to close the info file). These metadata can be used in different ways.
Displaying metadata in the concordance
The simplest way in which we may want to use metadata is to have certain types of information displayed as part of the concordance. This is done using the command set PrintStructures
: before running a query, we type this command followed by the relevant metadata attribute(s) in quotation marks. For example, in the BNC every text has a three-letter id with the attribute text_id
. To display this text id, type the following and hit RETURN
:
set PrintStructures "text_id"
When you now run a query like [hw=“love”%c]
, the text id will be given at the beginning of every concordance line. Try it. You will see that often more than one example is from the same text.
You can also have the Corpus Workbench display more than one type of metadata. For example, to display the text id and the genre, type:
set PrintStructures "text_id text_genre"
i.e., list the metadata attributes that you want displayed, separated by whitespaces.
To turn off the display, simply type
set PrintStructures ""
Your settings for displaying metadata will also be reset to zero when you switch to a different corpus or when you quit CQP.
Displaying metadata without attributes
Recall that there may be metadata that does not have any attribute-value pairs, like <p> … </p>
enclosing paragraphs or <s> … </s>
enclosing sentences. If you try to display these tags using the set PrintStructures
command, you get an error message:
BNC> set PrintStructures "p"; Warning: Structure ``p'' does not have any values.
This is because there is nothing that could be displayed at the beginning of the line. However, you may want to display these tags inside the concordance line, i.e., at the point where they occur in the text (for example, to see when a new paragraph begins). To do this, you use the show
command, followed by a plus sign and the tag you want to display. For example, to display the paragraph tags, type:
show +p
Try it, you will see that occasionally, <p>
or </p>
tags are now displayed in a concordance line. To turn the display off again, use the same command with a minus sign, i.e., type
show -p
Displaying linguistic annotation
As also discussed in Section 3f, the show
command can also be used to display information from the columns following the word
column. For example, if your corpus contains pos
tags (as the BNC does), you can type the following before creating your concordance:
show +pos
Try it, you will see that for each word in the concordance line, its pos
tag is now shown, separated from the word by a slash, Again, to turn the display off again, use the same command with a minus sign, i.e., type
show -pos
Restricting a search by metadata
A more sophisticated way that you can use metadata is by restricting your search to examples that match a particular value of a metadata attribute. This is done by attaching the command :: match.attribute=“value”
to the end of your query.
For example, you may be interested in whether men and women use the word love
differently. The BNC contains the attribute text_author_sex
with the values male
, female
and unknown
. To find only uses of the word love produced by men, type the following and hit RETURN
:
[hw="love"] :: match.text_author_sex="male"
To find only uses of the word love produced by women, type the following and hit RETURN
:
[hw="love"] :: match.text_author_sex="female"
Summary and outlook
This section has shown you how work with metadata. Building on this, you can look at the following sections in any order:
- Section 3a: Extending simple queries: Alternative attributes and values
- Section 3b: Extending simple queries: Combinations of attributes and values
- Section 3c: Complex Queries
- Section 3e: Regular expressions (basics)
- Section 3f: Working with concordances
- Section 3g: Sorting and sampling
[ Introduction to CQP: Section 1 – Section 2 – Section 3 – Section 4 – Section 5 – Section 6 ]