1. Corpus Structure

This section introduces the general structure of corpora and the way that this structure is represented in the .vrt file format used in the Corpus Workbench. This is necessary in order to understand how queries are constructed in the Corpus Query Processor language (CQP).

What is a corpus?

Put simply, a corpus is a collection of authentic samples of language use (written texts or transcribed spoken language). Authentic means that the language was produced by speakers in natural settings, without any knowledge that it would be used for scientific research. In the case of written language, the samples are usually taken from existing published or unpublished sources, e.g. newspapers, magazines, books (fiction and non-fiction), weblogs, government reports, academic papers, letters, emails etc. In the case of spoken language, the samples are sometimes taken from existing sources, e.g. television or radio programs, and sometimes specifically recorded for the purpose of creating a corpus, e.g. everyday conversations.

Since corpora tend to be quite large – small corpora typically contain between 500 000 and 1 million words, and large corpora can contain hundreds of millions of words – they have to be transformed into an electronic format that is accessible to computing devices. As part of this process, corpus creators typically add helpful information of two kinds. First, they add linguistic information to every word included in the corpus, e.g., what part of speech this word represents or which lemma it belongs to. Second, they add so-called “meta-information” – external information about the text (or, in a conversation, about the individual utterances), for example, the publication or recording date, the sex, age and social status of the writer or speaker, the text type, and so on. This makes it possible to investigate how a particular linguistic expression is used by particular speakers or in particular settings, or how usage changes over time.

Representing information in corpora: the .vrt format

Over the years, researchers have used different formats to represent the type of information just mentioned and there is still no standard way of doing this. We use a suite of programs called the Open Corpus Workbench, and in particular, a component called Corpus Query Processor (CQP), and this program uses a format called .vrt (VeRticalized Text), which represents the information in a very clean and structured way. There is a more detailed description of this format available here as part of the instructions for creating a corpus of one's own – the following is just meant to give you enough of an idea to later understand how corpora can be searched for different types of information.

The text itself and the linguistic annotation is represented in columns, with the first column containing every unit of the text, which (for written language) includes the words, punctuation marks and other symbols. Each one of these units is called a “token”, and, as you can see, every token has its own line. The second column could, for example, contain the part of speech, and the third column the lemma to which the word form belongs.

For example, the sentence Touch comes before sight, before speech. might look like this:

Touch     NN1     touch
comes     VVZ     come
before    PRP     before
sight     NN1     sight
,         PUN     ,
before    PRP     before
speech    NN1     speech
.         PUN     .

Meta-information is represented in the form of xml tags – a standardized way of representing information. Each xml tag starts with an opening pointed bracket < followed by a base name – for example, text; it can then contain attribute-value pairs of the form attribute=“value”, where the attribute specifies what type of information is given and the value specifies the information itself. The sentence just cited, for example, is from the novel The Blind Assassin by Margaret Atwood, so we might want to add an attribute-value pair like genre=“novel” – what type of meta-information is included in the form of such attribute-value pairs varies very much from corpus to corpus. Each xml tag then ends by a closing pointed bracket >. (Remember the notion of an attribute-value pair, it will be important in various places throughout this and many of the following sections.)

For example, the paragraph from which the sentence cited above is taken could be represented as follows, where id is a shorthand name of the text, genre describes the genre of the text, etc.:

<text id="TBA" genre="novel" author_sex="female" date="2000">
<p>
<s>
Touch     NN1     touch
comes     VVZ     come
before    PRP     before
sight     NN1     sight
,         PUN     ,
before    PRP     before
speech    NN1     speech
.         PUN     .
</s>
<s>
It        PNP     it
is        VVZ     be
the       AT0     the
first     ORD     first
language  NN1     language
and       CJC     and
the       AT0     the
last      ORD     last
,         ,       ,
and       CJC     and
it        PNP     it
always    AV0     always
tells     VBZ     tell
the       AT0     the
truth     NN1     truth
.         PUN     .
</s>
</p>
</text>

Note that xml tags do not have to contain attribute-value pairs, and that they always come in pairs: one tag stands at the beginning of the unit to which it refers, and one, with an additional slash, at the end. For example, sentences are enclosed by the tags <s> … </s> in the passage above, and <p> … </p>. Of course, the <text> … </text> tags would stand at the beginning and end of the entire sample, which would typically be longer than a single paragraph.

Before we can work with a corpus consisting of text samples that have been transformed to a .vrt file, this corpus must be compiled into a database. When the corpus is compiled, the columns are given name – the first column is always called word, additional columns can have freely chosen names, but frequent ones are pos for “part of speech” and hw or lemma for “headword/lemma”. These column names then also become attributes that can be searched for using attribute-value pairs, as explained in detail in Sections 2 and 3a/b of this tutorial.

Accessing corpora in the Corpus Workbench

Before you can work with a corpus that has been compiled into a database, this corpus must be uploaded to a server running the Corpus Workbench. You can actually create your own corpora and upload them (see Creating corpora, but most of the time you won't have to – you can simply use one of the precompiled corpora that are available as part of our installation.

If you have correctly set up your LOGIN account, you can start the Corpus Query Processor (the part of the Corpus Workbench that is typically used to access corpora and search them for linguistic phenomena) by signing into your account, typing cqp, and hitting the RETURN key.

If you want to know which corpora are available, type the following at the prompt and hit the RETURN key:

show

The Corpus Workbench will display the corpora you have access to. This may look as follows (this is an example, you may have fewer corpora or more corpora, depending on how you have set up your account):

To choose a corpus, simply type its name as shown on this list, and hit the RETURN key. For example, to choose, the British National Corpus (BNC), type BNC and hit RETURN. The prompt will now change from

[no corpus]>

to

BNC>

So your prompt always tells you which corpus you are currently working with. To switch to a different corpus, simply type the name of that corpus, and the prompt will change to reflect this.

Summary and outlook

This section has discussed what a corpus is, how corpora are presented in the Corpus Workbench and how you can check which corpora are available to you and how to select a corpus to work with. Building on this, Section 2 will show you how to perform simple searches (called “queries”) in a corpus using the Corpus Query Language.

[ Introduction to CQP: Section 1 – Section 2 – Section 3 – Section 4 – Section 5 – Section 6 – Section 7 ]