**[ [[cqp:introduction|Collection: Introduction to CQP]] ]**
====== 1. Corpus Structure ======
//This section introduces the general structure of corpora and the way that this structure is represented in the .vrt file format used in the Corpus Workbench. This is necessary in order to understand how queries are constructed in the Corpus Query Processor language (CQP).//
===== What is a corpus? =====
Put simply, a corpus is a collection of authentic samples of language use (written texts or transcribed spoken language). //Authentic// means that the language was produced by speakers in natural settings, without any knowledge that it would be used for scientific research. In the case of written language, the samples are usually taken from existing published or unpublished sources, e.g. newspapers, magazines, books (fiction and non-fiction), weblogs, government reports, academic papers, letters, emails etc. In the case of spoken language, the samples are sometimes taken from existing sources, e.g. television or radio programs, and sometimes specifically recorded for the purpose of creating a corpus, e.g. everyday conversations.
Since corpora tend to be quite large -- small corpora typically contain between 500 000 and 1 million words, and large corpora can contain hundreds of millions of words -- they have to be transformed into an electronic format that is accessible to computing devices. As part of this process, corpus creators typically add helpful information of two kinds. First, they add linguistic information to every word included in the corpus, e.g., what part of speech this word represents or which lemma it belongs to. Second, they add so-called “meta-information” -- external information about the text (or, in a conversation, about the individual utterances), for example, the publication or recording date, the sex, age and social status of the writer or speaker, the text type, and so on. This makes it possible to investigate how a particular linguistic expression is used by particular speakers or in particular settings, or how usage changes over time.
===== Representing information in corpora: the .vrt format =====
Over the years, researchers have used different formats to represent the type of information just mentioned and there is still no standard way of doing this. We use a suite of programs called the Open Corpus Workbench, and in particular, a component called Corpus Query Processor (CQP), and this program uses a format called .vrt (VeRticalized Text), which represents the information in a very clean and structured way. There is a more detailed description of this format available [[cqp:creating-corpora-vrt|here]] as part of the instructions for creating a corpus of one's own -- the following is just meant to give you enough of an idea to later understand how corpora can be searched for different types of information.
The text itself and the linguistic annotation is represented in columns, with the first column containing every unit of the text, which (for written language) includes the words, punctuation marks and other symbols. Each one of these units is called a “token”, and, as you can see, every token has its own line. The second column could, for example, contain the part of speech, and the third column the lemma to which the word form belongs.
For example, the sentence //Touch comes before sight, before speech.// might look like this:
Touch NN1 touch
comes VVZ come
before PRP before
sight NN1 sight
, PUN ,
before PRP before
speech NN1 speech
. PUN .
Meta-information is represented in the form of xml tags -- a standardized way of representing information. Each xml tag starts with an opening pointed bracket ''<'' followed by a base name -- for example, ''text''; it can then contain attribute-value pairs of the form ''attribute="value"'', where the attribute specifies what type of information is given and the value specifies the information itself. The sentence just cited, for example, is from the novel //The Blind Assassin// by Margaret Atwood, so we might want to add an attribute-value pair like ''genre="novel"'' -- what type of meta-information is included in the form of such attribute-value pairs varies very much from corpus to corpus. Each xml tag then ends by a closing pointed bracket ''>''. (Remember the notion of an attribute-value pair, it will be important in various places throughout this and many of the following sections.)
For example, the paragraph from which the sentence cited above is taken could be represented as follows, where ''id'' is a shorthand name of the text, ''genre'' describes the genre of the text, etc.:
Touch NN1 touch
comes VVZ come
before PRP before
sight NN1 sight
, PUN ,
before PRP before
speech NN1 speech
. PUN .
It PNP it
is VVZ be
the AT0 the
first ORD first
language NN1 language
and CJC and
the AT0 the
last ORD last
, , ,
and CJC and
it PNP it
always AV0 always
tells VBZ tell
the AT0 the
truth NN1 truth
. PUN .
… '' in the passage above, and ''
…
''. Of course, the ''