cqp:corpus-structure
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
cqp:corpus-structure [2020/04/21 14:07] – [Representing information in corpora: the .vrt format] astefanowitsch | cqp:corpus-structure [2024/10/26 22:21] (current) – [Summary and outlook] add section 7 aamoakuh | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | **[ [[cqp: | ||
+ | ====== 1. Corpus Structure ====== | ||
+ | |||
+ | //This section introduces the general structure of corpora and the way that this structure is represented in the .vrt file format used in the Corpus Workbench. This is necessary in order to understand how queries are constructed in the Corpus Query Processor language (CQP).// | ||
+ | |||
+ | |||
+ | ===== What is a corpus? ===== | ||
+ | |||
+ | Put simply, a corpus is a collection of authentic samples of language use (written texts or transcribed spoken language). // | ||
+ | |||
+ | Since corpora tend to be quite large -- small corpora typically contain between 500 000 and 1 million words, and large corpora can contain hundreds of millions of words -- they have to be transformed into an electronic format that is accessible to computing devices. As part of this process, corpus creators typically add helpful information of two kinds. First, they add linguistic information to every word included in the corpus, e.g., what part of speech this word represents or which lemma it belongs to. Second, they add so-called “meta-information” -- external information about the text (or, in a conversation, | ||
+ | |||
+ | ===== Representing information in corpora: the .vrt format ===== | ||
+ | |||
+ | Over the years, researchers have used different formats to represent the type of information just mentioned and there is still no standard way of doing this. We use a suite of programs called the Open Corpus Workbench, and in particular, a component called Corpus Query Processor (CQP), and this program uses a format called .vrt (VeRticalized Text), which represents the information in a very clean and structured way. There is a more detailed description of this format available [[cqp: | ||
+ | |||
+ | The text itself and the linguistic annotation is represented in columns, with the first column containing every unit of the text, which (for written language) includes the words, punctuation marks and other symbols. Each one of these units is called a “token”, | ||
+ | |||
+ | For example, the sentence //Touch comes before sight, before speech.// might look like this: | ||
+ | |||
+ | Touch | ||
+ | comes | ||
+ | before | ||
+ | sight | ||
+ | , | ||
+ | before | ||
+ | speech | ||
+ | . | ||
+ | |||
+ | Meta-information is represented in the form of xml tags -- a standardized way of representing information. Each xml tag starts with an opening pointed bracket ''<'' | ||
+ | |||
+ | For example, the paragraph from which the sentence cited above is taken could be represented as follows, where '' | ||
+ | |||
+ | <text id=" | ||
+ | <p> | ||
+ | <s> | ||
+ | Touch | ||
+ | comes | ||
+ | before | ||
+ | sight | ||
+ | , | ||
+ | before | ||
+ | speech | ||
+ | . | ||
+ | </s> | ||
+ | <s> | ||
+ | It PNP it | ||
+ | is VVZ be | ||
+ | the | ||
+ | first | ||
+ | language | ||
+ | and | ||
+ | the | ||
+ | last ORD last | ||
+ | , , | ||
+ | and | ||
+ | it PNP it | ||
+ | always | ||
+ | tells | ||
+ | the | ||
+ | truth | ||
+ | . | ||
+ | </s> | ||
+ | </p> | ||
+ | </ | ||
+ | |||
+ | Note that xml tags do not have to contain attribute-value pairs, and that they always come in pairs: one tag stands at the beginning of the unit to which it refers, and one, with an additional slash, at the end. For example, sentences are enclosed by the tags ''< | ||
+ | |||
+ | Before we can work with a corpus consisting of text samples that have been transformed to a .vrt file, this corpus must be compiled into a database. When the corpus is compiled, the columns are given name – the first column is always called '' | ||
+ | |||
+ | ===== Accessing corpora in the Corpus Workbench ===== | ||
+ | |||
+ | Before you can work with a corpus that has been compiled into a database, this corpus must be uploaded to a server running the Corpus Workbench. You can actually create your own corpora and upload them (see [[inlet: | ||
+ | |||
+ | If you have [[inlet: | ||
+ | |||
+ | If you want to know which corpora are available, type the following at the prompt and hit the '' | ||
+ | |||
+ | show | ||
+ | |||
+ | The Corpus Workbench will display the corpora you have access to. This may look as follows (this is an example, you may have fewer corpora or more corpora, depending on how you have set up your account): | ||
+ | |||
+ | {{ : | ||
+ | |||
+ | To choose a corpus, simply type its name as shown on this list, and hit the '' | ||
+ | |||
+ | [no corpus]> | ||
+ | |||
+ | to | ||
+ | |||
+ | BNC> | ||
+ | |||
+ | So your prompt always tells you which corpus you are currently working with. To switch to a different corpus, simply type the name of that corpus, and the prompt will change to reflect this. | ||
+ | |||
+ | |||
+ | ===== Summary and outlook ===== | ||
+ | |||
+ | This section has discussed what a corpus is, how corpora are presented in the Corpus Workbench and how you can check which corpora are available to you and how to select a corpus to work with. Building on this, [[cqp: | ||
+ | |||
+ | **[ Introduction to CQP: [[cqp: |