corpora:tagsets
Tagsets
Many corpora are annotated for word class – for every word form in the corpus, there is a “pos” tag describing what part of speech it is (see Corpus Structure for more information).
There is no generally agreed-upon set of word classes and no generally agreed-upon way of referring to word classes – different corpora use different sets of tags. Thus, you always have to check the info
file or the manual of a corpus in order to construct your queries.
Some widely-used tagsets for English are the following:
- The CLAWS 5 tagset, used, for example, for the British National Corpus
- The CLAWS 7 tagset, used, for example, in the BNC 2014.
- The CLAWS 7 tagset (COCA/COHA version) – a variant of CLAWS 7 used in the Corpus of Contemporary American English and the Corpus of Historical American English
Most German corpora use (some version of) the STTS tagset:
corpora/tagsets.txt · Last modified: 2024/06/20 13:53 by 127.0.0.1