Tagsets

Many corpora are annotated for word class – for every word form in the corpus, there is a “pos” tag describing what part of speech it is (see Corpus Structure for more information).

There is no generally agreed-upon set of word classes and no generally agreed-upon way of referring to word classes – different corpora use different sets of tags. Thus, you always have to check the info file or the manual of a corpus in order to construct your queries.

Some widely-used tagsets for English are the following:

The Penn tagset
The Historical English Penn Treebank tagset
The Tree Tagger tagset
The CLAWS 5 tagset, used, for example, for the British National Corpus
The CLAWS 7 tagset, used, for example, in the BNC 2014.
The CLAWS 7 tagset (COCA/COHA version) – a variant of CLAWS 7 used in the Corpus of Contemporary American English and the Corpus of Historical American English

Most German corpora use (some version of) the STTS tagset:

The original STTS tagset

Linguistics Resources

User Tools

Site Tools

Tagsets

Page Tools