Tagsets: CLAWS 7 (COCA/COHA)
The CLAWS 7 tagset as used in the Corpus of Historical American English (COCA) and Corpus of Contemporary American English (COCA) contains two additions and several systamtic errors to the standard CLAWS 7 tagset. In addition, all tags except for one are in lowercase! For the standard CLAWS 7 tagset, see the page CLAWS 7 tagset.
| Tag | Description |
|---|---|
| appge | possessive pronoun, pre-nominal (my, your, his, her, its, our, their) |
| at | article (e.g. the, no) |
| at1 | singular article (e.g. a, an, every) |
| bcl | before-clause marker (e.g. in order (that), in order (to); see comment about "ditto tags" at the end of the list!) |
| cc | coordinating conjunction (e.g. and, or) |
| ccb | adversative coordinating conjunction (but) |
| cs | subordinating conjunction (e.g. if, because, unless, so, for) |
| csa | as (when used as conjunction) |
| csn | than (when used as a conjunction) |
| cst | that (when used as a conjunction) |
| csw | whether (when used as a conjunction) |
| da | after-determiner or post-determiner capable of pronominal function (e.g. such, former, same) |
| da1 | singular after-determiner (e.g. little, much) |
| da2 | plural after-determiner (e.g. few, several, many) |
| dar | comparative after-determiner (e.g. more, less, fewer) |
| dat | superlative after-determiner (e.g. most, least, fewest) |
| db | before determiner or pre-determiner capable of pronominal function (all, half) |
| db2 | plural before-determiner (both) |
| dd | determiner (capable of pronominal function) (e.g any, some) |
| dd1 | singular determiner (e.g. this, that, another) |
| dd2 | plural determiner (these, those) |
| ddq | wh-determiner (which, what) |
| ddqge | wh-determiner, genitive (whose) |
| ddqv | wh-ever determiner (whichever, whatever) |
| ex | existential there |
| fo | formula |
| fu | unclassified word |
| fw | foreign word (e.g. de, la, aqua, chakra) |
| ge | germanic genitive marker - (' or's) |
| if | for (when used as a preposition) |
| ii | general preposition (all prepositions except for, of, with, without) |
| io | of (when used as a preposition) |
| iw | with, without (when used as a prepositions) |
| jj | general adjective (e.g. good, nice, lovely, different) |
| jjr | general comparative adjective (e.g. better, nicer) |
| jjt | general superlative adjective (e.g. best, nicest) |
| jk | catenative adjective (able in be able to, willing in be willing to) |
| mc | cardinal number, neutral for number (two, three, sixteen, …) |
| mc1 | singular cardinal number (one) |
| mc2 | plural cardinal number (e.g. sixes, sevens, twenties) |
| mcge | genitive cardinal number, neutral for number (two's, 100's) [Note: this does not occur in COCA/COHA, presumably because the sequence 's has been erroneously split from the stem and misanalyzed as a form of the verb be or a possessive] |
| mcmc | hyphenated number (5-10, 1914-1918) |
| md | ordinal number (e.g. first, second, next, last) |
| mf | fraction, neutral for number (e.g. quarters, two-thirds) |
| nd1 | singular noun of direction (e.g. north, southwest) |
| nn | common noun, neutral for number (e.g. people, staff, tuna, aircraft, series, ethics) |
| nn1 | singular common noun (e.g. horse, girl, love, democracy) |
| nn2 | plural common noun (e.g. horses, girls, democracies) |
| nna | following noun of title (e.g. M.A.) |
| nnb | preceding noun of title (e.g. Mrs., Prof.) |
| nnl1 | singular locative noun (e.g. Lake, Street, Hill) |
| nnl2 | plural locative noun (e.g. Lakes, Streets, Hills) |
| nno | numeral noun, neutral for number (e.g. dozen, hundred) |
| nno2 | numeral noun, plural (e.g. hundreds, thousands) |
| nnt1 | temporal noun, singular (e.g. day, week, year) |
| nnt2 | temporal noun, plural (e.g. days, weeks, years) |
| nnu | unit of measurement, neutral for number (e.g. mm, sec) |
| nnu1 | singular unit of measurement (e.g. millimetre, second) |
| nnu2 | plural unit of measurement (e.g. ins., feet) |
| np | proper noun, neutral for number (e.g. Philippines, Mercedes) |
| np1 | singular proper noun (e.g. Europe, BBC, Sarah) |
| np2 | plural proper noun (e.g. Himalayas, Beatles, Tudors) |
| npd1 | singular weekday noun (e.g. Friday) |
| npd2 | plural weekday noun (e.g. Fridays) |
| npm1 | singular month noun (e.g. September) |
| npm2 | plural month noun (e.g. Septembers) |
| OM | general tag for omitted words (not part of CLAWS7) |
| pn | indefinite pronoun, neutral for number (none) |
| pn1 | indefinite pronoun, singular (e.g. anyone, everything, nobody, one) |
| pnqo | objective wh-pronoun (whom) |
| pnqs | subjective wh-pronoun (who) |
| pnqv | wh-ever pronoun (whoever) |
| pnx1 | reflexive indefinite pronoun (oneself) |
| ppge | nominal possessive personal pronoun (e.g. mine, yours) |
| pph1 | 3rd person sing. neuter personal pronoun (it) |
| ppho1 | 3rd person sing. objective personal pronoun (him, her) |
| ppho2 | 3rd person plural objective personal pronoun (them) |
| pphs1 | 3rd person sing. subjective personal pronoun (he, she) |
| pphs2 | 3rd person plural subjective personal pronoun (they) |
| ppio1 | 1st person sing. objective personal pronoun (me) |
| ppio2 | 1st person plural objective personal pronoun (us) |
| ppis1 | 1st person sing. subjective personal pronoun (I) |
| ppis2 | 1st person plural subjective personal pronoun (we) |
| ppx1 | singular reflexive personal pronoun (e.g. yourself, itself) |
| ppx2 | plural reflexive personal pronoun (e.g. yourselves, themselves) |
| ppy | 2nd person personal pronoun (you) |
| ra | adverb, after nominal head (e.g. ago, am, pm) |
| rex | adverb introducing appositional constructions (namely, i.e.) |
| rg | degree adverb (very, so, too) |
| rgq | wh- degree adverb (how) |
| rgqv | wh-ever degree adverb (however) |
| rgr | comparative degree adverb (more, less) |
| rgt | superlative degree adverb (most, least) |
| rl | locative adverb (e.g. somewhere, forward, upstairs) |
| rp | prep. adverb, particle (e.g up, out, back) |
| rpk | prep. adv., catenative (e.g. about in be about to) |
| rr | general adverb (e.g. just, actually, always) |
| rrq | wh- general adverb (where, when, why, how) |
| rrqv | wh-ever general adverb (wherever, whenever) |
| rrr | comparative general adverb (e.g. more, better, earlier) |
| rrt | superlative general adverb (e.g. most, best, earliest) |
| rt | quasi-nominal adverb of time (e.g. now, tomorrow) |
| to | infinitive marker (to) |
| uh | interjection (e.g. oh, yes, um) |
| vb0 | be, base form (finite i.e. imperative, subjunctive) |
| vbdr | were |
| vbdz | was |
| vbg | being |
| vbi | be, infinitive (e.g. in I'll be wrapped around your finger, to be honest) |
| vbm | am |
| vbn | been |
| vbr | are |
| vbz | is |
| vd0 | do, base form (finite) |
| vdd | did |
| vdg | doing |
| vdi | do, infinitive (e.g. in I could do…, To do…) |
| vdn | done |
| vdz | does |
| vh0 | have, base form (finite) |
| vhd | had (past tense) |
| vhg | having |
| vhi | have, infinitive |
| vhn | had (past participle) |
| vhz | has |
| vm | modal auxiliary (can, will, would, etc.) |
| vmk | modal catenative (ought, used) |
| vv0 | base form of lexical verb (e.g. say, love) |
| vvd | past tense of lexical verb (e.g. said, loved) |
| vvg | -ing participle of lexical verb (e.g. saying, loving) |
| vvgk | -ing participle catenative (going in be going to) |
| vvi | infinitive (e.g. to say…, I will always love you…) |
| vvn | past participle of lexical verb (e.g. given, worked) |
| vvnk | past participle catenative (e.g. bound in be bound to) |
| vvz | -s form of lexical verb (e.g. says, loves) |
| xx | not, n't |
| y | major punctuation marks, specifically : , . " : ( ) ? ; ! (not part of CLAWS 7) |
| zz1 | singular letter of the alphabet (e.g. A, b) |
| zz2 | plural letter of the alphabet (e.g. A's, b's) |
| zzc, zzq | speaker labels in transcripts of spoken language (not part of CLAWS 7) |
Errors (COCA)
In the COCA, there are tags where the last character is missing, so that they will not be found in a standard query:
| Tag | Correct Tag | Comment |
|---|---|---|
| a | at | occurs only with ze as a representation of the in a foreign accent |
| c | cs | occurs only with cept as a clipped variant of except |
| d | dd1 | occurs only with an-other as a variant of another |
| f | fw | occurs only with de, las, dos |
| j | jj | occurs with 657 types with 35663 tokens |
| m | mc | occurs with 188 types with 391150 tokens |
| m1 | mc1 | occurs with 145 types with 63588 tokens |
| n | nn, nn1, nn2 | occurs with 15 types with 1588 tokens |
| npx | np, np1, np2 | occurs with 243 types with 65782 tokens |
| null | occurs with tokenization errors where the token was too long for the parser and with xml entities (approx. 1200 types with 58897 tokens | |
| p | ppho2 | occurs with some cases of them and what-all |
| ./. | ppy | occurs with some cases of y'all |
| vd | vdn | occurs with don as a clipped form of done (1763 tokens) |
| x | occurs with various symbols, xml entities, tokenization errors, … (approx. 5900 types with 895086 tokens) | |
| xxy | xx | occurs with tokenization errors involving the word no (11 types with 186 tokens) |
| zz | zz1 | occurs with various symbols, abbreviations, tokenization errors (approx. 1200 types with 596697 tokens) |
Ditto Tags
The CLAWS 7 tagset uses so-called “ditto” tags for certain sequences of tokens that are analyzed as belonging to a single lexical unit. For example, in terms of is analyzed as a preposition (in CLAWS 5, by comparison, it is analyzed as a sequence of a preposition, a noun and another preposition).
In such cases, all words are given the same tag (in the case of in terms of the tag in for preposition) followed by two digits: the first one specifying the length of the sequence, the second one specifying the position of the element in the sequence, for example
in/ii31 terms/ii32 of/ii33 at_rr21 length_rr22 a_dd21 lot_dd22
This is unfortunate, as it forces analytical decisions on us that are not at all uncontroversial, but we have to live with it!
