The CLAWS 7 tagset as used in the Corpus of Historical American English (COCA) and Corpus of Contemporary American English (COCA) contains two additions and several systamtic errors to the standard CLAWS 7 tagset. In addition, all tags except for one are in lowercase! For the standard CLAWS 7 tagset, see the page CLAWS 7 tagset.
Tag | Description |
---|---|
appge | possessive pronoun, pre-nominal (my, your, his, her, its, our, their) |
at | article (e.g. the, no) |
at1 | singular article (e.g. a, an, every) |
bcl | before-clause marker (e.g. in order (that), in order (to); see comment about "ditto tags" at the end of the list!) |
cc | coordinating conjunction (e.g. and, or) |
ccb | adversative coordinating conjunction (but) |
cs | subordinating conjunction (e.g. if, because, unless, so, for) |
csa | as (when used as conjunction) |
csn | than (when used as a conjunction) |
cst | that (when used as a conjunction) |
csw | whether (when used as a conjunction) |
da | after-determiner or post-determiner capable of pronominal function (e.g. such, former, same) |
da1 | singular after-determiner (e.g. little, much) |
da2 | plural after-determiner (e.g. few, several, many) |
dar | comparative after-determiner (e.g. more, less, fewer) |
dat | superlative after-determiner (e.g. most, least, fewest) |
db | before determiner or pre-determiner capable of pronominal function (all, half) |
db2 | plural before-determiner (both) |
dd | determiner (capable of pronominal function) (e.g any, some) |
dd1 | singular determiner (e.g. this, that, another) |
dd2 | plural determiner (these, those) |
ddq | wh-determiner (which, what) |
ddqge | wh-determiner, genitive (whose) |
ddqv | wh-ever determiner (whichever, whatever) |
ex | existential there |
fo | formula |
fu | unclassified word |
fw | foreign word (e.g. de, la, aqua, chakra) |
ge | germanic genitive marker - (' or's) |
if | for (when used as a preposition) |
ii | general preposition (all prepositions except for, of, with, without) |
io | of (when used as a preposition) |
iw | with, without (when used as a prepositions) |
jj | general adjective (e.g. good, nice, lovely, different) |
jjr | general comparative adjective (e.g. better, nicer) |
jjt | general superlative adjective (e.g. best, nicest) |
jk | catenative adjective (able in be able to, willing in be willing to) |
mc | cardinal number, neutral for number (two, three, sixteen, …) |
mc1 | singular cardinal number (one) |
mc2 | plural cardinal number (e.g. sixes, sevens, twenties) |
mcge | genitive cardinal number, neutral for number (two's, 100's) [Note: this does not occur in COCA/COHA, presumably because the sequence 's has been erroneously split from the stem and misanalyzed as a form of the verb be or a possessive] |
mcmc | hyphenated number (5-10, 1914-1918) |
md | ordinal number (e.g. first, second, next, last) |
mf | fraction, neutral for number (e.g. quarters, two-thirds) |
nd1 | singular noun of direction (e.g. north, southwest) |
nn | common noun, neutral for number (e.g. people, staff, tuna, aircraft, series, ethics) |
nn1 | singular common noun (e.g. horse, girl, love, democracy) |
nn2 | plural common noun (e.g. horses, girls, democracies) |
nna | following noun of title (e.g. M.A.) |
nnb | preceding noun of title (e.g. Mrs., Prof.) |
nnl1 | singular locative noun (e.g. Lake, Street, Hill) |
nnl2 | plural locative noun (e.g. Lakes, Streets, Hills) |
nno | numeral noun, neutral for number (e.g. dozen, hundred) |
nno2 | numeral noun, plural (e.g. hundreds, thousands) |
nnt1 | temporal noun, singular (e.g. day, week, year) |
nnt2 | temporal noun, plural (e.g. days, weeks, years) |
nnu | unit of measurement, neutral for number (e.g. mm, sec) |
nnu1 | singular unit of measurement (e.g. millimetre, second) |
nnu2 | plural unit of measurement (e.g. ins., feet) |
np | proper noun, neutral for number (e.g. Philippines, Mercedes) |
np1 | singular proper noun (e.g. Europe, BBC, Sarah) |
np2 | plural proper noun (e.g. Himalayas, Beatles, Tudors) |
npd1 | singular weekday noun (e.g. Friday) |
npd2 | plural weekday noun (e.g. Fridays) |
npm1 | singular month noun (e.g. September) |
npm2 | plural month noun (e.g. Septembers) |
OM | general tag for omitted words (not part of CLAWS7) |
pn | indefinite pronoun, neutral for number (none) |
pn1 | indefinite pronoun, singular (e.g. anyone, everything, nobody, one) |
pnqo | objective wh-pronoun (whom) |
pnqs | subjective wh-pronoun (who) |
pnqv | wh-ever pronoun (whoever) |
pnx1 | reflexive indefinite pronoun (oneself) |
ppge | nominal possessive personal pronoun (e.g. mine, yours) |
pph1 | 3rd person sing. neuter personal pronoun (it) |
ppho1 | 3rd person sing. objective personal pronoun (him, her) |
ppho2 | 3rd person plural objective personal pronoun (them) |
pphs1 | 3rd person sing. subjective personal pronoun (he, she) |
pphs2 | 3rd person plural subjective personal pronoun (they) |
ppio1 | 1st person sing. objective personal pronoun (me) |
ppio2 | 1st person plural objective personal pronoun (us) |
ppis1 | 1st person sing. subjective personal pronoun (I) |
ppis2 | 1st person plural subjective personal pronoun (we) |
ppx1 | singular reflexive personal pronoun (e.g. yourself, itself) |
ppx2 | plural reflexive personal pronoun (e.g. yourselves, themselves) |
ppy | 2nd person personal pronoun (you) |
ra | adverb, after nominal head (e.g. ago, am, pm) |
rex | adverb introducing appositional constructions (namely, i.e.) |
rg | degree adverb (very, so, too) |
rgq | wh- degree adverb (how) |
rgqv | wh-ever degree adverb (however) |
rgr | comparative degree adverb (more, less) |
rgt | superlative degree adverb (most, least) |
rl | locative adverb (e.g. somewhere, forward, upstairs) |
rp | prep. adverb, particle (e.g up, out, back) |
rpk | prep. adv., catenative (e.g. about in be about to) |
rr | general adverb (e.g. just, actually, always) |
rrq | wh- general adverb (where, when, why, how) |
rrqv | wh-ever general adverb (wherever, whenever) |
rrr | comparative general adverb (e.g. more, better, earlier) |
rrt | superlative general adverb (e.g. most, best, earliest) |
rt | quasi-nominal adverb of time (e.g. now, tomorrow) |
to | infinitive marker (to) |
uh | interjection (e.g. oh, yes, um) |
vb0 | be, base form (finite i.e. imperative, subjunctive) |
vbdr | were |
vbdz | was |
vbg | being |
vbi | be, infinitive (e.g. in I'll be wrapped around your finger, to be honest) |
vbm | am |
vbn | been |
vbr | are |
vbz | is |
vd0 | do, base form (finite) |
vdd | did |
vdg | doing |
vdi | do, infinitive (e.g. in I could do…, To do…) |
vdn | done |
vdz | does |
vh0 | have, base form (finite) |
vhd | had (past tense) |
vhg | having |
vhi | have, infinitive |
vhn | had (past participle) |
vhz | has |
vm | modal auxiliary (can, will, would, etc.) |
vmk | modal catenative (ought, used) |
vv0 | base form of lexical verb (e.g. say, love) |
vvd | past tense of lexical verb (e.g. said, loved) |
vvg | -ing participle of lexical verb (e.g. saying, loving) |
vvgk | -ing participle catenative (going in be going to) |
vvi | infinitive (e.g. to say…, I will always love you…) |
vvn | past participle of lexical verb (e.g. given, worked) |
vvnk | past participle catenative (e.g. bound in be bound to) |
vvz | -s form of lexical verb (e.g. says, loves) |
xx | not, n't |
y | major punctuation marks, specifically : , . " : ( ) ? ; ! (not part of CLAWS 7) |
zz1 | singular letter of the alphabet (e.g. A, b) |
zz2 | plural letter of the alphabet (e.g. A's, b's) |
zzc, zzq | speaker labels in transcripts of spoken language (not part of CLAWS 7) |
Errors (COCA)
In the COCA, there are tags where the last character is missing, so that they will not be found in a standard query:
Tag | Correct Tag | Comment |
---|---|---|
a | at | occurs only with ze as a representation of the in a foreign accent |
c | cs | occurs only with cept as a clipped variant of except |
d | dd1 | occurs only with an-other as a variant of another |
f | fw | occurs only with de, las, dos |
j | jj | occurs with 657 types with 35663 tokens |
m | mc | occurs with 188 types with 391150 tokens |
m1 | mc1 | occurs with 145 types with 63588 tokens |
n | nn, nn1, nn2 | occurs with 15 types with 1588 tokens |
npx | np, np1, np2 | occurs with 243 types with 65782 tokens |
null | occurs with tokenization errors where the token was too long for the parser and with xml entities (approx. 1200 types with 58897 tokens | |
p | ppho2 | occurs with some cases of them and what-all |
./. | ppy | occurs with some cases of y'all |
vd | vdn | occurs with don as a clipped form of done (1763 tokens) |
x | occurs with various symbols, xml entities, tokenization errors, … (approx. 5900 types with 895086 tokens) | |
xxy | xx | occurs with tokenization errors involving the word no (11 types with 186 tokens) |
zz | zz1 | occurs with various symbols, abbreviations, tokenization errors (approx. 1200 types with 596697 tokens) |
Ditto Tags
The CLAWS 7 tagset uses so-called “ditto” tags for certain sequences of tokens that are analyzed as belonging to a single lexical unit. For example, in terms of is analyzed as a preposition (in CLAWS 5, by comparison, it is analyzed as a sequence of a preposition, a noun and another preposition).
In such cases, all words are given the same tag (in the case of in terms of the tag in
for preposition) followed by two digits: the first one specifying the length of the sequence, the second one specifying the position of the element in the sequence, for example
in/ii31 terms/ii32 of/ii33 at_rr21 length_rr22 a_dd21 lot_dd22
This is unfortunate, as it forces analytical decisions on us that are not at all uncontroversial, but we have to live with it!