Using Corpora in Historical Linguistics
- Available corpora
  - The Penn Corpora

Using Corpora in Historical Linguistics

Available corpora

The Penn Corpora

Resources

penn-cheatsheet.pdf (created by Alhadji Jallow, Jan Reimer and Georg Hartisch in 2017, used with permission)
penn-tagset.pdf
exercises_penn-corpora.pdf

About

The Penn corpora are

The PPEME2 (Kroch, Anthony & Ann Taylor. 2000. The Penn-Helsinki Parsed Corpus of Middle English. Department of Linguistics, University of Pennsylvania.)
The PPCEME (Kroch, Anthony, Beatrice Santorini, and Lauren Delfs. 2004. The Penn-Helsinki Parsed Corpus of Early Modern English. Department of Linguistics, University of Pennsylvania.)
The PPCEEC (Nurmi, Arja, Ann Taylor, Anthony Warner, Susan Pintzuk and Terttu Nevalainen. 2006. Parsed Corpus of Early English Correspondence. York: University of York and Helsinki: University of Helsinki.)
The PPCMBE (Modern British English)

Notes

The Penn corpora are really great, because you can use the exact same queries for all of them, which makes results directly comparable. But there are a few things to watch out for:

When working with historical corpora, it is especially useful to work with pos-tags, since these corpora are not lemmatized and the texts follow no standard orthography. So, whenever possible, use pos-tags, e.g. to find forms of the auxiliary do etc. If that is not possible, consult the OED to get an idea of the possible spelling variants of the words you are interested in.

If you need to restrict your query to a specific sub-corpus, remember that the command is [yourquery]::match.text_[anything]=“[anything]” for the PPCME2, the PPCEME and the PPCMBE, but [yourquery]::match.letter_[anything]=“[anything]” for the PPCEEC. The available text attributes can be looked up in the cheatsheet (forthcoming). Sub-periods (M4, E1, E2 etc.) are sometimes capitalized, sometimes not, try both if you get no results. The PPCMBE can be restricted to centuries (18th/19th).

Related corpora

Other corpora with broadly the same tagset include

the YCOE (Taylor, Ann, Anthony Warner, Susan Pintzuk and Frank Beths. 2003. The York-Toronto-Helsinki Parsed Corpus of Old English Prose. Department of Language and Linguistic Science, University of York.)
the YCOE-P (Pintzuk, Susan & Leendert Plug. 2001. The York-Helsinki Parsed Corpus of Old English Poetry. http://ota.ox.ac.uk/; http://www-users.york.ac.uk/~lang18/pcorpus.html.)
the PCMEP (Zimmermann, Richard. The Parsed Corpus of Middle English Poetry.)
the HeliPaD (Walkden, George. 2015. HeliPaD: the Heliand Parsed Database [the Corpus of Historical [i.e. Old] Low German]. Version 0.9.)
the IcePaHC (Wallenberg, Joel C., Anton Karl Ingason, Einar Freyr Sigurðsson and Eiríkur Rögnvaldsson. 2011. [Old Norse-]Icelandic Parsed Historical Corpus. Version 0.9.)

Please contact your lecturer for information on how to access and use these corpora, should you be interested.

Table of Contents