Table of Contents

Using Corpora in Historical Linguistics

Available corpora

The Penn Corpora

Resources

About

The Penn corpora are

Notes

The Penn corpora are really great, because you can use the exact same queries for all of them, which makes results directly comparable. But there are a few things to watch out for:

When working with historical corpora, it is especially useful to work with pos-tags, since these corpora are not lemmatized and the texts follow no standard orthography. So, whenever possible, use pos-tags, e.g. to find forms of the auxiliary do etc. If that is not possible, consult the OED to get an idea of the possible spelling variants of the words you are interested in.

If you need to restrict your query to a specific sub-corpus, remember that the command is [yourquery]::match.text_[anything]=“[anything]” for the PPCME2, the PPCEME and the PPCMBE, but [yourquery]::match.letter_[anything]=“[anything]” for the PPCEEC. The available text attributes can be looked up in the cheatsheet (forthcoming). Sub-periods (M4, E1, E2 etc.) are sometimes capitalized, sometimes not, try both if you get no results. The PPCMBE can be restricted to centuries (18th/19th).

Other corpora with broadly the same tagset include

Please contact your lecturer for information on how to access and use these corpora, should you be interested.