====== Using Corpora in Historical Linguistics ====== ===== Available corpora ===== ==== The Penn Corpora ==== === Resources === * {{ :corpora:penn-cheatsheet.pdf |}} (created by Alhadji Jallow, Jan Reimer and Georg Hartisch in 2017, used with permission) * {{ :corpora:penn-tagset.pdf |}} * {{ :corpora:exercises_penn-corpora.pdf |}} === About === The Penn corpora are * The **PPEME2** (Kroch, Anthony & Ann Taylor. 2000. //The Penn-Helsinki Parsed Corpus of **Middle English**//. Department of Linguistics, University of Pennsylvania.) * The **PPCEME** (Kroch, Anthony, Beatrice Santorini, and Lauren Delfs. 2004. //The Penn-Helsinki Parsed Corpus of **Early Modern English**//. Department of Linguistics, University of Pennsylvania.) * The **PPCEEC** (Nurmi, Arja, Ann Taylor, Anthony Warner, Susan Pintzuk and Terttu Nevalainen. 2006. //Parsed Corpus of **Early English Correspondence**//. York: University of York and Helsinki: University of Helsinki.) * The **PPCMBE** (**Modern British English**) === Notes === The Penn corpora are really great, because you can use the exact same queries for all of them, which makes results directly comparable. But there are a few things to watch out for: When working with historical corpora, it is especially useful to work with pos-tags, since these corpora are **not lemmatized** and the texts follow **no standard orthography**. So, whenever possible, use pos-tags, e.g. to find forms of the auxiliary //do// etc. If that is not possible, consult the OED to get an idea of the possible spelling variants of the words you are interested in. If you need to restrict your query to a specific sub-corpus, remember that the command is [yourquery]**::match.text_**[anything]="[anything]" for the PPCME2, the PPCEME and the PPCMBE, but [yourquery]**::match.letter_**[anything]="[anything]" for the PPCEEC. The available text attributes can be looked up in the cheatsheet (forthcoming). Sub-periods (M4, E1, E2 etc.) are sometimes capitalized, sometimes not, try both if you get no results. The PPCMBE can be restricted to centuries (18th/19th). === Related corpora === Other corpora with broadly the same tagset include * the **[[https://www-users.york.ac.uk/~lang22/YCOE/YcoeHome.htm|YCOE]]** (Taylor, Ann, Anthony Warner, Susan Pintzuk and Frank Beths. 2003. //The York-Toronto-Helsinki Parsed Corpus of **Old English Prose**//. Department of Language and Linguistic Science, University of York.) * the **[[https://ota.bodleian.ox.ac.uk/repository/xmlui/handle/20.500.12024/2425|YCOE-P]]** (Pintzuk, Susan & Leendert Plug. 2001. //The York-Helsinki Parsed Corpus of **Old English Poetry**//. http://ota.ox.ac.uk/; http://www-users.york.ac.uk/~lang18/pcorpus.html.) * the **[[https://pcmep.net/links.php|PCMEP]]** (Zimmermann, Richard. //The Parsed Corpus of **Middle English Poetry**//.) * the **[[http://www.chlg.ac.uk/helipad/index.html|HeliPaD]]** (Walkden, George. 2015. //HeliPaD: the Heliand Parsed Database// [the Corpus of **Historical [i.e. Old] Low German**]. Version 0.9.) * the **[[http://www.linguist.is/icelandic_treebank|IcePaHC]]** (Wallenberg, Joel C., Anton Karl Ingason, Einar Freyr Sigurðsson and Eiríkur Rögnvaldsson. 2011. //**[Old Norse-]Icelandic** Parsed Historical Corpus//. Version 0.9.) Please contact your lecturer for information on how to access and use these corpora, should you be interested.