**[ [[cqp:introduction|Collection: Introduction to CQP]] ]** ====== 7. Available Corpora ====== Once you've set up your access to CQP on the university server (that is the INLET corpus system), you'll have a selection of different corpora at your disposal. This list will introduce you to some of them that might be interesting to you. If you don't have access to CQP just yet, check out the [[inlet:setup|INLET site]] to install the system on your account. For more information on the INLET system, visit [[inlet:overview|this site]]. For more detailed information on each of these corpora, select the corpus on CQP, type ''info'' and press ''ENTER''. ===== BNC ===== ==== BRITISH NATIONAL CORPUS ==== **Size:** 112,156,361 tokens **Text publication dates:** 1960-1993 (split up into 3 periods) **Tagset:** [[corpora:tagset-claws5|CLAWS-5]] **Cite as:** BNC Consortium. 2007. The British National Corpus, version 3 (BNC XML Edition). Oxford: Bodleian Libraries, University of Oxford. URL: http://www.natcorp.ox.ac.uk/ **Corpus documentation:** http://www.natcorp.ox.ac.uk/ ===== BNC-BABY ===== ==== BRITISH NATIONAL CORPUS (a smaller version) ==== **Size:** 4,644,834 tokens **Tagset:** [[corpora:tagset-claws5|CLAWS-5]] **Corpus documentation:** http://www.natcorp.ox.ac.uk/corpus/baby/manual.pdf ===== BNC2014-S ===== ==== Spoken British National Corpus 2014 ==== **Size:** 1,1422,615 tokens **Text publication dates**: 2012-2016 **Tagset:** [[https://ucrel.lancs.ac.uk/claws6tags.html|CLAWS-6]] **Corpus documentation:** http://corpora.lancs.ac.uk/bnc2014/documentation.php **Cite as**: Love, Robbie, Claire Dembry, Andrew Hardie, Vaclav Brezina & Tony McEnery. 2017. The Spoken BNC2014: Designing and building a spoken corpus of everyday conversations. International Journal of Corpus Linguistics 22(3). 319–344. https://doi.org/10.1075/ijcl.22.3.02lov. ===== CLMET ===== ==== CORPUS OF LATE MODERN ENGLISH TEXTS ==== **Size**: 40,340,760 tokens **Text publication dates**: 1710-1920 (split up into 3 periods) **Tagset**: [[corpora:tagset-penn|PENN Corpora]] **Corpus documentation**: https://perswww.kuleuven.be/~u0044428/clmet3_0.htm **Cite as**: De Smet, Hendrik, Susanne Flach, Jukka Tyrkkö & Hans-Jügen Diller. 2015. The Corpus of Late Modern English (CLMET), version 3.1: Improved tokenization and linguistic annotation. KU Leuven, FU Berlin, U Tampere, RU Bochum. ===== BROWN-LEGACY ===== ==== The Standard Corpus of Present-Day Edited American English ==== **Size**: 1,137,466 tokens (approx. 1m words) **Text publication dates**: 1961 **Corpus documentation**: https://varieng.helsinki.fi/CoRD/corpora/BROWN/index.html **Cite as**: A Standard Corpus of Present-Day Edited American English, for use with Digital Computers (Brown). 1964, 1971, 1979. Compiled by W. N. Francis and H. Kučera. Brown University. Providence, Rhode Island. ===== FROWN-LEGACY ===== ==== The Freiburg-Brown corpus of American English ==== **Size**: 1,180,152 (approx. 1m words) **Text publication dates**: 1992 **Corpus documentation**: https://varieng.helsinki.fi/CoRD/corpora/FROWN/index.html **Cite as**: The Freiburg-Brown Corpus (‘Frown’) (POS-tagged version) compiled by Christian Mair, Albert Ludwigs-Universität Freiburg, and Geoffrey Leech, University of Lancaster ===== LOB-LEGACY ===== ==== The Lancaster-Oslo/Bergen Corpus ==== **Size**: 1,157,496 tokens (approx. 1m words) **Text publication dates**: 1961 **Corpus documentation**: https://varieng.helsinki.fi/CoRD/corpora/LOB/index.html **Cite as**: The LOB Corpus, POS-tagged version (1981–1986), compiled by Geoffrey Leech, Lancaster University, Stig Johansson, University of Oslo (project leaders), Roger Garside, Lancaster University, and Knut Hofland, University of Bergen (heads of computing). ===== FLOB-LEGACY ===== ==== The Freiburg–LOB Corpus of British English ==== **Size**: 1,165,747 tokens (approx. 1m words) **Text publication dates**: 1991 **Corpus documentation**: https://varieng.helsinki.fi/CoRD/corpora/FLOB/index.html **Cite as**: The Freiburg-LOB Corpus (‘F-LOB’) (POS-tagged version) compiled by Christian Mair, Albert Ludwigs-Universität Freiburg, and Geoffrey Leech, University of Lancaster ===== ICLE ===== ==== INTERNATIONAL CORPUS OF LEARNER ENGLISH ==== **Size:** 2,518,276 tokens **Author's first languages:** Bulgarian, Czech, Dutch (Netherlands), Dutch (Belgium), French, German, Italian, Polish, Russian, etc. **Corpus documentation:** https://uclouvain.be/en/research-institutes/ilc/cecl/icle.html **Cite as:** Granger, Sylviane, Estelle Dagneaux & Fanny Meunier. 2002. //International Corpus of Learner English (ICLE)//. Louvain: Presses Universitaires de Louvain. ===== COCA-S ===== ==== CORPUS OF CONTEMPORARY AMERICAN ENGLISH (COCA) ==== **Size:** 542,341,719 tokens (440m words) **Text publication dates:** 1990-2012 **Tagset:** [[corpora:tagset-claws7-coxa|CLAWS-7]] **Corpus documentation:** http://corpus.byu.edu/coca **Cite as:** Davies, Mark. 2008. //The Corpus of Contemporary American English: 450 Million Words, 1990-2012//. http://corpus.byu.edu/coca. ===== COHA-S ===== ==== CORPUS OF HISTORICAL AMERICAN ENGLISH (COHA) ==== **Size:** 471,427,380 tokens (400m words) **Tagset:** [[corpora:tagset-claws7-coxa|CLAWS-7]] **Corpus documentation:** http://corpus.byu.edu/coha/ **Cite as:** Davies, Mark. 2010. //The Corpus of Historical American English: 400 million words, 1810-2009//. http://corpus.byu.edu/coha/. ===== PPCME2 ===== ==== PENN-HELSINKI PARSED CORPUS OF MIDDLE ENGLISH (Version 2) ==== **Size:** 1,354,926 tokens **Text publication dates:** 1150-1500 (split up into 9 periods) **Tagset:** [[corpora:tagsets|PENN Corpora]] **Corpus documentation:** https://www.ling.upenn.edu/hist-corpora/PPCME2-RELEASE-4/index.html , https://github.com/beatrice57/ppche-2024/tree/main/PPCME2-RELEASE-5/docs **Cite as:** Anthony Kroch and Ann Taylor. 2000. The Penn-Helsinki Parsed Corpus of Middle English (PPCME2). Department of Linguistics, University of Pennsylvania. CD-ROM, second edition, (http://www.ling.upenn.edu/hist-corpora/). ===== PPCEME ===== ==== PENN-HELSINKI PARSED CORPUS OF EARLY MODERN ENGLISH ==== **Size:** 1,968,483 tokens **Text publication dates:** 1500-1710 **Tagset:** [[corpora:tagsets|PENN Corpora]] **Corpus documentation:** https://github.com/beatrice57/ppche-2024/tree/main/PPCEME-RELEASE-4/docs , https://www.ling.upenn.edu/hist-corpora/PPCEME-RELEASE-3/index.html **Cite as:** Anthony Kroch, Beatrice Santorini, and Lauren Delfs. 2004. The Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME). Department of Linguistics, University of Pennsylvania. CD-ROM, first edition, (http://www.ling.upenn.edu/hist-corpora/). ===== PPCMBE ===== ==== PENN-HELSINKI PARSED CORPUS OF MODERN BRITISH ENGLISH ==== **Size:** 1,095,044 tokens **Text publication dates:** 1700-1914 **Tagset:** [[corpora:tagsets|PENN Corpora]] **Corpus documentation:** https://github.com/beatrice57/ppche-2024/tree/main/PPCMBE2-RELEASE-2/docs , https://www.ling.upenn.edu/hist-corpora/PPCMBE2-RELEASE-1/index.html **Cite as:** Anthony Kroch, Beatrice Santorini, and Lauren Delfs. 2010. //The Penn-Helsinki Parsed Corpus of Modern British English (PPCMBE)//. Department of Linguistics, University of Pennsylvania. CD-ROM, first edition. (http://www.ling.upenn.edu/hist-corpora/). ===== PPCEEC ===== ==== PENN-HELSINKI PARSED CORPUS OF EARLY ENGLISH CORRESPONDENCE ==== **Size:** 2,371,920 tokens **Text publication dates:** 1350-1710 (split up into 5 periods) **Tagset:** [[corpora:tagsets|PENN Corpora]] **Corpus documentation:** http://www-users.york.ac.uk/~lang22/PCEEC-manual/corpus_description/index.htm **Cite as:** Parsed Corpus of Early English Correspondence, tagged version. 2006. Annotated by Arja Nurmi, Ann Taylor, Anthony Warner, Susan Pintzuk, and Terttu Nevalainen. Compiled by the CEEC Project Team. York: University of York and Helsinki: University of Helsinki. Distributed through the Oxford Text Archive. ===== CED ===== ==== Corpus of English Dialogues 1560-1760 ==== **Size:** 1,458,700 tokens **Text publication dates:** 1560-1760 (split up into 5 periods) **Tagset**: untagged **Corpus documentation:** https://data.ldaca.edu.au/collection?id=arcp%3A%2F%2Fname%2Chdl10.26180~23961609&_crateId=arcp%3A%2F%2Fname%2Chdl10.26180~23961609 **Cite as:** A Corpus of English Dialogues 1560—1760. 2006. Compiled under the supervision of Merja Kyto (Uppsala University) and Jonathan Culpeper (Lancaster University). ===== COOEE ===== ==== Corpus of Oz Early English ==== **Size:** 2,243,235 tokens **Text publication dates:** 1788-1900 **Tagset:** [[corpora:tagset-treetagger|TreeTagger]] **Corpus documentation:** https://varieng.helsinki.fi/CoRD/corpora/COOEE/index.html **Cite as:** Fritz, Clemens W. A. 2012. From English in Australia to Australian English: 1788-1900. Frankfurt am Main: Peter Lang. **[ Introduction to CQP: [[cqp:corpus-structure|Section 1]] -- [[cqp:simple-queries|Section 2]] -- [[cqp:advanced-querying|Section 3]] -- [[cqp:beyond-queries|Section 4]] -- [[cqp:expert-tricks|Section 5]] -- [[cqp:exercises|Section 6]] -- [[cqp:list-of-coprora|Section 7]] ]**