RESOURCES FOR CORPUS LINGUISTICS 1. (FIRST GENERATION) WRITTEN CORPORA Brown Corpus Properties: 1m words, written, American, 1961 Design: original (see Corpus Design handout) Availability: ICAME-CD, Free online access at LDC Lancaster-Oslo/Bergen (LOB) Properties: 1m words, written, British, 1961 Design: based on BROWN Availability: ICAME-CD Kolhapur Corpus of Indian English Properties: 1m words, Written, Indian, 1978 Design: roughly based on BROWN (less specific fiction, more general fiction) Availability: ICAME-CD Wellington Corpus of Written New Zealand English Properties: 1m words, written, NZ, 1986-90 Design: roughly based on BROWN (no subcategories within Fiction) Availability: ICAME-CD, Wellington Corpus CD Australian Corpus of English (ACE) Properties: 1m words, written, Australian, 1986 Design: roughly based on BROWN Availability: ICAME-CD, ACE-CD London-Lund (LLC) Properties: 0.5m words, Spoken, British, 1959ff Design: derived from SEU Availability: ICAME-CD Freiburg-Brown (FROWN) Properties: 1m words, Written, American, 1991 Design: based on BROWN Availability: ICAME-CD Freiburg-LOB (FLOB) Properties: 1m words, Written, British, 1991 Design: based on LOB (i.e. BROWN) Availability: ICAME-CD 2. MEGA CORPORA Cobuild Bank of English Properties: > 300m words, mostly written, mostly British (some American) Design: opportunistic Availability: HarperCollins, Online demo at Cobulid web site British National Corpus (BNC) Properties: 100m words, spoken (10m) and written (90m), 1990s Design: original (see Corpus Design handout) Availability: BNC CD, Online demo at BNC web site. Corpus Linguistics © 2003 Anatol Stefanowitsch anatol@alumni.rice.edu 1/3 Resources for corpus linguistics 3. VARIOUS SPECIALIZED CORPORA Spoken Corpora Corpus of Spoken American English Properties: growing, spoken, American Design: free conversation among friends in natural setting, tape recorded by participants Availability: Free download at Talkbank Switchboard (SWB) Properties: ca. 3m words, spoken, American, 1990s Design: Telephone conversations between strangers on predetermined topics Availability: Free online access at LDC Spoken Portion of BNC Properties: 10m words, British, 1990s Design: see BNC Availability: BNC-World CD Note: Spoken language files are distributed through all subdirectories of the BNC; they can be extracted by searching for files that contain the string stext in the header. Corpus of Spoken Professional American English Properties: ca. 2m words, spoken, American, 1990s Design: faculty meetings, committee meetings, White House press conferences Availability: CSPAE-CD (Athelstan) Michigan Corpus of Spoken Academic English (MICASE) Properties: 1.7m words, spoken, Academic English University of Michigan, 1997-2001 Design: representative of speech in academic settings Availability: Free online access at MICASE web site Diachronic Complete Corpus of Old English Properties: Written, Old English Design: full-text (contains all surviving Old English texts) Availability: University of Toronto Helsinki Corpus of English Texts, Diachronic Part Properties: 1.5m words, written, Old English to Middle English Design: original Availability: ICAME-CD Language Acquisition Child Language Data Exchange System (CHILDES) Properties: Broad range of corpora of child language Design: Varies acc. to corpus Availability: Childes Website 4. TEXT ARCHIVES Project Gutenberg http://gutenberg.net/ Web interface: http://clwww.essex.ac.uk/w3c/corpus_ling/content/search_engine.html University of Virginia Electronic Text Center http://etext.virginia.edu/ 2/3 Resources for corpus linguistics 5. USING THE INTERNET AS A CORPUS http://www.webcorp.org.uk/ 6. WEB PAGES OF MAJOR CORPORA OF ENGLISH Talkbank www.talkbank.org ICAME http://www.hit.uib.no/icame.html British National Corpus (BNC) http://www.hcu.ox.ac.uk/BNC/ Demo Access: http://sara.natcorp.ox.ac.uk/lookup.html COBUILD Bank of English http://www.cobuild.collins.co.uk/boe_info.html Demo Access: http://www.cobuild.collins.co.uk/form.html Linguistic Data Consortium (LDC) http://www.ldc.upenn.edu Demo Access to BROWN and SWITCHBOARD: http://www.ldc.upenn.edu/lol International Corpus of English (ICE) http://www.ucl.ac.uk/english-usage/ice/ Michigan Corpus of Academic Spoken English (MICASE) http://www.hti.umich.edu/m/micase/ Online access: http://www.hti.umich.edu/cgi/m/micase/micase-idx?type=revise Child Language Data Exchange System (CHILDES) http://childes.psy.cmu.edu/ Bergen Corpus of London Teenage Language (COLT) http://nora.hd.uib.no/colt/ 3/3