RESOURCES FOR CORPUS LINGUISTICS 1. (FIRST GENERATION) WRITTEN CORPORA Brown Corpus Properties: 1m words, written, American, 1961 Design: original (see Corpus Design handout) Availability: ICAME-CD, Free online access at LDC Lancaster-Oslo/Bergen (LOB) Properties: 1m words, written, British, 1961 Design: based on BROWN Availability: ICAME-CD Kolhapur Corpus of Indian English Properties: 1m words, Written, Indian, 1978 Design: roughly based on BROWN (less specific fiction, more general fiction) Availability: ICAME-CD Wellington Corpus of Written New Zealand English Properties: 1m words, written, NZ, 1986-90 Design: roughly based on BROWN (no subcategories within Fiction) Availability: ICAME-CD, Wellington Corpus CD Australian Corpus of English (ACE) Properties: 1m words, written, Australian, 1986 Design: roughly based on BROWN Availability: ICAME-CD, ACE-CD London-Lund (LLC) Properties: 0.5m words, Spoken, British, 1959ff Design: derived from SEU Availability: ICAME-CD Freiburg-Brown (FROWN) Properties: 1m words, Written, American, 1991 Design: based on BROWN Availability: ICAME-CD Freiburg-LOB (FLOB) Properties: 1m words, Written, British, 1991 Design: based on LOB (i.e. BROWN) Availability: ICAME-CD 2. MEGA CORPORA Cobuild Bank of English Properties: > 300m words, mostly written, mostly British (some American) Design: opportunistic Availability: HarperCollins, Online demo at Cobulid web site British National Corpus (BNC) Properties: 100m words, spoken (10m) and written (90m), 1990s Design: original (see Corpus Design handout) Availability: BNC CD, Online demo at BNC web site. Corpus Linguistics © 2003 Anatol Stefanowitsch 1/3 Resources for corpus linguistics 3. VARIOUS SPECIALIZED CORPORA Spoken Corpora Corpus of Spoken American English Properties: growing, spoken, American Design: free conversation among friends in natural setting, tape recorded by participants Availability: Free download at Talkbank Switchboard (SWB) Properties: ca. 3m words, spoken, American, 1990s Design: Telephone conversations between strangers on predetermined topics Availability: Free online access at LDC Spoken Portion of BNC Properties: 10m words, British, 1990s Design: see BNC Availability: BNC-World CD Note: Spoken language files are distributed through all subdirectories of the BNC; they can be extracted by searching for files that contain the string stext in the header. Corpus of Spoken Professional American English Properties: ca. 2m words, spoken, American, 1990s Design: faculty meetings, committee meetings, White House press conferences Availability: CSPAE-CD (Athelstan) Michigan Corpus of Spoken Academic English (MICASE) Properties: 1.7m words, spoken, Academic English University of Michigan, 1997-2001 Design: representative of speech in academic settings Availability: Free online access at MICASE web site Diachronic Complete Corpus of Old English Properties: Written, Old English Design: full-text (contains all surviving Old English texts) Availability: University of Toronto Helsinki Corpus of English Texts, Diachronic Part Properties: 1.5m words, written, Old English to Middle English Design: original Availability: ICAME-CD Language Acquisition Child Language Data Exchange System (CHILDES) Properties: Broad range of corpora of child language Design: Varies acc. to corpus Availability: Childes Website 4. TEXT ARCHIVES Project Gutenberg Web interface: University of Virginia Electronic Text Center 2/3 Resources for corpus linguistics 5. USING THE INTERNET AS A CORPUS 6. WEB PAGES OF MAJOR CORPORA OF ENGLISH Talkbank ICAME British National Corpus (BNC) Demo Access: COBUILD Bank of English Demo Access: Linguistic Data Consortium (LDC) Demo Access to BROWN and SWITCHBOARD: International Corpus of English (ICE) Michigan Corpus of Academic Spoken English (MICASE) Online access: Child Language Data Exchange System (CHILDES) Bergen Corpus of London Teenage Language (COLT) 3/3