RESOURCES FOR CORPUS LINGUISTICS

advertisement
RESOURCES FOR CORPUS LINGUISTICS
1.
(FIRST GENERATION) WRITTEN CORPORA
Brown Corpus
Properties: 1m words, written, American, 1961
Design: original (see Corpus Design handout)
Availability: ICAME-CD, Free online access at LDC
Lancaster-Oslo/Bergen (LOB)
Properties: 1m words, written, British, 1961
Design: based on BROWN
Availability: ICAME-CD
Kolhapur Corpus of Indian English
Properties: 1m words, Written, Indian, 1978
Design: roughly based on BROWN (less specific fiction, more general fiction)
Availability: ICAME-CD
Wellington Corpus of Written New Zealand English
Properties: 1m words, written, NZ, 1986-90
Design: roughly based on BROWN (no subcategories within Fiction)
Availability: ICAME-CD, Wellington Corpus CD
Australian Corpus of English (ACE)
Properties: 1m words, written, Australian, 1986
Design: roughly based on BROWN
Availability: ICAME-CD, ACE-CD
London-Lund (LLC)
Properties: 0.5m words, Spoken, British, 1959ff
Design: derived from SEU
Availability: ICAME-CD
Freiburg-Brown (FROWN)
Properties: 1m words, Written, American, 1991
Design: based on BROWN
Availability: ICAME-CD
Freiburg-LOB (FLOB)
Properties: 1m words, Written, British, 1991
Design: based on LOB (i.e. BROWN)
Availability: ICAME-CD
2.
MEGA CORPORA
Cobuild Bank of English
Properties: > 300m words, mostly written, mostly British (some American)
Design: opportunistic
Availability: HarperCollins, Online demo at Cobulid web site
British National Corpus (BNC)
Properties: 100m words, spoken (10m) and written (90m), 1990s
Design: original (see Corpus Design handout)
Availability: BNC CD, Online demo at BNC web site.
Corpus Linguistics
© 2003 Anatol Stefanowitsch
anatol@alumni.rice.edu
1/3
Resources for corpus linguistics
3.
VARIOUS SPECIALIZED CORPORA
Spoken Corpora
Corpus of Spoken American English
Properties: growing, spoken, American
Design: free conversation among friends in natural setting, tape recorded by participants
Availability: Free download at Talkbank
Switchboard (SWB)
Properties: ca. 3m words, spoken, American, 1990s
Design: Telephone conversations between strangers on predetermined topics
Availability: Free online access at LDC
Spoken Portion of BNC
Properties: 10m words, British, 1990s
Design: see BNC
Availability: BNC-World CD
Note: Spoken language files are distributed through all subdirectories of the BNC; they can be
extracted by searching for files that contain the string stext in the header.
Corpus of Spoken Professional American English
Properties: ca. 2m words, spoken, American, 1990s
Design: faculty meetings, committee meetings, White House press conferences
Availability: CSPAE-CD (Athelstan)
Michigan Corpus of Spoken Academic English (MICASE)
Properties: 1.7m words, spoken, Academic English University of Michigan, 1997-2001
Design: representative of speech in academic settings
Availability: Free online access at MICASE web site
Diachronic
Complete Corpus of Old English
Properties: Written, Old English
Design: full-text (contains all surviving Old English texts)
Availability: University of Toronto
Helsinki Corpus of English Texts, Diachronic Part
Properties: 1.5m words, written, Old English to Middle English
Design: original
Availability: ICAME-CD
Language Acquisition
Child Language Data Exchange System (CHILDES)
Properties: Broad range of corpora of child language
Design: Varies acc. to corpus
Availability: Childes Website
4.
TEXT ARCHIVES
Project Gutenberg
http://gutenberg.net/
Web interface: http://clwww.essex.ac.uk/w3c/corpus_ling/content/search_engine.html
University of Virginia Electronic Text Center
http://etext.virginia.edu/
2/3
Resources for corpus linguistics
5.
USING THE INTERNET AS A CORPUS
http://www.webcorp.org.uk/
6.
WEB PAGES OF MAJOR CORPORA OF ENGLISH
Talkbank
www.talkbank.org
ICAME
http://www.hit.uib.no/icame.html
British National Corpus (BNC)
http://www.hcu.ox.ac.uk/BNC/
Demo Access: http://sara.natcorp.ox.ac.uk/lookup.html
COBUILD Bank of English
http://www.cobuild.collins.co.uk/boe_info.html
Demo Access: http://www.cobuild.collins.co.uk/form.html
Linguistic Data Consortium (LDC)
http://www.ldc.upenn.edu
Demo Access to BROWN and SWITCHBOARD: http://www.ldc.upenn.edu/lol
International Corpus of English (ICE)
http://www.ucl.ac.uk/english-usage/ice/
Michigan Corpus of Academic Spoken English (MICASE)
http://www.hti.umich.edu/m/micase/
Online access: http://www.hti.umich.edu/cgi/m/micase/micase-idx?type=revise
Child Language Data Exchange System (CHILDES)
http://childes.psy.cmu.edu/
Bergen Corpus of London Teenage Language (COLT)
http://nora.hd.uib.no/colt/
3/3
Download