What is a corpus

Przemysław Kaszubski
IFA UAM Poznań
The English Day (10 Dec., 2002)
Instytut Neofilologii
Państwowa Wyższa Szkoła Zawodowa w Koninie
The use of electronic text, or corpora, in the teaching of English
What is a corpus?
"a collection of pieces of language, selected and ordered according to explicit linguistic criteria in order to be used as a sample
of the language" (Sinclair 1996 - EAGLES96).
naturally-occurring / authentic text/discourse (NOT citations)
usu. machine-readable / electronic / computer-stored and -processable
compiled according to design criteria
(ideally) representative of the sampled language variety
Why bother with a corpus?
Expert speakers have only partial knowledge
Expert speakers think of what is possible
Expert speakers cannot quantify their knowledge
Expert speakers cannot make up natural examples
Corpus is more comprehensive and balanced
Corpus shows us what is common and typical
Corpus can give us fairly accurate statistics
Corpus can give us many natural examples
Some basic types of (monolingual) corpora
written / spoken
Other corpora
bilingual and multilingual (comparable & parallel)
special / non-standard (e.g. child language)
(non-native) learner (or interlanguage)
Representativeness. Why are the design criteria important? (Meyer 2002)
whose language (range of text sources; time-frame; sociolinguistic variables: gender, age, education, dialect, social
production or reception
spoken / written medium
genre / text-type
general or specialised
size vs balance
sample size vs whole texts
Useful & reliable corpus annotation
ELT: some questions asked of corpora
does an item exist (in general; in a genre; typicality/variation)
are the synonyms interchangeable
typical lexical/grammatical context(s) for an item
teaching grammar through lexis
false-friends or true friends (bilingual corpora)
study of literary texts through concordancing
Pedagogical approaches to using corpora
teacher-controlled use of corpus-based resources (dictionaries, coursebooks, exercises)
data-driven-learning / classroom-concordancing
Advantages of small (<0,5M) over large (>100M) corpora (Aston 1997)
easier to manage
more fully analysable
easier to become familiar with
easier to interpret
easier to construct
easier to reconstruct
more clearly patterned
limits are clearer
Where are the corpora? Where are the tools? [Some demos.]
Free online corpora access
BNC Online Sampler: http://sara.natcorp.ox.ac.uk/lookup.html
COBUILD Concordance & Collocation Sampler: http://titania.cobuild.collins.co.uk/form.html
WebCorp: http://www.webcorp.org.uk/
Korpus PWN: http://korpus.pwn.pl/
Pseudo-korpus IPI-PAN: http://www.ipipan.waw.pl/~corpus/
Free online text resources for offline research
Project Gutenberg: http://www.promo.net/pg/
Oxford Text Archive: http://ota.ahds.ac.uk/
Miscellaneous online sources (press, encyclopedias)
PWN links: http://slowniki.pwn.pl/korpus/linki.php
Bilingual documents: http://www.zbiordokumentow.pl/
Other affordable text resources
CD-ROMs (encyclopedias, newspapers)
Free tools for offline use
concordancer: Concordancer for Windows: http://www.linglit.tu-darmstadt.de/wconcord.htm
XCLOZE & CONTEXTS: http://web.bham.ac.uk/johnstf/timcall.htm
TestBuilder: <soon available, e-mail me for info>
Conclusions: the contributions of corpora to remember
Lexical and lexicogrammatical access to text
Supercede human intuition of commonness/variation
Anyone can research
WWW info on corpora (selection):
David Lee's Bookmarks for Corpus-Based Linguists: http://devoted.to/corpora
M. Barlow's Corpus Linguistics page: http://www.ruf.rice.edu/~barlow/corpus.html
Tim Johns CALL Page: http://web.bham.ac.uk/johnstf/timcall.htm
P. Kaszubski's (Learner) corpora page: http://main.amu.edu.pl/~przemka
PELCRA Home Page (Polish-English Language Corpora for Research and Applications):
