Pedagogic uses of a corpus of student writing and annotation

advertisement
Pedagogic uses of a corpus of
student writing
and their implications for sampling
and annotation
Alois Heuboeck
University of Reading, UK
The British Academic Written
English (BAWE) corpus
of student writing
Project in progress at the universities of
Reading, Warwick and Oxford Brookes
Funded by the Economic and Social
Research Council
(project nr. RES-000-23-0800)
Outline
• Corpora in LT: uses and purposes
• Accessing corpus information:
interfaces
• Building corpora: requirements and
decisions - the BAWE corpus
Using corpora in
language pedagogy
pedagogic uses
purposes
classroom
materials
description
“motivational”
“linguistic”
Interfaces (1): the concordance
typical query options
• word form
• lemma
• wildcards (e.g. “investigat*”)
• grammatical (e.g. POS)
• patterns
Information & interfaces (2)
statistics
• Frequencies, ratios
• e.g. word list, key words
• ad hoc statistics
corpus items
• macrostructural properties
and choices
• generic types, e.g. CARS
model (Swales 1990)
Requirements: a “good corpus”
for language pedagogy
• Representative: target variety
• Relevant: information, annotation
• Usable: e.g. interface, size
Representativeness
The corpus as a representative
sample should reflect:
Conflicting principles
– distribution and quantitative relations
quantitative representativeness
– range of features
qualitative representativeness
Linguistics
Classics
Archaeology
History of Art
Physics
Business
Politics
Anthropology
Publishing
Medicine
Meteorology
Mathematics
Computer Science
Engineering
Biochemistry
Agriculture
Food Sciences
Health & Social Care
Chemistry
History
Law
Biological Sciences
English
A trade-off: stratified sampling
AH
PS
Frame 2:
4 disciplinary
groups
Frame
1: the university:
Frame 4:
3: 4
4x6
levels
à 768
ass. ass.
corpus
Σ=3,072
disciplines
per
discipline
à 128
à 32ass.
ass.
SS
LS
Sociology
Representativeness (2):
the BAWE corpus
Relevance
Relevant information in corpus
Significant query
Corpus annotation
Features: lexicogrammatical,
structural etc.
Relevance (2): features
annotated in the BAWE corpus
• “grammatical”
• textual: structure of “running text”
• typographical (lay-out)
• metatextual: numbering
• other “interesting” features
Corpus size
“For the pedagogical analysis of many common
grammatical phenomena a full-size research
corpus is much too large.” (Osborne 2000)
Modularity: subcorpora
Specialised corpora
Conclusion: 3 views
• Qualitative vs. quantitative representation
corpus as representation of a (set of)
target variety/varieties
• Corpus annotation and interfaces: query
instances of lexicogrammatical (etc.)
features and phenomena
• Corpus size: modularity
balanced samples of target variety/varieties
Pedagogic uses of a corpus of
student writing
and their implications for sampling and annotation
Alois Heuboeck
University of Reading, UK
a.heuboeck@reading.ac.uk
The British Academic Written English corpus
http://www.warwick.ac.uk/go/BAWE/overview
Download