4 - University of Reading

advertisement
Corpora:
Resources for the
study of language
Paul Thompson
Applied Linguistics
(p.a.thompson@reading.ac.uk)
British Academic Spoken English corpus
(BASE)



160 lectures, 39 seminars
Transcripts, video and audio
199 XML files:





Transcripts with detailed annotation
Metadata included in header
160 lecture transcripts are tagged for Part-ofSpeech
www.reading.ac.uk/AcaDepts/ll/base_corpus/
Funded by AHRB, Euralex, BALEAP and university
sources
British Academic Written English corpus
(BAWE)



A corpus of assessed student writing at
university level
Texts collected at Warwick, Reading and
Oxford Brookes University
Funded by Economic and Social Research
Council of England (ESRC)
RES-000-23-0800
BAWE figures
6.5 million words
 2,896 texts
 2,761 assignments


XML files, POS-tagged
30+ disciplines
 4 levels of study

Query interface:
Sketch Engine
Commercial service:
Applied Linguistics
pays annual
subscription
BAWE: it BE ADJ that
(eg, ‘it is important that’)
Level
Raw
Rel %
3
225
121.7
2
275
107.7
1
255
96.0
PG
66
62.1
Further possibilities

BASE: Linking audio and video to the
transcripts, either online or on hard drives

Insertion of timestamp data into transcripts


Example
Why?


Access to temporal, spatial, paralinguistic,
phonological information
Studies of speech rate, for example
Uses of corpora






Comparison between languages
Historical linguistics
Stylistics
Studies of language in use
Specialised language use [eg, doctorpatient interactions]
Investigations of multimodality
Projects in mind

PhD thesis corpus


Academic speech events



Electronic submission
Seminars, tutorials, etc
Student use of computers in preparing
assignments [video and text]
Reading and writing of undergraduates
Desiderata

Hosting corpus resources at Reading or other
university – preferably on Linux servers – with
customisable interfaces




BASE, BAWE, and other corpora that Reading
possesses
For use by all departments at Reading and also
elsewhere
Varied levels of user access
Centralised support needed – lack of continuity
with project staff
Download