IA902 Practical Description of English : Session 4 Corpora What is a

advertisement
IA902 Practical Description of English : Session 4
Corpora
What is a corpus?
“A principled collection of texts available for qualitative and quantitative analysis”? (Biber, Conrad,
Reppen, 1998)
Corpora timeline
1700s: Dr Johnson wrote the first comprehensive dictionary of English, compiled by manually collating
samples of language from 1560-1660.
1960s Brown Corpus of Standard American English : first of the modern, computer readable, general
corpora
1980s John Sinclair & colleagues: Collins Birmingham University International Language Database
(COBUILD)
1987 Collins COBUILD English Dictionary
1990 Willis: the Lexical Syllabus
2007 Cambridge International Corpus => 1 billion words
Corpora referred to in this handout
BNC : British National Corpus
BROWN
CIC : Cambridge International Corpus
CANCODE: Cambridge and Nottingham Corpus of Discourse in English
Page
MGH / IA902 2012
1
COBUILD : Collins Birmingham University International Language Database
IA902 Practical Description of English : Session 4
Frequency of individual words
Page
MGH / IA902 2012
2
From O’Keefe et al (2007):
Page
MGH / IA902 2012
3
IA902 Practical Description of English : Session 4
IA902 Practical Description of English : Session 4
O’Keefe et al (2007) divide the 2000 most frequently occurring words in the CIC and CANCODE corpora
into 4 sub-lists:
A = 1-500
B = 501-1000
C = 1001-1500
D = 1501-2000.
Can you identify the most frequently-occurring word in each set below?
A
possibly, must, seem
B
C
D
,
just, clearly, honestly, pretty
house, TV, cheese, kids
sad, brilliant, lovely, terrible
Eventually, always, usually, generally
Page
MGH / IA902 2012
4
explain, accept, help, listen
IA902 Practical Description of English : Session 4
Collocations and Lexical Bundles
Biber et al (2002):
Collocation : “a combination of lexical words which frequently co-occur in texts”
Lexical Bundle : “a sequence of words which is used repeatedly in texts”
Alternatives:
Collocation:
-
“just the way we say it”?
-
“the occurrence of two or more words with a short space of each other in a text” (Sinclair, 1991)
-
“the relationship a lexical item has with items that appear with greater than random probability
in its (textual) context” (Hoey, 1991)
-
“a psychological association between words (rather than lemmas) up to four words apart
=…evidenced by their occurrence together in corpora more often than is explicable in terms of
random distribution” (Hoey, 2005)
-
“the lexical company that words keep” (Hoey, 2011)
Lexical bundles and frequency
Put the following items in order of the frequency with which they are used in spoken English:
a) a bit of
b) and things like that
c) regularly
d) since
e) this that and the other
MGH / IA902 2012
5
twice
Page
f)
IA902 Practical Description of English : Session 4
Six-word lexical bundles
1. Do you know _______ _______ _______?
2. At the end _______ _______ _______
3. And all the rest _______ _______
4. And all that sort _______ _______
5. I don’t know _______ _______ _______
Bit: what corpora can tell us about a word
Page
MGH / IA902 2012
6
From the Oxford Dictionary Online:
IA902 Practical Description of English : Session 4
Results from the BNC:
“A bit” is the 24th most common two-word chunk in CANCODE. But,…what does “a bit” mean? Does it
have any meaning by itself? How meaningful is “a bit” as a quantifier? What about its “hedging”
function?
It also belongs to several “frames”:
7
MGH / IA902 2012
mess
problem
performance
hassle
nuisance
bargain
Page
e.g. it was a bit of a
IA902 Practical Description of English : Session 4
Colligation : Where lexis meet grammar?
Data on language usage tells us that:
•
“a bit” is more likely than “the bit”
•
“a bit” is likely to be followed by “of” + NP
•
“a bit” is more likely to be used in an object position than a subject position
Instances of DIFFERENT TO and DIFFERENT FROM across three corpora:
DIFFERENT TO
DIFFERENT FROM
Brown
0
35
BNC Written
4
22
BNC Spoken
21
12
ILLUSTRATE and DRAW





DRAW is a more frequent item than ILLUSTRATE
Both verbs are frequently preceded by “to”.
Relatively speaking, ILLUSTRATE occurs significantly more frequently with “to” than DRAW does
ILLUSTRATE is frequently used in INFINITIVE CLAUSE
To illustrate this, we can compare concordance lists of each word using any of the websites
linked to on the IA902 blog.
Differences in spoken and written English:
Page
MGH / IA902 2012
8
- data on spoken English reflects an orientation to the “speaker-listener world in conversation”. (I, you)
- spoken discourse markers (well, right)
- high frequency items that are arguably not words at all (yeah, oh, er)
IA902 Practical Description of English : Session 4
References and further reading
Adolphs, S. 2010 “Using a corpus to study spoken language” in Hunston, S. and Oakey. D (eds)
Introducing Applied Linguistics, Routledge
Baker, P. et al. 2006 A Glossary of Corpus Linguistics, Edinburgh University Press
Biber, D. et al (eds) 2002 Longman Student Grammar of Spoken and Written English, Longman
Hoey, M. 2005 Lexical Priming, Routledge
Hoey, M. et al. 2007 (eds) Text, Discourse and Corpora : Theory and Analysis, Continuum
Hunston, S. and Francis, G. 2000 Pattern Grammar, John Benjamins Publishing Company
O’Keefe, A. et al. 2007, From Corpus to Classroom, Cambridge University Press
Thornbury, S. 2004 Natural Grammar, Oxford University Press
Page
MGH / IA902 2012
9
Willis, D. 2003 Rules, Patterns and Words, Cambridge University Press
Download