IA902 Practical Description of English : Session 4 Corpora What is a corpus? “A principled collection of texts available for qualitative and quantitative analysis”? (Biber, Conrad, Reppen, 1998) Corpora timeline 1700s: Dr Johnson wrote the first comprehensive dictionary of English, compiled by manually collating samples of language from 1560-1660. 1960s Brown Corpus of Standard American English : first of the modern, computer readable, general corpora 1980s John Sinclair & colleagues: Collins Birmingham University International Language Database (COBUILD) 1987 Collins COBUILD English Dictionary 1990 Willis: the Lexical Syllabus 2007 Cambridge International Corpus => 1 billion words Corpora referred to in this handout BNC : British National Corpus BROWN CIC : Cambridge International Corpus CANCODE: Cambridge and Nottingham Corpus of Discourse in English Page MGH / IA902 2012 1 COBUILD : Collins Birmingham University International Language Database IA902 Practical Description of English : Session 4 Frequency of individual words Page MGH / IA902 2012 2 From O’Keefe et al (2007): Page MGH / IA902 2012 3 IA902 Practical Description of English : Session 4 IA902 Practical Description of English : Session 4 O’Keefe et al (2007) divide the 2000 most frequently occurring words in the CIC and CANCODE corpora into 4 sub-lists: A = 1-500 B = 501-1000 C = 1001-1500 D = 1501-2000. Can you identify the most frequently-occurring word in each set below? A possibly, must, seem B C D , just, clearly, honestly, pretty house, TV, cheese, kids sad, brilliant, lovely, terrible Eventually, always, usually, generally Page MGH / IA902 2012 4 explain, accept, help, listen IA902 Practical Description of English : Session 4 Collocations and Lexical Bundles Biber et al (2002): Collocation : “a combination of lexical words which frequently co-occur in texts” Lexical Bundle : “a sequence of words which is used repeatedly in texts” Alternatives: Collocation: - “just the way we say it”? - “the occurrence of two or more words with a short space of each other in a text” (Sinclair, 1991) - “the relationship a lexical item has with items that appear with greater than random probability in its (textual) context” (Hoey, 1991) - “a psychological association between words (rather than lemmas) up to four words apart =…evidenced by their occurrence together in corpora more often than is explicable in terms of random distribution” (Hoey, 2005) - “the lexical company that words keep” (Hoey, 2011) Lexical bundles and frequency Put the following items in order of the frequency with which they are used in spoken English: a) a bit of b) and things like that c) regularly d) since e) this that and the other MGH / IA902 2012 5 twice Page f) IA902 Practical Description of English : Session 4 Six-word lexical bundles 1. Do you know _______ _______ _______? 2. At the end _______ _______ _______ 3. And all the rest _______ _______ 4. And all that sort _______ _______ 5. I don’t know _______ _______ _______ Bit: what corpora can tell us about a word Page MGH / IA902 2012 6 From the Oxford Dictionary Online: IA902 Practical Description of English : Session 4 Results from the BNC: “A bit” is the 24th most common two-word chunk in CANCODE. But,…what does “a bit” mean? Does it have any meaning by itself? How meaningful is “a bit” as a quantifier? What about its “hedging” function? It also belongs to several “frames”: 7 MGH / IA902 2012 mess problem performance hassle nuisance bargain Page e.g. it was a bit of a IA902 Practical Description of English : Session 4 Colligation : Where lexis meet grammar? Data on language usage tells us that: • “a bit” is more likely than “the bit” • “a bit” is likely to be followed by “of” + NP • “a bit” is more likely to be used in an object position than a subject position Instances of DIFFERENT TO and DIFFERENT FROM across three corpora: DIFFERENT TO DIFFERENT FROM Brown 0 35 BNC Written 4 22 BNC Spoken 21 12 ILLUSTRATE and DRAW DRAW is a more frequent item than ILLUSTRATE Both verbs are frequently preceded by “to”. Relatively speaking, ILLUSTRATE occurs significantly more frequently with “to” than DRAW does ILLUSTRATE is frequently used in INFINITIVE CLAUSE To illustrate this, we can compare concordance lists of each word using any of the websites linked to on the IA902 blog. Differences in spoken and written English: Page MGH / IA902 2012 8 - data on spoken English reflects an orientation to the “speaker-listener world in conversation”. (I, you) - spoken discourse markers (well, right) - high frequency items that are arguably not words at all (yeah, oh, er) IA902 Practical Description of English : Session 4 References and further reading Adolphs, S. 2010 “Using a corpus to study spoken language” in Hunston, S. and Oakey. D (eds) Introducing Applied Linguistics, Routledge Baker, P. et al. 2006 A Glossary of Corpus Linguistics, Edinburgh University Press Biber, D. et al (eds) 2002 Longman Student Grammar of Spoken and Written English, Longman Hoey, M. 2005 Lexical Priming, Routledge Hoey, M. et al. 2007 (eds) Text, Discourse and Corpora : Theory and Analysis, Continuum Hunston, S. and Francis, G. 2000 Pattern Grammar, John Benjamins Publishing Company O’Keefe, A. et al. 2007, From Corpus to Classroom, Cambridge University Press Thornbury, S. 2004 Natural Grammar, Oxford University Press Page MGH / IA902 2012 9 Willis, D. 2003 Rules, Patterns and Words, Cambridge University Press