Introduction to Corpus Linguistics 1. What is a corpus? Latin corpus, “body”, plural corpora A body of text - a collection of texts – a large database of texts/linguistic examples Characteristics of a corpus according to McEnery and Wilson (2001) Machine-readable form – saved on and analysed by a computer Usually very large – but of a finite size Representative sample – Corpora are so big so that they can be a “representative sample” of a language o large enough to establish norms about the form of language being studied o large enough to reveal lots of cases of rare or unusual uses of language that we might not get from a smaller sample. o standard reference or ‘benchmark’ Often annotated – extra linguistic information added to the raw data o Annotation also known as TAGGING o e.g. the grammatical class of each word in a text (noun, verb, etc) A corpus could be described as “a finite-sized body of machine-readable text, sampled in order to be maximally representative of the language variety under consideration.” (p.32) 2. Corpus Linguistics Characteristics of corpus linguistics according to Biber, Conrad & Reppen (1998:4) Uses a corpus Uses computers for analysis Empirical – analysing actual patterns of language use Depends on quantitative and qualitative analytical techniques Corpus linguistics is the study of language using corpora The basic idea: by analysing VERY large amounts of textual data, we can find out things about language that we can’t find out any other way Allows us to test theories about language Removes bias Helps us to spot common and rare language phenomena Corpora may contain different sorts of data… Written data – extracts from books, magazines, newspapers, websites… Spoken data – transcripts of meetings, lectures, radio programs, everyday conversations… Some corpora contain a mixture of both (so both sorts of language are represented) 1 3. Empiricism and the history of corpus linguistics Corpus linguistics is an empirical method – it based on real-world data However, at the time when the first very large corpora were being compiled, empirical methods were out of fashion in linguistics. This was due to the Chomskyan revolution in linguistics, in which methods based on introspection and intuition were much more important. Empirical approaches pay less attention to the internal mind of an individual native speaker, and recognise that people’s intuitions about their own language may not be accurate So the only way to see how language is used, is to look at examples of ordinary, everyday language (written or spoken) However, after a period of some years the tide turned and corpora began to be used more widely. Some important milestones include: 1960s The Brown corpus 1 million word sample of American English Compiled by Francis and Kucera, Brown University. Classed as “a waste of time” by some of their contemporaries http://icame.uib.no/brown/bcm.html 1970s The LOB corpus “Lancaster-Oslo/Bergen”. British English match for Brown Compiled by Geoffrey Leech at Lancaster and others http://clu.uni.no/icame/manuals/BROWN/INDEX.HTM 1980 The London-Lund corpus 500K words of spoken British English http://clu.uni.no/icame/manuals/LONDLUND/INDEX.HTM 1986 The Kolhapur Corpus (and other international corpora) 1 million words of Indian English http://clu.uni.no/icame/manuals/KOLHAPUR/INDEX.HTM 1980s 1996: The Helsinki Corpus Texts from 750CE to 1700CE Diachronic corpus (includes text across time). 1.5 million words. http://clu.uni.no/icame/manuals/HC/INDEX.HTM Later corpora have tended to get bigger and bigger: 1990s The British National corpus 100 million words of British English. (produced by a consortium including Lancaster) http://www.natcorp.ox.ac.uk/ The Bank of English 524 million words (and counting) http://www.titania.bham.ac.uk/ 2000s The English Gigaword 1756 million words, released 2003, consists mostly of newswire data http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T05 However, corpus linguistics also uses smaller, more specialised corpora, for example: Lancaster Newsbooks Corpus (Tony McEnery and Andrew Hardie). 800,000 words of text from newspapers published in the first half of 1654. Used to study the origins of English journalism http://www.lancs.ac.uk/fass/projects/newsbooks/ Note that the lists above are just a sample of a few of the very many corpora that have been collected in a huge range of languages and dialects. See URLs below for a few more examples: http://clu.uni.no/icame/manuals/ http://ucrel.lancs.ac.uk/corpora.html 2 4. The British National Corpus (BNC) Currently, one of the best contemporary UK English corpora 100 million words from the early 1990s represents a wide range of both spoken and written modern British English Written data o 90 million words o Includes extracts from newspapers, academic books, popular fiction, letters and university essays Spoken data o 10 million words o Includes demographic data and context governed data The demographic part · transcripts of about 900 everyday unscripted spoken conversations The context-governed part · spoken language collected in public contexts – e.g. radio phone-ins, government meetings, classroom interactions 5. Exploiting a corpus 5.1 Wordlists (also known as frequency lists) All of the words in a corpus presented in alphabetical order or sorted according to their frequency. Basic word counts: how often does each word occur in the corpus? frequency lists compiled by computer This can give us more useful information than you might suspect… … but the frequency statistics still needs interpreting 5.2 Concordances A computerised search for all the examples of a word or a phrase in a corpus. Concordance lines show the results of the search and include the search term plus some of the surrounding context for each example. Usually shown in a sort of table with the search term highlighted in some way This informs us about the ways in which a word is usually used in actual sentences 5.3 Collocation Collocation is the relationship between words that tend to occur together Words which occur more frequently in the vicinity of word X than they do elsewhere in the corpus are called the collocates of X. Based on frequency (how frequent separate vs. how frequent together) Some computer tools can discover non-obvious collocates using statistics. Collocation forms part of a word’s meaning o “You shall know a word by the company it keeps” (Firth 1957:11) o The company a word keeps: implicit associations or assumptions o Near-synonyms often differ in terms of their collocations… 3 5.4 Statistical significance Sometimes just looking at frequency in a text can be a bit useless… We are really interested in words or other linguistic features that are unusually prominent … that is, more frequent in the text than in the language as a whole These are often the features that have most importance for a text But how do we know what counts as “unusually prominent”? We can use mathematical procedures called significance tests We can work out whether a difference in the statistics is significant or not A test often used in humanities is called the Chi-square test, but in corpus linguistics we often use one called log-likelihood 5.5 Keywords A keyword is a word that is unusually frequent in a text or corpus Based on comparison with a benchmark or reference corpus The difference must be statistically significant Keywords are often important to the meaning or theme(s) of the text They can be a good guide to what would be interesting to investigate further 6. Corpus annotation Annotation = extra information added to the text ( also known as TAGGING) EG: the grammatical class of each word in a text (noun, verb, etc) (POS tagging – see below) 6.1 Part-of-speech (POS) tagging POS tagging is the most basic type of grammatical analysis of a corpus. An example tagset: the C7 tagset (devised at Lancaster). See: http://ucrel.lancs.ac.uk/claws7tags.html For example – verbs o V = all verbs o VV, VB, VH, VD = lexical and auxiliary verbs o VV0 = root form “go”, “see” o VVZ = -es form “goes”, “sees” o VVG = -ing form “going”, “seeing” o VVD = past tense “went”, “saw” Tags are useful for improved corpus searches For instance, can o can as Verb – ‘He can do it’ o can as Noun – ‘That’s opened a can of worms!’ So, we can use tags to narrow the search Tagging is usually done not by hand but by a specialised computer program. An example POS tagger program: the CLAWS system (developed at Lancaster). See: http://ucrel.lancs.ac.uk/claws/trial.html 4 6.2 How do taggers work? Some words are looked up in a computerised dictionary or lexicon o Other words can have multiple tags: e.g. “dogs” – plural noun (NN2) or third person singular verb (VVZ)? o Words like this are worked out based on the surrounding tags by 1 of 2 methods: Rule-based – the computer is given rules about what tags can and can’t possibly follow one another and uses these to work out the correct tag Probabilistic (or stochastic) – the computer uses information about how frequently tags follow one another to work out what the tag is likely to be Tagger programs make mistakes: they are not 100% accurate (CLAWS 96-97% accurate) Note that, particularly in probabilistic tagging, the computer extracts information from a corpus and uses that information to help it understand / analyse / process other examples of language. 6.3 Semantic tagging Computers can, to some extent, understand the meaning of words in the same way they understand grammar Each word is given a label to say what type of meaning that word has As with POS tagging, semantic tags can be added to text by a computer program At Lancaster we use 21 basic semantic categories. See: http://ucrel.lancs.ac.uk/usas/ These have more detailed subcategories. See http://ucrel.lancs.ac.uk/usas/semtags.txt 6.4 Why annotate? Training the computer to process language better Helping searches in corpora Allowing us to search for / count up patterns of grammar or meaning, rather than just the words There can be an “upward spiral” effect: a machine-tagged corpus can be checked, improved, then fed back in again to produce an even better tagging program! 7. Applications of Corpora and Corpus Linguistics 7.1 Corpora in lexicography Early lexicography: the lexicographer’s brain – Samuel Johnson’s dictionary Later lexicography: readers’ observations – The Oxford English Dictionary (1st ed.) Modern lexicography is based largely on corpora 7.2 Other applications The study of grammar Variation studies Historical studies Comparing different languages Stylistics research And there are “real-world” applications EG: machine translation requires both Dictionary lookup (corpus-based lexicon) and Grammar analysis for good translation 5 8. Conclusion Corpus linguistics is an empirical methodology It relies on scanning large computerised bodies of text Corpus methods were frowned upon during the “Chomskyan revolution” in the 1960s… … but in the 70s, 80s and 90s they have come into their own There are different ways to get information from a corpus: frequency lists, concordances, etc. Corpora can be used in, for example, lexicography, the study of grammar, etc. Computers can be programmed to process language to some degree – often by absorbing linguistic knowledge from annotated corpora The computer can’t do it all for us – we still have to analyse the results and ask “but what does all this mean??” References Biber, D. Conrad, S. & Reppen, R. (1998) Corpus Linguistics: Investigating Language Structure and Use. Cambridge University Press. Firth, J.R. (1957) A synopsis of linguistic theory, 1930–1955. Studies in Linguistic Analysis (pp. 1–32). Special Volume, Philological Society. McEnery, T. & Wilson, A. (2001) Corpus Linguistics (2nd Ed.). Edinburgh University Press. Further Reading Aijmer, K. (1991) English Corpus Linguistics. London: Longman. Archer, D (2007) Computer-assisted literary stylistics: the state of the field. In Lambrou, M and Stockwell, P (eds.) Contemporary Stylistics. London: Continuum. Pages 244-256. Baker, P, Hardie A and McEnery, A (2006) A glossary of Corpus Linguistics. Edinburgh: Edinburgh University Press. Hunston, S (2002) Corpora in applied linguistics. Cambridge: Cambridge University Press. Kennedy, G. (1998) An Introduction to Corpus Linguistics. London: Longman. Mahlberg, M. (2007) ‘A corpus stylistic perspective on Dickens's Great Expectactions’. In M. Lambrou and P. Stockwell (eds) Contemporary Stylistics. Routledge, pp. 19-31. McEnery, A., Z. Xiao & Y. Tono (2005) Corpus-based language studies : An advanced resource book . London : Routledge. Meyer, C. (2002) English Corpus Linguistics. Cambridge University Press. Semino, E and Short, M (2004) Corpus stylistics : speech, writing and thought presentation in a corpus of English writing. London: Routledge. Tribble, C. (2000) Genres, keywords, teaching: towards a pedagogic account of the language of project proposals. In Burnard, L. and McEnery, T. (eds) Rethinking language pedagogy from a corpus perspective. Berlin: Peter Lang, pp 74-90 Some Internet Resources Short course on corpus linguistics on the Lancaster University website The research centre for Lancaster University BNC Frequencies BNC query tool Log Likelihood information/calculator http://www.lancs.ac.uk/fss/courses/ling/corpus/ http://ucrel.lancs.ac.uk/ http://ucrel.lancs.ac.uk/bncfreq/ http://corpus.byu.edu/bnc http://ucrel.lancs.ac.uk/llwizard.html 6