Corpus Stylistics Outline: • Background and introduction to current work • Methodology in Corpus Stylistics • Applications of Corpus Stylistics • References Corpus Stylistics Background: What is Corpus Stylistics? • The statistical study of style, i.e. study of the relative frequency of elements in a text – Augustus de Morgan, 1851: disputes about the authenticity of some of the writings of St Paul settled by the measurement of the length of the words used in the various Epistles – T.C. Mendenhall, 1887: analysis of several authors’ frequency distributions of word-length Corpus Stylistics • Corpus: a body or collection of linguistic data for use in research • Since the early 1960s: interest in computer corpora or machine readable corpora • Statements about the relative frequency of various linguistic items in a corpus have become very accurate Corpus Stylistics • Some uses of statistical analysis of style through corpora: – Education, e.g. EFL textbook writing – Establishment of authorship, e.g. of unascribed manuscripts – Interpretive stylistics, e.g. study of the writer’s ideology and point of view Corpus Stylistics Methodology • Simple things may characterise different styles – average sentence length – average word length – type:token ratio (vocabulary richness) • number of types = number of different words • number of tokens = total number of words – vocabulary growth (homogeneity of text) • number of new types in 1st, 2nd, …, nth 1000 words • in rich varied text, number will climb steadily • Especially when used comparatively Corpus Stylistics Methodology (cont’d) • More complex analyses can give a more interesting picture – specific syntactic structures – degree of modification in NPs – types of verbs (e.g. verbs of persuasion, speech verbs, action verbs, descriptive verbs) – distribution of pronouns (1st/2nd/3rd person) – etc … (anything you can think of!) • Quite sophisticated mathematical techniques can give an overall picture – e.g. factor analysis: identifies from a (big) range of variables which ones best identify/characterise differences Corpus Stylistics Methodology (cont’d) Multidimensional analysis • Collect a huge range of measures of a wide variety – some simple word counts – syntactic features – classes and subclasses of N, V, Adj, Avd • Factor analysis – choose a range of features to measure, see which ones are correlated ~150 features in all Corpus Stylistics Methodology (cont’d) • Example: work based on corpora trying to quantify and characterise genre and register differences • Work pioneered by Douglas Biber* • Biber used statistical measures to identify stylistic factors that co-occurred, and could therefore be definitional of text types and genres – E.g. conjuncts like therefore, nevertheless and use of passive together indicate more formal style *D. Biber, S. Conrad & R. Reppen, Corpus Linguistics: Investigating Language Structure and Use, Ch 5: the study of discourse characteristics Corpus Stylistics Methodology (cont’d) • Corpora useful not only for counting frequencies of features, but also: • Concordancing – – – – Lists occurrences of word in context Identify syntactic use of word Identify range of meanings Identify relative frequency of different uses/meanings • Collocation – What words occur together? – Compare distribution of close synonyms Corpus Stylistics Methodology (cont’d) Vocabulary in context • “Concordance”, also known as KWIC list (key word in context) • Allows us to see the (immediate) environment in which a word appears • Listings can be customised to show what you want more clearly, e.g. – sorted according to next or previous word – showing more or less context Corpus Stylistics Methodology (cont’d) Collocation • Term coined by J R Firth (1957) to characterise (part of) his theory of meaning • “You shall judge a word by the company it keeps” • “The occurrence of two or more words within a short space of each other in a text” (Sinclair 1991) • “The relationship a lexical item has with items that appear with greater than random probability in its (textual) context” (Hoey 1991) Style and Corpora Methodology (cont’d) Collocation, text type and style – example: • Distinguish between general and more usual collocations vs. technical and more personal ones • e.g. in a general corpus time collocates with save, spend, waste, fritter away, … • but in a corpus of sports reports time collocates with half, full, extra, injury, first, second, third, … Style and Corpora Applications Stylometry • An attempt to capture the essence of the style of a particular author by reference to a variety of quantitative criteria, usually lexical, called discriminators. • Study of frequently occurring features: word/sentence length; choice and frequency of words; vocabulary richness) • The ideal situation for authorship studies is – when there are large amounts of undisputed text, or – few contenders for the authorship of the disputed text(s). Style and Corpora Applications (cont’d) Author attribution Establishing the author of an unascribed manuscript: • Build corpora – A - works definitely by author A – B - works definitely by author B – C - works of disputed authorship, but probably written by A or B • Then select discriminants and associated measures • When the technique has been shown to discriminate effectively between A and B, then try it on C (M. Oakes: ‘Computational Stylometry’, in Handbook of Corpus Linguistics) Style and Corpora Applications (cont’d) Language Learning • Frequency - in particular, word frequency - had a role in language learning in the days before electronic corpora existed. • The 'corpus revolution' made available frequency information about language use in a totally unprecedented way • Frequency dictionaries and frequency-based grammatical information are becoming more and more available and new sources of frequency information from the Web are being tapped • Various kinds of knowledge found in present-day language textbooks (grammatical, collocational, semantic) are getting to be frequency-based. • In general, corpora represent real usage of language • In addition, "more frequent” can equal “more important“ in many aspects of language learning Style and Corpora Applications (cont’d) Interpretive stylistics • Programmes like WordSmith Tools and other Windows-based applications allow researchers to derive a list of keywords (words which occur significantly more often than expected in texts when compared to a reference corpus). • Keywords are a powerful and quick means of analysis, and they have been used to examine discourses relating to specific social and cultural issues, and the ideology behind authors / texts • See e.g. work by P. Baker on gender and sexual identity Reading Leech, G. Language and Literature: Style and Foregrounding (Longman, 2008), ch.11 Leech, G. and Short, M. Style in Fiction (Routledge, 2007), ch. 2 and 3 Semino, E. & M. Short, Corpus Stylistics: Speech, writing and thought presentation in a corpus of English writing (Routledge, 2004) Short, M. Exploring the language of poems, plays, and prose (Longman, 1996), ch. 11