Corpus Stylistics

• Background and introduction to current
• Methodology in Corpus Stylistics
• Applications of Corpus Stylistics
• References
Background: What is Corpus Stylistics?
• The statistical study of style, i.e. study of
the relative frequency of elements in a text
– Augustus de Morgan, 1851: disputes about
the authenticity of some of the writings of St
Paul settled by the measurement of the length
of the words used in the various Epistles
– T.C. Mendenhall, 1887: analysis of several
authors’ frequency distributions of word-length
• Corpus: a body or collection of linguistic
data for use in research
• Since the early 1960s: interest in computer
corpora or machine readable corpora
• Statements about the relative frequency of
various linguistic items in a corpus have
become very accurate
• Some uses of statistical analysis of style
through corpora:
– Education, e.g. EFL textbook writing
– Establishment of authorship, e.g. of
unascribed manuscripts
– Interpretive stylistics, e.g. study of the
writer’s ideology and point of view
• Simple things may characterise different styles
– average sentence length
– average word length
– type:token ratio (vocabulary richness)
• number of types = number of different words
• number of tokens = total number of words
– vocabulary growth (homogeneity of text)
• number of new types in 1st, 2nd, …, nth 1000 words
• in rich varied text, number will climb steadily
• Especially when used comparatively
Methodology (cont’d)
• More complex analyses can give a more interesting
– specific syntactic structures
– degree of modification in NPs
– types of verbs (e.g. verbs of persuasion, speech verbs, action
verbs, descriptive verbs)
– distribution of pronouns (1st/2nd/3rd person)
– etc … (anything you can think of!)
• Quite sophisticated mathematical techniques can give an
overall picture
– e.g. factor analysis: identifies from a (big) range of variables
which ones best identify/characterise differences
Methodology (cont’d)
Multidimensional analysis
• Collect a huge range of measures of a
wide variety
– some simple word counts
– syntactic features
– classes and subclasses of N, V, Adj, Avd
• Factor analysis
– choose a range of features to measure, see
which ones are correlated
~150 features in all
Methodology (cont’d)
• Example: work based on corpora trying to
quantify and characterise genre and register
• Work pioneered by Douglas Biber*
• Biber used statistical measures to identify
stylistic factors that co-occurred, and could
therefore be definitional of text types and genres
– E.g. conjuncts like therefore, nevertheless and use of passive
together indicate more formal style
*D. Biber, S. Conrad & R. Reppen, Corpus Linguistics: Investigating
Language Structure and Use, Ch 5: the study of discourse characteristics
Methodology (cont’d)
• Corpora useful not only for counting frequencies
of features, but also:
• Concordancing
Lists occurrences of word in context
Identify syntactic use of word
Identify range of meanings
Identify relative frequency of different uses/meanings
• Collocation
– What words occur together?
– Compare distribution of close synonyms
Methodology (cont’d)
Vocabulary in context
• “Concordance”, also known as KWIC list
(key word in context)
• Allows us to see the (immediate)
environment in which a word appears
• Listings can be customised to show what
you want more clearly, e.g.
– sorted according to next or previous word
– showing more or less context
Methodology (cont’d)
• Term coined by J R Firth (1957) to characterise
(part of) his theory of meaning
• “You shall judge a word by the company it
• “The occurrence of two or more words within a
short space of each other in a text” (Sinclair
• “The relationship a lexical item has with items
that appear with greater than random probability
in its (textual) context” (Hoey 1991)
Methodology (cont’d)
Collocation, text type and style – example:
• Distinguish between general and more
usual collocations vs. technical and more
personal ones
• e.g. in a general corpus time collocates
with save, spend, waste, fritter away, …
• but in a corpus of sports reports time
collocates with half, full, extra, injury, first,
second, third, …
• An attempt to capture the essence of the style of
a particular author by reference to a variety of
quantitative criteria, usually lexical, called
• Study of frequently occurring features:
word/sentence length; choice and frequency of
words; vocabulary richness)
• The ideal situation for authorship studies is
– when there are large amounts of undisputed text, or
– few contenders for the authorship of the disputed
Applications (cont’d)
Author attribution
Establishing the author of an unascribed
• Build corpora
– A - works definitely by author A
– B - works definitely by author B
– C - works of disputed authorship, but probably written
by A or B
• Then select discriminants and associated measures
• When the technique has been shown to discriminate
effectively between A and B, then try it on C
(M. Oakes: ‘Computational Stylometry’, in Handbook of
Corpus Linguistics)
Applications (cont’d)
Language Learning
• Frequency - in particular, word frequency - had a role in
language learning in the days before electronic corpora
• The 'corpus revolution' made available frequency information
about language use in a totally unprecedented way
• Frequency dictionaries and frequency-based grammatical
information are becoming more and more available and new
sources of frequency information from the Web are being
• Various kinds of knowledge found in present-day language
textbooks (grammatical, collocational, semantic) are getting to
be frequency-based.
• In general, corpora represent real usage of language
• In addition, "more frequent” can equal “more important“ in
many aspects of language learning
Applications (cont’d)
Interpretive stylistics
• Programmes like WordSmith Tools and other
Windows-based applications allow researchers
to derive a list of keywords (words which occur
significantly more often than expected in texts
when compared to a reference corpus).
• Keywords are a powerful and quick means of
analysis, and they have been used to examine
discourses relating to specific social and cultural
issues, and the ideology behind authors / texts
• See e.g. work by P. Baker on gender and sexual
Leech, G. Language and Literature: Style and
Foregrounding (Longman, 2008), ch.11
Leech, G. and Short, M. Style in Fiction
(Routledge, 2007), ch. 2 and 3
Semino, E. & M. Short, Corpus Stylistics: Speech,
writing and thought presentation in a corpus of
English writing (Routledge, 2004)
Short, M. Exploring the language of poems, plays,
and prose (Longman, 1996), ch. 11
