Corpus Stylistics

Corpus Stylistics
• Background and introduction to current
• Methodology in Corpus Stylistics
• Applications of Corpus Stylistics
• References
Corpus Stylistics
Background: What is Corpus Stylistics?
• The statistical study of style, i.e. study of
the relative frequency of elements in a text
– Augustus de Morgan, 1851: disputes about
the authenticity of some of the writings of St
Paul settled by the measurement of the length
of the words used in the various Epistles
– T.C. Mendenhall, 1887: analysis of several
authors’ frequency distributions of word-length
Corpus Stylistics
• Corpus: a body or collection of linguistic
data for use in research
• Since the early 1960s: interest in computer
corpora or machine readable corpora
• Statements about the relative frequency of
various linguistic items in a corpus have
become very accurate
Corpus Stylistics
• Some uses of statistical analysis of style
through corpora:
– Education, e.g. EFL textbook writing
– Establishment of authorship, e.g. of
unascribed manuscripts
– Interpretive stylistics, e.g. study of the
writer’s ideology and point of view
Corpus Stylistics
• Simple things may characterise different styles
– average sentence length
– average word length
– type:token ratio (vocabulary richness)
• number of types = number of different words
• number of tokens = total number of words
– vocabulary growth (homogeneity of text)
• number of new types in 1st, 2nd, …, nth 1000 words
• in rich varied text, number will climb steadily
• Especially when used comparatively
Corpus Stylistics
Methodology (cont’d)
• More complex analyses can give a more interesting
– specific syntactic structures
– degree of modification in NPs
– types of verbs (e.g. verbs of persuasion, speech verbs, action
verbs, descriptive verbs)
– distribution of pronouns (1st/2nd/3rd person)
– etc … (anything you can think of!)
• Quite sophisticated mathematical techniques can give an
overall picture
– e.g. factor analysis: identifies from a (big) range of variables
which ones best identify/characterise differences
Corpus Stylistics
Methodology (cont’d)
Multidimensional analysis
• Collect a huge range of measures of a
wide variety
– some simple word counts
– syntactic features
– classes and subclasses of N, V, Adj, Avd
• Factor analysis
– choose a range of features to measure, see
which ones are correlated
~150 features in all
Corpus Stylistics
Methodology (cont’d)
• Example: work based on corpora trying to
quantify and characterise genre and register
• Work pioneered by Douglas Biber*
• Biber used statistical measures to identify
stylistic factors that co-occurred, and could
therefore be definitional of text types and genres
– E.g. conjuncts like therefore, nevertheless and use of passive
together indicate more formal style
*D. Biber, S. Conrad & R. Reppen, Corpus Linguistics: Investigating
Language Structure and Use, Ch 5: the study of discourse characteristics
Corpus Stylistics
Methodology (cont’d)
• Corpora useful not only for counting frequencies
of features, but also:
• Concordancing
Lists occurrences of word in context
Identify syntactic use of word
Identify range of meanings
Identify relative frequency of different uses/meanings
• Collocation
– What words occur together?
– Compare distribution of close synonyms
Corpus Stylistics
Methodology (cont’d)
Vocabulary in context
• “Concordance”, also known as KWIC list
(key word in context)
• Allows us to see the (immediate)
environment in which a word appears
• Listings can be customised to show what
you want more clearly, e.g.
– sorted according to next or previous word
– showing more or less context
Corpus Stylistics
Methodology (cont’d)
• Term coined by J R Firth (1957) to characterise
(part of) his theory of meaning
• “You shall judge a word by the company it
• “The occurrence of two or more words within a
short space of each other in a text” (Sinclair
• “The relationship a lexical item has with items
that appear with greater than random probability
in its (textual) context” (Hoey 1991)
Style and Corpora
Methodology (cont’d)
Collocation, text type and style – example:
• Distinguish between general and more
usual collocations vs. technical and more
personal ones
• e.g. in a general corpus time collocates
with save, spend, waste, fritter away, …
• but in a corpus of sports reports time
collocates with half, full, extra, injury, first,
second, third, …
Style and Corpora
• An attempt to capture the essence of the style of
a particular author by reference to a variety of
quantitative criteria, usually lexical, called
• Study of frequently occurring features:
word/sentence length; choice and frequency of
words; vocabulary richness)
• The ideal situation for authorship studies is
– when there are large amounts of undisputed text, or
– few contenders for the authorship of the disputed
Style and Corpora
Applications (cont’d)
Author attribution
Establishing the author of an unascribed
• Build corpora
– A - works definitely by author A
– B - works definitely by author B
– C - works of disputed authorship, but probably written
by A or B
• Then select discriminants and associated measures
• When the technique has been shown to discriminate
effectively between A and B, then try it on C
(M. Oakes: ‘Computational Stylometry’, in Handbook of
Corpus Linguistics)
Style and Corpora
Applications (cont’d)
Language Learning
• Frequency - in particular, word frequency - had a role in
language learning in the days before electronic corpora
• The 'corpus revolution' made available frequency information
about language use in a totally unprecedented way
• Frequency dictionaries and frequency-based grammatical
information are becoming more and more available and new
sources of frequency information from the Web are being
• Various kinds of knowledge found in present-day language
textbooks (grammatical, collocational, semantic) are getting to
be frequency-based.
• In general, corpora represent real usage of language
• In addition, "more frequent” can equal “more important“ in
many aspects of language learning
Style and Corpora
Applications (cont’d)
Interpretive stylistics
• Programmes like WordSmith Tools and other
Windows-based applications allow researchers
to derive a list of keywords (words which occur
significantly more often than expected in texts
when compared to a reference corpus).
• Keywords are a powerful and quick means of
analysis, and they have been used to examine
discourses relating to specific social and cultural
issues, and the ideology behind authors / texts
• See e.g. work by P. Baker on gender and sexual
Leech, G. Language and Literature: Style and
Foregrounding (Longman, 2008), ch.11
Leech, G. and Short, M. Style in Fiction
(Routledge, 2007), ch. 2 and 3
Semino, E. & M. Short, Corpus Stylistics: Speech,
writing and thought presentation in a corpus of
English writing (Routledge, 2004)
Short, M. Exploring the language of poems, plays,
and prose (Longman, 1996), ch. 11
Related flashcards

Writing systems

19 cards


22 cards


14 cards

Cyrillic letters

38 cards

Letters (message)

33 cards

Create Flashcards