Corpora in literary and stylistic studies

advertisement
Corpora in literary and stylistic studies
Corpus Linguistics
Richard Xiao
lancsxiaoz@googlemail.com
Aims of this session
• Lecture
– An overview of applications of corpora in literary and
stylistic studies
– Case study: Culpeper’s (2002) keyword analysis of six
characters in Romeo and Juliet
• Lab session
– To duplicate Culpeper’s (2002) study
Corpora vs. literary stylistics
• Stylistic shifts in usage may be observed with
reference to features associated with either
particular situations of use or particular groups of
speakers (cf. Schilling-Estes 2002: 375)
– In this sense, similar to registers and genres or dialects
and language varieties
– …but stylisticians are typically more interested in
individual works by individual authors rather than
language or language variety as such
• The use of corpora in stylistics and literary studies is
presently very limited
Potential uses of corpora
•
•
•
•
•
•
Study of prose style
Study of individual authorial styles
Authorship attribution
Literary appreciation and criticism
Teaching of stylistics
Study of literariness in discourses other than
literary texts (e.g. Carter 1999)
Study of prose style
• In stylistics, there is a long tradition of
focusing on the representation of speech and
thought in fiction
• Leech and Short’s (1981) influential model of
speech and thought presentation
– Style in Fiction, Longman, 1981
• Further refined in Short, Semino and Culpeper
(1996), and Semino, Short and Culpeper
(1997)
S&TP: Lancaster Speech, Thought and
Writing Presentation Corpus
• Developed during 1994-2003
– Written: 260,000 words in size, three narrative genres:
prose fiction, newspaper reportage and
(auto)biography, which are further divided into
‘serious’ and ‘popular’ sections
– Spoken: created with the express aim of comparing
S&TP in spoken and written languages systematically,
260,000 words, 60 samples from BNCdemo, and 60
samples from oral history archives in the Centre for
North West Regional Studies at Lancaster
• Download: http://ota.ahds.ac.uk/headers/2464.xml
S&TP categories
• Direct category, e.g.
– direct speech, direct thought and direct writing
• Free direct category, e.g.
– free direct speech, free direct thought, free direct writing
• Indirect category, e.g.
– indirect speech, indirect thought, indirect writing
• Free indirect category
– free indirect speech, free indirect thought, free indirect writing
• Representation of speech/thought/writing act category
• Representation of voice/internal state/writing category
• Report category, e.g.
– report of speech, report of thought, report of writing
Authorial styles of individual authors
• Typically specialized corpora of the works of
individual authors, e.g.
– A corpus composed of their early and later works to track any
stylistic shift over time
– A corpus composed of their works belonging to different genres
(e.g. plays and essays) to compare their styles across genres
– A corpus composed of works by different authors to compare
their different authorial styles
• Large general corpora can provide ‘a means of
establishing a norm for comparison when discussing
features of literary style’ (Hunston 2002: 128)
Techniques of studying authorial styles
• Corpus stylistics goes well beyond simple
counting but rather relying heavily on
sophisticated statistical approaches
– MDA (e.g. Watson 1994)
– Principal Component Analysis (e.g. Binongo and Smith
1999)
– Multivariate analysis (or more specifically, cluster
analysis, e.g. Watson 1999; Hoover 2003)
• Stylistics + computation + statistics
– stylometry, stylometrics, computational stylistics,
statistical stylistics, corpus stylistics
Authorship attribution
• Is the work by Shakespeare or Marlowe?
• Cluster analysis of frequent words, frequent word
sequences, and frequent collocations provides an accurate
and robust method for authorship attribution (Hoover
2001, 2002, 2003a, 2003b)
• Corpus-based authorship attribution has been used as
linguistic evidence in court (“forensic linguistics”)
– Confession/witness statements (e.g. Coulthard 1993)
– Blackmail/ransom/suicide notes (Baldauf 1999)
• Plagiarism detection in academic and education settings (e.g.
Turnitin UK)
The Derek Bentley case
• Derek Bentley was hanged in the UK in 1953 for allegedly
encouraging his young companion Chris Craig (a minor)
to shoot a policeman
– The evidence that weighed against him was a confession
statement which he signed in police custody but later
claimed at the trial that the police had ‘helped’ him (to?)
produce
• The case was re-opened in 1993, 40 years after Derek
was hanged
– Malcolm Coulthard, a forensic linguist, was commissioned
by Bentley’s family to examine the confession as part of an
appeal to get a posthumous pardon for Derek
The Derek Bentley case
• The appeal was initially rejected by the Home
Secretary
• In 1998, another court of appeal overthrew
the original conviction and found Derek
Bentley innocent
• In 1999 the Home Secretary awarded
compensation to the Bentley family
The Derek Bentley case
• In Bentley’s confession, the word then was unusually
frequent
– It occurred 10 times in his 582-word confession statement,
ranking as the 8th most frequent word in the statement
– It ranked 58th in a corpus of spoken English, and 83rd in the Bank
of English (on average once every 500 words)
• Six witness statements
– 3 made by other witnesses: then occurs just once in 980 words
– 3 by police officers, including two involved in the Bentley case:
then occurs 29 times – once in every 78 words!
The Derek Bentley case
• The position of then
– Subject + then (e.g. I then, Chris then) was
unusually frequent in Bentley’s confession
• I then occurs three times (once every 190 words)
• In a 1.5-million-word corpus of spoken English, the sequence
occurs just nine times (once every 165,000 words)
• No instance of I then was found in ordinary witness
statements
• Nine occurrences were found in the police statement
• In the spoken BoE, then I was 10 times as frequent as I then
The Derek Bentley case
• The sequence subject + then was
characteristic of the police statement
• Although the police denied Bentley’s claim
and said that the statement was a verbatim
record of what Bentley had actually said, the
unusual frequency of then and its abnormal
position could be taken to be indicative of
some intrusion of the policemen’s register in
the statement
Culpeper (2002)
• Culpeper, Jonathan (2002) Computers,
language and characterisation: An analysis of
six characters in Romeo and Juliet. In U.
Melander-Marttala, C. Ostman and Merja Kyto
(eds.), Conversation in Life and in Literature.
Uppsala: Universitetstryckeriet, pp.11-30.
– www.lexically.net/wordsmith/corpus_linguistics_li
nks/Keywords-Culpeper.pdf
Aim of Culpeper (2002)
• ‘The broad aim of this paper is to show how the
study of an important area within “stylistics”, namely
characterisation, can benefit from an empirical
approach, specifically, a methodology for identifying
what might be the “key” words of a text …
Such an approach can reveal significant lexical and
grammatical patterns without reliance on
speculations about what the relevant dimensions
are’ (Culpeper 2002: 12)
Keywords vs. style-markers
• Enkvist (1964: 29)
– ‘Style is concerned with frequencies of linguistic items in a given
context, and thus with contextual probabilities.’
– ‘To measure the style of a passage, the frequencies of its linguistic items
[…] must be compared with the corresponding features in another text
or corpus which is regarded as a norm and which has a definite
relationship with this passage.’
• Style as a matter of ‘frequencies’, ‘probabilities’ and ‘norms’
– ‘We may […] define style markers as those linguistic items that only
appear, or are most or least frequent in, one group of contexts. In other
words, style markers are contextually bound linguistic elements…’ (ibid.
34-5)
– ‘Elements that are not style markers are stylistically neutral.’ (ibid. 35)
• ‘Style-markers…are words whose frequencies differ significantly
from their frequencies in a norm’ (Culpeper 2002: 13)
– Keywords (positive and negative)
Preparing the text
• Problem 1: Which text to use … original version or
modern version?
– Culpeper opted for a modern edition (to get round
problem of spelling variation: sweet vs. sweete, etc.)
• Problem 2: Shakespeare plays are full of dialogue
– How can we get the tool to distinguish between different
characters?
– Culpeper used a simple tagging scheme, e.g.
<ROM>…<\ROM>
<JUL>…<\JUL>
Who is worth concentrating on …?
Character
• Culpeper chose his
characters based on
the number of words
that they “spoke”
Romeo
Total no. of
words
spoken
5031
Juliet
4564
Friar
Lawrence
Nurse
2901
Capulet
2292
Mercutio
2254
Benvolio
1293
2369
Choosing a reference corpus
• Culpeper opted to make 6 reference corpora – one for
each character, e.g.
–
–
–
–
RC for Romeo = whole play minus Romeo’s contributions
RC for Juliet = whole play minus Juliet’s contributions
RC for Nurse = whole play minus Nurse’s contributions
…
• Why use a reference corpus of the same play?
– ‘Characters are partly shaped by their context. Thus, it makes
little sense to compare, say, the characters of Romeo and Juliet
with the characters of Macbeth or Anthony and Cleopatra, since
the fictional worlds of Italy, Scotland and Egypt provide very
different contextual influences. Furthermore, characters, like
people, are partly perceived in terms of whom they interact
with …’ (Culpeper 2002: 16)
Alternative reference corpora …?
• Scott and Tribble (2006) have compared Romeo and Juliet
against
–
–
–
–
The Complete Works of Shakespeare
Plays only
Tragedies only
The BNC
• Interestingly … they found that
– A ‘robust core’ of keywords occur whichever reference corpus is
used. These include personal and place names like “Benvolio”,
“Romeo”, “Juliet” and “Mantua” but also terms like “banished”,
“county”, “love” and “night”
• In contrast to Scott and Tribble (2006), Culpeper (2002) found
that his results were more meaningful - in terms of
characterisation - when using the other Romeo and Juliet
characters (minus the target character) as a reference corpus
Making wordlists for each character
• Making the characters’ word lists
– Involves telling Wordsmith to only include <…> … <\…>
– Procedure …
• Wordlist – Settings – Wordlist specific – Tags – Only part of
file – Sections to keep – [specifying start/end tags]
• Making the reference corpora
– Involves telling Wordsmith to exclude anything between
<…> … <\…>
– Procedure …
• Wordlist – Settings – Wordlist specific – Tags – Only part of
file – Sections to cut out – [specifying start/end tags]
Top 10 on wordlists (frequency)
ROMEO
JULIET
CAPULET
NURSE
MERCUTIO
FRIAR L
PLAY
PRESDAY
SPOKEN
ENGLISH
PRESDAY
WRITTEN
ENGLISH
AND
I
THE
TO
MY
THAT
A
OF
ME
IN
I
TO
AND
MY
THE
THAT
THOU
IS
A
BE
TO
YOU
AND
A
MY
I
IS
THE
HER
NOT
I
A
AND
THE
YOU
TO
IT
IS
MY
O
A
THE
OF
AND
TO
THAT
I
IS
IN
THOU
AND
THE
TO
IN
THY
THOU
OF
IS
THAT
A
AND
THE
I
TO
A
OF
MY
THAT
IS
IN
THE
I
YOU
AND
IT
A
‘S
TO
OF
THAT
THE
OF
AND
A
IN
TO (INF)
IS
TO (PR)
WAS
IT
Q: Do they tell us anything interesting/worthwhile and, if so, what?
Positive keywords for the six characters
Romeo
Juliet
Capulet
Nurse
Mercutio
Friar L
Beauty
Blessed
Love
Eyes
More
Mine
Rich
Dear
Yonder
Farewell
Me
Sick
Lips
Stars
Fair
Thine
Hand
Banished
If
Or
Sweet
Be
News
My
Night
I
Would
Yet
Thou
Words
Name
Nurse
Tybalt
Send
Husband
That
swear
Go
Wife
Thank
Ha
You
Thursday
Her
Child
Welcome
We
Haste
Gentlemen
Tis
Our
Make
Now
Daughter
Well
Day
He’s
A
Thy
You
Quoth
Hare
From
Woeful God
Very
Thyself
Warrant Madam Of
Mantua
Lord
Lady
He
Part
Hie
It
The
Heaven
Your
O’er
Forth
Faith
Her
Said
Alone
Ay
Time
She
Married
About
Letter
Ever
What differences can you
Sir
spot between the results
Marry
here and the results on the
Ah
previous table?
Fall
Well
What key words can tell us about
characterisation …
• Romeo’s top three key words – ‘beauty’, ‘blessed’, ‘love’
• Expected? Surprising? … the lover of the play
– Other keywords related to ‘love talk’ = ‘dear’, ‘stars’, ‘fair’
– Keywords relating to body parts – ‘eyes’, ‘lips’, ‘hand’ – obsessed
with the physical?
• Juliet’s top key word – ‘if’, ‘or’, ‘be’, ‘yet’, ‘would’ (conditional
+ modals)
– Reflecting her state of mind – anxiety and uncertainty?
• Capulet most ‘key’ key word – ‘go’
– Context reveals that mostly used as an imperative command …
Capulet as head of the household to direct other people (see
also ‘make’ and ‘haste’), e.g.
• Go wake Juliet, go and trim her up…
• Nurse’s keywords are surge features (i.e. reflecting outbursts
of emotion) – ‘god’, ‘warrant’, ‘woeful’, ‘faith’, ‘marry’, ‘ah’
Negative key words for the six characters
Romeo
Juliet
Capulet
Nurse
Mercutio
Friar L
You
Romeo
He
Go
Her
The
You
And
Go
Thou
That
The
Of
And
With
Thou
My
I
What
I
You
A
Have
My
IMPORTANT
These represent words that are used unusually infrequently
(statistically speaking) by these characters.
Do you notice anything interesting?
Use of Pronouns within Romeo and Juliet
Juliet
Romeo Capulet
nurse
Mercutio
Friar L
POS
MY
I
THOU
ME
MINE
THINE
YOU
WE
TIS
OUR
HE’S
YOU
IT
YOUR
SHE
HE
THY
THYSELF
NEG
YOU
YOU
HE
THOU
THOU
MY
I
I
YOU
MY
•
•
•
•
Romeo and Juliet use first and second person pronouns
– Expected? - “at the heart of the social interaction in the play”
But compare Romeo’s use of ‘me/mine’ with Juliet’s use of ‘I’ …
– Culpeper’s (2002) conclusion: ‘Juliet spends much time in the play bearing her
soul … whereas Romeo is much more conscious of his own role as a lover and
of the effect of the circumstances upon him’ (ibid: 24)
What about Capulet? – “you”, “we”, “our”, why?
Thou-forms vs. you-forms to be covered
Culpeper’s Conclusion (2002: 27)
• “In some cases, my analysis provided solid evidence for
what one might have guessed (e.g. Romeo’s keywords
‘beauty’ and ‘love’) …”
• “… in others, it revealed what I think would be very
difficult to guess but fits well a possible interpretation
(e.g. Juliet’s keywords ‘if’ and ‘yet’).”
• “… keywords analysis also offers a way into analysing
function words, such as pronouns, and accounting for
their contribution to style and meaning”
What should we take note of …?
• How he was able to come to his conclusions
– The importance of having the right reference
corpus
– The need to use mark-up (as a means of
identifying the different characters)
– Knowing how to use Wordsmith …
• To make the different wordlists
• To make the keyword lists
Any potential weaknesses …
• It did not attempt to lemmatize the word forms … so
that, for example, ‘loves’ would form part of the
word count of ‘love’ (Culpeper 2002: 27)
• Contractions (e.g. I’ll) would also have been counted
separately
• Key word analysis …
– makes us focus on ‘statistical deviations from a relative
norm, and ignores the significance of relatively infrequent
deviations from absolute norms’ (i.e. what your given texts
may have in common)
– ignores one-off occurrences of words
Now it’s your turn…
Duplicating Culpeper (2002)
The Romeo text
• Download the “Oxford Shakespeare” version of
Romeo and Juliet
– http://www.bartleby.com/70/index38.html
– Local copy available
• Using tags to separate stage directions from
dialogues
– Did Culpeper do this?
• Tag words spoken by each character
• Alternatively, you can use a local version I have
prepared
Sample of tagged text
•
•
•
•
•
•
•
•
•
•
•
<Exeunt MONTAGUE and LADY. ROMEO. >
<Ben.> Good morrow, cousin. <\Ben.>
<Rom.> Is the day so young? <\Rom.>
<Ben.> But new struck nine. <\Ben.>
<Rom.> Ay me! sad hours seem long. Was that my father that went hence
so fast? <\Rom.>
<Ben.> It was. What sadness lengthens Romeo’s hours? <\Ben.>
<Rom.> Not having that, which having, makes them short. <\Rom.>
<Ben.> In love? <\Ben.>
<Rom.> Out— <\Rom.>
<Ben.> Of love? <\Ben.>
<Rom.> Out of her favour, where I am in love. <\Rom.>
Separating words by apostrophes
clear ‘ from this box and press OK
Making a wordlist for each character
• Start wordlist function
• Load the text
• Setting – Tags – Only part of File - “Sections to keep” – type in
the start/end tags given below
– Ignore <*> is default setting – ignore stage directions
• Make a wordlist for
–
–
–
–
–
–
Romeo_TC (<Rom.>…<\Rom.>)
Juliet_TC (<Jul.>…<.\Jul.>)
Capulet_TC (<Cap.>…<\Cap.>)
Nurse_TC (<Nurse.>…<\Nurse.>)
Mercutio_TC (<Mer.>…<\Mer.>)
Friar_L_TC (<Fri._L.>…<\Fri._L.>)
Tag and markup
Only Part of file
Making a reference list for each character
• Setting – Tags – Only part of File - “Sections to cut
out” – type in the start/end tags given below
– Excluding what is said by the target character
• Make a wordlist for
–
–
–
–
–
–
Romeo_RC (<Rom.>…<\Rom.>)
Juliet_RC (<Jul.>…<.\Jul.>)
Capulet_RC (<Cap.>…<\Cap.>)
Nurse_RC (<Nurse.>…<\Nurse.>)
Mercutio_RC (<Mer.>…<\Mer.>)
Friar_L_RC (<Fri._L.>…<\Fri._L.>)
Running words
Character
In our file
Culpeper (2002)
Romeo
4842
5031
Juliet
4438
4564
Friar Lawrence
2860
2901
Capulet
2282
2292
Nurse
2250
2369
Mercutio
2169
2254
Discrepancies: Some explanations
• Different tagging
– We ignored stage directions
– We tried what Culpeper (2002) suggested at the end of his
paper, treating contracted words such as “I’ll” as two
words
• A potential problem of this approach with Shakespearean texts
– danc’d, disturb’d, and rais’d etc all became two words!
– Is there a need to annotate the text?
• Not done here or in Culpeper (2002), but worth its efforts
– the city’s side
– let’s away
– Where’s this girl?
• Want to have a try?
– http://ucrel.lancs.ac.uk/claws/trial.html
Top 10 on wordlists
• Romeo
• Nurse
Juliet
Mercutio
Capulet
Friar L
whole play
Keyword settings
Selected statistic formula
Cutoff p value
Min. Frequency
Making a keyword list per character
• Romeo_kw
– Romeo_TC + Romeo_RC
• Juliet_kw
– Juliet_TC + Juliet_RC
• Capulet_kw
– Capulet_TC + Capulet_RC
• Nurse_kw
– Nurse_TC + Nurse_RC
• Mercutio_kw
– Mercutio_TC + Mercutio_RC
• Friar_L_kw
– Friar_L_TC + Friar_L_RC
Romeo’s keywords by keyness
Positive keywords
Negative keywords
Himself: Romeo, he, him
Both: you, we
Movement: come, go, up
Aboutness: beauty, love, blessed, dream, joy, sin, kiss, death, poison, soul …
Love talk: dear, farewell, stars
Body parts: eyes, lips, hand
Pronouns: mine, me, thine, thee, my
Juliet’s keywords by keyness
Positive keywords
Negative keywords
Herself: her
Both: we, you
Movement: here, go
People in interaction: nurse, Romeo, sweet, husband, mother, father
State of mind: if, or, be, yet, would
Pronouns: my, I, thou
Aboutness: news, words, night, swear, send, tongue, speak
Why “nurse” and husband?
(vocal function)
You-forms vs. thou-forms
• You-forms vs. thou-forms
– Plural: ye, you, your, yours, yourself
– Singular: thou, thee, thy, thine, thyself
• You-forms vs. thou-forms (thou, thine, thee) – sociopragmatic implications
– Romeo and Juliet prefer thou-forms (positive) and avoid you-forms
(negative)
• High status social equals use you-forms
• You-forms are dispassionate and emotionally unmarked
• Thou-forms are strongly expressive: positive (affection and love) or
negative (anger and contempt) – intimacy, love talk
– Friar Laurence prefers thou-forms: He is engaged in intimate and
emotionally charged discourse
– Capulet and the Nurse prefer you-forms: used among social superiors,
or individuals of low status talking to people of high social status
Capulet’s keywords by keyness
Positive keywords
Negative keywords
Pron: thy, thou
Others: the, of,
that, etc.
[full of actions, not
a ‘nouny’ style]
Directions: go, haste, make, now, look (imperatives)
Pronouns: you, we, her, our (directing and speaking on behalf of the household)
etc…
[you vs. thou: imperative; less emotional]
Nurse’s keywords by keyness
Positive keywords
Negative keywords
Pron: thou
Why ’d?
Emotional: ay, ah, O, God, woeful, warrant, faith
Pronouns: you, your, he, I
Address terms: lady, madam, lord, sir
Why “day”? - “O day! O day! O day! O hateful day!”
Why “d”?
Culpeper might have made the correct decision to treat contractions as one word?
Mercutio’s keywords by keyness
Positive keywords
Negative keywords
Less interactive style:
Lack of Question word: what
Lack of 1st person pron: I, my
“Noun-y” style: a, of, the, an – akin to written, less interactive
Friar L’s keywords by keyness
Positive keywords
Negative keywords
Less emotional (than Nurse): O
Pronouns: my, you, I
Pronouns: thy, thyself, thou - involved in
intimate and emotional charged discourse,
"emotional mirror"
A man of the Church: heaven, from
(heaven)
Roles he played in facilitating the plot:
Mantua, letter
Planning your own study …
What should I do first …?
– Choose your data and/or tool
– Determine what interests you about the data
– Come up with some “hypotheses” that you’d like to test
out
• This can be data-driven (what seems to “jump out” at you from
your data)
• This can be theory-oriented (i.e. testing out something about the
language that’s taken for granted)
53
Download