Stylistics – case study - Personal Webpages (The University of

advertisement
Stylistics – case study
see last slide for websites used to get
numerical information from texts
Stylistic analysis
• Literary vs linguistic stylistics
– Lit crit focuses on effect on the reader,
intended or otherwise, so largely intuitive and
subjective
– Linguistic stylistics looking for
characterisations of style (including literary
style) in terms of linguistic phenomena at the
various levels of linguistic description
2/20
Stylistic analysis
• Inventory of linguistic devices and their effect
– usually in a contrastive way:
– in contrast with other texts of a similar genre
– in contrast with other genres
• Linguistic devices described in terms of the
usual linguistic levels of description: phonology,
morphology, lexis, grammar, etc.
3/20
Example
• Newspaper reporting of a similar story
• Sun vs Independent
– readership by social class
– Sun: widely read (c. 5m), mostly by lower
class and lower middle class
– Independent: circulation 0.25m, educated
middle class
• How would you expect this different
readership to be reflected in the styles?
4/20
Sun vs Independent
• Targeted readership largely dictates subject
matter and the angle of coverage
• From a purely linguistic point of view we might
expect differences in …
– vocabulary
– complexity of sentence structure
• Other differences might include literary
• But (compared to other texts) features of the
genre (newspaper story) may be shared
5/20
http://www.independent.co.uk/news/world/asia/hawker-family-make-new-plea-799964.html
6/20
http://www.thesun.co.uk/sol/homepage/news/justice/article952630.ece
7/20
Some differences
Sun
Independent
Family bid for Lindsay's killer
Hawker family make new plea
THE family of an English teacher murdered in
Japan today appealed for her killer to be found
– a year after her death.
The family of a young British teacher murdered in
Japan were yesterday flying to Tokyo to launch
a fresh appeal on the first anniversary of her
death.
Lindsay Ann Hawker, 22, was found dead in a
sand-filled bath on a balcony of a flat belonging
to one of her students.
Lindsay Ann Hawker, 22, was found dead in a bath
filled with sand on the balcony of a flat in
Ichikawa, east of Tokyo, on March 27 last year.
Parents Bill and Julia and their daughters Lisa, 26,
and Louise, 23 have flown to capital Tokyo to
“get justice for Lindsay”.
Miss Hawker's parents, Bill and Julia, and her two
sisters, Lisa and Louise, will leave London's
Heathrow Airport this morning to travel to the
Japanese capital to renew their appeal to find
her killer.
A poster campaign aims to help catch suspect
Tatsuya Ichihashi, 29, who fled from cops.
Detectives are still hunting 29-year-old suspect
Tatsuya Ichihashi, who lived at the flat and fled
when approached by officers for questioning.
More than 20,000 people joined a tribute page on
website Facebook
A webpage set up on social networking site
Facebook, called "Don't forget Lindsay Hawker,
Please remember this Face", now has more than
20,000 members.
8/20
Some differences
• Differences of detail
– [Some are due to slightly different publication time, before or
after press conf]
– What elements are of interest?
• Differences of vocabulary
– cops vs officers, dad vs father, year after vs anniversary
• Differences of explication
– capital of Japan, Facebook
• Differences of syntax
– surprisingly few
– but possible stylistic trademark of redtop is internal structure of
noun phrases …
9/20
Appositive noun phrases
• a sand-filled bath
• Parents Bill and Julia
• capital Tokyo
• suspect Tatsuya
Ichihashi, 29
• website Facebook
• a bath filled with sand
• Miss Hawker's parents,
Bill and Julia
• Tokyo; the Japanese
capital
• 29-year-old suspect
Tatsuya Ichihashi
• [the] social networking
site Facebook
10/20
Numerical comparison
• Thanks to computers it is now (relatively) easy to
count things
• What should we count?
– easy to count number of paragraphs, sentences,
words, letters
– may give a measure of complexity
• average sentence length (words/sentence)
• average word length
• percentage of long words
– type:token ratio (vocabulary richness)
• number of types = number of different words
• number of tokens = total number of words
• Hapax legomena = numbner of unique words
11/20
Normalization and significance
• Always important to compare like with like
– It is usual when counting things to “normalize” over
the length of the text
– If one text is longer than the other, of course you
would expect higher frequencies of everything
• Issue of statistical significance
– Small differences may not really tell you anything
– Various measures can confirm whether difference is
statistically significant or due to random fluctuation
12/20
How to count
• How to recognize paragraph breaks?
• How to recognize sentence breaks?
– Headlines don’t end in a fullstop
– Not all sentences end in a fullstop
– Not all full stops are sentence ending (abbreviations)
• How to count words
– Hyphenated words, contractions e.g. don’t
• How to measure word-length/complexity
–
–
–
–
length only roughly corresponds to complexity
number of characters vs number of syllables
cf. through vs idea
counting syllables implies either a dictionary or an algorithm
13/20
Numerical comparison
Sun
sentences
•
Indy
13
10
262
257
letters/numbers
1166
1213
complex words
19 (7%)
36 (14%)
syllables
356
378
av’ge word length (characters)
4.45
4.72
av’ge word length (syllables)
1.36
1.47
20.15
25.7
short sentences
6 (42%)
4 (40%)
long sentences
2 (14%)
1 (10%)
types
156
165
type-token ratio
0.60
0.62
110
128
words
av’ge sentence length (words)
Hapax legomena
•
•
•
•
•
texts are roughly
the same length
Hard to know if
any differences
are statistically
significant with
such a small
amount of data,
but …
Indy does have
more complex
words …
and higher AWL
and ASL …
and higher ratio of
short:long
sentences …
and richer 14/20
vocabulary
• Comparison of
distribution of
words by length
only tells us that
the two texts are
very similar
• correlation ρ =
0.977
total
Word length
60
50
40
30
20
10
0
Indy
Sun
1
3
5
7
9
11
word length
15/20
Syntactic information
Sun
Indy
questions
0
0
passives
8 (57%)
6 (60%)
longest sentence
33 words
43 words
shortest sentence
4 words
7 words
use of verb to be
8
8
use of auxiliary
1
3
conjunctions
3 (8%)
4 (10%)
pronouns
7 (19%)
4 (11%)
13 (34%)
14 (38%)
prepositions
nominalizations
0
1 (2%)
Sentence beginnings: pronouns
6
1
article
1
4
conjunction
0
0
preposition
0
0
• Again, hard
to know if
differences
are
significant
• This kind of
measure
more useful
to
distinguish
different
genres
16/20
Readability
• Big interest from teachers, publishers and
researchers in quantifying the appropriate
reading age for a text
– i.e. what level of education do you need to
understand this text? (reader-oriented view)
– or: for what age of readership is this text appropriate
(text-oriented view)
• Most measures based on combination of
average word length (measured in characters or
syllables), and average sentence length
– some additionally take into account proportion of
long/short words
17/20
Readbility indexes
 syllables 
 words 
Kincaid  11.8

0
.
39


  15.59
 words 
 sentences 
 characters 
 words 
ARI  4.71
  0.5
  21.43
 words 
 sentences 
 characters 
 sentences 
CLI  5.89
  0.3
  15.8
 words 
 100  words 
 syllables 
 words 
Flesch  206.835  6
  1.015

 words 
 sentences 
 words 
 words  3 syllables 
FOG  0.4
  100

words
 sentences 


words
 words  6 chars 
Lix 
 100

sentences
words


 words  3 syllables 
SMOG  1.0430 30
  3.1291
sentences


18/20
Readability indexes
• Most give a (US) school grade:
– Kincaid – best for technical material; short sentences, eg in dialogues,
will lower the score: gives a grade level
– ARI (Automated Readability Index)
– Coleman-Liau – counts characters rather than syllables, so easier to
implement
– SMOG (simple measure of goobledygook) (McLaughlin 1969) – can be
estimated by sampling e.g. 3 10-sentence segments; said to give best
correlation with its criterion. See http://www.harrymclaughlin.com/SMOG.htm
– FOG (Gunning 1952) – gives a school grade. Score >12 means “too
hard to read”!
• A few give a raw score:
– Flesch-Kincaid – widely used, simple calculation; the higher the score,
the easier it is to read. Highest possible score is 121 (text made up of
one-word one-syllable sentences). Score around 100 means OK for 11yr old. Time magazine ~52, Harvard Law Review low 30s.
– Lix (Björnsson) – originally developed for Swedish, raw score <24
suitable for children, >55 very hard.
19/20
Readability
Sun
Indy
9
11.1
10.4
13.3
Coleman-Liau
9
11.1
Flesch-Kincaid
69.7
62.8
Gunning FOG
11.6
13.8
Lix
38.9
46.1
SMOG
10.4
13.1
Kincaid
ARI
Conversion:
Add 1 to US grade to give
British school year
eg 11th grade = year 12
Note: with Flesch-Kincaid,
lower score means harder
to read
http://www.editcentral.com/gwt/com.editcentral.EC/EC.html
also suggests where improvements can be made!
also used (give slightly different figures, probably depending on how they count things)
http://www.readability.info/
http://www.online-utility.org/english/readability_test_and_improve.jsp
20/20
Download