Computational Extraction of Social and
Interactional Meaning
SSLST, Summer 2011
Dan Jurafsky
Lecture 4: Sarcasm, Alzheimers,
+Distributional Semantics
Why Sarcasm Detection
 Sarcasm causes problems in sentiment analysis
 One task: Summarization of Reviews
 Identify features (size/weight, zoom, battery life, pic quality…)
 Identify sentiment and polarity of sentiment for each feature
(great battery life, insufficient zoom, distortion close to
boundaries)
 Average the sentiment for each feature
“Perfect size, fits great in your pocket”
+ “got to love this pocket size camera, you just need a
porter to carry it for you”
= ?!
slide adapted from Tsur, Davidov, Rappaport 2010
What is sarcasm
 “Sarcasm is verbal irony that expresses
negative and critical attitudes toward
persons or events” (Kreuz and Glucksberg, 1989)
 “a form of irony that attacks a person or belief through
harsh and bitter remarks that often mean the opposite
of what they say”
 “The activity of saying or writing the opposite of what
you mean in a way intended to make someone else feel
stupid or show them that you are angry.”
Examples from Tsur et al.
 “Great for insomniacs.” (book)
 “Just read the book.” (book/movie review)
 “thank you Janet Jackson for yet another year of Super
Bowl classic rock!”
 “Great idea, now try again with a real product
development team.” (e-reader)
 “make sure to keep the purchase receipt” (smart
phone)
slide adapted from Tsur, Davidov, Rappaport 2010
Examples
 “Wow GPRS data speeds are blazing fast.” Overexaggeration
 @UserName That must suck. I can't express how
much I love shopping on black Friday.
 @UserName that's what I love about Miami.
Attention to detail in preserving historic
landmarks of the past.
Negative positive
 “[I] Love The Cover” (book)
“don’t judge a book by its cover”
Twitter #sarcasm issues
 Problems: (Davidov et al 2010)
 Used infrequently
 Used in non-sarcastic cases, e.g. to clarify a previous
tweet (it was #Sarcasm)
 Used when sarcasm is otherwise ambiguous (prosody
surrogate?) – biased towards the most difficult cases
 GIMW11 argues that the non-sarcastic cases are easily
filtered by only using ones with #sarcasm at the end
Davidov, Tsur, Rappaport 2010 Data
 Twitter: 5.9M tweets, unconstrained
context
 Amazon: 66k reviews, known
product context
 Books (fiction, non fiction, children)
 Electronics (mp3 players, digital cameras, mobiles
phones, GPS devices,…)
 Mechanical Turk annotation
 K= 0.34 on Amazon, K = 0.41 on
Twitter
Amazon review of Kindle e-reader
 a
slide adapted from Tsur, Davidov, Rappaport 2010
Twitter
 140 characters
 Free text
 Lacking context
 Hashtags: #hashtag
 URL addreses
 References to other users: @user
slide adapted from Tsur, Davidov, Rappaport 2010
Star Sentiment Baseline (Amazon)
 “Saying or writing the opposite of what you
mean...”
 Identify unhappy reviewers (1-2 stars)
 Identify extremely-positive sentiment
words (Best, exciting, top, great, …)
 Classify these sentences as sarcastic
slide adapted from Tsur, Davidov, Rappaport 2010
SASI: Semi-supervised Algorithm
for Sarcasm Identification
 Small seed of sarcastic-tagged
sentences. Tags 1,...,5:
 1: not sarcastic at all
 5: clearly sarcastic
slide adapted from Tsur, Davidov, Rappaport 2010
SASI:
 Extract features from all training sentences.
 Represent training sentences in a feature vector
space.
 Features:
 Pattern based features
 Punctuation based features
 Given a new sentence: use weighted-kNN to
classify it.
 Majority vote (over k>0)
slide adapted from Tsur, Davidov, Rappaport 2010
Hashtag classifier
 #sarcasm hashtag
 Not very common
 Use this tag as a label for supervised
learning.
slide adapted from Tsur, Davidov, Rappaport 2010
Preprocessing
 [author],[title], [product], [company]
 [url], [usr], [hashtag]
 “Silly me , the Kindle and the Sony eBook can’t read
these protected formats. Great!”
 “Silly me , the Kindle and the [company] [product]
can’t read these protected formats. Great!”
slide adapted from Tsur, Davidov, Rappaport 2010
Pattern based features
 Davidov & Rappoport 2006, 2008
 High Frequency Words (>0.0001)
 Content Words (<0.001)
 Pattern: ordered sequence of high frequency
words and slots for content words
 Restrictions:
 2-6 HFW
 1-5 slots for CW
 Minimal pattern: [HFW] [CW slot] [HFW]
slide adapted from Tsur, Davidov, Rappaport 2010
Pattern extraction from the training (seed)
“Garmin apparently does not care much about product
quality or customer support”
 [company] CW does not CW much
 does not CW much about CW CW or
 not CW much
 about CW CW or CW CW.
slide adapted from Tsur, Davidov, Rappaport 2010
Weights of pattern based features
 1 : exact match.
 α : sparse match – extra elements are found
between components.
 γ*n/N: incomplete match – only n of N
patterns components are found.
 0 : no match.
slide adapted from Tsur, Davidov, Rappaport 2010
Match example
“Garmin apparently does not care much about product
quality or customer support”
 [company] CW does not CW much: exact match: 1
 [company] CW not: sparse match: 0.1
 Insertion of the word does
 [company] CW CW does not: incomplete match: 0.08
 One of five components (the CW) is missing:
0.1*4/5=0.08
slide adapted from Tsur, Davidov, Rappaport 2010
Punctuation based features
 Number of !
 Number of ?
 Number of quotes
 Number of CAPITALIZED words/letters
slide adapted from Tsur, Davidov, Rappaport 2010
Davidov et al (2010) Results
F-score
Punctuation
0.28
F-score
Patterns
0.77
Amazon - Turk
0.79
Patts + punc
0.81
Twitter - Turk
0.83
Enriched patts
0.40
Twitter – #Gold
0.55
Enriched punct
0.77
All (SASI)
0.83
Amazon/Twitter SASI results for eval
paradigms
Amazon results for different feature
sets on gold standard
slide from Mari Ostendorf
GMW11 Study
 Data: 2700 tweets, equal amounts of positive,
negative and sarcastic (no neutral)
 Annotation by hashtags: sarcasm/sarcastic,
happy/joy/lucky, sadness/angry/frustrated
 Features:
 Unigrams, LIWC classes (grouped), WordNet affect
 Interjections and punctuation, Emoticons & ToUser
 Classifier: SVM & logistic regression
slide from Mari Ostendorf
Results
 Automatic system accuracy:
 3-way S-P-N: 57%, 2-way S-NS: 65%
 Equal difficulty in separating sarcastic from positive and
negative
 Human S-P-N labeling: 270 tweet subset, K=0.48
 Human “accuracy”: 43% unanimous, 63% avg
 New humans S-NS labeling, K=.59
 Human “accuracy”: 59% unanimous, 67% avg
 Automatic: 68%
 Accuracies & agreement go up for subset with
emoticons
 Conclusion: Humans are not so good at this task either…
slide from Mari Ostendorf
Prosody and sarcasm
 Bryant and Fox Tree 2005
 take utterances from talk shows
 low-pass out the lexical content, leaving only prosody
 subjects could tell “dripping” sarcastic from non-
sarcastic
 Rockwell (2000)
 sarcasm in talk shows correlated with
 lower pitch
 slower tempo
 Cheang and Pell
 acted speech: sarcasm, humour, sincerity, neutrality
 lower pitch, decreased pitch variation
 spectral differences: disgust? (facial sneer) or blank face
Yeah, right. (Tepperman et al., 2006)
 131 instances of “yeah right” in Switchboard & Fisher,
23% annotated as sarcastic
 Annotation:
 In isolation: very low agreement between human
listeners (k=0.16)
 “Prosody alone is not sufficient to discern whether a speaker is
being sarcastic.”
 In context, still weak agreement (k=.31)
 Gold standard based on discussion
 Observation: laughter is much more frequent around
sarcastic versions
slide from Mari Ostendorf
Sarcasm Detector
 Features:
 Prosody: relative pitch, duration & energy for each word
 Spectral: class-dependent HMM acoustic model score
 Context: laughter, gender, pause, Q/A DA, location in
utterance
 Classifier: decision tree (WEKA)
 Implicit feature selection in tree training
slide from Mari Ostendorf
Results
• Laughter is most important contextual feature
• Energy seems a little more important than pitch
slide from Mari Ostendorf
Whatever! (Benus, Gravano &
Hirschberg, 2007)
FILLER:
I don’t wanna waste my time buying a prom dress or
whatever.
NEUTRAL:
A: Hey Ritchie, you want these over here?
B: Yeah, whatever, just put them down.
NEGATIVE EVALUATION:
So she ordered all this stuff and two days ago she
changed her mind. I was like, whatever.
Whatever! (Benus, Gravano &
Hirschberg, 2007)
 Acted speech: subjects read transcripts of natural
conversation
Methodology
 Record 12 subjects reading 5 conversations
 Transcribe, hand-align,
 Label
 syllable boundaries
 /t/ closure
 ToBI pitch accents and boundary tones
 max, min F0, pitch range
“Whatever” production results
 As “whatever” becomes more negative, first
syllable is more likely to
 have pitch accent (sharply rising L+H*)
 be longer
 have expanded pitch range
 Similar results in perception study
 have pitch accent
 be longer
 but have flat pitch (boredom)?
Alzheimers
 Garrard et al. 2005
 Lancashire and Hirst 2009
The Nun Study
 Linguistic Ability in Early Life and the Neuropathology of Alzheimer’s
Disease and Cerebrovascular Disease: Findings from the Nun Study
 D.A. SNOWDON, L.H. GREINER, AND W.R. MARKESBERY
 The Nun Study: a longitudinal study of aging and




Alzheimer’s disease
Cognitive and physical function assessed annually
All participants agreed to brain donation at death
At the first exam given between 1991 and 1993, the 678
participants were 75 to 102 years old.
This study:
 subset of 74 participants
 for whom we had handwritten autobiographies from early life,
 all of whom had died.
The data
 In September 1930
 leader of the School Sisters of Notre Dame religious
congregation requested each sister write
 “a short sketch of her own life. This account should
not contain more than two to three hundred words
and should be written on a single sheet of paper ...
include the place of birth, parentage, interesting and
edifying events of one's childhood, schools attended,
influences that led to the convent, religious life, and
its outstanding events.”
 Handwritten diaries found in two participating
convents, Baltimore and Milwaukee
The linguistic analysis
 Grammatical complexity
 Developmental Level metric (Cheung/Kemper)
 sentences classified from 0 (simple one-clause sentences) to 7
(complex sentences with multiple embedding and subordination)
 Idea density:
 average number of ideas expressed per 10 words. elementary
propositions, typically verb, adjective, adverb, or prepositional
phrase. Complex propositions that stated or inferred causal,
temporal, or other relationships between ideas also were counted.
 Prior studies suggest:
 idea density is associated with educational level,
vocabulary, and general knowledge
 grammatical complexity is associated with working
memory, performance on speeded tasks, and writing skill.
Idea density
 “I was born in Eau Claire, Wis., on May 24, 1913 and
was baptized in St. James Church.”
(1) I was born,
(2) born in Eau Claire, Wis.,
(3) born on May 24, 1913,
(4) I was baptized,
(5) was baptized in church
(6) was baptized in St. James Church,
(7) I was born...and was baptized.
 There are 18 words or utterances in that sentence.
 The idea density for that sentence was 3.9 (7/18 * 10
= 3.9 ideas per 10 words).
Results
 correlation between neuropathologically defined
Alzheimers desiease
 had lower idea density scores than non-Alzheimers
 Correlations between idea density scores and mean
neurofibrillary tangle counts
 −0.59 for the frontal lobe,
 −0.48 for the temporal lobe,
 −0.49 for the parietal lobe
Explanations?
 Early studies found same results with a college-
education subset of the population who were
teachers, suggesting education was not the key factor
 They suggest:
 Low linguistic ability in early life may reflect suboptimal
neurological and cognitive development which might
increase susceptibility to the development of
Alzheimer’s disease pathology in late life
Garrard et al. 2005
 British writer Iris Murdoch
 last novel published 1995,
 Diagnosed with Alzheimers 1997
 Compared three novels
 Under the Net (first)
 The Sea (in her prime)
 Jackson's Dilemma (final novel)
 All her books written in longhand with little editing
Type to token ratio in the 3 novels
Syntactic Complexity
Mean proportions of usages of the 10 most frequently occurring words in each book that
appear twice within a series of short intervals, ranging from consecutive positions in the text
to a separation of three intervening words.
Garrard P et al. Brain 2005;128:250-260
Brain Vol. 128 No. 2 © Guarantors of Brain 2004; all rights reserved
Parts of speech
Comparative distributions of values of: (A) frequency and (B) word length in the three books.
Garrard P et al. Brain 2005;128:250-260
Brain Vol. 128 No. 2 © Guarantors of Brain 2004; all rights reserved
 From Under the Net, 1954
 "So you may imagine how unhappy it makes me to have to
cool my heels at Newhaven, waiting for the trains to run again,
and with the smell of France still fresh in my nostrils. On this
occasion, too, the bottles of cognac, which I always smuggle,
had been taken from me by the Customs, so that when closing
time came I was utterly abandoned to the torments of a
morbid self-scrutiny.”
 From Jackson's Dilemma, 1995
 "His beautiful mother had died of cancer when he was 10. He
had seen her die. When he heard his father's sobs he knew.
When he was 18, his younger brother was drowned. He had no
other siblings. He loved his mother and his brother
passionately. He had not got on with his father. His father, who
was rich and played at being an architect, wanted Edward to
be an architect too. Edward did not want to be an architect."
Lancashire and Hirst
Vocabulary Changes in Agatha Christie’s Mysteries as
an Indication of Dementia: A Case Study
Ian Lancashire and Graeme Hirst 2009
Vocabulary Changes in Agatha Christie’s Mysteries as an
Indication of Dementia: A Case Study
Ian Lancashire and Graeme Hirst 2009
 Examined all of Agatha Christie’s novels
 Features:
 Nicholas, M., Obler, L. K., Albert, M. L., Helm-Estabrooks, N. (1985). Empty speech in Alzheimer’s
disease and fluent aphasia. Journal of Speech and Hearing Research, 28: 405–10.
 Number of unique word types
 Number of different repeated n-grams up to 5
 Number of occurences of “thing”, “anything”, and
“something”
Results
Distributional Semantics
Distributional methods for word
similarity
 Firth (1957): “You shall know a word by the
company it keeps!”
 Zellig Harris (1954): “If we consider oculist and eye-doctor we find
that, as our corpus of utterances grows, these two occur in almost the
same environments. In contrast, there are many sentence
environments in which oculist occurs but lawyer does not...
 It is a question of the relative frequency of such environments, and of
what we will obtain if we ask an informant to substitute any word he
wishes for oculist (not asking what words have the same meaning).
 These and similar tests all measure the probability of particular
environments occurring with particular elements... If A and B have
almost identical environments we say that they are synonyms.
Distributional methods for word
similarity
 Nida example:
 A bottle of tezgüino is on the table
 Everybody likes tezgüino
 Tezgüino makes you drunk
 We make tezgüino out of corn.
 Intuition:
 just from these contexts a human could guess meaning
of tezguino
 So we should look at the surrounding contexts, see what
other words have similar context.
Context vector
 Consider a target word w
 Suppose we had one binary feature fi for each of the N




words in the lexicon vi
Which means “word vi occurs in the neighborhood of
w”
w=(f1,f2,f3,…,fN)
If w=tezguino, v1 = bottle, v2 = drunk, v3 = matrix:
w = (1,1,0,…)
Intuition
 Define two words by these sparse features vectors
 Apply a vector distance metric
 Say that two words are similar if two vectors are
similar
Distributional similarity

So we just need to specify 3 things
1. How the co-occurrence terms are defined
2. How terms are weighted
 (frequency? Logs? Mutual information?)
3. What vector distance metric should we use?
 Cosine? Euclidean distance?
Defining co-occurrence vectors
 We could have windows
 Bag-of-words
 We generally remove stopwords
 But the vectors are still very sparse
 So instead of using ALL the words in the neighborhood
 How about just the words occurring in particular
relations
Defining co-occurrence vectors
 Zellig Harris (1968)
 The meaning of entities, and the meaning of grammatical relations
among them, is related to the restriction of combinations of these
entities relative to other entities
 Idea: two words are similar if they have similar parse
contexts. Consider duty and responsibility:
 They share a similar set of parse contexts:
Slide adapted from Chris Calllison-Burch
Co-occurrence vectors based on
dependencies
 For the word “cell”: vector of NxR features
 R is the number of dependency relations
2. Weighting the counts
(“Measures of association with context”)
 We have been using the frequency of some feature as
its weight or value.
 But we could use any function of this frequency
 One possibility: tf-idf
 Another one: conditional probability
 f=(r,w’) = (obj-of,attack)
 P(f|w)=count(f,w)/count(w);
 Assocprob(w,f)=p(f|w)
Intuition: why not frequency
 “drink it” is more common than “drink wine”
 But “wine” is a better “drinkable” thing than “it”
 Idea:
 We need to control for change (expected frequency)
 We do this by normalizing by the expected frequency we would get
assuming independence
Weighting: Mutual Information
 Mutual information: between 2 random variables X
and Y
 Pointwise mutual information: measure of how
often two events x and y occur, compared with what
we would expect if they were independent:
Weighting: Mutual Information
 Pointwise mutual information: measure of how often two events x
and y occur, compared with what we would expect if they were
independent:
 PMI between a target word w and a feature f :
Mutual information intuition
 Objects of the verb drink
3. Defining similarity between
vectors
Summary of similarity measures
Evaluating similarity
 Intrinsic Evaluation:
 Correlation coefficient
 Between algorithm scores
 And word similarity ratings from humans
 Extrinsic (task-based, end-to-end) Evaluation:
 Malapropism (spelling error) detection
 WSD
 Essay grading
 Taking TOEFL multiple-choice vocabulary tests
An example of detected plagiarism
Resources
 Peter Turney and Patrick Pantel. 2010. From
Frequency to Meaning: Vector Space Models of
Semantics. Journal of Artificial Intelligence Research
37: 141-188.
 Distributional Semantics and Compositionality
(DiSCo’2011) Workshop at ACL HLT 2011