Identifying Sarcasm in Twitter: A Closer Look

advertisement
Identifying Sarcasm in Twitter: A
Closer Look
Roberto Gonzalez
Smaranda Muresan
Nina Wacholder
Aim of the study
• To construct a corpus of sarcastic utterances
that have been explicitly labeled so by the
composers themselves. (#sarcasm, #sarcastic)
• To exemplify the difficulty in distinguishing
sarcastic sentences from negative/positive
sentences.
Data
• Data for the study is divided in three sets of
900 tweets each: sarcastic, positive and
negative.
• Each data set is culled from twitter using
appropriate hash-tags.
– Sarcasm: #sarcasm, #sarcastic
– Positive: #happy, #joy, #lucky
– Negative: #sadness, #frustrated, #angry
Data Preprocessing
• Tweets tagged with #sarcasm or #sarcastic in
the middle of the tweet removed.
• Manually checked to see if the tags were a
part of the content of the tweet.
– Eg: “I really love #sarcasm”
Lexical features
• Unigrams
• Dictionary based
– Pennebaker et al (LIWC)
•
•
•
•
Linguistic Processes (adverbs, pronouns)
Psychological Processes (Positive, negative emotion)
Personal Concerns (work, achievement)
Spoken Categories ( assent, non-fluencies)
– WordNet Affect
– List of interjections and punctuations
Pragmatic Features
• Positive emoticons
– smileys
• Negative emoticons
– Frowning faces
• ToUser
– @user
Comparisons and X2 rankings
Classification
• Logistic Regression and Support Vector
Machine with SMO (sequential minimal
optimization)
• Features used:
– Unigrams
– Dictionary features presence (LIWC+_P)
– Dictionary features frequency (LIWC+_F)
Classification Results
Comparison against human
performance
• 3 judges asked to classify tweets as sarcastic,
positive or negative. (90 tweets per category)
• S-N-P: 50% agreement (k = 0.4788)
• S-NS: 71.67% agreement (k = 0.5861)
• Emoticon based S-NS: 89% agreement (k =
0.74)
Human Comparison results
Download