Identifying Sarcasm in Twitter: A Closer Look Roberto Gonzalez Smaranda Muresan Nina Wacholder Aim of the study • To construct a corpus of sarcastic utterances that have been explicitly labeled so by the composers themselves. (#sarcasm, #sarcastic) • To exemplify the difficulty in distinguishing sarcastic sentences from negative/positive sentences. Data • Data for the study is divided in three sets of 900 tweets each: sarcastic, positive and negative. • Each data set is culled from twitter using appropriate hash-tags. – Sarcasm: #sarcasm, #sarcastic – Positive: #happy, #joy, #lucky – Negative: #sadness, #frustrated, #angry Data Preprocessing • Tweets tagged with #sarcasm or #sarcastic in the middle of the tweet removed. • Manually checked to see if the tags were a part of the content of the tweet. – Eg: “I really love #sarcasm” Lexical features • Unigrams • Dictionary based – Pennebaker et al (LIWC) • • • • Linguistic Processes (adverbs, pronouns) Psychological Processes (Positive, negative emotion) Personal Concerns (work, achievement) Spoken Categories ( assent, non-fluencies) – WordNet Affect – List of interjections and punctuations Pragmatic Features • Positive emoticons – smileys • Negative emoticons – Frowning faces • ToUser – @user Comparisons and X2 rankings Classification • Logistic Regression and Support Vector Machine with SMO (sequential minimal optimization) • Features used: – Unigrams – Dictionary features presence (LIWC+_P) – Dictionary features frequency (LIWC+_F) Classification Results Comparison against human performance • 3 judges asked to classify tweets as sarcastic, positive or negative. (90 tweets per category) • S-N-P: 50% agreement (k = 0.4788) • S-NS: 71.67% agreement (k = 0.5861) • Emoticon based S-NS: 89% agreement (k = 0.74) Human Comparison results