TJTSD66: Advanced Topics in Social Media (Social Media Mining) Natural Language Processing Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fi Homepage: http://users.jyu.fi/~swang/ Part I: What is Natural Language Processing? Social Media Mining Natural Language Processing Slide 2 of 66 2 NLP Example: Question Answering Social Media Mining Natural Language Processing Slide 3 of 66 3 NLP Example: Information Extraction Social Media Mining Natural Language Processing Slide 4 of 66 4 NLP Example: Sentiment Analysis Attributes: zoom affordability size and weight flash Ease of use • size and weight – nice and compact to carry! – since the camera is small and light, I won't need to carry around those heavy, bulky professional cameras either! – the camera feels flimsy, is plastic and very light in weight, you have to be very delicate in the handling of Social Media this Mining Natural Language Processing Slide 5 of 66 5 camera NLP Example: Machine Translation Social Media Mining Natural Language Processing Slide 6 of 66 6 Language Technology Social Media Mining Natural Language Processing Slide 7 of 66 7 Part II: Preprocessing Techniques in NLP Social Media Mining Natural Language Processing Slide 8 of 66 8 Normalization • Information Retrieval: indexed text & query terms must have same form. • am, is, are be • U.S.A. USA • car, cars, car’s, cars’ car • automate(s), automatic, automation automat • US vs. us!!! Social Media Mining Natural Language Processing Slide 9 of 66 9 Sentence Segmentation • !,? Are relatively unambiguous • Period “.” is quite ambiguous – Sentence boundary – Abbreviations like Inc. or Dr. – Numbers like .02% or 4.3 • Build a binary classifier for “.” Social Media Mining Natural Language Processing Slide 10 of 66 10 Sentence Segmentation: Decision Tree Social Media Mining Natural Language Processing Slide 11 of 66 11 Word Tokenization • English – “San Francisco”, “New York” as a token • Chinese and Japanese – – – – No spaces between words 莎拉波娃现在居住在美国东南部的佛罗里达 莎拉波娃/现在/居住/在/美国/东南部/的/佛罗里达 Also called Word Segmentation in Chinese (中文分词) Social Media Mining Natural Language Processing Slide 12 of 66 12 Part III: Language Modeling Social Media Mining Natural Language Processing Slide 13 of 66 13 Probabilistic Language Models • Assign a Probability to a sentence – Machine Translation: • P(high winds tonight) > P(large winds tonight) – Spell Correction • The office is about fifteen minuets from my house • P(about fifteen minutes from) > P(about fifteen minuets from) – Speech Recognition • P(I saw a van) >> P(eyes awe of an) – Summarization, question-answering, etc. Social Media Mining Natural Language Processing Slide 14 of 66 14 Probabilistic Language Modeling • Goal: compute the probability of a sentence or sequence of words: – P(W) = P(w1,w2,w3,w4,w5…wn) • Related task: probability of an upcoming word: – P(w5|w1,w2,w3,w4) • A model that computes either of these: – P(W) or P(wn|w1,w2…wn-1) Is called a language model Social Media Mining Natural Language Processing Slide 15 of 66 15 How to Compute P(W) Social Media Mining Natural Language Processing Slide 16 of 66 16 Example • P(“its water is so transparent”) = P(its) × P(water|its) × P(is|its water) × P(so|its water is) × P(transparent|its water is so) • P(the|its water is so transparent that) = Count(its water is so transparent that the) / Count(its water is so transparent that) ??? • No! We’ll never see enough data for estimation. Social Media Mining Natural Language Processing Slide 17 of 66 17 Markov Assumption • Simplifying assumption: – P(the | its water is so transparent that) ≈ P(the | that) • Or maybe – P(the | its water is so transparent that) ≈ P(the | transparent that) Social Media Mining Natural Language Processing Slide 18 of 66 18 Markov Assumption Social Media Mining Natural Language Processing Slide 19 of 66 19 N-gram Models Social Media Mining Natural Language Processing Slide 20 of 66 20 Smoothing • Intuition – Some words WITHOUT any occurrence in the corpus do NOT mean its probability is ZERO! – Maybe because the corpus is too small • Smoothing – Steal probability mass to generalize better Social Media Mining Natural Language Processing Slide 21 of 66 21 Add-one (Laplace) Smoothing Social Media Mining Natural Language Processing Slide 22 of 66 22 Berkeley Restaurant Corpus Social Media Mining Natural Language Processing Slide 23 of 66 23 Laplace-Smoothed Bigrams Social Media Mining Natural Language Processing Slide 24 of 66 24 Reconstituted Counts Social Media Mining Natural Language Processing Slide 25 of 66 25 Compare with Raw Counts Social Media Mining Natural Language Processing Slide 26 of 66 26 Disadvantage • Add-1 estimation is a blunt instrument • So add‐1 isn’t used for N-grams: – We’ll see better methods • But add‐1 is used to smooth other NLP models – For text classification – In domains where the number of zeros isn’t so huge. Social Media Mining Natural Language Processing Slide 27 of 66 27 Good Turing Smoothing Add-1 Smoothing Add-k Smoothing: More general 1 𝑐 𝑤𝑖−1 , 𝑤𝑖 + 𝑚 × 𝑉 = 𝑐 𝑤𝑖−1 + 𝑚 Social Media Mining Natural Language Processing Slide 28 of 66 28 Unigram Prior Smoothing 1 • Let 𝑘𝑉 = 𝑚: 𝑃𝐴𝑑𝑑−𝑘 𝑤𝑖 |𝑤𝑖−1 = 𝑐 𝑤𝑖−1 ,𝑤𝑖 +𝑚×𝑉 𝑐 𝑤𝑖−1 +𝑚 1 : 𝑉 the probability of each word under the uniform distribution • The word distribution P(wi) can be estimated with the unigram prior knowledge, the smoothing formula can be rewritten : • Social Media Mining Natural Language Processing Slide 29 of 66 29 Jelinek-Mercer Smoothing • Let 𝜆= 𝑚 𝑐 𝑤𝑖−1 + 𝑚 , 𝑐 𝑤𝑖−1 1−𝜆 = 𝑐 𝑤𝑖−1 + 𝑚 • Then 𝑝𝑈𝑃 𝑤𝑖 𝑤𝑖−1 𝑐 𝑤𝑖−1 , 𝑤𝑖 = (1 − 𝜆) + 𝜆𝑝 𝑤𝑖 𝑐 𝑤𝑖−1 = 𝑝 𝑤𝑖 |𝑤𝑖−1 • Bigram interpolated model 𝑝𝑖𝑛𝑡𝑒𝑟𝑝 𝑤𝑖 |𝑤𝑖−1 = (1 − 𝜆) 𝑝 𝑤𝑖 |𝑤𝑖−1 + 𝜆𝑝 𝑤𝑖 Social Media Mining Natural Language Processing Slide 30 of 66 30 Part IV: Sentiment Analysis Social Media Mining Natural Language Processing Slide 31 of 66 31 Positive or Negative in Movie Reviews? • Unbelievably disappointing • Full of zany characters and richly applied satire, and some great plot twists • This is the greatest screwball comedy ever filmed • It was pathetic. The worst part about it was the boxing scenes. Social Media Mining Natural Language Processing Slide 32 of 66 32 Google Product Search Social Media Mining Natural Language Processing Slide 33 of 66 33 Twitter sentiment vs. Gallup Poll of Consumer Confidence Social Media Mining Natural Language Processing Slide 34 of 66 34 Twitter Sentiment • Johan Bollen, Huina Mao, Xiaojun Zeng. 2011. Twitter mood predicts the stock market, Journal of Computational Science 2:1, 1-8. 10.1016/j.jocs.2010.12.007. Social Media Mining Natural Language Processing Slide 35 of 66 35 Other Names of Sentiment Analysis • • • • Opinion extraction Opinion mining Sentiment mining Subjectivity analysis Social Media Mining Natural Language Processing Slide 36 of 66 36 Why sentiment analysis? • Movie: is this review positive or negative? • Products: what do people think about the new iPhone? • Public sentiment: how is consumer confidence? Is despair increasing? • Politics: what do people think about this candidate or issue? • Prediction: predict election outcomes or market trends from sentiment Social Media Mining Natural Language Processing Slide 37 of 66 37 Sentiment Analysis • Simplest task: – Is the attitude of this text positive or negative? • More complex: – Rank the attitude of this text from 1 to 5 • Advanced: – Detect the target, source, or complex attitude types Social Media Mining Natural Language Processing Slide 38 of 66 38 Sentiment Classification for Movie Reviews • Polarity detection: – Is an IMDB movie review positive or negative? • Data: Polarity Data 2.0: – http://www.cs.cornell.edu/people/pabo/moviereview-data • Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. EMNLP, 79-86. • Bo Pang and Lillian Lee. 2004. A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts. ACL, 271-278. Social Media Mining Natural Language Processing Slide 39 of 66 39 IMDB data in the Pang and Lee database • • • • when _star wars_ came out some twenty years ago, the image of traveling throughout the stars has become a commonplace image. […] when han solo goes light speed, the stars change to bright lines, going towards the viewer in lines that converge at an invisible point. cool. _october sky_ offers a much simpler image--that of a single white dot, traveling horizontally across the night sky. […] Social Media Mining • • • "snake eyes" is the most aggravating kind of movie: the kind that shows so much potential then becomes unbelievably disappointing. it's not just because this is a brian depalma film, and since he's a great director and one who’s films are always greeted with at least some fanfare. and it's not even because this was a film starring nicolas cage and since he gives a brauvara performance, this film is hardly worth his talents. Natural Language Processing Slide 40 of 66 40 Baseline Algorithm • Tokenization • Feature Extraction • Classification using different classifiers – Naïve Bayes – MaxEnt – SVM Social Media Mining Natural Language Processing Slide 41 of 66 41 Sentiment Tokenization Issues • • • • • • Deal with HTML and XML markup Twitter mark-up (names, hash tags) Capitalization (preserve for words in all caps) Phone numbers, dates Emoticons Useful code: – Christopher Potts sentiment tokenizer – Brendan O’Connor twitter tokenizer Social Media Mining Natural Language Processing Slide 42 of 66 42 Extracting Features • How to handle negation – I didn’t like this movie! vs – I really like this movie! • Which words to use? – Only adjectives – All words • All words turns out to work better, at least on this data Social Media Mining Natural Language Processing Slide 43 of 66 43 Negations • Add NOT_ to every word between negation and following punctuation: – didn’t like this movie , but I – didn’t NOT_like NOT_this NOT_movie but I • Das, Sanjiv and Mike Chen. 2001. Yahoo! for Amazon: Extracting Market Sentiment from Stock Message Boards. In APFA. • Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. EMNLP, 79-86. Social Media Mining Natural Language Processing Slide 44 of 66 44 Boolean feature for Multinomial Naïve Bayes • Intuition: – For sentiment (and probably for other text classification domains) – Word occurrence may matter more than word frequency • The occurrence of the word fantastic tells us a lot • The fact that it occurs 5 times may not tell us much more. – Boolean Multinomial Naïve Bayes • Clips all the word counts in each document at 1 – Other possibility: log(freq(w)) Social Media Mining Natural Language Processing Slide 45 of 66 45 Boolean Multinomial Naïve Bayes • From training corpus, extract Vocabulary • Calculate P(cj) terms • Calculate P(wk|cj) terms • Make predictions 𝑐𝑁𝐵 = max 𝑃 𝑐𝑗 𝑐𝑗 ∈𝐶 Social Media Mining 𝑃(𝑤𝑖 |𝑐𝑗 ) 𝑖∈𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑠 Natural Language Processing Slide 46 of 66 46 Other issues in Classification • MaxEnt and SVM tend to do better than Naïve Bayes • Subtlety: makes reviews hard to classify – Perfume review in Perfumes: the Guide • “If you are reading this because it is your darling fragrance, please wear it at home exclusively, and tape the windows shut.” Social Media Mining Natural Language Processing Slide 47 of 66 47 Thwarted Expectations and Ordering Effects • “This film should be brilliant. It sounds like a great plot, the actors are first grade, and the supporting cast is good as well, and Stallone is attempting to deliver a good performance. However, it can’t hold up.” • Well as usual Keanu Reeves is nothing special, but surprisingly, the very talented Laurence Fishburne is not so good either, I was surprised. Social Media Mining Natural Language Processing Slide 48 of 66 48 Sentiment Lexicons • The General Inquirer, 1966. – Home page: http://www.wjh.harvard.edu/~inquirer – List of Categories: http://www.wjh.harvard.edu/~inquirer/homecat.htm – Spreadsheet: http://www.wjh.harvard.edu/~inquirer/inquirerbasic .xls – Categories: • Positive (1915 words) and Negative (2291 words) • Strong vs. Weak, Active vs. Passive, Overstated vs. Understated • Pleasure, Pain, Virtue, Vice, Motivation, Cognitive Orientation, etc. – Free for Research Use Social Media Mining Natural Language Processing Slide 49 of 66 49 Sentiment Lexicons • LIWC (Linguistic Inquiry and Word Count), 2007 – Home page: http://www.liwc.net/ – 2300 words, >70 classes – Affective Processes • negative emotion (bad, weird, hate, problem, tough) • positive emotion (love, nice, sweet) – Cognitive Processes • Tentative (maybe, perhaps, guess), Inhibition (block, constraint) – Pronouns, Negation (no, never), Quantifiers (few, many) – $30 or $90 fee Social Media Mining Natural Language Processing Slide 50 of 66 50 Sentiment Lexicons • MPQA Subjectivity Cues Lexicon (EMNLP’03, 05) – Home page: http://www.cs.piX.edu/mpqa/subj_lexicon.html • 6885 words from 8221 lemmas • 2718 positive and 4912 negative – Each word annotated for intensity (strong, weak) – GNU GPL Social Media Mining Natural Language Processing Slide 51 of 66 51 Sentiment Lexicons • Bing Liu Opinion Lexicon (KDD‐2004) – Bing Liu's Page on Opinion Mining – http://www.cs.uic.edu/~liub/FBS/opinion‐lexiconEnglish.rar – 6786 words: 2006 positive and 4783 negative Social Media Mining Natural Language Processing Slide 52 of 66 52 Sentiment Lexicons • SentiWordNet (LREC‐2010) – Home page: http://sentiwordnet.isti.cnr.it/ – All WordNet synsets automatically annotated for degrees of positivity, negativity, and neutrality/objectiveness Social Media Mining Natural Language Processing Slide 53 of 66 53 Disagreements between Polarity Lexicons Christopher Potts, Sentiment Tutorial, 2011 Social Media Mining Natural Language Processing Slide 54 of 66 54 Learning Sentiment Lexicons • Semi-supervised learning of lexicons – – – – Use a small amount of information A few labeled examples A few hand‐built patterns To bootstrap a lexicon Social Media Mining Natural Language Processing Slide 55 of 66 55 Intuition for identifying word polarity (ACL’97) • Adjectives conjoined by “and” have same polarity – Fair and legitimate, corrupt and brutal – *fair and brutal, *corrupt and legitimate • Adjectives conjoined by “but” do not – fair but brutal Social Media Mining Natural Language Processing Slide 56 of 66 56 Step 1 • Label seed set of 1336 adjectives (all >20 in 21 million word WSJ corpus) – 657 positive adequate central clever famous intelligent remarkable reputed sensitive slender thriving… – 679 negative contagious drunken ignorant lanky listless primitive strident troublesome unresolved unsuspecting… Social Media Mining Natural Language Processing Slide 57 of 66 57 Step 2 • Expand seed set to conjoined adjectives Social Media Mining Natural Language Processing Slide 58 of 66 58 Step 3 • Supervised classifier assigns “polarity similarity” to each word pair, resulting in graph: Social Media Mining Natural Language Processing Slide 59 of 66 59 Step 4 • Clustering for partitioning the graph into two Social Media Mining Natural Language Processing Slide 60 of 66 60 Output Polarity Lexicon • Positive – bold decisive disturbing generous good honest important large mature patient peaceful positive proud sound stimulating straightforward strange talented vigorous witty… • Negative – ambiguous cautious cynical evasive harmful hypocritical inefficient insecure irrational irresponsible minor outspoken pleasant reckless risky selfish tedious unsupported vulnerable wasteful… Social Media Mining Natural Language Processing Slide 61 of 66 61 Output Polarity Lexicon • Positive – bold decisive disturbing generous good honest important large mature patient peaceful positive proud sound stimulating straightforward strange talented vigorous witty… • Negative – ambiguous cautious cynical evasive harmful hypocritical inefficient insecure irrational irresponsible minor outspoken pleasant reckless risky selfish tedious unsupported vulnerable wasteful… Social Media Mining Natural Language Processing Slide 62 of 66 62 Using WordNet to learn polarity • WordNet: online thesaurus (covered in later • lecture). • Create positive (“good”) and negative seedwords (“terrible”) • Find Synonyms and Antonyms – Positive Set: Add synonyms of positive words (“well”) and antonyms of negative words – Negative Set: Add synonyms of negative words (“awful”) and antonyms of positive words (”evil”) • Repeat, following chains of synonyms • Filter Social Media Mining Natural Language Processing Slide 63 of 66 63 Turney Algorithm (ACL’02) 1. Extract a phrasal lexicon from reviews 2. Learn polarity of each phrase 3. Rate a review by the average polarity of its phrases Social Media Mining Natural Language Processing Slide 64 of 66 64 Summary on Learning Lexicons • Advantages: – Can be domain‐specific – Can be more robust (more words) • Intuition – Start with a seed set of words (‘good’, ‘poor’) – Find other words that have similar polarity: • Using “and” and “but” • Using words that occur nearby in the same document • Using WordNet synonyms and antonyms Social Media Mining Natural Language Processing Slide 65 of 66 65 Any Question? Social Media Mining Natural Language Processing Slide 66 of 66 66