Social Media Mining: An Introduction

advertisement
TJTSD66: Advanced Topics in Social Media
(Social Media Mining)
Natural Language Processing
Dr. WANG, Shuaiqiang @ CS & IS, JYU
Email: shuaiqiang.wang@jyu.fi
Homepage: http://users.jyu.fi/~swang/
Part I: What is Natural
Language Processing?
Social Media Mining
Natural Language Processing
Slide 2 of 66
2
NLP Example: Question Answering
Social Media Mining
Natural Language Processing
Slide 3 of 66
3
NLP Example: Information Extraction
Social Media Mining
Natural Language Processing
Slide 4 of 66
4
NLP Example: Sentiment Analysis
Attributes:
zoom
affordability
size and weight
flash
Ease of use
• size and weight
– nice and compact to carry!
– since the camera is small and light, I won't need to
carry around those heavy, bulky professional cameras
either!
– the camera feels flimsy, is plastic and very light in
weight, you have to be very delicate in the handling of
Social Media this
Mining
Natural Language Processing
Slide 5 of 66
5
camera
NLP Example: Machine Translation
Social Media Mining
Natural Language Processing
Slide 6 of 66
6
Language Technology
Social Media Mining
Natural Language Processing
Slide 7 of 66
7
Part II: Preprocessing
Techniques in NLP
Social Media Mining
Natural Language Processing
Slide 8 of 66
8
Normalization
• Information Retrieval: indexed text & query
terms must have same form.
• am, is, are  be
• U.S.A.  USA
• car, cars, car’s, cars’  car
• automate(s), automatic, automation  automat
• US vs. us!!!
Social Media Mining
Natural Language Processing
Slide 9 of 66
9
Sentence Segmentation
• !,? Are relatively unambiguous
• Period “.” is quite ambiguous
– Sentence boundary
– Abbreviations like Inc. or Dr.
– Numbers like .02% or 4.3
• Build a binary classifier for “.”
Social Media Mining
Natural Language Processing
Slide 10 of 66
10
Sentence Segmentation: Decision Tree
Social Media Mining
Natural Language Processing
Slide 11 of 66
11
Word Tokenization
• English
– “San Francisco”, “New York” as a token
• Chinese and Japanese
–
–
–
–
No spaces between words
莎拉波娃现在居住在美国东南部的佛罗里达
莎拉波娃/现在/居住/在/美国/东南部/的/佛罗里达
Also called Word Segmentation in Chinese (中文分词)
Social Media Mining
Natural Language Processing
Slide 12 of 66
12
Part III: Language Modeling
Social Media Mining
Natural Language Processing
Slide 13 of 66
13
Probabilistic Language Models
• Assign a Probability to a sentence
– Machine Translation:
• P(high winds tonight) > P(large winds tonight)
– Spell Correction
• The office is about fifteen minuets from my house
• P(about fifteen minutes from) > P(about fifteen minuets
from)
– Speech Recognition
• P(I saw a van) >> P(eyes awe of an)
– Summarization, question-answering, etc.
Social Media Mining
Natural Language Processing
Slide 14 of 66
14
Probabilistic Language Modeling
• Goal: compute the probability of a sentence or
sequence of words:
– P(W) = P(w1,w2,w3,w4,w5…wn)
• Related task: probability of an upcoming word:
– P(w5|w1,w2,w3,w4)
• A model that computes either of these:
– P(W) or P(wn|w1,w2…wn-1) Is called a language
model
Social Media Mining
Natural Language Processing
Slide 15 of 66
15
How to Compute P(W)
Social Media Mining
Natural Language Processing
Slide 16 of 66
16
Example
• P(“its water is so transparent”) = P(its) ×
P(water|its) × P(is|its water) × P(so|its water is)
× P(transparent|its water is so)
• P(the|its water is so transparent that) =
Count(its water is so transparent that the) /
Count(its water is so transparent that) ???
• No! We’ll never see enough data for estimation.
Social Media Mining
Natural Language Processing
Slide 17 of 66
17
Markov Assumption
• Simplifying assumption:
– P(the | its water is so transparent that) ≈ P(the | that)
• Or maybe
– P(the | its water is so transparent that) ≈ P(the |
transparent that)
Social Media Mining
Natural Language Processing
Slide 18 of 66
18
Markov Assumption
Social Media Mining
Natural Language Processing
Slide 19 of 66
19
N-gram Models
Social Media Mining
Natural Language Processing
Slide 20 of 66
20
Smoothing
• Intuition
– Some words WITHOUT any occurrence in the corpus
do NOT mean its probability is ZERO!
– Maybe because the corpus is too small
• Smoothing
– Steal probability mass to generalize better
Social Media Mining
Natural Language Processing
Slide 21 of 66
21
Add-one (Laplace) Smoothing
Social Media Mining
Natural Language Processing
Slide 22 of 66
22
Berkeley Restaurant Corpus
Social Media Mining
Natural Language Processing
Slide 23 of 66
23
Laplace-Smoothed Bigrams
Social Media Mining
Natural Language Processing
Slide 24 of 66
24
Reconstituted Counts
Social Media Mining
Natural Language Processing
Slide 25 of 66
25
Compare with Raw Counts
Social Media Mining
Natural Language Processing
Slide 26 of 66
26
Disadvantage
• Add-1 estimation is a blunt instrument
• So add‐1 isn’t used for N-grams:
– We’ll see better methods
• But add‐1 is used to smooth other NLP models
– For text classification
– In domains where the number of zeros isn’t so huge.
Social Media Mining
Natural Language Processing
Slide 27 of 66
27
Good Turing Smoothing
Add-1 Smoothing
Add-k Smoothing:
More general
1
𝑐 𝑤𝑖−1 , 𝑤𝑖 + 𝑚 × 𝑉
=
𝑐 𝑤𝑖−1 + 𝑚
Social Media Mining
Natural Language Processing
Slide 28 of 66
28
Unigram Prior Smoothing
1
• Let 𝑘𝑉 = 𝑚: 𝑃𝐴𝑑𝑑−𝑘 𝑤𝑖 |𝑤𝑖−1 =
𝑐 𝑤𝑖−1 ,𝑤𝑖 +𝑚×𝑉
𝑐 𝑤𝑖−1 +𝑚
1
:
𝑉
the probability of each word under the
uniform distribution
• The word distribution P(wi) can be estimated
with the unigram prior knowledge, the
smoothing formula can be rewritten :
•
Social Media Mining
Natural Language Processing
Slide 29 of 66
29
Jelinek-Mercer Smoothing
• Let
𝜆=
𝑚
𝑐 𝑤𝑖−1 + 𝑚
,
𝑐 𝑤𝑖−1
1−𝜆 =
𝑐 𝑤𝑖−1 + 𝑚
• Then
𝑝𝑈𝑃 𝑤𝑖 𝑤𝑖−1
𝑐 𝑤𝑖−1 , 𝑤𝑖
= (1 − 𝜆)
+ 𝜆𝑝 𝑤𝑖
𝑐 𝑤𝑖−1
= 𝑝 𝑤𝑖 |𝑤𝑖−1
• Bigram interpolated model
𝑝𝑖𝑛𝑡𝑒𝑟𝑝 𝑤𝑖 |𝑤𝑖−1 = (1 − 𝜆) 𝑝 𝑤𝑖 |𝑤𝑖−1 + 𝜆𝑝 𝑤𝑖
Social Media Mining
Natural Language Processing
Slide 30 of 66
30
Part IV: Sentiment Analysis
Social Media Mining
Natural Language Processing
Slide 31 of 66
31
Positive or Negative in Movie Reviews?
• Unbelievably disappointing
• Full of zany characters and richly applied satire,
and some great plot twists
• This is the greatest screwball comedy ever filmed
• It was pathetic. The worst part about it was the
boxing scenes.
Social Media Mining
Natural Language Processing
Slide 32 of 66
32
Google Product Search
Social Media Mining
Natural Language Processing
Slide 33 of 66
33
Twitter sentiment vs. Gallup Poll of Consumer
Confidence
Social Media Mining
Natural Language Processing
Slide 34 of 66
34
Twitter Sentiment
• Johan Bollen, Huina Mao, Xiaojun Zeng. 2011. Twitter
mood predicts the stock market, Journal of
Computational Science 2:1, 1-8.
10.1016/j.jocs.2010.12.007.
Social Media Mining
Natural Language Processing
Slide 35 of 66
35
Other Names of Sentiment Analysis
•
•
•
•
Opinion extraction
Opinion mining
Sentiment mining
Subjectivity analysis
Social Media Mining
Natural Language Processing
Slide 36 of 66
36
Why sentiment analysis?
• Movie: is this review positive or negative?
• Products: what do people think about the
new iPhone?
• Public sentiment: how is consumer
confidence? Is despair increasing?
• Politics: what do people think about this
candidate or issue?
• Prediction: predict election outcomes or
market trends from sentiment
Social Media Mining
Natural Language Processing
Slide 37 of 66
37
Sentiment Analysis
• Simplest task:
– Is the attitude of this text positive or negative?
• More complex:
– Rank the attitude of this text from 1 to 5
• Advanced:
– Detect the target, source, or complex attitude types
Social Media Mining
Natural Language Processing
Slide 38 of 66
38
Sentiment Classification for Movie Reviews
• Polarity detection:
– Is an IMDB movie review positive or negative?
• Data: Polarity Data 2.0:
– http://www.cs.cornell.edu/people/pabo/moviereview-data
• Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.
2002. Thumbs up? Sentiment Classification using
Machine Learning Techniques. EMNLP, 79-86.
• Bo Pang and Lillian Lee. 2004. A Sentimental
Education: Sentiment Analysis Using Subjectivity
Summarization Based on Minimum Cuts. ACL, 271-278.
Social Media Mining
Natural Language Processing
Slide 39 of 66
39
IMDB data in the
Pang and Lee database
•
•
•
•
when _star wars_ came out some
twenty years ago, the image of
traveling throughout the stars has
become a commonplace image. […]
when han solo goes light speed,
the stars change to bright lines,
going towards the viewer in lines
that converge at an invisible point.
cool.
_october sky_ offers a much
simpler image--that of a single
white dot, traveling horizontally
across the night sky. […]
Social Media Mining
•
•
•
"snake eyes" is the most
aggravating kind of movie: the
kind that shows so much potential
then becomes unbelievably
disappointing.
it's not just because this is a brian
depalma film, and since he's a
great director and one who’s films
are always greeted with at least
some fanfare.
and it's not even because this was
a film starring nicolas cage and
since he gives a brauvara
performance, this film is hardly
worth his talents.
Natural Language Processing
Slide 40 of 66
40
Baseline Algorithm
• Tokenization
• Feature Extraction
• Classification using different classifiers
– Naïve Bayes
– MaxEnt
– SVM
Social Media Mining
Natural Language Processing
Slide 41 of 66
41
Sentiment Tokenization Issues
•
•
•
•
•
•
Deal with HTML and XML markup
Twitter mark-up (names, hash tags)
Capitalization (preserve for words in all caps)
Phone numbers, dates
Emoticons
Useful code:
– Christopher Potts sentiment tokenizer
– Brendan O’Connor twitter tokenizer
Social Media Mining
Natural Language Processing
Slide 42 of 66
42
Extracting Features
• How to handle negation
– I didn’t like this movie!
vs
– I really like this movie!
• Which words to use?
– Only adjectives
– All words
• All words turns out to work better, at least on this data
Social Media Mining
Natural Language Processing
Slide 43 of 66
43
Negations
• Add NOT_ to every word between negation and
following punctuation:
– didn’t like this movie , but I
– didn’t NOT_like NOT_this NOT_movie but I
• Das, Sanjiv and Mike Chen. 2001. Yahoo! for Amazon:
Extracting Market Sentiment from Stock Message
Boards. In APFA.
• Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.
2002. Thumbs up? Sentiment Classification using
Machine Learning Techniques. EMNLP, 79-86.
Social Media Mining
Natural Language Processing
Slide 44 of 66
44
Boolean feature for Multinomial Naïve Bayes
• Intuition:
– For sentiment (and probably for other text
classification domains)
– Word occurrence may matter more than word
frequency
• The occurrence of the word fantastic tells us a lot
• The fact that it occurs 5 times may not tell us much more.
– Boolean Multinomial Naïve Bayes
• Clips all the word counts in each document at 1
– Other possibility: log(freq(w))
Social Media Mining
Natural Language Processing
Slide 45 of 66
45
Boolean Multinomial Naïve Bayes
• From training corpus, extract Vocabulary
• Calculate P(cj) terms
• Calculate P(wk|cj) terms
• Make predictions
𝑐𝑁𝐵 = max 𝑃 𝑐𝑗
𝑐𝑗 ∈𝐶
Social Media Mining
𝑃(𝑤𝑖 |𝑐𝑗 )
𝑖∈𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑠
Natural Language Processing
Slide 46 of 66
46
Other issues in Classification
• MaxEnt and SVM tend to do better than Naïve
Bayes
• Subtlety: makes reviews hard to classify
– Perfume review in Perfumes: the Guide
• “If you are reading this because it is your darling fragrance,
please wear it at home exclusively, and tape the windows
shut.”
Social Media Mining
Natural Language Processing
Slide 47 of 66
47
Thwarted Expectations and Ordering Effects
• “This film should be brilliant. It sounds like a
great plot, the actors are first grade, and the
supporting cast is good as well, and Stallone is
attempting to deliver a good performance.
However, it can’t hold up.”
• Well as usual Keanu Reeves is nothing special,
but surprisingly, the very talented Laurence
Fishburne is not so good either, I was
surprised.
Social Media Mining
Natural Language Processing
Slide 48 of 66
48
Sentiment Lexicons
• The General Inquirer, 1966.
– Home page: http://www.wjh.harvard.edu/~inquirer
– List of Categories:
http://www.wjh.harvard.edu/~inquirer/homecat.htm
– Spreadsheet:
http://www.wjh.harvard.edu/~inquirer/inquirerbasic
.xls
– Categories:
• Positive (1915 words) and Negative (2291 words)
• Strong vs. Weak, Active vs. Passive, Overstated vs.
Understated
• Pleasure, Pain, Virtue, Vice, Motivation, Cognitive
Orientation, etc.
– Free for Research Use
Social Media Mining
Natural Language Processing
Slide 49 of 66
49
Sentiment Lexicons
• LIWC (Linguistic Inquiry and Word Count),
2007
– Home page: http://www.liwc.net/
– 2300 words, >70 classes
– Affective Processes
• negative emotion (bad, weird, hate, problem, tough)
• positive emotion (love, nice, sweet)
– Cognitive Processes
• Tentative (maybe, perhaps, guess), Inhibition (block,
constraint)
– Pronouns, Negation (no, never), Quantifiers (few,
many)
– $30 or $90 fee
Social Media Mining
Natural Language Processing
Slide 50 of 66
50
Sentiment Lexicons
• MPQA Subjectivity Cues Lexicon (EMNLP’03,
05)
– Home page:
http://www.cs.piX.edu/mpqa/subj_lexicon.html
• 6885 words from 8221 lemmas
• 2718 positive and 4912 negative
– Each word annotated for intensity (strong, weak)
– GNU GPL
Social Media Mining
Natural Language Processing
Slide 51 of 66
51
Sentiment Lexicons
• Bing Liu Opinion Lexicon (KDD‐2004)
– Bing Liu's Page on Opinion Mining
– http://www.cs.uic.edu/~liub/FBS/opinion‐lexiconEnglish.rar
– 6786 words: 2006 positive and 4783 negative
Social Media Mining
Natural Language Processing
Slide 52 of 66
52
Sentiment Lexicons
• SentiWordNet (LREC‐2010)
– Home page: http://sentiwordnet.isti.cnr.it/
– All WordNet synsets automatically annotated for
degrees of positivity, negativity, and
neutrality/objectiveness
Social Media Mining
Natural Language Processing
Slide 53 of 66
53
Disagreements between Polarity Lexicons
Christopher Potts, Sentiment Tutorial, 2011
Social Media Mining
Natural Language Processing
Slide 54 of 66
54
Learning Sentiment Lexicons
• Semi-supervised learning of lexicons
–
–
–
–
Use a small amount of information
A few labeled examples
A few hand‐built patterns
To bootstrap a lexicon
Social Media Mining
Natural Language Processing
Slide 55 of 66
55
Intuition for identifying word polarity (ACL’97)
• Adjectives conjoined by “and” have same
polarity
– Fair and legitimate, corrupt and brutal
– *fair and brutal, *corrupt and legitimate
• Adjectives conjoined by “but” do not
– fair but brutal
Social Media Mining
Natural Language Processing
Slide 56 of 66
56
Step 1
• Label seed set of 1336 adjectives (all >20 in 21
million word WSJ corpus)
– 657 positive
adequate central clever famous intelligent remarkable reputed
sensitive slender thriving…
– 679 negative
contagious drunken ignorant lanky listless primitive strident
troublesome unresolved unsuspecting…
Social Media Mining
Natural Language Processing
Slide 57 of 66
57
Step 2
• Expand seed set to conjoined adjectives
Social Media Mining
Natural Language Processing
Slide 58 of 66
58
Step 3
• Supervised classifier assigns “polarity similarity”
to each word pair, resulting in graph:
Social Media Mining
Natural Language Processing
Slide 59 of 66
59
Step 4
• Clustering for partitioning the graph into two
Social Media Mining
Natural Language Processing
Slide 60 of 66
60
Output Polarity Lexicon
• Positive
– bold decisive disturbing generous good honest
important large mature patient peaceful positive
proud sound stimulating straightforward strange
talented vigorous witty…
• Negative
– ambiguous cautious cynical evasive harmful
hypocritical inefficient insecure irrational
irresponsible minor outspoken pleasant reckless risky
selfish tedious unsupported vulnerable wasteful…
Social Media Mining
Natural Language Processing
Slide 61 of 66
61
Output Polarity Lexicon
• Positive
– bold decisive disturbing generous good honest
important large mature patient peaceful positive
proud sound stimulating straightforward strange
talented vigorous witty…
• Negative
– ambiguous cautious cynical evasive harmful
hypocritical inefficient insecure irrational
irresponsible minor outspoken pleasant reckless
risky selfish tedious unsupported vulnerable
wasteful…
Social Media Mining
Natural Language Processing
Slide 62 of 66
62
Using WordNet to learn polarity
• WordNet: online thesaurus (covered in later
• lecture).
• Create positive (“good”) and negative seedwords (“terrible”)
• Find Synonyms and Antonyms
– Positive Set: Add synonyms of positive words (“well”)
and antonyms of negative words
– Negative Set: Add synonyms of negative words
(“awful”) and antonyms of positive words (”evil”)
• Repeat, following chains of synonyms
• Filter
Social Media Mining
Natural Language Processing
Slide 63 of 66
63
Turney Algorithm (ACL’02)
1. Extract a phrasal lexicon from reviews
2. Learn polarity of each phrase
3. Rate a review by the average polarity of its
phrases
Social Media Mining
Natural Language Processing
Slide 64 of 66
64
Summary on Learning Lexicons
• Advantages:
– Can be domain‐specific
– Can be more robust (more words)
• Intuition
– Start with a seed set of words (‘good’, ‘poor’)
– Find other words that have similar polarity:
• Using “and” and “but”
• Using words that occur nearby in the same document
• Using WordNet synonyms and antonyms
Social Media Mining
Natural Language Processing
Slide 65 of 66
65
Any Question?
Social Media Mining
Natural Language Processing
Slide 66 of 66
66
Download