Determining, Creating, and Encoding Semantic, Domain

advertisement
Automatic Domain Adaptive
Sentiment Analysis Phase 1
Justin Martineau

Introduction






Outline
Problem Definition
Thesis Statement
Motivation
Background and Related Work
 Challenges
 Approaches
Research Plan
 Approach
 Evaluation
 Timeline
Conclusion
1. Intro - 2. Related Work - 3. Research Plan - 4. Conclusion
Problem Definition


Sentiment Analysis is the automatic detection
and measurement of sentiment in text
segments by machines.
3 Sub Tasks





Objective vs. Subjective
Topic Detection
Positive vs. Negative
Commonly applied to web data
Very Domain Dependent
1. Intro - 2. Related Work - 3. Research Plan - 4. Conclusion
Sentiment Analysis Example
1. Intro - 2. Related Work - 3. Research Plan - 4. Conclusion
Thesis Statement
This dissertation will develop and evaluate
techniques to discover and encode domainspecific, domain-independent, and semantic
knowledge to improve both single and
multiple domain sentiment analysis problems
on textual data given low labeled data
conditions.
1. Intro - 2. Related Work - 3. Research Plan - 4. Conclusion
Motivation: Private Sector

Market Research





Surveys
Focus Groups
Feature Analysis
Customer targeting (Free samples etc…)
Consumer Sentiment Search


Compare pros and cons
Overall opinion of products/services
1. Intro - 2. Related Work - 3. Research Plan - 4. Conclusion
Motivation: Public Sector

Political




National Security





Alternative Polling
Determine popular support for legislation
Choose campaign issues
Detect individuals at risk for radicalization
Determine local sentiment about US policy
Determine local values and sentimental icons
Portray actions positively using local flavor
Public Health


Detect potential suicide victims
Detect mentally unstable people
1. Intro - 2. Related Work - 3. Research Plan - 4. Conclusion
Challenges








Text Representation
Unedited Text
Sentiment Drift
Negation
Sarcasm
Sentiment Target Identification
Granularity
Domain Dependence
1. Intro - 2. Related Work - 3. Research Plan - 4. Conclusion
Domain Dependence 1
Domain Dependent Sentiment

The same sentence can mean two very different
things in different domains



Ex: “Read the book.” <= Good for books, bad for movies
Ex: “Jolting, heart pounding, You’re in for one hell of a
bumpy ride!” Good for movies and books, bad for cars.
Sentimental word associations change with domain



Fuzzy cameras are bad, but fuzzy teddy bears are good.
Big trucks are good, but big iPods are bad.
Bad is bad, but bad villains are good.
1. Intro - 2. Related Work - 3. Research Plan - 4. Conclusion
Domain Dependence 2
Endless Possibilities
1. Intro - 2. Related Work - 3. Research Plan - 4. Conclusion
Domain Dependence 3
Organization and Granularity
1. Intro - 2. Related Work - 3. Research Plan - 4. Conclusion
Theory of the Three Signals

Authors communicate messages using three
types of signals




Domain-Specific Signals
Domain-Independent Signals
Semantic Signals
More specific signals are generally more
powerful than more generic signals
1. Intro - 2. Related Work - 3. Research Plan - 4. Conclusion
Domain-Specific Signals



Dependent on problem and domain
Considered more useful by readers 





Tells what is good or bad about topic
Domain knowledge determines
sentiment orientation
Very strong in context, but weak or
misleading out of context
Can cause over generalization
error when overvalued
New domain-specific signal words
are ignored in CDT









Fuzzy teddy
bears
Sharp pictures
Sharp knives
Smooth rides
New ideas
Fast servers
Fast cars
Slow roasted
burgers
Slow motion
Small cameras
Big cars
1. Intro - 2. Related Work - 3. Research Plan - 4. Conclusion
Proposed Approach


Sentiment Search is more than just a
classification problem
Detecting and Using the three signals



Dynamic Domain Adapting Classifiers
Generic Feature Detection using unlabeled data
Semantic Feature Spaces
1. Intro - 2. Related Work - 3. Research Plan - 4. Conclusion
Dynamic Domain Adapting
Classifiers




A (preferably domain-independent) model is built using computationally
intense algorithms before query time on a set of labeled data.
Users interact at a query box level
Query results define the domain of interest
Domain specific adaptations are calculated




Domain specific adaptations are woven into the domain independent model



compares how the domain of interest is different from known cases
uses semantic knowledge about word senses and relations
must be fast algorithm: users are waiting
resulting model is temporary
used to classify documents as positive, negative, or objective
Sentimental search results are processed for significant components and
presented for human consumption
1. Intro - 2. Related Work - 3. Research Plan - 4. Conclusion
Overview
Query
Lucene
Index
Query Results
Define a new Domain
Business
Intelligence
Labeled data from
Dynamic known domain
Component
Domain
Analysis
+ Adapter
Semantic
General
Knowledge
Model
Context
Sentimental
Sentiment
Specific
Search
Classifier
Model
Results
Key: User Level, Source Data, Knowledge,Labeled Data
Algorithms, Search Results
Subjective Context Scoring

Multiply:




PMI(Word,Context)
IDF
Co-occurance with know generic sentiment seed
words times their bias (From movie reviews)
Seeds:


bad,worst,stupid,ridiculous, terrible,poorly
great,best,perfect,wonderful, excellent,effective
Rocchio Baseline

Rocchio - Query Expansion algorithm for
search



Similar goals to ours, find more relevant words
Does not account for sentiment
The new query is a weight sum of



Matching document vectors
Query vector
Non-matching document vectors (negative value).
Papa John’s According to TFIDF
Papa John’s According to
Subjective Context
George Bush According to
TFIDF
George Bush According to
Subjective Context
iPod according to Rocchio
iPod according to TFIDF
Sentimental Context

Components:





PMI(Word,Context)
TF
IDF
Log( Actual Co Occur of Word,Seed, context / Prob by
chance)
Values:




Abnormality to other docs
Popular words in context
Rare words in the corpus
Words that occur with sentiment words in the query
documents
iPod according to Sentimental
Context
iPod Nike according to
Sentimental Context
iPod+Nike According to Apple
iPod Audio according to
Sentiment Context
iPod Shuffle According to
Sentiment Context
iPod Warranty According to
Sentimental Context
iPod Battery according to
Sentiment Context
iPod nano battery According
to Sentimental Context
Google Hits (Battery Related):












iPod battery good ~ 13.5 Mill
iPod battery bad ~ 900 K
iPod nano battery good ~ 3 Mill
iPod nano battery bad ~ 785 K
iPod shuffle battery good ~ 1.6 Mill
iPod shuffle battery bad ~ 230 K
iPod shuffle battery price good ~ 2.6 Mill (not a typo)
iPod shuffle battery price bad ~ 230 K
iPod battery price good ~ 13.5 Mill
iPod battery price bad ~ 850 K
iPod nano battery price good ~ 3 Mill
iPod nano battery price bad ~ 785 K
1. Intro - 2. Related Work - 3. Research Plan - 4. Conclusion
Summary



Interesting problem with many potential
applications
Domain dependence is the core challenge
The keys to success are:



Vast quantities of unlabeled data
Semantic knowledge from freely available sources
Semantics must guide and influence but not
overrule the statistics
Questions?
BACKUP SLIDES
1. Intro - 2. Related Work - 3. Research Plan - 4. Conclusion
PMI - Pointwise Mutual
Information


a.k.a. Specific Mutual Information
Do 2 variables occur more often with each
other than chance?
P(X & Y ) 
PMI(X,Y )  log 

P(X)P(Y ) 
Download