ppt

advertisement
Opinion Detection by Transfer
Learning
11-742 Information Retrieval Lab
Grace Hui Yang
Advised by Prof. Yiming Yang
Outline
• Introduction
• The Problem
• Transfer Learning by Constructing
Informative Prior
• Datasets
• Evaluation Method
• Experimental Results
• Conclusion
Introduction
• TREC 2006 Blog Track
– Opinion Detection Task
<num> Number: 851
<title> "March of the Penguins"
<desc> Description:
Provide opinion of the film documentary "March of the Penguins".
<narr> Narrative:
Relevant documents should include opinions concerning the film
documentary "March of the Penguins". Articles or comments about
penguins outside the context of this film documentary are not
relevant.
Opinion Detection Literature
Review
• Researchers in Natural Language Processing (NLP)
community
– Turney (2002) : groups online words whose point mutual
information close to "excellent" and "poor"
– Riloff & Wiebe (2003): use a high-precision classifier to get high
quality opinions and non-opinions, and then extract syntactic
patterns. Repeat this process to bootstrap
– Pang et al. (2002): treat opinion and sentiment detection and as a
text classification problem
• Naive Bayes, Maximum Entropy, SVM +unigram pres. (82.9%)
– Pang & Lee (2005): use Minicuts to cluster sentences based on
their subjectivity and sentiment orientation.
• Researchers from data mining community
– Morinaga et al. (2002) : use word polarity, syntactic pattern
matching rules to extract opinions, PCA to create
correspondence between the product names and keywords
Existing System
• Query Expansion
• Document Retrieval
• Binary Text Classification by Bayesian
Logistic Regression
No Available Training Data
• Transfer Learning
– Transfer knowledge over similar tasks but
different domain
– Generalize knowledge from limited training
data
– Discover underlying general structures across
domains
Transfer Learning Literature
Review
• Baxter(1997) and Thrun(1996): both used
hierarchical Bayesian learning
• Lawrence and Platt (2004), Yu et al. (2005): also
use hierarchical Bayesian models to learn hyperparameters of Gaussian process
• Ando and Zhang (2005): proposed a framework
for Gaussian logistic regression for text
classification .
• Raina et al. (2006): continued this approach and
built informative priors for Gaussian logistic
regression
Transfer Learning
• The Approach presented in this project is
Inspired by the work done by Raina, Ng &
Koller (2006) on text classification
• Transferring common knowledge (word
dependence) in similar tasks by
constructing a informative prior in a
Bayesian Logistic Regression Framework
Logistic Regression Framework
• Logistic regression assumes sigmoid-like
data distribution
• To avoid overfitting, multivariate
Gaussian prior is added on θ
• Maximum a posteriori (MAP) Estimation
Non-diagonal Covariance
• Zero-mean, equal variance Prior
– Cannot capture relationship among words
• Zero-mean, non-diagonal covariance Prior
– Model word dependency in covariance
matrix’s off-diagonal entries
Pair-wised Covariance
• Covariance Definition:
• Given zero mean,
Get Covariance by MCMC
• Markov Chain Monte Carlo (MCMC)
• Sample V (V=4) small vocabularies with
size S (S=5) containing the two words wi
and wjcorresponding to θi and θj.
• From each vocabulary, sample T (T=4)
training sets with size Z(Z=3) to train an
ordinary Log. Reg. model on labeled
datasets
Get Covariance by MCMC
• Subtract a bootstrap estimation of the
covariance due to randomness of training
set change
Learning a Covariance Matrix
• Learning a single covariance for pairs of
regression coefficients is NOT all we need
• Two Challenges:
(1) Valid Covariance Matrix
– A valid covariance matrix needs to be positive
semi-definite (PSD)
– Hermitian matrix (square, self-adjoint) with
nonnegative eigen values.
– Project the matrix on to a PSD cone
Learning a Covariance Matrix
(2) Pair-wise calculations increase the
complexity quadratically with vocabulary
size
– represent the word dependence as linear
combination of underlying features
– Learn the coefficients by Least Squared Error
Learning a Covariance Matrix By
Joint Minimization
• λ is the trade-off coefficient between the
two objectives.
– As λ-> 0, only care about PSD cone
– As λ-> 1, only care about word pair
relationship
– Set to 0.6
Solve the Joint Minimization
• Convex problem, converge to global minimum
• Fix Σ , minimize over ψ
– Use Quadratic Program (QP) Solver
• Fix ψ , minimize over Σ
– A special semi-definite programming (SDP)
– Eigen decomposition and keep the nonnegative
values
Feature Design
• Model word dependency
– Wordnet synset
– and?
• People do not always use the same general
syntactic patterns to express opinion
– "blah blah is good",
– "awesome blah blah!"
Target-Opinion Word Pair
• Different opinion targets relate to different
customary expression
–
–
–
–
–
A person is knowledgeable
A computer processor is fast
A computer processor is knowledgeable (ill)
A person is fast (ill)
A computer processor is running like a horse
(word polarity test fails)
Target-Opinion Word Pair
• From training corpus, extract from a
positive example
– subject and object (excludes pronouns)
• “Melvin, pig”
– subject and BE-predicate
• “lens, clear”, “base, heavy”
– modifier and subject
• “good, coffee” , “interesting, movie”
Word Synonym
• Bridge vocabulary gap from training to
testing
– “This movie is good" in training corpus
– "The film is really good" in the testing corpus
Feature Vector
Log-cooccurrence
Target-Opinion
Synonym
Datasets
• Training Corpus
– Movie reviews [Pang & Lee from Cornell]
• 10,000 sentences (5,000 opinions, 5,000 nonopinions)
– Product reviews [Hu & Liu from UIC]
• 4,000+ sentences (2,034 opinions, 2,173 nonopinions.
• Digital camera, cell phone, DVD player, Jukebox,
…
Datasets
• Test Corpus
– TREC 2006 Blog corpus
– 3,201,002 articles (TREC reports 3,215,171)
– December 2005 to February 2006
– Technorati, Bloglines, Blogpulse …
• For each topic, 5,000 passages are retrieved
–
–
–
–
Using Lemur as search engine
132,399 passages in total
2,648 passages per topic
Each passage 1-10 sentences ( less than 100 words)
Evaluation Method
• Precision at 11-pt recall level
• Mean average precision (MAP)
• Answers are provided by TREC qrels,
– Document ids of documents containing an opinion
• Note that our system is developed for opinion
detection at sentence level
– An averaged score of all the sentences in a retrieved
passages
– Extract Unique document ids to compare with TREC
qrels
Experimental Results
• Effects of Using Non-diagonal Prior Covariance
– Baseline: Using movie reviews to train the Gaussian
log. Reg. model with Prior ~N(0,σ2)
– Feature Selection: Using common word features in
movie reviews and product reviews to train the
Gaussian log. Reg. model with Prior ~N(0,σ2)
– Informative Prior:Using movie reviews to calculate
prior covariance, train the Gaussian log. Reg. model
with the informative prior ~N(0,Σ)
32%
improvement
Experimental Results
• Effects of Feature Design
– Baseline: Using movie reviews to train the
Gaussian log. Reg. model with Prior ~N(0,σ2),
bi-gram model
– Transfer Learning Using Synonyms: Using
informative prior ~N(0,Σ)
– Transfer Learning Using Target-Opinion
pairs: informative prior ~N(0,Σ)
– Transfer Learning Using Both: informative
prior ~N(0,Σ)
A good
feature
Experimental Results
• Effects on External Dataset Selection
Negative
Effect of
Transfer
Learning
Why Negative Effect Occurs?
• Movie covers more general topics
• Product only share 23% topics
Conclusion
• Applying Transfer Learning in Opinion
Detection
• Transfer Learning by Informative Prior
improves brutal transfer learning by 32%
• Discovering a good feature for opinion
detection
– Target-Opinion pair
• Need to be careful when choosing external
datasets to help
Thank You!
Download