slides

advertisement
Exploiting Associations between Word
Clusters and Document Classes for
Cross-domain Text Categorization
Fuzhen Zhuang, Ping Luo, Hui Xiong,
Qing He, Yuhong Xiong, Zhongzhi Shi
Outline
Introduction
• Problem Formulation
• Solution for Optimization Problem and
Analysis of Algorithm Convergence
• Experimental Validation
• Related Works
• Conclusions
Fuzhen Zhuang et al., SDM 2010
2
Introduction
• Many traditional learning techniques work well only
under the assumption: Training and test data follow the
same distribution
Fail !
Enterprise News Classification: including the classes
“Product Announcement”, “Business scandal”, “Acquisition”, … …
Training
(labeled)
Product announcement: HP's
just-released LaserJet Pro
P1100 printer and the LaserJet
Pro M1130 and M1210
multifunction printers,
price … performance ...
HP news
From different
companies
Classifier
Different
distribution
Fuzhen Zhuang et al., SDM 2010
Test
(unlabeled)
Announcement for Lenovo
ThinkPad ThinkCentre – price
$150 off Lenovo K300 desktop
using coupon code ... Lenovo
ThinkPad ThinkCentre – price
$200 off Lenovo IdeaPad U450p
laptop using. ...their performance
Lenovo news
3
Motivation (1)
• Example Analysis:
HP news
Lenovo news
Product announcement: HP's
just-released LaserJet Pro
P1100 printer and the LaserJet
Pro M1130 and M1210
multifunction printers,
price … performance ...
Announcement for Lenovo
ThinkPad ThinkCentre – price
$150 off Lenovo K300 desktop
using coupon code ... Lenovo
ThinkPad ThinkCentre – price
$200 off Lenovo IdeaPad U450p
laptop using. ...their performance
Related
document class:
Product
announcement
LaserJet, printer,
announcement,
price,
Product
word concept:
Fuzhen Zhuang et al., SDM 2010
ThinkPad,
ThinkCentre,
announcement,
price
4
Motivation (2)
• Example Analysis: The words expressing the same
word concept are domain-dependent
LaserJet, printer, price, performance et al.
HP
• Can
we model
this observation
classification?
Thinkpad,
Thinkcentre,for
price,
performance et al.
Lenovo
• Weword
study
to model it for cross-domain classification
concept
• Domain-dependent
word concepts
indicates
Productword
• Domain-independent
association between
Product
concepts and document classes
announcement
The association between word concepts and
Zhuang et al., SDM 2010
document classesFuzhen
is domain-independent
5
Motivation (3)
• Example Analysis:
HP news
Product announcement: HP's
just-released LaserJet Pro
P1100 printer and the LaserJet
Pro M1130 and M1210
multifunction printers,
price … performance ...
Share some common
words: announcement,
price, performance …
Related
document class:
Lenovo news
Announcement for Lenovo
ThinkPad ThinkCentre – price
$150 off Lenovo K300 desktop
using coupon code ... Lenovo
ThinkPad ThinkCentre – price
$200 off Lenovo IdeaPad U450p
laptop using. ...their performance
Product
announcement
LaserJet, printer,
announcement,
price…
Product
word concept:
Fuzhen Zhuang et al., SDM 2010
ThinkPad,
ThinkCentre
announcement
price…
6
Outline
• Introduction
Problem Formulation
• Solution for Optimization Problem and
Analysis of Algorithm Convergence
• Experimental Validation
• Related Works
• Conclusions
Fuzhen Zhuang et al., SDM 2010
7
Preliminary Knowledge
• Basic formula of matrix tri-factorization:
where the input X is the word-document co-occurrence matrix
F
denotes concept information, may vary in different domains
S
indeed is the association between word concepts and document
classes, may retain stable cross domains
G
denotes the document classification information
Fuzhen Zhuang et al., SDM 2010
8
Problem Formulation (1)
• Input: source domain Xs, target domain Xt
• Matrix tri-factorization based classification
framework
• Two-step Optimization Framework (MTrick0)
• Joint Optimization Framework (MTrick)
Fuzhen Zhuang et al., SDM 2010
9
Problem Formulation (2)
• Sketch map of two-step optimization
Fs
Gs
Ss
Source
domain Xs
First step
Ft
Gt
Target
domain Xt
Second step
Fuzhen Zhuang et al., SDM 2010
10
Problem Formulation (3)
• The optimization problem in source domain (First step)
• TheOur
optimization
problem in target
(Second step)
goal:
G isdomain
used as the
0
to obtain Fs , Gs and Ss
supervision information
for this optimization
Our goal:
to obtain Ft , Gt
Fuzhen Zhuang et al., SDM 2010
Ss is the solution
obtained from the
source domain
11
Problem Formulation (4)
• Sketch map of joint optimization
Fs
Gs
Source
domain Xs
S
Ft
Knowledge Transfer
Gt
Target
domain Xt
Fuzhen Zhuang et al., SDM 2010
12
Problem Formulation (5)
• The joint optimization problem over source and target
domain:
the association
S is shared as
bridge to
transfer
knowledge
G0 is the
supervision
information
Fuzhen Zhuang et al., SDM 2010
13
Outline
• Introduction
• Problem Formulation
Solution for Optimization Problem and
Analysis of Algorithm Convergence
• Experimental Validation
• Related Works
• Conclusions
Fuzhen Zhuang et al., SDM 2010
14
Solution for Optimization
• Alternately iterative algorithm is developed and the
updated formulas are as follows,
This is the
solution for
joint
optimization
problem
Fuzhen Zhuang et al., SDM 2010
15
Analysis of Algorithm Convergence
• According to the methodology of convergence analysis
in the two works [Lee et al., NIPS’01] and [Ding et al.,
KDD’06], the following theorem holds.
Theorem (Convergence): After each round of calculating
the iterative formulas, the objective function in the joint
optimization will converge monotonically.
Fuzhen Zhuang et al., SDM 2010
16
Outline
• Introduction
• Problem Formulation
• Solution for Optimization Problem and
Analysis of Algorithm Convergence
Experimental Validation
• Related Works
• Conclusions
Fuzhen Zhuang et al., SDM 2010
17
Experimental Preparation (1)
• Construct Classification Tasks
rec
sci
rec.autos
rec.motorcycles rec.baseball
rec.hockey
sci.crypt
sic.electronics
sci.space
sci.med
rec and sci denote the positive and negative classes, respectively
For source domain: rec.autos + sci.med (4 x 4 cases)
For target domain: rec.motorcycles + sci.space (3 x 3 cases)

144 (P42  P42 ) Tasks can be constructed from this data set
rec vs. sci
Fuzhen Zhuang et al., SDM 2010
18
Experimental Preparation (2)
• Data Sets
20 Newsgroup (three top categories are selected)
rec rec.autos
sci sci.crypt
talk talk.guns
rec.motorcycles rec.baseball
rec.hockey
sic.electronics
sci.med
sci.space
talk.mideast
talk.misc
talk.religion
– Two data sets for binary classification: rec vs. sci and sci vs. talk
rec vs. sci : 144 tasks
sci vs. talk : 144 tasks
Reuters-21578 (the problems constructed in [Gao et al., KDD’08])
Fuzhen Zhuang et al., SDM 2010
19
Experimental Preparation (3)
• Compared Algorithms
– Supervised Learning:


Logistic Regression (LG) [David et al., 00]
Support Vector Machine (SVM) [Joachims, ICML’99]
– Semi-supervised Learning:

TSVM [Joachims, ICML’99]
– Cross-domain Learning:


CoCC [Dai et al., KDD’07]
LWE [Gao et al., KDD’08]
• Our Methods


MTrick0 (Two-step optimization framework)
MTrick (Joint optimization framework)
• Measure: classification accuracy
Fuzhen Zhuang et al., SDM 2010
20
Experimental Results (1)
• Comparisons among MTrick, MTrick0, CoCC, TSVM, SVM and
LG on data set rec vs. sci
MTrick can
perform well even
the accuracy of LG
is lower than 65%
Fuzhen Zhuang et al., SDM 2010
21
Experimental Results (2)
• Comparisons among MTrick, MTrick0, CoCC, TSVM, SVM and
LG on data set sci vs. talk
Similar with rec
vs. sci Mtrick
also achieves the
best results in
this data set
Fuzhen Zhuang et al., SDM 2010
22
Experimental Results (3)
• The performance comparison of MTrick, LWE, CoCC, TSVM,
SVM and LG on Reuters-21578
MTrick also performs very well on this data set
Fuzhen Zhuang et al., SDM 2010
23
Experimental Results Summary
• The systemic experiments show that MTrick
outperforms all the compared algorithms
• Especially, MTrick can perform very well when the
accuracy of LG is low (< 65%), which indicates that
MTrick still works when the difficulty degree of
transfer learning is great
• Also we can find that the joint optimization is better
than the two-step optimization
Fuzhen Zhuang et al., SDM 2010
24
Overview
• Introduction
• Problem Formulation
• Solution for Optimization Problem and
Analysis of Algorithm Convergence
• Experimental Validation
Related Works
• Conclusions
Fuzhen Zhuang et al., SDM 2010
25
Related Work (1)
• Cross-domain Learning
Solve the distribution mismatch problems between the training
and testing data.
– Instance weighting based approaches


Boosting based learning by Dai et al.[ICML’07]
Instance weighting framework for NLP tasks by Jiang et
al.[ACL’07]
– Feature selection based approaches



Two-phase feature selection framework by Jiang et al.[CIKM’07]
Dimensionality reduction approach by Pan et al.[AAAI’08],
which focuses on finding out the latent feature space regarded as
the bridge knowledge between the source and target domains
Co-Clustering based Classification method by Dai et al. [KDD’07]
Fuzhen Zhuang et al., SDM 2010
26
Related Work (2)
• Nonnegative Matrix Factorization (NMF)





Weighted nonnegative matrix factorization (WNMF) by
Guillamet et al. [PRL’03]
Incorporating word space knowledge for document clustering
by Li et al. [SigIR’08]
Orthogonal constrained NMF by Ding et al.[KDD’06]
Cross-domain collaborative filtering by Li et al.[IJCAI’09]
Transfer the label information by sharing the information of
word clusters, proposed by Li et al.[SigIR’09]. However, the
word clusters are not exactly the same due to distribution
difference cross domains
Fuzhen Zhuang et al., SDM 2010
27
Outline
• Introduction
• Problem Formulation
• Solution for Optimization Problem and
Analysis of Algorithm Convergence
• Experimental Validation
• Related Works
Conclusions
Fuzhen Zhuang et al., SDM 2010
28
Conclusions
•
•
•
Propose a nonnegative matrix factorization based
classification framework (MTrick), which explicitly consider
‒ the domain-dependent concepts
‒ the domain-independent association between concepts and
document classes
Develop an alternately iterative algorithm to solve the
optimization problem, and theoretically analyze the
convergence
Experiments on real-world text data sets show the
effectiveness of the proposed approach
Fuzhen Zhuang et al., SDM 2010
29
Acknowledgement
Thank you!
Q. & A.
Fuzhen Zhuang et al., SDM 2010
30
Download