Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong Xiong, Zhongzhi Shi Outline Introduction • Problem Formulation • Solution for Optimization Problem and Analysis of Algorithm Convergence • Experimental Validation • Related Works • Conclusions Fuzhen Zhuang et al., SDM 2010 2 Introduction • Many traditional learning techniques work well only under the assumption: Training and test data follow the same distribution Fail ! Enterprise News Classification: including the classes “Product Announcement”, “Business scandal”, “Acquisition”, … … Training (labeled) Product announcement: HP's just-released LaserJet Pro P1100 printer and the LaserJet Pro M1130 and M1210 multifunction printers, price … performance ... HP news From different companies Classifier Different distribution Fuzhen Zhuang et al., SDM 2010 Test (unlabeled) Announcement for Lenovo ThinkPad ThinkCentre – price $150 off Lenovo K300 desktop using coupon code ... Lenovo ThinkPad ThinkCentre – price $200 off Lenovo IdeaPad U450p laptop using. ...their performance Lenovo news 3 Motivation (1) • Example Analysis: HP news Lenovo news Product announcement: HP's just-released LaserJet Pro P1100 printer and the LaserJet Pro M1130 and M1210 multifunction printers, price … performance ... Announcement for Lenovo ThinkPad ThinkCentre – price $150 off Lenovo K300 desktop using coupon code ... Lenovo ThinkPad ThinkCentre – price $200 off Lenovo IdeaPad U450p laptop using. ...their performance Related document class: Product announcement LaserJet, printer, announcement, price, Product word concept: Fuzhen Zhuang et al., SDM 2010 ThinkPad, ThinkCentre, announcement, price 4 Motivation (2) • Example Analysis: The words expressing the same word concept are domain-dependent LaserJet, printer, price, performance et al. HP • Can we model this observation classification? Thinkpad, Thinkcentre,for price, performance et al. Lenovo • Weword study to model it for cross-domain classification concept • Domain-dependent word concepts indicates Productword • Domain-independent association between Product concepts and document classes announcement The association between word concepts and Zhuang et al., SDM 2010 document classesFuzhen is domain-independent 5 Motivation (3) • Example Analysis: HP news Product announcement: HP's just-released LaserJet Pro P1100 printer and the LaserJet Pro M1130 and M1210 multifunction printers, price … performance ... Share some common words: announcement, price, performance … Related document class: Lenovo news Announcement for Lenovo ThinkPad ThinkCentre – price $150 off Lenovo K300 desktop using coupon code ... Lenovo ThinkPad ThinkCentre – price $200 off Lenovo IdeaPad U450p laptop using. ...their performance Product announcement LaserJet, printer, announcement, price… Product word concept: Fuzhen Zhuang et al., SDM 2010 ThinkPad, ThinkCentre announcement price… 6 Outline • Introduction Problem Formulation • Solution for Optimization Problem and Analysis of Algorithm Convergence • Experimental Validation • Related Works • Conclusions Fuzhen Zhuang et al., SDM 2010 7 Preliminary Knowledge • Basic formula of matrix tri-factorization: where the input X is the word-document co-occurrence matrix F denotes concept information, may vary in different domains S indeed is the association between word concepts and document classes, may retain stable cross domains G denotes the document classification information Fuzhen Zhuang et al., SDM 2010 8 Problem Formulation (1) • Input: source domain Xs, target domain Xt • Matrix tri-factorization based classification framework • Two-step Optimization Framework (MTrick0) • Joint Optimization Framework (MTrick) Fuzhen Zhuang et al., SDM 2010 9 Problem Formulation (2) • Sketch map of two-step optimization Fs Gs Ss Source domain Xs First step Ft Gt Target domain Xt Second step Fuzhen Zhuang et al., SDM 2010 10 Problem Formulation (3) • The optimization problem in source domain (First step) • TheOur optimization problem in target (Second step) goal: G isdomain used as the 0 to obtain Fs , Gs and Ss supervision information for this optimization Our goal: to obtain Ft , Gt Fuzhen Zhuang et al., SDM 2010 Ss is the solution obtained from the source domain 11 Problem Formulation (4) • Sketch map of joint optimization Fs Gs Source domain Xs S Ft Knowledge Transfer Gt Target domain Xt Fuzhen Zhuang et al., SDM 2010 12 Problem Formulation (5) • The joint optimization problem over source and target domain: the association S is shared as bridge to transfer knowledge G0 is the supervision information Fuzhen Zhuang et al., SDM 2010 13 Outline • Introduction • Problem Formulation Solution for Optimization Problem and Analysis of Algorithm Convergence • Experimental Validation • Related Works • Conclusions Fuzhen Zhuang et al., SDM 2010 14 Solution for Optimization • Alternately iterative algorithm is developed and the updated formulas are as follows, This is the solution for joint optimization problem Fuzhen Zhuang et al., SDM 2010 15 Analysis of Algorithm Convergence • According to the methodology of convergence analysis in the two works [Lee et al., NIPS’01] and [Ding et al., KDD’06], the following theorem holds. Theorem (Convergence): After each round of calculating the iterative formulas, the objective function in the joint optimization will converge monotonically. Fuzhen Zhuang et al., SDM 2010 16 Outline • Introduction • Problem Formulation • Solution for Optimization Problem and Analysis of Algorithm Convergence Experimental Validation • Related Works • Conclusions Fuzhen Zhuang et al., SDM 2010 17 Experimental Preparation (1) • Construct Classification Tasks rec sci rec.autos rec.motorcycles rec.baseball rec.hockey sci.crypt sic.electronics sci.space sci.med rec and sci denote the positive and negative classes, respectively For source domain: rec.autos + sci.med (4 x 4 cases) For target domain: rec.motorcycles + sci.space (3 x 3 cases) 144 (P42 P42 ) Tasks can be constructed from this data set rec vs. sci Fuzhen Zhuang et al., SDM 2010 18 Experimental Preparation (2) • Data Sets 20 Newsgroup (three top categories are selected) rec rec.autos sci sci.crypt talk talk.guns rec.motorcycles rec.baseball rec.hockey sic.electronics sci.med sci.space talk.mideast talk.misc talk.religion – Two data sets for binary classification: rec vs. sci and sci vs. talk rec vs. sci : 144 tasks sci vs. talk : 144 tasks Reuters-21578 (the problems constructed in [Gao et al., KDD’08]) Fuzhen Zhuang et al., SDM 2010 19 Experimental Preparation (3) • Compared Algorithms – Supervised Learning: Logistic Regression (LG) [David et al., 00] Support Vector Machine (SVM) [Joachims, ICML’99] – Semi-supervised Learning: TSVM [Joachims, ICML’99] – Cross-domain Learning: CoCC [Dai et al., KDD’07] LWE [Gao et al., KDD’08] • Our Methods MTrick0 (Two-step optimization framework) MTrick (Joint optimization framework) • Measure: classification accuracy Fuzhen Zhuang et al., SDM 2010 20 Experimental Results (1) • Comparisons among MTrick, MTrick0, CoCC, TSVM, SVM and LG on data set rec vs. sci MTrick can perform well even the accuracy of LG is lower than 65% Fuzhen Zhuang et al., SDM 2010 21 Experimental Results (2) • Comparisons among MTrick, MTrick0, CoCC, TSVM, SVM and LG on data set sci vs. talk Similar with rec vs. sci Mtrick also achieves the best results in this data set Fuzhen Zhuang et al., SDM 2010 22 Experimental Results (3) • The performance comparison of MTrick, LWE, CoCC, TSVM, SVM and LG on Reuters-21578 MTrick also performs very well on this data set Fuzhen Zhuang et al., SDM 2010 23 Experimental Results Summary • The systemic experiments show that MTrick outperforms all the compared algorithms • Especially, MTrick can perform very well when the accuracy of LG is low (< 65%), which indicates that MTrick still works when the difficulty degree of transfer learning is great • Also we can find that the joint optimization is better than the two-step optimization Fuzhen Zhuang et al., SDM 2010 24 Overview • Introduction • Problem Formulation • Solution for Optimization Problem and Analysis of Algorithm Convergence • Experimental Validation Related Works • Conclusions Fuzhen Zhuang et al., SDM 2010 25 Related Work (1) • Cross-domain Learning Solve the distribution mismatch problems between the training and testing data. – Instance weighting based approaches Boosting based learning by Dai et al.[ICML’07] Instance weighting framework for NLP tasks by Jiang et al.[ACL’07] – Feature selection based approaches Two-phase feature selection framework by Jiang et al.[CIKM’07] Dimensionality reduction approach by Pan et al.[AAAI’08], which focuses on finding out the latent feature space regarded as the bridge knowledge between the source and target domains Co-Clustering based Classification method by Dai et al. [KDD’07] Fuzhen Zhuang et al., SDM 2010 26 Related Work (2) • Nonnegative Matrix Factorization (NMF) Weighted nonnegative matrix factorization (WNMF) by Guillamet et al. [PRL’03] Incorporating word space knowledge for document clustering by Li et al. [SigIR’08] Orthogonal constrained NMF by Ding et al.[KDD’06] Cross-domain collaborative filtering by Li et al.[IJCAI’09] Transfer the label information by sharing the information of word clusters, proposed by Li et al.[SigIR’09]. However, the word clusters are not exactly the same due to distribution difference cross domains Fuzhen Zhuang et al., SDM 2010 27 Outline • Introduction • Problem Formulation • Solution for Optimization Problem and Analysis of Algorithm Convergence • Experimental Validation • Related Works Conclusions Fuzhen Zhuang et al., SDM 2010 28 Conclusions • • • Propose a nonnegative matrix factorization based classification framework (MTrick), which explicitly consider ‒ the domain-dependent concepts ‒ the domain-independent association between concepts and document classes Develop an alternately iterative algorithm to solve the optimization problem, and theoretically analyze the convergence Experiments on real-world text data sets show the effectiveness of the proposed approach Fuzhen Zhuang et al., SDM 2010 29 Acknowledgement Thank you! Q. & A. Fuzhen Zhuang et al., SDM 2010 30