Latent Association Analysis of Document Pairs Gengxin Miao University of California, Santa Barbara Presented at the IBM T.J. Watson Research Center Hawthorne, NY December 2, 2011 Networked Texts Texts flow on expert networks Semantically associated texts DB2 logon t2 D Symptoms G t1 t3 C A E B Users Diseases H F Belong to the same search task Treatments Queries Interconnected text streams Web pages Gengxin Miao UC Santa Barbara 2 Semantically Associated Documents + Applications Software system maintenance Root cause finding Problem prediction Machine translation Question answering Healthcare assistance Gengxin Miao UC Santa Barbara 4 Huge Datasets Beyond human learner’s capability Gengxin Miao UC Santa Barbara 5 Modeling Options Source Document Set Target Document Set Word-level mapping Topic-level mapping Document-level mapping Gengxin Miao UC Santa Barbara 6 Word-Level Mapping (UAI’09) Learns a dictionary between the two document sets Applies to machine translation Word mappings are typically noisy Gengxin Miao UC Santa Barbara 7 Topic-Level Mapping (EMNLP’09) Topic simplex of Topic simplex of the source document set the target document set Assumes the associated documents share the same topic proportion Works well for translations between languages Gengxin Miao UC Santa Barbara 8 Document-Level Mapping (our work) One-to-many or many-to-one mappings are broken down into one-to-one document pairs Two documents are associated by their association factor Gengxin Miao UC Santa Barbara 9 Latent Association Analysis – Framework Generative Models Ranking Algorithms Experiment Generative process Draw an association factor for each document pair Draw topic proportions for both the source and the target document Draw the words in each document Gengxin Miao UC Santa Barbara 10 Latent Association Analysis – An Instantiation Generative Models Ranking Algorithms Experiment Canonical Correlation Analysis (CCA) Captures the semantic association in document pairs Correlated Topic Model (CTM) Captures the document and word co-occurrence Gengxin Miao UC Santa Barbara 11 The Generative Process Generative Models Ranking Algorithms Experiment A pair of documents arise from the following process Draw an L-dimensional association factor For the source/target document, draw the topic proportions For each word in the documents, draw a topic and a word Gengxin Miao UC Santa Barbara 12 Problems Generative Models Ranking Algorithms Experiment Inference Given a model M and a document pair How to determine the association factor, topic proportions and topic assignments that best describe the document pair? Model fitting Given a set of document pairs How to calculate the parameters in M that best describes the entire document pair set? Gengxin Miao UC Santa Barbara 13 Inference Generative Models Ranking Algorithms Experiment Objective function Given a model and a document pair Calculate the topic assignments and the topic proportions Posterior distribution is intractable to compute The topic assignments and the topic proportions are correlated when conditioned on observations Gengxin Miao UC Santa Barbara 14 Variational Inference Generative Models Ranking Algorithms Experiment Decouple the parameters using a variational distribution Q Fit the variational parameters to approximate the true posterior distribution Gengxin Miao UC Santa Barbara 15 Variational Parameters Generative Models Gengxin Miao Ranking Algorithms Experiment UC Santa Barbara 16 Model Fitting Generative Models Gengxin Miao Ranking Algorithms Experiment UC Santa Barbara 17 LAA Ranking Methods Generative Models Ranking Algorithms Experiment Direct Ranking Ranking function for a candidate document pair Word frequency can distort the probability Latent Ranking Gengxin Miao UC Santa Barbara 18 Two-Step Ranking Generative Models Ranking Algorithms Experiment Separate Topic Models Source document Target document has topic proportion has topic proportion Topic-Level Mapping Canonical Correlation Analysis captures the association between the topic proportions Rank Target Documents Gengxin Miao UC Santa Barbara 19 Experiments Generative Models Ranking Algorithms Experiment Datasets IT-Change: Changes made to an IT environment and the consequent problems 24,317 document pairs 20,000 used for training, the rest used for testing IT-Solution: IT problems and their solutions 19,696 document pairs 15,000 used for training, the rest used for testing Evaluation Randomly select 100 document pairs in testing dataset For each source document, rank the 100 target documents Use the rank of the correct target document as accuracy measurement Gengxin Miao UC Santa Barbara 20 Accuracy Analysis Generative Models Gengxin Miao Ranking Algorithms Experiment UC Santa Barbara 21 Example Generative Models Gengxin Miao Ranking Algorithms Experiment UC Santa Barbara 22 Summary The LAA framework is capable of modeling two document sets associated by a bipartite graph One-to-many mappings or many-to-one mappings of documents are taken into consideration We instantiated LAA with CCA and CTM, but the framework can be used with other instantiations that fit specific applications The LAA-latent ranking algorithm ranks the correct target document better than other state-of-the-art algorithms Gengxin Miao UC Santa Barbara 23 Acknowledgment Prof. Louise E. Moser Prof. Xifeng Yan Dr. Shu Tao Dr. Ziyu Guan Dr. Nikos Anerousis Gengxin Miao UC Santa Barbara 24 Q & A? Thanks! Unigram Model Generative Models Ranking Algorithms Experiment N p(w ) p( wn ) n 1 Gengxin Miao UC Santa Barbara 26 Mixture of Unigrams Generative Models Ranking Algorithms Experiment N p(w) p( z ) p( wn | z ) z Gengxin Miao n 1 UC Santa Barbara 27 Probabilistic Latent Semantic Indexing Generative Models Ranking Algorithms Experiment p(d , wn ) p(d ) p(wn | z ) p( z | d ) z Gengxin Miao UC Santa Barbara 28 LDA and CTM Generative Models Ranking Algorithms Experiment topic 1 topic 2 topic 3 topic 1 topic 2 Gengxin Miao UC Santa Barbara topic 3 29