Researcher

advertisement
Latent Association Analysis
of Document Pairs
Gengxin Miao
University of California, Santa Barbara
Presented at the
IBM T.J. Watson Research Center
Hawthorne, NY
December 2, 2011
Networked Texts
Texts flow on expert networks
Semantically associated texts
DB2
logon
t2
D
Symptoms
G
t1
t3
C
A
E
B
Users
Diseases
H
F
Belong to the
same search task
Treatments
Queries
Interconnected
text streams
Web
pages
Gengxin Miao
UC Santa Barbara
2
Semantically Associated Documents
+
Applications
 Software system maintenance
 Root cause finding
 Problem prediction
 Machine translation
 Question answering
 Healthcare assistance
Gengxin Miao
UC Santa Barbara
4
Huge Datasets
Beyond human learner’s capability
Gengxin Miao
UC Santa Barbara
5
Modeling Options
Source Document Set
Target Document Set
 Word-level mapping
 Topic-level mapping
 Document-level mapping
Gengxin Miao
UC Santa Barbara
6
Word-Level Mapping (UAI’09)
 Learns a dictionary between the two document sets
 Applies to machine translation
 Word mappings are typically noisy
Gengxin Miao
UC Santa Barbara
7
Topic-Level Mapping (EMNLP’09)
Topic simplex of
Topic simplex of
the source document set the target document set
 Assumes the associated documents share the same
topic proportion
 Works well for translations between languages
Gengxin Miao
UC Santa Barbara
8
Document-Level Mapping (our work)
 One-to-many or many-to-one mappings are
broken down into one-to-one document pairs
 Two documents are associated by their
association factor
Gengxin Miao
UC Santa Barbara
9
Latent Association Analysis – Framework
Generative Models
Ranking Algorithms
Experiment
 Generative process
 Draw an association factor for each document pair
 Draw topic proportions for both the source and
the target document
 Draw the words in each document
Gengxin Miao
UC Santa Barbara
10
Latent Association Analysis – An Instantiation
Generative Models
Ranking Algorithms
Experiment
 Canonical Correlation Analysis (CCA)
 Captures the semantic association in document pairs
 Correlated Topic Model (CTM)
 Captures the document and word co-occurrence
Gengxin Miao
UC Santa Barbara
11
The Generative Process
Generative Models
Ranking Algorithms
Experiment
 A pair of documents arise from the following process
 Draw an L-dimensional association factor
 For the source/target document, draw the topic proportions
 For each word in the documents, draw a topic and a word
Gengxin Miao
UC Santa Barbara
12
Problems
Generative Models
Ranking Algorithms
Experiment
 Inference
 Given a model M and a document pair
 How to determine the association factor, topic proportions
and topic assignments that best describe the document pair?
 Model fitting
 Given a set of document pairs
 How to calculate the parameters in M that best describes
the entire document pair set?
Gengxin Miao
UC Santa Barbara
13
Inference
Generative Models
Ranking Algorithms
Experiment
 Objective function
 Given a model and a document pair
 Calculate the topic assignments and the topic proportions
 Posterior distribution is intractable to compute
 The topic assignments and the topic proportions
are correlated when conditioned on observations
Gengxin Miao
UC Santa Barbara
14
Variational Inference
Generative Models
Ranking Algorithms
Experiment
 Decouple the parameters using a variational
distribution Q
 Fit the variational parameters to approximate the
true posterior distribution
Gengxin Miao
UC Santa Barbara
15
Variational Parameters
Generative Models
Gengxin Miao
Ranking Algorithms
Experiment
UC Santa Barbara
16
Model Fitting
Generative Models
Gengxin Miao
Ranking Algorithms
Experiment
UC Santa Barbara
17
LAA Ranking Methods
Generative Models
Ranking Algorithms
Experiment
 Direct Ranking
 Ranking function for a candidate document pair
 Word frequency can distort the probability
 Latent Ranking
Gengxin Miao
UC Santa Barbara
18
Two-Step Ranking
Generative Models
Ranking Algorithms
Experiment
 Separate Topic Models
 Source document
 Target document
has topic proportion
has topic proportion
 Topic-Level Mapping

 Canonical Correlation Analysis captures the association
between the topic proportions
 Rank Target Documents


Gengxin Miao
UC Santa Barbara
19
Experiments
Generative Models
Ranking Algorithms
Experiment
 Datasets
 IT-Change: Changes made to an IT environment and
the consequent problems
 24,317 document pairs
 20,000 used for training, the rest used for testing
 IT-Solution: IT problems and their solutions
 19,696 document pairs
 15,000 used for training, the rest used for testing
 Evaluation
 Randomly select 100 document pairs in testing dataset
 For each source document, rank the 100 target documents
 Use the rank of the correct target document as accuracy
measurement
Gengxin Miao
UC Santa Barbara
20
Accuracy Analysis
Generative Models
Gengxin Miao
Ranking Algorithms
Experiment
UC Santa Barbara
21
Example
Generative Models
Gengxin Miao
Ranking Algorithms
Experiment
UC Santa Barbara
22
Summary
 The LAA framework is capable of modeling two
document sets associated by a bipartite graph
 One-to-many mappings or many-to-one mappings
of documents are taken into consideration
 We instantiated LAA with CCA and CTM, but the
framework can be used with other instantiations
that fit specific applications
 The LAA-latent ranking algorithm ranks the correct
target document better than other state-of-the-art
algorithms
Gengxin Miao
UC Santa Barbara
23
Acknowledgment
 Prof. Louise E. Moser
 Prof. Xifeng Yan
 Dr. Shu Tao
 Dr. Ziyu Guan
 Dr. Nikos Anerousis
Gengxin Miao
UC Santa Barbara
24
Q & A?
Thanks!
Unigram Model
Generative Models
Ranking Algorithms
Experiment
N
p(w )   p( wn )
n 1
Gengxin Miao
UC Santa Barbara
26
Mixture of Unigrams
Generative Models
Ranking Algorithms
Experiment
N
p(w)   p( z ) p( wn | z )
z
Gengxin Miao
n 1
UC Santa Barbara
27
Probabilistic Latent Semantic Indexing
Generative Models
Ranking Algorithms
Experiment
p(d , wn )  p(d ) p(wn | z ) p( z | d )
z
Gengxin Miao
UC Santa Barbara
28
LDA and CTM
Generative Models
Ranking Algorithms
Experiment
topic 1
topic 2
topic 3
topic 1
topic 2
Gengxin Miao
UC Santa Barbara
topic 3
29
Download