Presentation

advertisement
Next Question Prediction
Tom Chao Zhou
Email: czhou@cse.cuhk.edu.hk
Co-work with
Chin-Yew Lin, Irwin King, Young-In Song and Yunbo Cao
This work was done at Microsoft Research Asia.
Outline






Introduction
Our Approach
Experiments
Conclusions
Future Work
Q&A
Tom Chao Zhou
czhou@cse.cuhk.edu.hk
2
Introduction

Community-based Question Answering (CQA) service:

Ask questions, and answer other people’s questions.



E.g. Yahoo! Answers, Baidu Zhidao.
Why good? Returning explicit, self-contained answers.
Question Search:



Why? Time lag.
How? Finding semantically equivalent questions.
Limitation: it cannot guide the users to explore resources.
that may interest him/her in a CQA service.
Tom Chao Zhou
czhou@cse.cuhk.edu.hk
3
Introduction

Our thought:

A user posts a question in CQA because he/she is interested
in a particular topic, but his/her current question may only
capture one aspect of the topic.
Tom Chao Zhou
czhou@cse.cuhk.edu.hk
4
Introduction

Our thought:


CQA service is not an ideal source to learn the semantic
relations among questions. No explicit relations between
questions in CQA services.
Online forums



Highly interactive discussions of questions.
Questions posted by the thread starter are semantically
related.
Forums contain much larger number of questions than CQA.
Tom Chao Zhou
czhou@cse.cuhk.edu.hk
5
Introduction
discussion topic:
‘travel in orange beach’
Tom Chao Zhou
czhou@cse.cuhk.edu.hk
6
Introduction

Next Question Prediction:

Definition:


Predict a ‘next’ question that is semantically or pragmatically
related to the queried question.
Benefits:



Assisting the user in exploring different aspects of a topic
tailored to his/her information needs.
Allowing the system to populate the question database before
hand so to make it more complete.
Providing a relevance feedback mechanism for guided search.
Tom Chao Zhou
czhou@cse.cuhk.edu.hk
7
Introduction
The system framework of next question prediction and a user case scenario.
Tom Chao Zhou
czhou@cse.cuhk.edu.hk
8
Our Approach

Question Detection:

Problem:




Simple rules, such as ? and 5W1H words are not adequate for
detecting question.
Questions can be expressed by imperative sentences.
Question marks are omitted.
Classification-based approach.


Mining labeled sequential patterns.
Discovered LSP forms a binary feature as the input for a
classifier.
Tom Chao Zhou
czhou@cse.cuhk.edu.hk
9
Our Approach

Question Detection:

Mining labeled sequential patterns (LSP):






LSP: S->l, S is a sequence of patterns, and l is a class label.
s2=<b1,…,bn> contains a sequence s1=<a1,…,am> if there exists integers
i1,…,im such that 1<=i1<…<im<=n and aj=bij. j  1,..., m
A LSP p1 is contained by p2 if the sequence of p1 is contained by the
sequence and they have the same class label.
sup(p) : support of p, percentage of tuples in a sequence database
that contain the p.
conf(p): sup(p)/sup(p.S)
Classifier:


Each discovered LSP forms a binary feature as the input for a
classification model.
Ripper: a rule based classification algorithm.
Tom Chao Zhou
czhou@cse.cuhk.edu.hk
10
Our Approach

Translation Model-based Approach:

Translation model with Jelinek-Mercer smoothing:


q: queried question, D is candidate question, t is a query term in
q, w is a term that appears in the collection, θC is the background
collection, λ is a smoothing parameter, and pmle() is the maximum
likelihood estimator.
Why we employ translation model?


It can bridge the lexical chasm between the queried question
and candidate question.
The behavior of translation model is determined by how the
word translation probabilities are obtained.
Tom Chao Zhou
czhou@cse.cuhk.edu.hk
11
Our Approach

Learning word translation probabilities:


Questions are the focus of forum discussions.
Questions posted by the thread starter are very likely to
explore different aspects of a discussion topic.



Words in these questions are semantically related.
Utilizing the co-occurrence properties of words that appear in
semantically related questions may help find semantically
related words.
Proposing an Association Rule Mining (ARM) based algorithm
to explore word translation probabilities.
Tom Chao Zhou
czhou@cse.cuhk.edu.hk
12
Our Approach

Learning word translation probabilities:

ARM-based algorithm.
Employ the ARM to mine the
length 2 association rule: w->t
Use the confidence value of rule
w->t to represent the word
translation probability p(t|w) in the
translation model.
Tom Chao Zhou
czhou@cse.cuhk.edu.hk
13
Our Approach

TransLDA: Translation Model with LDA Ensemble

Translation model:


Limitation:


Only employing unigram level knowledge may suffer the
weakness that the trained model is not robust against the noise.
Topic modeling:


A fine-grained framework.
Enhance document representation, has been applied in Ad-hoc
retrieval before.
Target:


Next questions should explore different aspects of a discussion
topic to the queried question.
Topic modeling view: the candidate questions should share
similar topic distributions with the queried question.
Tom Chao Zhou
czhou@cse.cuhk.edu.hk
14
Our Approach

TransLDA: Translation Model with LDA Ensemble

Latent Dirichlet Allocation (LDA):


Motivation:


Possessing fully generative semantics.
Both unigram level knowledge and latent factors should be
considered in measuring semantic relations between questions.
TransLDA:

Incorporates LDA into the translation model.
Tom Chao Zhou
czhou@cse.cuhk.edu.hk
15
Experiments

We ask ourselves three research questions:



RQ 1: How are the proposed methods compared with other
approaches on labeled questions in next question prediction
problem?
RQ 2: How are the proposed methods compared with other
approaches on topic matching task under the next question
prediction scenario?
RQ 3: How are the proposed methods compared with other
approaches on joint topic distributions’ similarity with ground
truth?
Tom Chao Zhou
czhou@cse.cuhk.edu.hk
16
Experiments

Data Set:





Crawled from TripAdvisor, a popular travel forum.
Development Set: DEV SET
Unlabeled Testing Set: TST SET
Labeled Testing Set: TST LABEL SET
Training Set: TRAIN SET
Tom Chao Zhou
czhou@cse.cuhk.edu.hk
17
Experiments

Data Set:

Statistics:

TRAIN SET:

1,976,522 questions extracted from 971,859 threads.
Tom Chao Zhou
czhou@cse.cuhk.edu.hk
18
Experiments

Data Analysis:

Analyzing thread starter’s activities in TRAIN SET.
Over 40% of thread starters will reply on average 1.9 posts to the thread he or she
initialized, indicating our expectation that forum discussions are quite interactive.
Over 68.8% thread starters do ask questions in their posts, and on average they ask 2
questions in their posts. Prove the motivation that question is a focus of forum
discussions and forum data is an ideal source to train the proposed model.
Tom Chao Zhou
czhou@cse.cuhk.edu.hk
19
Experiments

Data Analysis:

Analyzing thread starter’s activities in TRAIN SET.
Tom Chao Zhou
czhou@cse.cuhk.edu.hk
20
Experiments

Data Preprocessing:

Stemming:


Stop word list:


Used by SMART system.
Document representation:


Porter Stemmer.
Unigram. (each word is a unit)
Metrics:

Precision at rank n, Precision at rank R, Mean Average
Precision, Mean Reciprocal Rank, Bpref, Kullback-Leibler
divergence, Top-k Recall.
Tom Chao Zhou
czhou@cse.cuhk.edu.hk
21
Experiments

Comparison methods:

Cosine Similarity (COS):

Query Likelihood Language Model (QL):

Our methods:



Trans-ns1, Trans-ns2, Trans-ns3
TransLDA-ns1, TransLDA-ns2, TransLDA-ns3
ns is the neighborhood size parameter.
Tom Chao Zhou
czhou@cse.cuhk.edu.hk
22
Experiments

Parameter Tuning:

Data Set:


Metric:


MRR.
Method:


DEV SET.
Exhaustive search.
Tuned parameters:




Dirichlet smoothing parameter in QL.
Translation model parameter.
TransLDA parameter.
Number of topics in LDA.
Tom Chao Zhou
czhou@cse.cuhk.edu.hk
23
Experiments

Experiment on Labeled Question:

TST LABEL SET, 200 queried questions, 31,018 questions in
the repository.
Tom Chao Zhou
czhou@cse.cuhk.edu.hk
24
Experiments

Experiment on Topic Matching:
TST SET, 10,000 queried questions, 31,018 questions in question
repository.
 Given a query q, denoting the first next question of q in the actual
thread as q’, and consider q’ to be relevant to q.
 Function topic(X) returns the most probable topic that X belongs to.
 For each method, return a ranked list of results {D1,D2,…Dn}, we
denote Di to be relevant to q:

Tom Chao Zhou
czhou@cse.cuhk.edu.hk
25
Experiments

Experiment on Topics’ Joint Probability Distribution:
TST SET, 10,000 queried questions, 31,018 questions in question repository.
 For each q, we consider its first next question q’ in the thread as its
relevant result. We aggregate the counts of topic transitions in the actual
thread as ground truth, and apply maximum likelihood estimation to calculate
topics’ joint probability (100*100 matrix):


For each method, we consider the first result as its predicted next question,
and calculate topics’ joint probability distribution as the same way.
Tom Chao Zhou
czhou@cse.cuhk.edu.hk
26
Conclusions



Defining the next question prediction problem.
Proposing an ARM-based algorithm to learn the word
translation probabilities, and adapting it in to the
translation model.
Proposing the TransLDA, which incorporates LDA into
translation model.
Tom Chao Zhou
czhou@cse.cuhk.edu.hk
27
Future Work



Conducting experiments on CQA services.
Investigating how to measure and how to diversify the
predicted questions.
Query suggestion for long queries would also benefit from
next question prediction.
Tom Chao Zhou
czhou@cse.cuhk.edu.hk
28

Your Question, Our passion!
Tom Chao Zhou
czhou@cse.cuhk.edu.hk
29
 Thanks!
Tom Chao Zhou
czhou@cse.cuhk.edu.hk
30
Download