Next Question Prediction Tom Chao Zhou Email: czhou@cse.cuhk.edu.hk Co-work with Chin-Yew Lin, Irwin King, Young-In Song and Yunbo Cao This work was done at Microsoft Research Asia. Outline Introduction Our Approach Experiments Conclusions Future Work Q&A Tom Chao Zhou czhou@cse.cuhk.edu.hk 2 Introduction Community-based Question Answering (CQA) service: Ask questions, and answer other people’s questions. E.g. Yahoo! Answers, Baidu Zhidao. Why good? Returning explicit, self-contained answers. Question Search: Why? Time lag. How? Finding semantically equivalent questions. Limitation: it cannot guide the users to explore resources. that may interest him/her in a CQA service. Tom Chao Zhou czhou@cse.cuhk.edu.hk 3 Introduction Our thought: A user posts a question in CQA because he/she is interested in a particular topic, but his/her current question may only capture one aspect of the topic. Tom Chao Zhou czhou@cse.cuhk.edu.hk 4 Introduction Our thought: CQA service is not an ideal source to learn the semantic relations among questions. No explicit relations between questions in CQA services. Online forums Highly interactive discussions of questions. Questions posted by the thread starter are semantically related. Forums contain much larger number of questions than CQA. Tom Chao Zhou czhou@cse.cuhk.edu.hk 5 Introduction discussion topic: ‘travel in orange beach’ Tom Chao Zhou czhou@cse.cuhk.edu.hk 6 Introduction Next Question Prediction: Definition: Predict a ‘next’ question that is semantically or pragmatically related to the queried question. Benefits: Assisting the user in exploring different aspects of a topic tailored to his/her information needs. Allowing the system to populate the question database before hand so to make it more complete. Providing a relevance feedback mechanism for guided search. Tom Chao Zhou czhou@cse.cuhk.edu.hk 7 Introduction The system framework of next question prediction and a user case scenario. Tom Chao Zhou czhou@cse.cuhk.edu.hk 8 Our Approach Question Detection: Problem: Simple rules, such as ? and 5W1H words are not adequate for detecting question. Questions can be expressed by imperative sentences. Question marks are omitted. Classification-based approach. Mining labeled sequential patterns. Discovered LSP forms a binary feature as the input for a classifier. Tom Chao Zhou czhou@cse.cuhk.edu.hk 9 Our Approach Question Detection: Mining labeled sequential patterns (LSP): LSP: S->l, S is a sequence of patterns, and l is a class label. s2=<b1,…,bn> contains a sequence s1=<a1,…,am> if there exists integers i1,…,im such that 1<=i1<…<im<=n and aj=bij. j 1,..., m A LSP p1 is contained by p2 if the sequence of p1 is contained by the sequence and they have the same class label. sup(p) : support of p, percentage of tuples in a sequence database that contain the p. conf(p): sup(p)/sup(p.S) Classifier: Each discovered LSP forms a binary feature as the input for a classification model. Ripper: a rule based classification algorithm. Tom Chao Zhou czhou@cse.cuhk.edu.hk 10 Our Approach Translation Model-based Approach: Translation model with Jelinek-Mercer smoothing: q: queried question, D is candidate question, t is a query term in q, w is a term that appears in the collection, θC is the background collection, λ is a smoothing parameter, and pmle() is the maximum likelihood estimator. Why we employ translation model? It can bridge the lexical chasm between the queried question and candidate question. The behavior of translation model is determined by how the word translation probabilities are obtained. Tom Chao Zhou czhou@cse.cuhk.edu.hk 11 Our Approach Learning word translation probabilities: Questions are the focus of forum discussions. Questions posted by the thread starter are very likely to explore different aspects of a discussion topic. Words in these questions are semantically related. Utilizing the co-occurrence properties of words that appear in semantically related questions may help find semantically related words. Proposing an Association Rule Mining (ARM) based algorithm to explore word translation probabilities. Tom Chao Zhou czhou@cse.cuhk.edu.hk 12 Our Approach Learning word translation probabilities: ARM-based algorithm. Employ the ARM to mine the length 2 association rule: w->t Use the confidence value of rule w->t to represent the word translation probability p(t|w) in the translation model. Tom Chao Zhou czhou@cse.cuhk.edu.hk 13 Our Approach TransLDA: Translation Model with LDA Ensemble Translation model: Limitation: Only employing unigram level knowledge may suffer the weakness that the trained model is not robust against the noise. Topic modeling: A fine-grained framework. Enhance document representation, has been applied in Ad-hoc retrieval before. Target: Next questions should explore different aspects of a discussion topic to the queried question. Topic modeling view: the candidate questions should share similar topic distributions with the queried question. Tom Chao Zhou czhou@cse.cuhk.edu.hk 14 Our Approach TransLDA: Translation Model with LDA Ensemble Latent Dirichlet Allocation (LDA): Motivation: Possessing fully generative semantics. Both unigram level knowledge and latent factors should be considered in measuring semantic relations between questions. TransLDA: Incorporates LDA into the translation model. Tom Chao Zhou czhou@cse.cuhk.edu.hk 15 Experiments We ask ourselves three research questions: RQ 1: How are the proposed methods compared with other approaches on labeled questions in next question prediction problem? RQ 2: How are the proposed methods compared with other approaches on topic matching task under the next question prediction scenario? RQ 3: How are the proposed methods compared with other approaches on joint topic distributions’ similarity with ground truth? Tom Chao Zhou czhou@cse.cuhk.edu.hk 16 Experiments Data Set: Crawled from TripAdvisor, a popular travel forum. Development Set: DEV SET Unlabeled Testing Set: TST SET Labeled Testing Set: TST LABEL SET Training Set: TRAIN SET Tom Chao Zhou czhou@cse.cuhk.edu.hk 17 Experiments Data Set: Statistics: TRAIN SET: 1,976,522 questions extracted from 971,859 threads. Tom Chao Zhou czhou@cse.cuhk.edu.hk 18 Experiments Data Analysis: Analyzing thread starter’s activities in TRAIN SET. Over 40% of thread starters will reply on average 1.9 posts to the thread he or she initialized, indicating our expectation that forum discussions are quite interactive. Over 68.8% thread starters do ask questions in their posts, and on average they ask 2 questions in their posts. Prove the motivation that question is a focus of forum discussions and forum data is an ideal source to train the proposed model. Tom Chao Zhou czhou@cse.cuhk.edu.hk 19 Experiments Data Analysis: Analyzing thread starter’s activities in TRAIN SET. Tom Chao Zhou czhou@cse.cuhk.edu.hk 20 Experiments Data Preprocessing: Stemming: Stop word list: Used by SMART system. Document representation: Porter Stemmer. Unigram. (each word is a unit) Metrics: Precision at rank n, Precision at rank R, Mean Average Precision, Mean Reciprocal Rank, Bpref, Kullback-Leibler divergence, Top-k Recall. Tom Chao Zhou czhou@cse.cuhk.edu.hk 21 Experiments Comparison methods: Cosine Similarity (COS): Query Likelihood Language Model (QL): Our methods: Trans-ns1, Trans-ns2, Trans-ns3 TransLDA-ns1, TransLDA-ns2, TransLDA-ns3 ns is the neighborhood size parameter. Tom Chao Zhou czhou@cse.cuhk.edu.hk 22 Experiments Parameter Tuning: Data Set: Metric: MRR. Method: DEV SET. Exhaustive search. Tuned parameters: Dirichlet smoothing parameter in QL. Translation model parameter. TransLDA parameter. Number of topics in LDA. Tom Chao Zhou czhou@cse.cuhk.edu.hk 23 Experiments Experiment on Labeled Question: TST LABEL SET, 200 queried questions, 31,018 questions in the repository. Tom Chao Zhou czhou@cse.cuhk.edu.hk 24 Experiments Experiment on Topic Matching: TST SET, 10,000 queried questions, 31,018 questions in question repository. Given a query q, denoting the first next question of q in the actual thread as q’, and consider q’ to be relevant to q. Function topic(X) returns the most probable topic that X belongs to. For each method, return a ranked list of results {D1,D2,…Dn}, we denote Di to be relevant to q: Tom Chao Zhou czhou@cse.cuhk.edu.hk 25 Experiments Experiment on Topics’ Joint Probability Distribution: TST SET, 10,000 queried questions, 31,018 questions in question repository. For each q, we consider its first next question q’ in the thread as its relevant result. We aggregate the counts of topic transitions in the actual thread as ground truth, and apply maximum likelihood estimation to calculate topics’ joint probability (100*100 matrix): For each method, we consider the first result as its predicted next question, and calculate topics’ joint probability distribution as the same way. Tom Chao Zhou czhou@cse.cuhk.edu.hk 26 Conclusions Defining the next question prediction problem. Proposing an ARM-based algorithm to learn the word translation probabilities, and adapting it in to the translation model. Proposing the TransLDA, which incorporates LDA into translation model. Tom Chao Zhou czhou@cse.cuhk.edu.hk 27 Future Work Conducting experiments on CQA services. Investigating how to measure and how to diversify the predicted questions. Query suggestion for long queries would also benefit from next question prediction. Tom Chao Zhou czhou@cse.cuhk.edu.hk 28 Your Question, Our passion! Tom Chao Zhou czhou@cse.cuhk.edu.hk 29 Thanks! Tom Chao Zhou czhou@cse.cuhk.edu.hk 30