Answer Generating Methods for Community Question and Answering Portals {Tao Haoxiong, Hao Yu, Zhu Xiaoyan} @Tsinghua University 清华大学计算机系 Outline • Introduction and Related Work • List-type Question – Answer generating method – Method result and analysis • Solution-type Question – Visible list – Select the best list – Experiment and analysis • Conclusion • Future Work 清华大学计算机系 Introduction • Online community question answering (cQA) portals have become a popular way to acquire information, like Soso Wenwen and Baidu Zhidao. • But they have some limitations: – Can’t get answers in real-time. – The quality of many answers is not high. 清华大学计算机系 Related Work • To overcome unreal-time limitation, cQA portals support search service. – Users need to click links to see the whole answers. – Spend long time to find useful information. 清华大学计算机系 Related Work • To return high-quality answers – Predict the quality of cQA answers. • User profile features, text features, etc. – Use multi-document summarization to summarize answers. • More comprehensive but less readable. – To improve answer quality, almost all well-perform systems introduce a question taxonomy. 清华大学计算机系 Related Work • The question taxonomy proposed by Fan Bu contains 6 question types: • TYPE proportion List 23.8% Solution 19.7% Reason 18.1% Navigation 14.8% Fact 14.4% Definition 7.5% Examples: – – List-type: List Nobel prize winners in 1990s? Solution-type: How to make pizzas? 清华大学计算机系 Research Framework • Propose answer generating methods for both Listtype and Solution-type questions. 清华大学计算机系 List-type Question • Each answer will be a single phrase or a list of phrases. 清华大学计算机系 Answer Generating Method • Two characteristics about answers: – “Best Answer” often don’t contain all answer points. – Answer points which are high-quality or relevant to the question often appear in more than one answers. • Propose a method based on clustering of answer points. 清华大学计算机系 Answer Generating Method 清华大学计算机系 Answer Generating Method 清华大学计算机系 Example of the Method Result 清华大学计算机系 Method Result and Analysis • Result contains more answer points than “Best Answer”. • Outputs are ranked. Easy to control the answer length. • Further research is needed: – Split answer into answer points. – Choose the threshold of clustering. 清华大学计算机系 Solution-type Question • Visible List 清华大学计算机系 Solution-type Question • Visible List – Choose 1179 solved Solution-type questions from Baidu Zhidao, 30% questions’ answers having visible lists. – Average length of “Best Answer” is above 1400 words, while average length of visible list is about 600 words. – 55% questions have more than one visible lists. We propose a method to select the best list. 清华大学计算机系 Select the Best List • Features: – FirstList • If the list is the first list of the answer, then this feature value is 1, otherwise its value is 0. – GuideSimilarity • Cosine similarity between Guide words and question title. – Guide words: 列表四:三种方法巧疗慢性咽炎 – Question title:问题:慢性咽炎怎么治疗? – ContentSimilarity • Cosine similarity between list content and question. 清华大学计算机系 Select the Best List • Features: – VPRatio • Word ratio of verbs and prepositions in the content of the list. – SummaryScore • Summarized answer contains N sentences, for every visible list, if it contains k sentences out of the N sentences, then it will have a summary score of k/N. • Method: – Each feature is a [0, 1] value, we use Learning to Rank model to get the weight of every feature. 清华大学计算机系 Experiment and Analysis • Dataset: – Choose 1179 questions from Baidu Zhidao, 358 (30%) questions have visible lists. – 196 (55%) questions have more than one lists. – Manually label a score to the 196 questions with more than one visible list: • 1: high quality; 0:low quality. • Two evaluations: – Evaluate the method of selecting the best list. – Evaluate the quality of visible list as the answer 清华大学计算机系 Result of Selected Visible-lists *Random select: 51.7% 清华大学计算机系 Evaluate Visible List as Answer • Manually compare the quality of “Best Answer” and visible list for each question: – Mainly focus on the relevance to question, completeness and whether containing redundant information. • The average length of visible list is 600 words, while the average length is more than 1400 words for “Best Answer”. 清华大学计算机系 Conclusion • Relying on the similar questions and their answers from the cQA portals, propose appropriate answer generating methods for List-type and Solution-type questions – List-type questions: based on the clustering of answer points. – Solution-type questions: based on visible lists. 清华大学计算机系 Future Work • List-type questions: – Do further research to split the answer into answer points more robustly. • Solution-type questions: – Introduce more semantic features to improve the semantic relevance between selected list and question. • Other types of questions: – Do further research to generate high-quality answers. 清华大学计算机系 Thanks 清华大学计算机系