面向问答社区的答案生成方法

advertisement
Answer Generating Methods for
Community Question and Answering Portals
{Tao Haoxiong, Hao Yu, Zhu Xiaoyan}
@Tsinghua University
清华大学计算机系
Outline
• Introduction and Related Work
• List-type Question
– Answer generating method
– Method result and analysis
• Solution-type Question
– Visible list
– Select the best list
– Experiment and analysis
• Conclusion
• Future Work
清华大学计算机系
Introduction
• Online community question answering (cQA) portals
have become a popular way to acquire information,
like Soso Wenwen and Baidu Zhidao.
• But they have some limitations:
– Can’t get answers in real-time.
– The quality of many answers is not high.
清华大学计算机系
Related Work
• To overcome unreal-time limitation, cQA portals
support search service.
– Users need to click links to see the whole answers.
– Spend long time to find useful information.
清华大学计算机系
Related Work
• To return high-quality answers
– Predict the quality of cQA answers.
• User profile features, text features, etc.
– Use multi-document summarization to summarize answers.
• More comprehensive but less readable.
– To improve answer quality, almost all well-perform systems
introduce a question taxonomy.
清华大学计算机系
Related Work
• The question taxonomy proposed by Fan Bu contains
6 question types:
•
TYPE
proportion
List
23.8%
Solution
19.7%
Reason
18.1%
Navigation
14.8%
Fact
14.4%
Definition
7.5%
Examples:
–
–
List-type: List Nobel prize winners in 1990s?
Solution-type: How to make pizzas?
清华大学计算机系
Research Framework
• Propose answer generating methods for both Listtype and Solution-type questions.
清华大学计算机系
List-type Question
• Each answer will be a single phrase or a list of phrases.
清华大学计算机系
Answer Generating Method
• Two characteristics about answers:
– “Best Answer” often don’t contain all answer points.
– Answer points which are high-quality or relevant to the
question often appear in more than one answers.
• Propose a method based on clustering of answer points.
清华大学计算机系
Answer Generating Method
清华大学计算机系
Answer Generating Method
清华大学计算机系
Example of the Method Result
清华大学计算机系
Method Result and Analysis
• Result contains more answer points than “Best
Answer”.
• Outputs are ranked. Easy to control the answer length.
• Further research is needed:
– Split answer into answer points.
– Choose the threshold of clustering.
清华大学计算机系
Solution-type Question
• Visible List
清华大学计算机系
Solution-type Question
• Visible List
– Choose 1179 solved Solution-type questions from Baidu
Zhidao, 30% questions’ answers having visible lists.
– Average length of “Best Answer” is above 1400 words,
while average length of visible list is about 600 words.
– 55% questions have more than one visible lists. We
propose a method to select the best list.
清华大学计算机系
Select the Best List
• Features:
– FirstList
• If the list is the first list of the answer, then this feature value is
1, otherwise its value is 0.
– GuideSimilarity
• Cosine similarity between Guide words and question title.
– Guide words: 列表四:三种方法巧疗慢性咽炎
– Question title:问题:慢性咽炎怎么治疗?
– ContentSimilarity
• Cosine similarity between list content and question.
清华大学计算机系
Select the Best List
• Features:
– VPRatio
• Word ratio of verbs and prepositions in the content of the list.
– SummaryScore
• Summarized answer contains N sentences, for every visible list,
if it contains k sentences out of the N sentences, then it will
have a summary score of k/N.
• Method:
– Each feature is a [0, 1] value, we use Learning to Rank model
to get the weight of every feature.
清华大学计算机系
Experiment and Analysis
• Dataset:
– Choose 1179 questions from Baidu Zhidao, 358 (30%) questions
have visible lists.
– 196 (55%) questions have more than one lists.
– Manually label a score to the 196 questions with more than one
visible list:
• 1: high quality; 0:low quality.
• Two evaluations:
– Evaluate the method of selecting the best list.
– Evaluate the quality of visible list as the answer
清华大学计算机系
Result of Selected Visible-lists
*Random select: 51.7%
清华大学计算机系
Evaluate Visible List as Answer
• Manually compare the quality of “Best Answer” and
visible list for each question:
– Mainly focus on the relevance to question, completeness
and whether containing redundant information.
• The average length of visible list is 600 words, while the
average length is more than 1400 words for “Best Answer”.
清华大学计算机系
Conclusion
• Relying on the similar questions and their answers
from the cQA portals, propose appropriate answer
generating methods for List-type and Solution-type
questions
– List-type questions: based on the clustering of answer points.
– Solution-type questions: based on visible lists.
清华大学计算机系
Future Work
• List-type questions:
– Do further research to split the answer into answer points
more robustly.
• Solution-type questions:
– Introduce more semantic features to improve the semantic
relevance between selected list and question.
• Other types of questions:
– Do further research to generate high-quality answers.
清华大学计算机系
Thanks
清华大学计算机系
Download