Multiple Outcome Supervised Latent Dirichlet Jose San Pedro and Alexandros Karatzoglou

advertisement
Late-Breaking Developments in the Field of Artificial Intelligence
Papers Presented at the Twenty-Seventh AAAI Conference on Artificial Intelligence
Multiple Outcome Supervised Latent Dirichlet
Allocation for Expert Discovery in Online Forums
Jose San Pedro and Alexandros Karatzoglou
Telefonica Research
Barcelona, Spain 08019
[email protected], [email protected]
on the history of previously answered questions. A similarity metric is established and experts are detected by comparing their affinity with new questions in the system. Methods
based on Query Likelihood Language Model (Li and King
2010), TF-IDF (Riahi et al. 2012), pLSA (Xu, Ji, and Wang
2012), and probabilistic topic models (Liu, Liu, and Yang
2010; Riahi et al. 2012; Ni et al. 2012) have been proposed
using this general framework. Another text-based approach
poses expert detection as a classification problem where the
user space is split into two classes depending on their ability
to reply a specific question (Zhou, Lyu, and King 2012).
In our approach, we build a prediction model for users
based on the latent topics of replied questions and provided
answers. We use a supervised learning paradigm, where
topic assignments and prediction parameters are learned
concurrently by means of bayesian inference. Using this
method we are able to leverage together both textual features
and answers quality metrics.
Abstract
This paper presents a supervised bayesian approach
to model expertise in online forums with application
to question routing. The proposed method extends the
well-known sLDA model to the multi-task case, accounting for a supervised stage with multiple outputs
per document corresponding to the users of the system.
A study of the characteristics of real world data revealed
a number of challenges in the practical application of
this model, relevant to the research community.
Online forums continue to be an active hub for information
exchange on the Web. They tend to focus on specific areas and attract communities of individuals interested, and
sometimes knowledgable, in the topic. An important share
of the discussions in forums follow the scheme of Community Question Answering (CQA): users formulate questions
to leverage the expertise of other knowledgeable users participating in the forum. While potential responders have the
knowledge and motivation to submit a reply to these questions, it is often just by chance that they come across them
and can therefore submit a reply.
In this paper, we consider the study of question routing
and present a novel approach to actively push unanswered
questions to potential responders aiming at increasing their
visibility and chance for getting a satisfactory response. Our
approach considers a bayesian inference framework that extends over Latent Dirichlet Allocation to account for authorship of questions and answers as well as community ratings.
A body of literature exists around expertise modeling in
the CQA context. Link analysis approaches exploit the relationship between users in the community to infer their level
of expertise. In this category we find methods based on variations of PageRank and HITS, classic in the IR literature to
assess the authority of websites, adapted to the CQA setting (Jurczyk and Agichtein 2007; Zhang, Ackerman, and
Adamic 2007). A related approach considers pairwise expertise comparisons between users to establish a global rank
of experts (Liu, Song, and Lin 2011). None of these methods makes use of textual features or quality metrics of the
answers provided by users.
Textual features can be used to build user profiles based
Multiple Outcome Supervised Latent Dirichlet
Allocation
We pose expert detection for question routing as a regression problem. Given an input document D, i.e. a question
in the forum, we want to predict a score Yd,a that represents the ability of user A to provide a relevant response.
Specifically, we extend on supervised Latent Dirichlet Allocation (sLDA (Blei and McAuliffe 2010)), which makes
variable outcome predictions based on the topics of the given
document. sLDA is restricted to a single response value per
document; we extend the model to account for multiple outcomes. Each of these outcomes is required to represent the
expertise level of the multiple users in the forum.
Our model, depicted in Fig 1, can be expressed as a generative process that generates questions (and optionally associated answers) and assigns them scores. As previously mentioned, scores are multidimensional with cardinality equal to
the number of users in the community. In contrast to sLDA,
our model does not assume normality in the response variables Yd,a . We will present additional evidence supporting
this decision at a later point in the paper, and we will also
specify the distributions that we consider for modeling responses. Let us assume for the moment that responses come
from a statistical distribution parametrized by ηa . With this
c 2013, Association for the Advancement of Artificial
Copyright Intelligence (www.aaai.org). All rights reserved.
110
α
rier in the United Kingdom. Customers use this forum to
ask about non severe issues for which intervention from the
company is not necessary. To encourage participation, GiffGaff uses a rewarding system in which users that provide
useful responses earn monetary prices. The utility of users’
responses is peer-reviewed by the rest of the community: any
user can mark responses as especially useful or relevant by
assigning to these messages “kudos”. We use kudos to build
the observed response values Yd,a to train our model, as they
provide a strong signal of response quality.
β
θd
Wd,n Zd,n
φk
n ∈ Nd
Yd,a
ηa
k∈K
d∈D
We use a data sample that covers forum messages posted
between January 2010 and December 2011, two years worth
of data. This sample includes 78, 633 questions, a mean of
9.92 replies per question, 46, 288 users and 73, 333 kudos
assigned. A preliminary study of the characteristics of this
dataset reveals a clear deviation from normality in the distribution of response values. To illustrate the significance, from
the 78, 633 questions in the dataset users reply to 18.39 in
average, with a median of just 1 answer. Low participation
levels are organic to this kind of communities, where most
of the answers are generated by a minority of contributors.
a∈A
Figure 1: Latent Dirichlet allocation as directed factor graph.
elements, we use statistical inference to find the parameter
values that best explain the data. We follow the approach
used by Boyd-Graber et al. (Graber and Resnik 2010) based
on stochastic EM. The E-step is performed using collapsed
Gibbs sampling to infer topic assignments for the terms in
the documents. Because there is a dependency between topic
assignments and observed responses, the inferred topic distribution favours topic assignments that minimizes the difference between the predicted and the observed response.
We can sample new topic assignments from the following
conditional distribution
p(Zd,n = k|Z−(d,n) , wd,n , Yd , α, β, η)
nw ,k + βwd,n
−(d,n)
(nd,k
+ αk ) d,n
×
−(d,n)
Nk
+W
A
The sparsity of this particular dataset renders the application of linear regression modeling methods unfeasible. Nevertheless, we experimented with General Linear Models using several approaches (Ordinary Least Squares, Weighted
Least Squares, Multitask Learning and Poisson Regression).
In all the cases, the model performed significantly better
than the TF-IDF baseline, but was clearly outperformed by a
second baseline based on popularity ranking, in which users
are ranked based exclusively on their number of kudos, independently of the question content.
Our next step will consider the utilization of zero inflated
models for the regression stage. In particular, Zero Inflated
Negative Binomial (ZINB) regression can be used to model
count variables with excessive number of zeros, which fits
perfectly the sparsity characteristics of our collection. This
method assumes that the excess of zeros is generated by a
separate process that is modeled independently, normally by
means of logistic regression.
∝
p(Yd,a |ηa , Z−(d,n) , Zd,n = k)
a=1
One of the most interesting features of ZINB is the ability
to use different features for the two modeling substages: the
excess of zeros and the actual counts. In our problem setting,
the excess of zeros can be seen as the likelihood of a user
coming across a question. We hypothesize that the inclusion
of social features (e.g. connectivity degree with asker), question posting temporal features (e.g. time of the day, day of
the week) as well as user-based participation metrics (e.g.
fraction of questions answered) will add rich a priori knowledge about the community, which we expect to significantly
boost the model accuracy.
In the previous equation, nw,k refers to the number of
times word w has been assigned to topic k, and nd,k refers to
the frequency of topic k in document d. We use the standard
notation −(d, n) to refer to the counts where the term being
sampled is excluded, typical in Gibbs sampling scenarios.
Finally Nk refers to the number of terms assigned to topic k
across all documents and W refers to the vocabulary size.
After sampling topics for each document, we proceed to
find new regression parameters that maximize the likelihood
of the response variables conditioned on the current state of
topic assignments. In the following section, we detail the
different strategies for parameter optimization that we tried
and the challenges found in this stage.
The proposed approach provides a flexible and powerful framework to model multiple independent outcomes in a
sLDA context, with the ability to incorporate not just textual
and quality features, but also a priori knowledge about users.
Although conceived for expert detection in CQA contexts,
this model presents a myriad of additional applications, such
as targeted advertising.
Preliminary Tests and Challenges Ahead
The study presented in this paper considers data from the
“Help & Support” forum of GiffGaff, a mobile phone car-
111
References
Blei, D. M., and McAuliffe, J. D. 2010. Supervised Topic
Models. ArXiv preprint arXiv:1003.0783.
Graber, J. B., and Resnik, P. 2010. Holistic sentiment analysis across languages: multilingual supervised latent Dirichlet
allocation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP
’10, 45–55. Stroudsburg, PA, USA: Association for Computational Linguistics.
Jurczyk, P., and Agichtein, E. 2007. Discovering authorities in question answer communities by using link analysis.
In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, CIKM
’07, 919–922. New York, NY, USA: ACM.
Li, B., and King, I. 2010. Routing questions to appropriate
answerers in community question answering services. In
Proceedings of the 19th ACM international conference on
Information and knowledge management, CIKM ’10, 1585–
1588. New York, NY, USA: ACM.
Liu, M.; Liu, Y.; and Yang, Q. 2010. Predicting best answerers for new questions in community question answering. In
Proceedings of the 11th international conference on Webage information management, WAIM’10, 127–138. Berlin,
Heidelberg: Springer-Verlag.
Liu, J.; Song, Y. I.; and Lin, C. Y. 2011. Competition-based
user expertise score estimation. In Proceedings of the 34th
international ACM SIGIR conference on Research and development in Information Retrieval, SIGIR ’11, 425–434.
New York, NY, USA: ACM.
Ni, X.; Lu, Y.; Quan, X.; Wenyin, L.; and Hua, B. 2012. User
interest modeling and its application for question recommendation in user-interactive question answering systems.
Information Processing & Management 48(2):218–233.
Riahi, F.; Zolaktaf, Z.; Shafiei, M.; and Milios, E. 2012.
Finding expert users in community question answering. In
Proceedings of the 21st international conference companion
on World Wide Web, WWW ’12 Companion, 791–798. New
York, NY, USA: ACM.
Xu, F.; Ji, Z.; and Wang, B. 2012. Dual role model for question recommendation in community question answering. In
Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval,
SIGIR ’12, 771–780. New York, NY, USA: ACM.
Zhang, J.; Ackerman, M. S.; and Adamic, L. 2007. Expertise networks in online communities: structure and algorithms. In Proceedings of the 16th international conference
on World Wide Web, WWW ’07, 221–230. New York, NY,
USA: ACM.
Zhou, T. C.; Lyu, M. R.; and King, I. 2012. A classificationbased approach to question routing in community question
answering. In Proceedings of the 21st international conference companion on World Wide Web, WWW ’12 Companion, 783–790. New York, NY, USA: ACM.
112
Download