Recommendation Algorithms for Optimizing Hit Rate, User Satisfaction and Website Revenue

advertisement
Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015)
Recommendation Algorithms for Optimizing
Hit Rate, User Satisfaction and Website Revenue
Xin Wang, Yunhui Guo, Congfu Xu∗
Institute of Artificial Intelligence, College of Computer Science
Zhejiang University
cswangxinm@gmail.com, {gyhui, xucongfu}@zju.edu.cn
Abstract
uate the difference between predicted rating values and actual
rating values by computing root mean square error (RMSE),
mean square error (MSE) or mean absolute error (MAE). For
item recommendation, we consider whether the most relevant
items are recommended with the highest priority by measuring a ranked item list for each user in terms of precision, recall, normalized discounted cumulative gain (NDCG), area
under the curve (AUC), expected reciprocal rank (ERR), etc.
[Weimer et al., 2007; Rendle et al., 2009; Shi et al., 2013;
Pan and Chen, 2013].
Despite those algorithms and metrics, it is still challenging to comprehensively measure the performance of a particular recommendation algorithm. The rating prediction metrics (e.g., RMSE, MAE, MSE) or some item recommendation
metrics on the whole item lists (e.g., NDCG, AUC, ERR) may
not directly reflect system performance in applications, where
only top-K items are provided. On the other hand, the hit rate
based metrics (e.g. precision and recall) cannot convey preference degree information. Hence, in this paper, we consider
two other important factors for designing and evaluating recommendation algorithms.
User Satisfaction We define user satisfaction as how much
a user satisfies with an item (s)he has already bought. The
models with higher hit rate performance may recommend
more items that users tend to buy, but they do not guarantee that the proposed items suit the customers. An extreme
example is that many users are not pleased with the bought
items that recommended by the system. In this case, although
the models achieve high hit rate, it is even worse than not
recommending the items to these users, because they are disappointed with the items and even distrust the recommender
systems in the future.
Website Revenue Another destination of recommender
systems is to help website retailers earn more revenue (which
is narrowly defined as retailer satisfaction for convenience).
Unfortunately, it is ignored by almost all the related algorithms. Improving hit rate can indeed help improve sales,
but it is limited and indirect. Suppose a user intends to buy
one of two selected items, recommending any one of them
will gain the same hit rate, but it is better to choose the item
with higher profit for it can help the website earn more money
without reducing users’ satisfaction.
To illustrate the impact of these factors, we show a simple
example in Table 1. It contains three similar items that the
We usually use hit rate to measure the performance
of item recommendation algorithms. In addition
to hit rate, we consider additional two important
factors which are ignored by most previous works.
First, we consider whether users are satisfied with
the recommended items. It is possible that a user
has bought an item but dislikes it. Hence high
hit rate may not reflect high customer satisfaction.
Second, we consider whether the website retailers
are satisfied with the recommendation results. If a
customer is interested in two products and wants
to buy one of them, it may be better to suggest
the item which can help bring more profit. Therefore, a good recommendation algorithm should not
only consider improving hit rate but also consider
optimizing user satisfaction and website revenue.
In this paper, we propose two algorithms for the
above purposes and design two modified hit rate
based metrics to measure them. Experimental results on 10 real-world datasets show that our methods can not only achieve better hit rate, but also
improve user satisfaction and website revenue comparing with the state-of-the-art models.
1
Introduction
Recommender systems have been widely applied to online
shopping websites for better user experience. Traditional
recommendation algorithms try to leverage item aspects and
historical user behaviors to exploit the potential user preferences. Broadly, a customer’s behavior can be divided into
two types: explicit feedback and implicit feedback. The former, such as rating scores, reveals the preference degree of
buyers, and the latter, such as browsing and clicking information, records valuable customer actions. Based on them, most
well-known methods utilize the collaborative filtering (CF)
technique, which learns the correlations of users and items
beneath the feedback, in order to recognize potential useritem associations [Koren et al., 2009]. To measure recommendation performance, researches have also considered designing good metrics. For rating prediction, we usually eval∗
Corresponding author
1820
0.01
Table 1: A customer’s wish list.
item2
2
10
hit
-
item3
5
9
hit
0.008
Proportion
satisfaction of u
revenue ($)
alg1 (algorithm1)
alg2 (algorithm2)
alg3 (algorithm3)
item1
3
5
hit
-
0.006
0.004
0.002
0
1
customer u is interested in. The satisfaction denotes the rating value that u will give to each item if s(he) buys it, and
the revenue is related to item price. The following three lines
show the performance of three recommendation algorithms
(denoted as alg1, alg2 and alg3), and the hit means that u will
buy the item if it is recommended. From the table, we immediately know that both alg1, alg2 and alg3 achieve similar
performance in terms of hit rate (i.e., the same prediction, recall and F1 score), because they both hit one item. However,
those methods have different performance when we consider
the other two factors. If u buys the item recommended by
alg1, he will give 3 stars to the item, and the retailer will get
5 dollars. If u buys the item recommended by alg2 instead,
the user satisfaction value will drop 1 score, but the retailer
can get double revenue. To the item recommended by alg3,
the user gives the highest score, and the website retailer can
get approximate optimal revenue. Therefore, we think that
alg3 is the best algorithm because it satisfies both users and
the retailer, and alg2 may be better than arg1 because it helps
to double the revenue with the cost of only losing 1 level of
user satisfaction.
To conclude, a good recommendation algorithm should not
only improve hit rate, but also consider optimizing user satisfaction and website revenue, or at least achieve a balance
among them. As a response, in this paper, we propose two
novel algorithms to optimize them simultaneously. To test
the performance of our methods in terms of user and retailer
satisfaction, we design two simple but convincing metrics.
The experimental results on 10 real-world datasets show that
our models provide better overall performance than existing
methods.
2
foods
arts
automotive
cell&accessories
clothing&accessories
health
industrial&scientific
jewelry
musical instruments
office products
patio
baby
beauty
2
3
4
5
(a) Rating
Figure 1: Statistics on real-world datasets.
where sui is the rating score that u gives to i; To denotes the
set of observed (user, item) pairs; ||U ||2 and ||V ||2 are regularization terms for avoiding over-fitting.
Many MF models are originally proposed for rating prediction, hence they are weak for ranking tasks. To improve
hit rate performance for implicit feedback, one-class collaborative filtering (OCCF) [Pan et al., 2008] and implicit matrix factorization (iMF) [Hu et al., 2008; Lin et al., 2014]
are proposed based on the assumption of pointwise preferences on items. Both of them treat users’ feedback as an indication of positive and negativePpreferences with different
levels (cui ) and try to minimize u,i∈To cui (1 − UuT Vi )2 +
P
2
T
u,j∈Tuo cuj (0 − Uu Vj ) for each (user,item) pair. In addition to these absolute preference assumptions, many pairwise
algorithms such as BPR and some of its extensions are proposed [Rendle et al., 2009; 2010; Krohn-Grimberghe et al.,
2012; Pan and Chen, 2013]. They directly optimize the observed and unobserved (user,item) pairs with relative preference assumptions. Although those algorithms are for implicit
feedback, they can also be applied to ranking items with explicit information; the drawback is that the user satisfaction
information (i.e., rating values) cannot be efficiently leveraged.
Nowadays, more and more methods try to utilize explicit
information for item recommendation. For example, Weimer
et al. [2007] propose a maximum margin matrix factorization
(MMMF) method CofiRank to optimize ranking in terms of
NDCG score. Similr to CofiRank, xCLiMF proposed by [Shi
et al., 2013] builds a recommendation model by generalizing
collaborative less-is-more filtering assumption and optimizes
expected reciprocal rank (ERR). ListRank [Shi et al., 2010]
combines a list-wise learning-to-rank algorithm with MF. It
optimizes a loss function that represents the cross-entropy of
the output rating lists and training lists. These algorithms perform well when the data is dense, but they may be unsuitable
for sparse datasets, because few explicit ratings can be used
to compare and learn users’ preferences.
Beyond product ratings, directly optimizing website revenue is another interesting topic. Although a few researches
have considered on how to make price with the help of recommender systems [Bergemann and Ozmen, 2006] or how to
recommend discount items to users [Kamishima and Akaho,
2011], they do not consider optimizing the website revenue
directly. A good recommender system should not only pro-
Background
In this section, we review several related studies and perform
some analysis on real-world datasets.
The most popular techniques for collaborative filtering are
matrix factorization(MF) models [Koren et al., 2009], which
are effective in fitting the rating matrix with low-rank approximations. MF uses the latent factor U ∈ RD×N to
represent N users and V ∈ RD×M to represent M items,
and it employs the D-rank matrix U T V ∈ RN ×M for rating prediction. The latent factors can be trained by adopting Singular Value Decomposition(SVD) [Koren, 2008].
For example, Salakhutdinov et al. [2008b; 2008a] propose a probabilistic matrix factorization (PMF) model, in
which the factors and ratings are assumed to be sampled from
Gaussian distribution, and the latent factors are learned by
maximum
likelihood estimation,P
which equals toPminimizP
ing u,i∈To (UuT Vi − sui )2 + λ1 u ||Uu ||2 + λ2 i ||Vi ||2 ,
1821
are considered only when δui = 1. In other words, we optimize sui and rui when (user, item) pair is observed. Since the
correlation of sui with rui is obviously positive based on our
analysis in the previous section, we assume they are sampled
from bivariate normal distribution, that is,
vide potential items to users that they like, but let the website
earn as much money as possible. We study many real-world
datasets and find it is feasible to both optimize user satisfaction and website revenue in applications. We briefly discuss
it here by making a qualitative analysis of the relationship
between user satisfaction and website revenue. Figure 1(a)
shows our statistics on 13 Amazon datasets. The horizontal axis is user’s satisfaction and the vertical axis represents
the proportion of items which are above average price. Figure 1(b) records the rating values and related normalized item
price of an example dataset. It is obvious that user satisfaction
and website sales have some positive correlation and can be
optimized simultaneously. Overall, we can draw the following conclusions: (1)For the products in the same category,
item price has some impact on user satisfaction. The items
with higher price may have good quality and hence tend to receive higher rating values. (2)Involving price information can
not only benefit the website retailer, but also help to predict
user satisfaction, because we can learn the acceptable price
range of each user and avoid recommending the items that
users do not accept.
3
(
P (sui , rui |δui ) =
r̃ui =
1+
(4)
where some model parameters are ignored for the sake of
brevity. We assume bui , sui and rui are influenced by Uu
and Vi , and a straightforward assumption is that fb (u, i) =
fs (u, i) = fr (u, i) = UuT Vi .
We add regularization terms ||U ||2 and ||V ||2 for avoiding
over-fitting, and conduct stochastic gradient descent (SGD)
algorithm to update U and V . Different from PMF, we update
the latent factors with both observed and unobserved (user,
item) pairs. Specifically, in each iteration, we sample two
pairs from each class and update the latent factors in turn. In
order to make a prediction, we compute the expected value
UuT Vi for each (u, i) pair after finishing the update process.
For u, the item with higher UuT Vi will rank higher and will be
recommended with higher probability.
3.2
Algorithm2: HSR-2
Because the Gaussian assumption in HSR-1 may not be an
optimal solution to modeling discrete values (i.e., sui ), we
use binomial distribution to model user satisfaction inspired
by [Wu, 2009; Kim and Choi, 2014]. We further integrate bui
into sui for consistency. Then the probability of sui can be
written as:
(1)
where δui is an indicator parameter. δui = 1 if u buys i in the
training set, otherwise δui = 0. Here σ(u, i) is the sigmoid
funciton with fb (u, i) denoting the user-item association,
1
rui − rM in
L
rM ax − rM in
u,i∈T
Since bui has two states, i.e., bui ∈ {buy, not buy}, we use
Bernoulli distribution to model it for HSR-1. The probability
of bui is defined as follows:
e−fb (u,i)+c
ui
where rM ax and rM in are supremum and infimum of item
price set. We then formulate the maximum posterior to derive
our optimization criterion:
Y
OP T (HSR-1) ∝ ln
P (bui |δui )P (s̃ui , r̃ui |δui ) (5)
Algorithm1: HSR-1
σ(u, i) =
!#)δ
(3)
Our Models and Metrics
P (bui |δui ) = σ(u, i)δui (1 − σ(u, i))1−δui
∆2s
∆2
2ρ∆s ∆r
+ 2r −
σs2
σr
σs σr
p
where K1 = 1/(2πσs σr 1 − ρ2 ), K2 = 1/(2(1 − ρ2 )),
∆s = (sui − fs (u, i)) and ∆r = (rui − fr (u, i)). Here σ denotes standard deviation of each factor and ρ is the correlation
between sui and rui . fs (u, i) and fr (u, i) can be depicted as
satisfaction function for u and i.
Because sui and rui have different ranges, in applications,
we standardize both of them to [0, L]. For example,
In this section, we propose two different solutions to optimize
Hit rate, user Satisfaction and website Revenue (HSR) simultaneously. We denote the probability (or the tendency) of a
user u buying an item i by P (bui ). We use sui to indicate the
rating score that u gives to i and use P (sui ) to represent how
much u prefers i after (s)he has already bought it. Specifically, we assume the greater rating score that u gives to i in
the dataset, the more likely u likes i. For the online retailer,
we denote rui as the revenue it gets when u buys i. Therefore, P (rui ) can be regarded as retailer satisfaction. In this
paper, the revenue information is considered as the only factor to influence retailer satisfaction and is determined by hit
rate and item price. The more sales the retailer achieves, the
higher value of P (rui ) will be. With the above assumptions,
our goal is to maximize both P (bui ), P (sui ) and P (rui ).
3.1
"
K1 exp −K2
P (sui ) = B(sui |L, σ(u, i))
!
L
=
σ(u, i)sui (1 − σ(u, i))L−sui
sui
(2)
(6)
where B denotes binomial distribution, and L is the maximal
rating score in the dataset.
Intuitively, binomial distribution is always unimodal,
which is more reasonable than Gaussian models. HSR-2 has
another advantage than HSR-1: UuT Vi will not be limited by
sui and rui when optimizing σ(u, i).
where c is a constant.
Our next task is to model user satisfaction and retailer satisfaction given bui . We note that some items, (1)items that
users like but do not pretend to buy, and (2)items with high
price but irrelevant to users, should not be over optimized
due to the risk of reducing the hit rate. Hence, sui and rui
1822
We also use binomial distribution to model rui . Note that
rui needs not to be an integer because the first term in Eq.(6)
will be eliminated in the optimization process. Similar to
HSR-1, we also assume user satisfaction and retailer satisfaction are correlated but conditional independent given Uu and
Vi . Combining distributions of sui and rui , and formulating
the maximum posterior, we obtain the following optimization
criterion:
Y
OP T (HSR-2) ∝ ln
P (s̃ui )P (r̃ui |δui )
(7)
A
CofiRank/ListRank/xCLiMF:
1
Retailer satisfaction
Retailer satisfaction
B
HSR-1/HSR-2:
1
0.8
0.6
3
0.4
1
0.2
2
0
0 0.2 0.4 0.6 0.8 1
1
0.8
0.8
3
0.6
1
0.6
2
0.4
0.4
0.2
0.2
0
0
0 0.2 0.4 0.6 0.8
User satisfaction
1
User satisfaction
u,i∈T
We also conduct SGD algorithm to update U and V for
HSR-2 and the learning process is similar to HSR-1. In each
iteration, the time complexity of our methods is O(D), therefore the execution time is comparable to the existing pairwise
item recommendation methods.
3.3
Figure 2: Model comparison.
and price recall (P-R) as follows:
Discussion and Explanation
S-R = M (s)R(b)
(8)
P-R = M (r)R(b)
(9)
UuT Vi .
Both HSR-1 and HSR-2 rank items based on
This is
because if UuT Vi is greater, P (bui ), P (sui ) and P (rui ) will
also be greater, and hence u will be more likely to choose and
prefer i, while the website retailer will get relatively higher
revenue. It can be explained in Figure 2, where the horizontal
axis denotes user satisfaction, vertical axis represents website
revenue and the contour expresses P (sui )P (rui ) that HSR
try to optimize. The black points represent the items that u
wants to buy and the gray points are irrelevant items.
We now compare HSR with other related models and give
an interpretation of their assumptions. For implicit feedback,
the algorithms such as iMF and BPR aim to find the most
relevant items but ignore horizontal and vertical directions.
Therefore, they do not consider whether u likes the recommended items or not. The recommendation methods based on
explicit feedback, e.g, CofiRank, ListRank and xCLiMF, try
to search the potential items from direction A, in which the
items that u most likes are supposed to be first recommended.
However, because they do not try to avoid the irrelevant items
when training the models, some points with high UuT Vi may
be irrelevant to u in the test set.
For HSR, we search the items from direction B, which
guarantees both user satisfaction and retailer satisfaction. We
also try to avoid the irrelevant items in a training set by aforementioned Bernoulli or binomial assumptions, so the cheap
and relevant items can also be recommended with many opportunities. From Eq.(5) and Eq.(7) we find that by involving
item price information, HSR-1 and HSR-2 can not only globally optimize the website revenue, but also try to learn the
acceptable price range of each customer, which is beneficial
to learning user satisfaction. Because of those reasons, HSR
are more comprehensive than existing approaches.
We note that for some datasets, the item price and user satisfaction may not be correlated obviously. In this case, we
still try to balance all the factors for it can improve overall
performance. In practice, we can also adjust the weight of hit
rate part, user satisfaction part and revenue part of Eq.(5) and
Eq.(7) for different purposes.
3.4
where R(b) is the recall metric on the test set. It can be represented as:
P
u
R(b) =
te
te
|Ltr
u ∩ Lu |/|Lu |
N
(10)
where N is the number of users, Ltr
u denotes the recomis
the
observed
item set. M (s)
mended item set for u, and Lte
u
and M (r) are the mean rating score and mean price of hit
items in the test set. For example,
P
M (r) =
te
rui δ(i ∈ Ltr
u ∩ Lu )
P
tr
te
u |Lu ∩ Lu |
u,i
(11)
From Eq.(8), we observe that if an algorithm achieves a
high S-R score, it must guarantee both recall and mean rating score of hit items. This idea also applies to P-R, which
balances both recall and mean price of hit items. Hence, our
metrics describe overall system performance.
4
4.1
Experiments
Datasets
To fairly evaluate the performance of our models, we study
10 Amazon datasets provided by [McAuley and Leskovec,
2013]. Each dataset contains plenty of review information
such as rating scores, item price, user comments and review
time. We only consider user ratings and item price. Some
related statistic information is summarized in Table 3. We
preprocess the data by dropping the items without rating or
price information, and subdivide them into training, validation and test sets, where 80% is used for training and the left
20% is evenly split between validation and test sets.
4.2
Baselines and Parameter Settings
We compare our models with 4 state-of-the-art methods for
ranking: (1)CofiRank, (2)ListRank, (3)iMF and (4)BPR.
CofiRank [Weimer et al., 2007] and ListRank [Shi et al.,
2010] are listwise methods for item ranking with explicit
feedback (e.g, ratings). Both of them directly optimize a
The Metrics
To measure user satisfaction and website revenue, we design
two modified hit rate based metrics - satisfaction recall (S-R)
1823
Table 2: Prediction and recall performance of compared models on 10 real-world datasets.
metric
Precision
K=5
Recall
K=5
F1
method
automotive
beauty
CofiRank
ListRank
iMF
BPR
HSR-1
HSR-2
improve
CofiRank
ListRank
iMF
BPR
HSR-1
HSR-2
improve
improve
0.0009
0.0001
0.0038
0.0057
0.0072
0.0064
26.32%
0.0041
0.0003
0.0155
0.0220
0.0286
0.0246
30.00%
27.06%
0.0021
0.0023
0.0102
0.0301
0.0337
0.0323
11.96%
0.0102
0.0108
0.0439
0.1327
0.1501
0.1446
13.11%
12.17%
clothing&
accessories
0.0039
0.0072
0.0248
0.0449
0.0470
0.0484
7.80%
0.0188
0.0357
0.1215
0.2112
0.2199
0.2271
7.53%
7.75%
industrial&
scientific
0.0006
0.0009
0.0102
0.0172
0.0193
0.0239
38.95%
0.0023
0.0012
0.0176
0.0641
0.0683
0.0814
26.99%
36.24%
#users
#items
automotive
beauty
clothing&accessories
industrial&scientific
musical instruments
shoes
software
tools&home
toys& games
video games
116898
144771
14849
26381
52665
4762
33362
235731
215373
174014
40256
19159
2278
17155
9532
1275
3823
36784
30835
12257
avg
rating
4.1641
4.1712
3.9649
4.3450
4.2403
4.2541
3.3448
4.0681
4.1607
3.9857
avg
price ($)
56.0249
23.9697
23.0374
42.1607
59.8015
42.0972
54.5471
59.3773
46.1495
45.7538
sparsity
0.0035%
0.0077%
0.0571%
0.0192%
0.0132%
0.1966%
0.0334%
0.0037%
0.0045%
0.0157%
software
0.0081
0.0016
0.0613
0.0762
0.0832
0.0804
9.19%
0.0272
0.0075
0.2341
0.2890
0.3124
0.3044
8.10%
8.96%
0.0018
0.0022
0.0091
0.0080
0.0099
0.0110
20.88%
0.0045
0.0036
0.0426
0.0256
0.0436
0.0514
20.66%
20.84%
tools&
home
0.0005
0.0009
0.0022
0.0040
0.0028
0.0050
25.00%
0.0022
0.0040
0.0101
0.0188
0.0133
0.0231
22.87%
24.62%
toys&
games
0.0005
0.0006
0.0011
0.0015
0.0012
0.0017
13.33%
0.0022
0.0020
0.0053
0.0063
0.0060
0.0082
30.16%
16.22%
video
games
0.0042
0.0023
0.0023
0.0078
0.0084
0.0087
11.54%
0.0152
0.0012
0.0088
0.0329
0.0401
0.0410
24.62%
13.83%
(2) On average, BPR is better than iMF due to the pairwise
assumption. Although iMF considers the unobserved
pairs, it is based on an absolute assumption that the su,i
will be 1 when u buys i, and 0 otherwise. On the contrary, BPR, HSR-1 and HSR-2 try to take feedback as
relative preferences rather than absolute ones.
list of items based on rating values. iMF [Hu et al., 2008;
Lin et al., 2014] is a pointwise method that updates observed
and unobserved pairs individually. BPR [Rendle et al., 2009]
adopts a pairwise ranking method with the assumption that
the observed (user,item) pairs are more positive than the unobserved (user,item) pairs, and optimizes the comparison in
each iteration. For iMF and BPR, we take a pre-processing
step by keeping the ratings as observed positvie feedback and
other (user, item) pairs as unobserved feedback. We conduct
comparisons with iMF and BPR because they perform better
in very sparse datasets than the other baseline methods.
We set the learning rate to 0.001 for iMF, ListRank and
HSR-2 and to 0.01 for CofiRank, BPR and HSR-1. We fix latent dimension D = 10 and choose regularization coefficient
from λ ∈ {0.001, 0.01, 0.1, 1}. The correlation ρ is set to
0.1 and L = 5. We adopt precision, recall and our proposed
metrics S-R and P-R to test the performance. The length of
recommendation lists is fixed at K = 5 for all metrics.
4.3
shoes
(1) Because our test sets are very sparse, the precision results are small comparing with recall when K = 10.
We also observe that the models considering unobserved
items (i.e., iMF, BPR, HSR-1 and HSR-2) perform much
better than the models which only use rating information (i.e., CofiRank and ListRank). Extremely, when
each user only has one observed item in the training set,
CofiRank and ListRank will be next-to-no benefit for exploiting the relevant items, and that is the reason for their
poor performance in our experiments.
Table 3: Basic statistics of the data.
dataset
musical
instruments
0.0003
0.0013
0.0050
0.0062
0.0073
0.0073
17.74%
0.0017
0.0064
0.0218
0.0274
0.0322
0.0325
18.61%
17.90%
(3) Both HSR-1 and HSR-2 are better than the other baseline methods in terms of both precision and recall results. They improve the precision by an average of more
than 18% and improve the recall by an average of more
than 20% comparing with the best baseline models. It
is mainly because that HSR-1 and HSR-2 not only consider all possible (user, item) pairs, but also adopt explicit feedback information.
4.4
Analysis of User Satisfaction
We then analyse the performance of the compared models
on S-R metric and list the improvement of our models on
mean ratings of hit items (M(s)). The experimental results
are shown in Table 4, from which we can have the following
observations and conclusions:
Analysis of Hit Rate
(1) Some models with lower precision or recall values can
achieve higher S-R scores. For example, on the video
games dataset, HSR-1 has lower recall performance than
HSR-2 but achieves better S-R due to higher M (s).
We show the hit rate performance comparison of our mothods with other baseline models. The precision and recall results are shown in Table 2, where the best performance is in
bold font. For comprehensive comparison, we also list the
improvement of F1 score in the last row. From the table, we
have the following observations:
(2) Both HSR-1 and HSR-2 are better than the other baseline models on S-R, which demonstrates the benefit of
1824
Table 4: S-R and P-R performance of compared models on 10 real-world datasets.
M(s)
P-R
k=5
M(r)
automotive
beauty
CofiRank
ListRank
iMF
BPR
HSR-1
HSR-2
improve
improve
CofiRank
ListRank
iMF
BPR
HSR-1
HSR-2
improve
improve
0.0174
0.0012
0.0672
0.0967
0.1249
0.1093
29.15%
1.09%
0.0303
0.0053
0.1764
0.3458
0.6205
0.4901
79.42%
38.02%
0.0455
0.0436
0.1861
0.5644
0.6414
0.6171
13.64%
0.53%
0.0499
0.0594
0.2392
0.8074
1.0306
0.9902
27.65%
12.92%
0.1
industrial&
scientific
0.0106
0.0053
0.0724
0.2634
0.3021
0.3443
30.70%
7.33%
0.0108
0.0126
0.2481
0.8670
1.1300
1.7384
100.51%
51.16%
0.07
iMF
BPR
HSR−1
HSR−2
0.15
Recall
0.08
shoes
software
0.1274
0.0356
1.0332
1.2592
1.3422
1.2993
6.59%
-2.65%
0.3384
0.0944
2.9148
3.7270
4.4324
4.3663
18.93%
11.25%
0.0148
0.0123
0.1632
0.0886
0.1744
0.2011
23.21%
4.54%
0.1180
0.0824
1.2105
0.4779
1.3224
1.5946
31.72%
9.26%
1
0.1
tools&
home
0.0092
0.0172
0.0444
0.0805
0.0546
0.0992
23.24%
-1.83%
0.0302
0.0478
0.1478
0.3143
0.2676
0.3550
12.95%
19.72%
toys&
games
0.0094
0.0086
0.0228
0.0271
0.0268
0.0359
32.11%
3.22%
0.0346
0.0238
0.0734
0.0717
0.0909
0.1339
82.40%
17.59%
video
games
0.0706
0.0055
0.0400
0.1466
0.1848
0.1799
26.05%
1.33%
0.1896
0.0133
0.0989
0.3401
0.5586
0.6510
91.40%
41.19%
2
iMF
BPR
HSR−1
HSR−2
1.2
iMF
BPR
HSR−1
HSR−2
1.5
0.8
0.6
0.4
0.05
0.06
musical
instruments
0.0071
0.0265
0.0990
0.1220
0.1415
0.1460
19.66%
-1.14%
0.0176
0.0615
0.1545
0.2786
0.3436
0.3343
23.31%
4.97%
1.4
0.2
iMF
BPR
HSR−1
HSR−2
0.09
Prediction
clothing&
accessories
0.0891
0.1493
0.5026
0.8861
0.9167
0.9644
8.84%
1.24%
0.1507
0.2434
0.7853
1.5440
1.6604
1.7265
11.82%
4.02%
P−R
S-R
k=5
method
S−R
metric
1
0.5
0.2
0.05
1
3
5
K (Shoes)
7
9
0
1
3
5
7
K (Beauty)
0
9
1
3
5
7
K (Clothing)
9
0
1
3
5
7
K (Industrial)
9
Figure 3: Performance comparison with different recommendation list size.
combining bui and sui . We also find that M (s) is not improved in some datasets. This is mainly because HSR try
to balance hit rate and user satisfaction instead of simply
optimizing the latter one. Because HSR have much better performance than the other models on recall values,
the influence of M (s) is not obvious here.
4.5
the performance can even double the best baseline models, which shows the efficiency of HSR by integrating
the price factor into the recommendation models. Because total revenue and profit are almost always correlated, our methods have high degree of confidence to
help earn much more money for the online aggregators.
Analysis of Revenue
4.6
We compare our models with other baselines on P-R metric,
and list the improvement of HSR on mean price of hit items
(M(r)). The results are summarized in the lower portion of
Table 4, and we find that
(1) There is plenty of room to optimize the website revenue. Some algorithms with poor recall performance
may achieve better P-R metric and M (r). For example,
on the toys&games dataset, iMF has nearly 19% lower
recall score than BPR, but it performs better than BPR
when we consider P-R metric, because it increases the
average revenue a lot at the cost of decreasing a certain
degree of the recall value.
(2) Our models achieve much better performance than any
other models on both P-R metric and M (r). For many
datasets, the improvement is over 50%, and for some
datasets, such as industrial&scientific and video games,
Analysis of Stability
Finally, to study the recommendation stability, we vary the
recommendation size K to be values of {1, 3, 5, 7, 9}, and
some results are shown in Figure 3. As can be seen, when K
increases, the precision scores have general downward trend,
and other metrics pretend to increase. HSR-1 and HSR-2 are
consistently better than other models. In particular, when K
increases, the advantages of our models on P-R become more
and more obvious.
5
Conclusion and Future Work
In this paper, we consider three important factors, (1)hit rate,
(2)user satisfaction, and (3)website revenue, to measure the
performance of recommender systems. We then propose two
algorithms, called HSR-1 and HSR-2, to optimize them. Finally, two simple but convincing metrics are designed to measure the comparison models. The experimental results show
1825
[McAuley and Leskovec, 2013] Julian McAuley and Jure
Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. In Proceedings of
the 7th ACM Conference on Recommender Systems, RecSys’13, pages 165–172. ACM, 2013.
[Pan and Chen, 2013] Weike Pan and Li Chen. Gbpr: Group
preference based bayesian personalized ranking for oneclass collaborative filtering. In Proceedings of the 23th
International Joint Conference on Artificial Intelligence,
IJCAI’13, pages 2691–2697. AAAI Press, 2013.
[Pan et al., 2008] Rong Pan, Yunhong Zhou, Bin Cao,
Nathan Nan Liu, Rajan Lukose, Martin Scholz, and Qiang
Yang. One-class collaborative filtering. In Proceedings of the 8th International Conference on Data Mining,
ICDM’08, pages 502–511. IEEE, 2008.
[Rendle et al., 2009] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. Bpr:
Bayesian personalized ranking from implicit feedback. In
Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, pages 452–461. AUAI Press, 2009.
[Rendle et al., 2010] Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. Factorizing personalized markov chains for next-basket recommendation.
In Proceedings of the 19th International Conference on
World Wide Web, WWW’10, pages 811–820. ACM, 2010.
[Salakhutdinov and Mnih, 2008a] Ruslan Salakhutdinov and
Andriy Mnih. Bayesian probabilistic matrix factorization using markov chain monte carlo. In Proceedings of
the 25th International Conference on Machine Learning,
ICML’08, pages 880–887. ACM, 2008.
[Salakhutdinov and Mnih, 2008b] Ruslan
Salakhutdinov
and Andriy Mnih. Probabilistic matrix factorization. In
Advances in Neural Information Processing Systems,
volume 20 of NIPS’08, 2008.
[Shi et al., 2010] Yue Shi, Martha Larson, and Alan Hanjalic. List-wise learning to rank with matrix factorization
for collaborative filtering. In Proceedings of the 4th ACM
Conference on Recommender Systems, RecSys’10, pages
269–272. ACM, 2010.
[Shi et al., 2013] Yue Shi, Alexandros Karatzoglou, Linas
Baltrunas, Martha Larson, and Alan Hanjalic. xclimf: optimizing expected reciprocal rank for data with multiple
levels of relevance. In Proceedings of the 7th ACM Conference on Recommender Systems, RecSys’13, pages 431–
434. ACM, 2013.
[Weimer et al., 2007] Markus Weimer, Alexandros Karatzoglou, Quoc Viet Le, and Alex Smola. Maximum margin
matrix factorization for collaborative ranking. Advances
in Neural Information Processing Systems, 2007.
[Wu, 2009] Jinlong Wu. Binomial matrix factorization for
discrete collaborative filtering. In Proceedings of the
9th International Conference on Data Mining, ICDM’09,
pages 1046–1051. IEEE, 2009.
that our models can achieve better hit rate and user satisfaction results and also significantly improve website revenue.
In the future, we intend to exploit more kinds of user behaviors and design some mechanisms for recommendation diversification. To better measure the overall recommendation
performance, we will try to design more effective metrics.
Acknowledgments
This research is supported by the National Natural Science Foundation of China (NSFC) No. 61272303 and
Natural Science Foundation of Guangdong Province No.
2014A030310268. We also thank Dr. Weike Pan for helpful discussions.
References
[Bergemann and Ozmen, 2006] Dirk Bergemann and Deran
Ozmen. Optimal pricing with recommender systems. In
Proceedings of the 7th ACM Conference on Electronic
Commerce, pages 43–51. ACM, 2006.
[Hu et al., 2008] Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative filtering for implicit feedback datasets.
In Proceedings of the 8th International Conference on
Data Mining, ICDM’08, pages 263–272. IEEE, 2008.
[Kamishima and Akaho, 2011] Toshihiro Kamishima and
Shotaro Akaho. Personalized pricing recommender system: Multi-stage epsilon-greedy approach. In Proceedings
of the 2nd International Workshop on Information Heterogeneity and Fusion in Recommender Systems, HetRec’11,
pages 57–64. ACM, 2011.
[Kim and Choi, 2014] Yong-Deok Kim and Seungjin Choi.
Bayesian binomial mixture model for collaborative prediction with non-random missing data. In Proceedings of
the 8th ACM Conference on Recommender Systems, RecSys’14, pages 201–208. ACM, 2014.
[Koren et al., 2009] Yehuda Koren, Robert Bell, and Chris
Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009.
[Koren, 2008] Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model.
In Proceedings of the 14th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining,
SIGKDD’13, pages 426–434. ACM, 2008.
[Krohn-Grimberghe et al., 2012] Artus Krohn-Grimberghe,
Lucas Drumond, Christoph Freudenthaler, and Lars
Schmidt-Thieme. Multi-relational matrix factorization using bayesian personalized ranking for social network data.
In Proceedings of the 5th ACM International Conference
on Web Search and Data Mining, WSDM’12, pages 173–
182. ACM, 2012.
[Lin et al., 2014] Christopher H. Lin, Ece Kamar, and Eric
Horvitz. Signals in the silence: Models of implicit feedback in a recommendation system for crowdsourcing. In
Proceedings of the 28th AAAI Conference on Artificial Intelligence, AAAI’14, pages 22–28, 2014.
1826
Download