Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015) Recommendation Algorithms for Optimizing Hit Rate, User Satisfaction and Website Revenue Xin Wang, Yunhui Guo, Congfu Xu∗ Institute of Artificial Intelligence, College of Computer Science Zhejiang University cswangxinm@gmail.com, {gyhui, xucongfu}@zju.edu.cn Abstract uate the difference between predicted rating values and actual rating values by computing root mean square error (RMSE), mean square error (MSE) or mean absolute error (MAE). For item recommendation, we consider whether the most relevant items are recommended with the highest priority by measuring a ranked item list for each user in terms of precision, recall, normalized discounted cumulative gain (NDCG), area under the curve (AUC), expected reciprocal rank (ERR), etc. [Weimer et al., 2007; Rendle et al., 2009; Shi et al., 2013; Pan and Chen, 2013]. Despite those algorithms and metrics, it is still challenging to comprehensively measure the performance of a particular recommendation algorithm. The rating prediction metrics (e.g., RMSE, MAE, MSE) or some item recommendation metrics on the whole item lists (e.g., NDCG, AUC, ERR) may not directly reflect system performance in applications, where only top-K items are provided. On the other hand, the hit rate based metrics (e.g. precision and recall) cannot convey preference degree information. Hence, in this paper, we consider two other important factors for designing and evaluating recommendation algorithms. User Satisfaction We define user satisfaction as how much a user satisfies with an item (s)he has already bought. The models with higher hit rate performance may recommend more items that users tend to buy, but they do not guarantee that the proposed items suit the customers. An extreme example is that many users are not pleased with the bought items that recommended by the system. In this case, although the models achieve high hit rate, it is even worse than not recommending the items to these users, because they are disappointed with the items and even distrust the recommender systems in the future. Website Revenue Another destination of recommender systems is to help website retailers earn more revenue (which is narrowly defined as retailer satisfaction for convenience). Unfortunately, it is ignored by almost all the related algorithms. Improving hit rate can indeed help improve sales, but it is limited and indirect. Suppose a user intends to buy one of two selected items, recommending any one of them will gain the same hit rate, but it is better to choose the item with higher profit for it can help the website earn more money without reducing users’ satisfaction. To illustrate the impact of these factors, we show a simple example in Table 1. It contains three similar items that the We usually use hit rate to measure the performance of item recommendation algorithms. In addition to hit rate, we consider additional two important factors which are ignored by most previous works. First, we consider whether users are satisfied with the recommended items. It is possible that a user has bought an item but dislikes it. Hence high hit rate may not reflect high customer satisfaction. Second, we consider whether the website retailers are satisfied with the recommendation results. If a customer is interested in two products and wants to buy one of them, it may be better to suggest the item which can help bring more profit. Therefore, a good recommendation algorithm should not only consider improving hit rate but also consider optimizing user satisfaction and website revenue. In this paper, we propose two algorithms for the above purposes and design two modified hit rate based metrics to measure them. Experimental results on 10 real-world datasets show that our methods can not only achieve better hit rate, but also improve user satisfaction and website revenue comparing with the state-of-the-art models. 1 Introduction Recommender systems have been widely applied to online shopping websites for better user experience. Traditional recommendation algorithms try to leverage item aspects and historical user behaviors to exploit the potential user preferences. Broadly, a customer’s behavior can be divided into two types: explicit feedback and implicit feedback. The former, such as rating scores, reveals the preference degree of buyers, and the latter, such as browsing and clicking information, records valuable customer actions. Based on them, most well-known methods utilize the collaborative filtering (CF) technique, which learns the correlations of users and items beneath the feedback, in order to recognize potential useritem associations [Koren et al., 2009]. To measure recommendation performance, researches have also considered designing good metrics. For rating prediction, we usually eval∗ Corresponding author 1820 0.01 Table 1: A customer’s wish list. item2 2 10 hit - item3 5 9 hit 0.008 Proportion satisfaction of u revenue ($) alg1 (algorithm1) alg2 (algorithm2) alg3 (algorithm3) item1 3 5 hit - 0.006 0.004 0.002 0 1 customer u is interested in. The satisfaction denotes the rating value that u will give to each item if s(he) buys it, and the revenue is related to item price. The following three lines show the performance of three recommendation algorithms (denoted as alg1, alg2 and alg3), and the hit means that u will buy the item if it is recommended. From the table, we immediately know that both alg1, alg2 and alg3 achieve similar performance in terms of hit rate (i.e., the same prediction, recall and F1 score), because they both hit one item. However, those methods have different performance when we consider the other two factors. If u buys the item recommended by alg1, he will give 3 stars to the item, and the retailer will get 5 dollars. If u buys the item recommended by alg2 instead, the user satisfaction value will drop 1 score, but the retailer can get double revenue. To the item recommended by alg3, the user gives the highest score, and the website retailer can get approximate optimal revenue. Therefore, we think that alg3 is the best algorithm because it satisfies both users and the retailer, and alg2 may be better than arg1 because it helps to double the revenue with the cost of only losing 1 level of user satisfaction. To conclude, a good recommendation algorithm should not only improve hit rate, but also consider optimizing user satisfaction and website revenue, or at least achieve a balance among them. As a response, in this paper, we propose two novel algorithms to optimize them simultaneously. To test the performance of our methods in terms of user and retailer satisfaction, we design two simple but convincing metrics. The experimental results on 10 real-world datasets show that our models provide better overall performance than existing methods. 2 foods arts automotive cell&accessories clothing&accessories health industrial&scientific jewelry musical instruments office products patio baby beauty 2 3 4 5 (a) Rating Figure 1: Statistics on real-world datasets. where sui is the rating score that u gives to i; To denotes the set of observed (user, item) pairs; ||U ||2 and ||V ||2 are regularization terms for avoiding over-fitting. Many MF models are originally proposed for rating prediction, hence they are weak for ranking tasks. To improve hit rate performance for implicit feedback, one-class collaborative filtering (OCCF) [Pan et al., 2008] and implicit matrix factorization (iMF) [Hu et al., 2008; Lin et al., 2014] are proposed based on the assumption of pointwise preferences on items. Both of them treat users’ feedback as an indication of positive and negativePpreferences with different levels (cui ) and try to minimize u,i∈To cui (1 − UuT Vi )2 + P 2 T u,j∈Tuo cuj (0 − Uu Vj ) for each (user,item) pair. In addition to these absolute preference assumptions, many pairwise algorithms such as BPR and some of its extensions are proposed [Rendle et al., 2009; 2010; Krohn-Grimberghe et al., 2012; Pan and Chen, 2013]. They directly optimize the observed and unobserved (user,item) pairs with relative preference assumptions. Although those algorithms are for implicit feedback, they can also be applied to ranking items with explicit information; the drawback is that the user satisfaction information (i.e., rating values) cannot be efficiently leveraged. Nowadays, more and more methods try to utilize explicit information for item recommendation. For example, Weimer et al. [2007] propose a maximum margin matrix factorization (MMMF) method CofiRank to optimize ranking in terms of NDCG score. Similr to CofiRank, xCLiMF proposed by [Shi et al., 2013] builds a recommendation model by generalizing collaborative less-is-more filtering assumption and optimizes expected reciprocal rank (ERR). ListRank [Shi et al., 2010] combines a list-wise learning-to-rank algorithm with MF. It optimizes a loss function that represents the cross-entropy of the output rating lists and training lists. These algorithms perform well when the data is dense, but they may be unsuitable for sparse datasets, because few explicit ratings can be used to compare and learn users’ preferences. Beyond product ratings, directly optimizing website revenue is another interesting topic. Although a few researches have considered on how to make price with the help of recommender systems [Bergemann and Ozmen, 2006] or how to recommend discount items to users [Kamishima and Akaho, 2011], they do not consider optimizing the website revenue directly. A good recommender system should not only pro- Background In this section, we review several related studies and perform some analysis on real-world datasets. The most popular techniques for collaborative filtering are matrix factorization(MF) models [Koren et al., 2009], which are effective in fitting the rating matrix with low-rank approximations. MF uses the latent factor U ∈ RD×N to represent N users and V ∈ RD×M to represent M items, and it employs the D-rank matrix U T V ∈ RN ×M for rating prediction. The latent factors can be trained by adopting Singular Value Decomposition(SVD) [Koren, 2008]. For example, Salakhutdinov et al. [2008b; 2008a] propose a probabilistic matrix factorization (PMF) model, in which the factors and ratings are assumed to be sampled from Gaussian distribution, and the latent factors are learned by maximum likelihood estimation,P which equals toPminimizP ing u,i∈To (UuT Vi − sui )2 + λ1 u ||Uu ||2 + λ2 i ||Vi ||2 , 1821 are considered only when δui = 1. In other words, we optimize sui and rui when (user, item) pair is observed. Since the correlation of sui with rui is obviously positive based on our analysis in the previous section, we assume they are sampled from bivariate normal distribution, that is, vide potential items to users that they like, but let the website earn as much money as possible. We study many real-world datasets and find it is feasible to both optimize user satisfaction and website revenue in applications. We briefly discuss it here by making a qualitative analysis of the relationship between user satisfaction and website revenue. Figure 1(a) shows our statistics on 13 Amazon datasets. The horizontal axis is user’s satisfaction and the vertical axis represents the proportion of items which are above average price. Figure 1(b) records the rating values and related normalized item price of an example dataset. It is obvious that user satisfaction and website sales have some positive correlation and can be optimized simultaneously. Overall, we can draw the following conclusions: (1)For the products in the same category, item price has some impact on user satisfaction. The items with higher price may have good quality and hence tend to receive higher rating values. (2)Involving price information can not only benefit the website retailer, but also help to predict user satisfaction, because we can learn the acceptable price range of each user and avoid recommending the items that users do not accept. 3 ( P (sui , rui |δui ) = r̃ui = 1+ (4) where some model parameters are ignored for the sake of brevity. We assume bui , sui and rui are influenced by Uu and Vi , and a straightforward assumption is that fb (u, i) = fs (u, i) = fr (u, i) = UuT Vi . We add regularization terms ||U ||2 and ||V ||2 for avoiding over-fitting, and conduct stochastic gradient descent (SGD) algorithm to update U and V . Different from PMF, we update the latent factors with both observed and unobserved (user, item) pairs. Specifically, in each iteration, we sample two pairs from each class and update the latent factors in turn. In order to make a prediction, we compute the expected value UuT Vi for each (u, i) pair after finishing the update process. For u, the item with higher UuT Vi will rank higher and will be recommended with higher probability. 3.2 Algorithm2: HSR-2 Because the Gaussian assumption in HSR-1 may not be an optimal solution to modeling discrete values (i.e., sui ), we use binomial distribution to model user satisfaction inspired by [Wu, 2009; Kim and Choi, 2014]. We further integrate bui into sui for consistency. Then the probability of sui can be written as: (1) where δui is an indicator parameter. δui = 1 if u buys i in the training set, otherwise δui = 0. Here σ(u, i) is the sigmoid funciton with fb (u, i) denoting the user-item association, 1 rui − rM in L rM ax − rM in u,i∈T Since bui has two states, i.e., bui ∈ {buy, not buy}, we use Bernoulli distribution to model it for HSR-1. The probability of bui is defined as follows: e−fb (u,i)+c ui where rM ax and rM in are supremum and infimum of item price set. We then formulate the maximum posterior to derive our optimization criterion: Y OP T (HSR-1) ∝ ln P (bui |δui )P (s̃ui , r̃ui |δui ) (5) Algorithm1: HSR-1 σ(u, i) = !#)δ (3) Our Models and Metrics P (bui |δui ) = σ(u, i)δui (1 − σ(u, i))1−δui ∆2s ∆2 2ρ∆s ∆r + 2r − σs2 σr σs σr p where K1 = 1/(2πσs σr 1 − ρ2 ), K2 = 1/(2(1 − ρ2 )), ∆s = (sui − fs (u, i)) and ∆r = (rui − fr (u, i)). Here σ denotes standard deviation of each factor and ρ is the correlation between sui and rui . fs (u, i) and fr (u, i) can be depicted as satisfaction function for u and i. Because sui and rui have different ranges, in applications, we standardize both of them to [0, L]. For example, In this section, we propose two different solutions to optimize Hit rate, user Satisfaction and website Revenue (HSR) simultaneously. We denote the probability (or the tendency) of a user u buying an item i by P (bui ). We use sui to indicate the rating score that u gives to i and use P (sui ) to represent how much u prefers i after (s)he has already bought it. Specifically, we assume the greater rating score that u gives to i in the dataset, the more likely u likes i. For the online retailer, we denote rui as the revenue it gets when u buys i. Therefore, P (rui ) can be regarded as retailer satisfaction. In this paper, the revenue information is considered as the only factor to influence retailer satisfaction and is determined by hit rate and item price. The more sales the retailer achieves, the higher value of P (rui ) will be. With the above assumptions, our goal is to maximize both P (bui ), P (sui ) and P (rui ). 3.1 " K1 exp −K2 P (sui ) = B(sui |L, σ(u, i)) ! L = σ(u, i)sui (1 − σ(u, i))L−sui sui (2) (6) where B denotes binomial distribution, and L is the maximal rating score in the dataset. Intuitively, binomial distribution is always unimodal, which is more reasonable than Gaussian models. HSR-2 has another advantage than HSR-1: UuT Vi will not be limited by sui and rui when optimizing σ(u, i). where c is a constant. Our next task is to model user satisfaction and retailer satisfaction given bui . We note that some items, (1)items that users like but do not pretend to buy, and (2)items with high price but irrelevant to users, should not be over optimized due to the risk of reducing the hit rate. Hence, sui and rui 1822 We also use binomial distribution to model rui . Note that rui needs not to be an integer because the first term in Eq.(6) will be eliminated in the optimization process. Similar to HSR-1, we also assume user satisfaction and retailer satisfaction are correlated but conditional independent given Uu and Vi . Combining distributions of sui and rui , and formulating the maximum posterior, we obtain the following optimization criterion: Y OP T (HSR-2) ∝ ln P (s̃ui )P (r̃ui |δui ) (7) A CofiRank/ListRank/xCLiMF: 1 Retailer satisfaction Retailer satisfaction B HSR-1/HSR-2: 1 0.8 0.6 3 0.4 1 0.2 2 0 0 0.2 0.4 0.6 0.8 1 1 0.8 0.8 3 0.6 1 0.6 2 0.4 0.4 0.2 0.2 0 0 0 0.2 0.4 0.6 0.8 User satisfaction 1 User satisfaction u,i∈T We also conduct SGD algorithm to update U and V for HSR-2 and the learning process is similar to HSR-1. In each iteration, the time complexity of our methods is O(D), therefore the execution time is comparable to the existing pairwise item recommendation methods. 3.3 Figure 2: Model comparison. and price recall (P-R) as follows: Discussion and Explanation S-R = M (s)R(b) (8) P-R = M (r)R(b) (9) UuT Vi . Both HSR-1 and HSR-2 rank items based on This is because if UuT Vi is greater, P (bui ), P (sui ) and P (rui ) will also be greater, and hence u will be more likely to choose and prefer i, while the website retailer will get relatively higher revenue. It can be explained in Figure 2, where the horizontal axis denotes user satisfaction, vertical axis represents website revenue and the contour expresses P (sui )P (rui ) that HSR try to optimize. The black points represent the items that u wants to buy and the gray points are irrelevant items. We now compare HSR with other related models and give an interpretation of their assumptions. For implicit feedback, the algorithms such as iMF and BPR aim to find the most relevant items but ignore horizontal and vertical directions. Therefore, they do not consider whether u likes the recommended items or not. The recommendation methods based on explicit feedback, e.g, CofiRank, ListRank and xCLiMF, try to search the potential items from direction A, in which the items that u most likes are supposed to be first recommended. However, because they do not try to avoid the irrelevant items when training the models, some points with high UuT Vi may be irrelevant to u in the test set. For HSR, we search the items from direction B, which guarantees both user satisfaction and retailer satisfaction. We also try to avoid the irrelevant items in a training set by aforementioned Bernoulli or binomial assumptions, so the cheap and relevant items can also be recommended with many opportunities. From Eq.(5) and Eq.(7) we find that by involving item price information, HSR-1 and HSR-2 can not only globally optimize the website revenue, but also try to learn the acceptable price range of each customer, which is beneficial to learning user satisfaction. Because of those reasons, HSR are more comprehensive than existing approaches. We note that for some datasets, the item price and user satisfaction may not be correlated obviously. In this case, we still try to balance all the factors for it can improve overall performance. In practice, we can also adjust the weight of hit rate part, user satisfaction part and revenue part of Eq.(5) and Eq.(7) for different purposes. 3.4 where R(b) is the recall metric on the test set. It can be represented as: P u R(b) = te te |Ltr u ∩ Lu |/|Lu | N (10) where N is the number of users, Ltr u denotes the recomis the observed item set. M (s) mended item set for u, and Lte u and M (r) are the mean rating score and mean price of hit items in the test set. For example, P M (r) = te rui δ(i ∈ Ltr u ∩ Lu ) P tr te u |Lu ∩ Lu | u,i (11) From Eq.(8), we observe that if an algorithm achieves a high S-R score, it must guarantee both recall and mean rating score of hit items. This idea also applies to P-R, which balances both recall and mean price of hit items. Hence, our metrics describe overall system performance. 4 4.1 Experiments Datasets To fairly evaluate the performance of our models, we study 10 Amazon datasets provided by [McAuley and Leskovec, 2013]. Each dataset contains plenty of review information such as rating scores, item price, user comments and review time. We only consider user ratings and item price. Some related statistic information is summarized in Table 3. We preprocess the data by dropping the items without rating or price information, and subdivide them into training, validation and test sets, where 80% is used for training and the left 20% is evenly split between validation and test sets. 4.2 Baselines and Parameter Settings We compare our models with 4 state-of-the-art methods for ranking: (1)CofiRank, (2)ListRank, (3)iMF and (4)BPR. CofiRank [Weimer et al., 2007] and ListRank [Shi et al., 2010] are listwise methods for item ranking with explicit feedback (e.g, ratings). Both of them directly optimize a The Metrics To measure user satisfaction and website revenue, we design two modified hit rate based metrics - satisfaction recall (S-R) 1823 Table 2: Prediction and recall performance of compared models on 10 real-world datasets. metric Precision K=5 Recall K=5 F1 method automotive beauty CofiRank ListRank iMF BPR HSR-1 HSR-2 improve CofiRank ListRank iMF BPR HSR-1 HSR-2 improve improve 0.0009 0.0001 0.0038 0.0057 0.0072 0.0064 26.32% 0.0041 0.0003 0.0155 0.0220 0.0286 0.0246 30.00% 27.06% 0.0021 0.0023 0.0102 0.0301 0.0337 0.0323 11.96% 0.0102 0.0108 0.0439 0.1327 0.1501 0.1446 13.11% 12.17% clothing& accessories 0.0039 0.0072 0.0248 0.0449 0.0470 0.0484 7.80% 0.0188 0.0357 0.1215 0.2112 0.2199 0.2271 7.53% 7.75% industrial& scientific 0.0006 0.0009 0.0102 0.0172 0.0193 0.0239 38.95% 0.0023 0.0012 0.0176 0.0641 0.0683 0.0814 26.99% 36.24% #users #items automotive beauty clothing&accessories industrial&scientific musical instruments shoes software tools&home toys& games video games 116898 144771 14849 26381 52665 4762 33362 235731 215373 174014 40256 19159 2278 17155 9532 1275 3823 36784 30835 12257 avg rating 4.1641 4.1712 3.9649 4.3450 4.2403 4.2541 3.3448 4.0681 4.1607 3.9857 avg price ($) 56.0249 23.9697 23.0374 42.1607 59.8015 42.0972 54.5471 59.3773 46.1495 45.7538 sparsity 0.0035% 0.0077% 0.0571% 0.0192% 0.0132% 0.1966% 0.0334% 0.0037% 0.0045% 0.0157% software 0.0081 0.0016 0.0613 0.0762 0.0832 0.0804 9.19% 0.0272 0.0075 0.2341 0.2890 0.3124 0.3044 8.10% 8.96% 0.0018 0.0022 0.0091 0.0080 0.0099 0.0110 20.88% 0.0045 0.0036 0.0426 0.0256 0.0436 0.0514 20.66% 20.84% tools& home 0.0005 0.0009 0.0022 0.0040 0.0028 0.0050 25.00% 0.0022 0.0040 0.0101 0.0188 0.0133 0.0231 22.87% 24.62% toys& games 0.0005 0.0006 0.0011 0.0015 0.0012 0.0017 13.33% 0.0022 0.0020 0.0053 0.0063 0.0060 0.0082 30.16% 16.22% video games 0.0042 0.0023 0.0023 0.0078 0.0084 0.0087 11.54% 0.0152 0.0012 0.0088 0.0329 0.0401 0.0410 24.62% 13.83% (2) On average, BPR is better than iMF due to the pairwise assumption. Although iMF considers the unobserved pairs, it is based on an absolute assumption that the su,i will be 1 when u buys i, and 0 otherwise. On the contrary, BPR, HSR-1 and HSR-2 try to take feedback as relative preferences rather than absolute ones. list of items based on rating values. iMF [Hu et al., 2008; Lin et al., 2014] is a pointwise method that updates observed and unobserved pairs individually. BPR [Rendle et al., 2009] adopts a pairwise ranking method with the assumption that the observed (user,item) pairs are more positive than the unobserved (user,item) pairs, and optimizes the comparison in each iteration. For iMF and BPR, we take a pre-processing step by keeping the ratings as observed positvie feedback and other (user, item) pairs as unobserved feedback. We conduct comparisons with iMF and BPR because they perform better in very sparse datasets than the other baseline methods. We set the learning rate to 0.001 for iMF, ListRank and HSR-2 and to 0.01 for CofiRank, BPR and HSR-1. We fix latent dimension D = 10 and choose regularization coefficient from λ ∈ {0.001, 0.01, 0.1, 1}. The correlation ρ is set to 0.1 and L = 5. We adopt precision, recall and our proposed metrics S-R and P-R to test the performance. The length of recommendation lists is fixed at K = 5 for all metrics. 4.3 shoes (1) Because our test sets are very sparse, the precision results are small comparing with recall when K = 10. We also observe that the models considering unobserved items (i.e., iMF, BPR, HSR-1 and HSR-2) perform much better than the models which only use rating information (i.e., CofiRank and ListRank). Extremely, when each user only has one observed item in the training set, CofiRank and ListRank will be next-to-no benefit for exploiting the relevant items, and that is the reason for their poor performance in our experiments. Table 3: Basic statistics of the data. dataset musical instruments 0.0003 0.0013 0.0050 0.0062 0.0073 0.0073 17.74% 0.0017 0.0064 0.0218 0.0274 0.0322 0.0325 18.61% 17.90% (3) Both HSR-1 and HSR-2 are better than the other baseline methods in terms of both precision and recall results. They improve the precision by an average of more than 18% and improve the recall by an average of more than 20% comparing with the best baseline models. It is mainly because that HSR-1 and HSR-2 not only consider all possible (user, item) pairs, but also adopt explicit feedback information. 4.4 Analysis of User Satisfaction We then analyse the performance of the compared models on S-R metric and list the improvement of our models on mean ratings of hit items (M(s)). The experimental results are shown in Table 4, from which we can have the following observations and conclusions: Analysis of Hit Rate (1) Some models with lower precision or recall values can achieve higher S-R scores. For example, on the video games dataset, HSR-1 has lower recall performance than HSR-2 but achieves better S-R due to higher M (s). We show the hit rate performance comparison of our mothods with other baseline models. The precision and recall results are shown in Table 2, where the best performance is in bold font. For comprehensive comparison, we also list the improvement of F1 score in the last row. From the table, we have the following observations: (2) Both HSR-1 and HSR-2 are better than the other baseline models on S-R, which demonstrates the benefit of 1824 Table 4: S-R and P-R performance of compared models on 10 real-world datasets. M(s) P-R k=5 M(r) automotive beauty CofiRank ListRank iMF BPR HSR-1 HSR-2 improve improve CofiRank ListRank iMF BPR HSR-1 HSR-2 improve improve 0.0174 0.0012 0.0672 0.0967 0.1249 0.1093 29.15% 1.09% 0.0303 0.0053 0.1764 0.3458 0.6205 0.4901 79.42% 38.02% 0.0455 0.0436 0.1861 0.5644 0.6414 0.6171 13.64% 0.53% 0.0499 0.0594 0.2392 0.8074 1.0306 0.9902 27.65% 12.92% 0.1 industrial& scientific 0.0106 0.0053 0.0724 0.2634 0.3021 0.3443 30.70% 7.33% 0.0108 0.0126 0.2481 0.8670 1.1300 1.7384 100.51% 51.16% 0.07 iMF BPR HSR−1 HSR−2 0.15 Recall 0.08 shoes software 0.1274 0.0356 1.0332 1.2592 1.3422 1.2993 6.59% -2.65% 0.3384 0.0944 2.9148 3.7270 4.4324 4.3663 18.93% 11.25% 0.0148 0.0123 0.1632 0.0886 0.1744 0.2011 23.21% 4.54% 0.1180 0.0824 1.2105 0.4779 1.3224 1.5946 31.72% 9.26% 1 0.1 tools& home 0.0092 0.0172 0.0444 0.0805 0.0546 0.0992 23.24% -1.83% 0.0302 0.0478 0.1478 0.3143 0.2676 0.3550 12.95% 19.72% toys& games 0.0094 0.0086 0.0228 0.0271 0.0268 0.0359 32.11% 3.22% 0.0346 0.0238 0.0734 0.0717 0.0909 0.1339 82.40% 17.59% video games 0.0706 0.0055 0.0400 0.1466 0.1848 0.1799 26.05% 1.33% 0.1896 0.0133 0.0989 0.3401 0.5586 0.6510 91.40% 41.19% 2 iMF BPR HSR−1 HSR−2 1.2 iMF BPR HSR−1 HSR−2 1.5 0.8 0.6 0.4 0.05 0.06 musical instruments 0.0071 0.0265 0.0990 0.1220 0.1415 0.1460 19.66% -1.14% 0.0176 0.0615 0.1545 0.2786 0.3436 0.3343 23.31% 4.97% 1.4 0.2 iMF BPR HSR−1 HSR−2 0.09 Prediction clothing& accessories 0.0891 0.1493 0.5026 0.8861 0.9167 0.9644 8.84% 1.24% 0.1507 0.2434 0.7853 1.5440 1.6604 1.7265 11.82% 4.02% P−R S-R k=5 method S−R metric 1 0.5 0.2 0.05 1 3 5 K (Shoes) 7 9 0 1 3 5 7 K (Beauty) 0 9 1 3 5 7 K (Clothing) 9 0 1 3 5 7 K (Industrial) 9 Figure 3: Performance comparison with different recommendation list size. combining bui and sui . We also find that M (s) is not improved in some datasets. This is mainly because HSR try to balance hit rate and user satisfaction instead of simply optimizing the latter one. Because HSR have much better performance than the other models on recall values, the influence of M (s) is not obvious here. 4.5 the performance can even double the best baseline models, which shows the efficiency of HSR by integrating the price factor into the recommendation models. Because total revenue and profit are almost always correlated, our methods have high degree of confidence to help earn much more money for the online aggregators. Analysis of Revenue 4.6 We compare our models with other baselines on P-R metric, and list the improvement of HSR on mean price of hit items (M(r)). The results are summarized in the lower portion of Table 4, and we find that (1) There is plenty of room to optimize the website revenue. Some algorithms with poor recall performance may achieve better P-R metric and M (r). For example, on the toys&games dataset, iMF has nearly 19% lower recall score than BPR, but it performs better than BPR when we consider P-R metric, because it increases the average revenue a lot at the cost of decreasing a certain degree of the recall value. (2) Our models achieve much better performance than any other models on both P-R metric and M (r). For many datasets, the improvement is over 50%, and for some datasets, such as industrial&scientific and video games, Analysis of Stability Finally, to study the recommendation stability, we vary the recommendation size K to be values of {1, 3, 5, 7, 9}, and some results are shown in Figure 3. As can be seen, when K increases, the precision scores have general downward trend, and other metrics pretend to increase. HSR-1 and HSR-2 are consistently better than other models. In particular, when K increases, the advantages of our models on P-R become more and more obvious. 5 Conclusion and Future Work In this paper, we consider three important factors, (1)hit rate, (2)user satisfaction, and (3)website revenue, to measure the performance of recommender systems. We then propose two algorithms, called HSR-1 and HSR-2, to optimize them. Finally, two simple but convincing metrics are designed to measure the comparison models. The experimental results show 1825 [McAuley and Leskovec, 2013] Julian McAuley and Jure Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. In Proceedings of the 7th ACM Conference on Recommender Systems, RecSys’13, pages 165–172. ACM, 2013. [Pan and Chen, 2013] Weike Pan and Li Chen. Gbpr: Group preference based bayesian personalized ranking for oneclass collaborative filtering. In Proceedings of the 23th International Joint Conference on Artificial Intelligence, IJCAI’13, pages 2691–2697. AAAI Press, 2013. [Pan et al., 2008] Rong Pan, Yunhong Zhou, Bin Cao, Nathan Nan Liu, Rajan Lukose, Martin Scholz, and Qiang Yang. One-class collaborative filtering. In Proceedings of the 8th International Conference on Data Mining, ICDM’08, pages 502–511. IEEE, 2008. [Rendle et al., 2009] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. Bpr: Bayesian personalized ranking from implicit feedback. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, pages 452–461. AUAI Press, 2009. [Rendle et al., 2010] Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. Factorizing personalized markov chains for next-basket recommendation. In Proceedings of the 19th International Conference on World Wide Web, WWW’10, pages 811–820. ACM, 2010. [Salakhutdinov and Mnih, 2008a] Ruslan Salakhutdinov and Andriy Mnih. Bayesian probabilistic matrix factorization using markov chain monte carlo. In Proceedings of the 25th International Conference on Machine Learning, ICML’08, pages 880–887. ACM, 2008. [Salakhutdinov and Mnih, 2008b] Ruslan Salakhutdinov and Andriy Mnih. Probabilistic matrix factorization. In Advances in Neural Information Processing Systems, volume 20 of NIPS’08, 2008. [Shi et al., 2010] Yue Shi, Martha Larson, and Alan Hanjalic. List-wise learning to rank with matrix factorization for collaborative filtering. In Proceedings of the 4th ACM Conference on Recommender Systems, RecSys’10, pages 269–272. ACM, 2010. [Shi et al., 2013] Yue Shi, Alexandros Karatzoglou, Linas Baltrunas, Martha Larson, and Alan Hanjalic. xclimf: optimizing expected reciprocal rank for data with multiple levels of relevance. In Proceedings of the 7th ACM Conference on Recommender Systems, RecSys’13, pages 431– 434. ACM, 2013. [Weimer et al., 2007] Markus Weimer, Alexandros Karatzoglou, Quoc Viet Le, and Alex Smola. Maximum margin matrix factorization for collaborative ranking. Advances in Neural Information Processing Systems, 2007. [Wu, 2009] Jinlong Wu. Binomial matrix factorization for discrete collaborative filtering. In Proceedings of the 9th International Conference on Data Mining, ICDM’09, pages 1046–1051. IEEE, 2009. that our models can achieve better hit rate and user satisfaction results and also significantly improve website revenue. In the future, we intend to exploit more kinds of user behaviors and design some mechanisms for recommendation diversification. To better measure the overall recommendation performance, we will try to design more effective metrics. Acknowledgments This research is supported by the National Natural Science Foundation of China (NSFC) No. 61272303 and Natural Science Foundation of Guangdong Province No. 2014A030310268. We also thank Dr. Weike Pan for helpful discussions. References [Bergemann and Ozmen, 2006] Dirk Bergemann and Deran Ozmen. Optimal pricing with recommender systems. In Proceedings of the 7th ACM Conference on Electronic Commerce, pages 43–51. ACM, 2006. [Hu et al., 2008] Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative filtering for implicit feedback datasets. In Proceedings of the 8th International Conference on Data Mining, ICDM’08, pages 263–272. IEEE, 2008. [Kamishima and Akaho, 2011] Toshihiro Kamishima and Shotaro Akaho. Personalized pricing recommender system: Multi-stage epsilon-greedy approach. In Proceedings of the 2nd International Workshop on Information Heterogeneity and Fusion in Recommender Systems, HetRec’11, pages 57–64. ACM, 2011. [Kim and Choi, 2014] Yong-Deok Kim and Seungjin Choi. Bayesian binomial mixture model for collaborative prediction with non-random missing data. In Proceedings of the 8th ACM Conference on Recommender Systems, RecSys’14, pages 201–208. ACM, 2014. [Koren et al., 2009] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009. [Koren, 2008] Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, SIGKDD’13, pages 426–434. ACM, 2008. [Krohn-Grimberghe et al., 2012] Artus Krohn-Grimberghe, Lucas Drumond, Christoph Freudenthaler, and Lars Schmidt-Thieme. Multi-relational matrix factorization using bayesian personalized ranking for social network data. In Proceedings of the 5th ACM International Conference on Web Search and Data Mining, WSDM’12, pages 173– 182. ACM, 2012. [Lin et al., 2014] Christopher H. Lin, Ece Kamar, and Eric Horvitz. Signals in the silence: Models of implicit feedback in a recommendation system for crowdsourcing. In Proceedings of the 28th AAAI Conference on Artificial Intelligence, AAAI’14, pages 22–28, 2014. 1826