Supervised Machine Learning Methods for Item Recommendation A Thesis submitted for the degree of Doctor of Natural Science (Dr. rer. nat.) by Zeno Gantner Department of Computer Science Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim February 2012 2 Preface Recommender systems are personalized information systems that learn individual preferences from interacting with users. Recommender systems use machine learning techniques to compute suggestions for the users. Supervised machine learning relies on optimizing for a suitable objective function. Suitability means here that the function actually reects what users and operators consider to be a good system performance. Most of the academic literature on recommendation is about rating prediction. For two reasons, this is not the most practically relevant prediction task in the area of recommender systems: First, the important question is not how much a user will express to like a given item (by the rating), but rather which items a user will like. Second, obtaining explicit preference information like ratings requires additional actions from the side of the user, which always comes at a cost. Implicit feedback in the form of purchases, viewing times, clicks, etc., on the other hand, is abundant anyway. Very often, this implicit feedback is only present in the form of positive expressions of preference. In this work, we primarily consider item recommendation from positive-only feedback. A particular problem is the suggestion of new items items that have no interaction data associated with them yet. This is an example of a cold-start scenario in recommender systems. Collaborative models like matrix factorization rely on interaction data to make predictions. We augment a matrix factorization model for item recommendation with a mechanism to estimate the latent factors of new items from their attributes (e.g. descriptive keywords). In particular, we demonstrate that optimizing the latent factor estimation with regard to the overall loss of the item recommendation task is superior to optimizing it with regard to the prediction error on the latent factors. The idea of estimating latent factors from attributes can be extended to other tasks (new users, rating prediction) and prediction models, yielding a general framework to deal with cold-start scenarios. Next, we adapt the Bayesian Personalized Ranking (BPR) framework, which is state of the art in item recommendation, to a setting where more popular items are more frequently encountered when making predictions. By generalizing even more, we get Weighted Bayesian Personalized Ranking, an extension of BPR that allows importance weights to be placed on specic users and items. All method contributions are supported by experiments using large-scale real-life datasets from various application areas like movie recommendation and music recommendation. Finally, this thesis presents an ecient and scalable free software package, MyMediaLite, that implements, among other things, all the presented methods (plus related work) and evaluation protocols. Besides oering the existing mod3 4 els and evaluation protocols as a library and via command-line tools and web services, MyMediaLite allows the easy development of new models and learning methods. Acknowledgements This thesis would not have been possible without the support and inuence of many people. First and foremost, I thank my advisor, Lars Schmidt-Thieme, for providing a stimulating environment to pursue the research presented here, and for his support throughout these years. I would also like to thank all members of the Information Systems and Machine Learning Lab (ISMLL) at the University of Hildesheim, in particular my friends and collaborators Steen Rendle, Christoph Freudenthaler, Lucas Drumond, Tom as Horv ath, Leandro Balby Marinho, Christine Preisach, and Karen Tso-Sutter. Carsten Witzke, Franziska Leithold, and Christina Lichtenthaler in their roles as student assistants relieved me of several time-consuming duties and thus helped me to focus on the crucial aspects of my research. All contributors to MyMediaLite, which is an important part of the thesis, helped to shape it towards what it is today. Ruth Janning, Christoph Freudenthaler, Tom as Horv ath and Thorsten Zitterell provided useful feedback on preliminary drafts of this thesis. Wesley Dopkins proof-read the thesis and made sure it contains mostly proper English. Finally, I would like to thank Anna for her understanding, encouragement, and support in the last years. To my family. 6 Contents 1 2 Introduction 1.1 1.2 1.3 1.4 1.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Recommender Systems: Tasks and Methods 2.1 2.2 2.3 2.4 2.5 3 Motivation . Overview . . Contributions Publications . Preliminaries Supervised Machine Learning . . . . . . Tasks . . . . . . . . . . . . . . . . . . . 2.2.1 Rating Prediction . . . . . . . . 2.2.2 Item Recommendation . . . . . . 2.2.3 Further Tasks and Variants . . . Methods . . . . . . . . . . . . . . . . . . 2.3.1 Baselines . . . . . . . . . . . . . 2.3.2 Neighborhood-Based Models . . 2.3.3 Attribute-Based Methods . . . . 2.3.4 Hybrid Methods and Ensembles 2.3.5 Stochastic Gradient Learning . . 2.3.6 Matrix Factorization . . . . . . . 2.3.7 Bayesian Personalized Ranking . 2.3.8 Context-Aware Recommendation Evaluation Criteria . . . . . . . . . . . . 2.4.1 Predictive Accuracy . . . . . . . 2.4.2 Runtime Performance . . . . . . Datasets . . . . . . . . . . . . . . . . . . 2.5.1 MovieLens . . . . . . . . . . . . . 2.5.2 Netix . . . . . . . . . . . . . . . 2.5.3 KDD Cup 2011 . . . . . . . . . . 2.5.4 Yahoo! Music Ratings . . . . . . Cold-Start Recommendation 3.1 3.2 3.3 Problem Statement . . . . . . . Attribute-to-Feature Mappings 3.2.1 General Framework . . 3.2.2 Item Mappings . . . . . Experiments . . . . . . . . . . . 3.3.1 Datasets . . . . . . . . . 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 17 19 20 20 21 23 23 24 25 26 28 31 32 33 35 35 35 36 38 40 40 41 43 43 44 44 44 44 45 46 46 47 48 51 51 8 CONTENTS 3.4 3.5 4 4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Relation to Other Approaches . . . . . . . . . . . . . . . . Weighted BPR . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Generic Weighted Bayesian Personalized Ranking . 4.2.2 Example 1: Non-Uniform Negative Item Weighting 4.2.3 Example 2: Uniform User Weights . . . . . . . . . Sampling Strategies . . . . . . . . . . . . . . . . . . . . . 4.3.1 Sampling According to Entity Weights . . . . . . . 4.3.2 Matrix Factorization Optimized for WBPR . . . . Summary and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problem Statement . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Evaluation Criterion . . . . . . . . . . . . . . . . . . Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Optimizing for the Competition Objective . . . . . . 5.2.2 Matrix Factorization Optimized for Weighted BPR . 5.2.3 Ensembles . . . . . . . . . . . . . . . . . . . . . . . . 5.2.4 Incorporating Rating Information . . . . . . . . . . . 5.2.5 Contrasts . . . . . . . . . . . . . . . . . . . . . . . . Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Rating Prediction . . . . . . . . . . . . . . . . . . . 5.3.3 Track 2 Results . . . . . . . . . . . . . . . . . . . . . 5.3.4 Final Submission . . . . . . . . . . . . . . . . . . . . Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . Summary and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Recommending Songs 5.1 5.2 5.3 5.4 5.5 6 . . . . . . . . Bayesian Personalized Ranking Revisited 4.1 4.2 4.3 5 3.3.2 Compared Methods . . . . . . . . . 3.3.3 Experiment 1: Method Comparison 3.3.4 Experiment 2: Large Attribute Sets 3.3.5 Run-Time Comparison . . . . . . . . 3.3.6 Reproducibility . . . . . . . . . . . . 3.3.7 Discussion . . . . . . . . . . . . . . . Related Work . . . . . . . . . . . . . . . . . Summary and Outlook . . . . . . . . . . . . The MyMediaLite Library 6.1 6.2 6.3 Motivation: Free Software for Research . Feature Overview . . . . . . . . . . . . . 6.2.1 Recommendation Tasks . . . . . 6.2.2 Command-Line Tools . . . . . . 6.2.3 Data Sources . . . . . . . . . . . 6.2.4 Evaluation . . . . . . . . . . . . 6.2.5 Incremental Updates . . . . . . . 6.2.6 Parallel Processing . . . . . . . . 6.2.7 Serialization . . . . . . . . . . . . 6.2.8 Documentation . . . . . . . . . . 6.2.9 Diversication and Ensembles . . Development Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 53 54 54 54 55 55 59 61 61 62 62 63 63 63 63 65 65 67 67 68 69 69 69 70 71 72 72 72 74 74 74 77 77 79 79 80 80 81 81 82 82 83 84 84 84 84 9 CONTENTS 6.4 6.5 6.6 6.7 6.8 7 Existing Software . . . . . . . . . . . . . . . 6.4.1 Recommender System Libraries . . . 6.4.2 Implementations of Single Methods . 6.4.3 Non-Free Publicly Available Software System Comparison . . . . . . . . . . . . . Experiments . . . . . . . . . . . . . . . . . . 6.6.1 General Performance . . . . . . . . . 6.6.2 Parallel Stochastic Gradient Descent Impact . . . . . . . . . . . . . . . . . . . . . Summary and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion 7.1 7.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . A MyMediaLite Reference A.1 Installation . . . . . . . . . . . . . . . . . . A.1.1 Prerequisites . . . . . . . . . . . . . A.1.2 Packages . . . . . . . . . . . . . . . . A.1.3 Instructions . . . . . . . . . . . . . . A.2 Command-Line Tools . . . . . . . . . . . . . A.2.1 Rating Prediction . . . . . . . . . . A.2.2 Item Recommendation . . . . . . . . A.3 Library Structure . . . . . . . . . . . . . . . A.3.1 Conventions . . . . . . . . . . . . . . A.3.2 Interfaces . . . . . . . . . . . . . . . A.4 Data Structures . . . . . . . . . . . . . . . . A.4.1 Basic Data Types . . . . . . . . . . A.4.2 Entity Mappings . . . . . . . . . . . A.4.3 Rating Data . . . . . . . . . . . . . . A.4.4 Positive-Only Feedback . . . . . . . A.4.5 Attributes and Relations . . . . . . . A.5 Recommenders . . . . . . . . . . . . . . . . A.5.1 Rating Prediction . . . . . . . . . . A.5.2 Item Recommendation . . . . . . . . A.5.3 Group Recommendation . . . . . . . A.5.4 Ensembles . . . . . . . . . . . . . . . A.6 Using MyMediaLite Recommenders . . . . . A.6.1 General Remarks . . . . . . . . . . . A.6.2 C# . . . . . . . . . . . . . . . . . . . A.6.3 F# . . . . . . . . . . . . . . . . . . . A.6.4 Python . . . . . . . . . . . . . . . . A.6.5 Ruby . . . . . . . . . . . . . . . . . A.7 Implementing MyMediaLite Recommenders B URLs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 86 88 89 91 93 93 95 95 96 97 97 98 101 101 101 101 101 102 102 104 106 107 107 108 108 109 109 112 112 113 114 118 121 122 122 122 122 123 124 125 126 129 10 CONTENTS List of Figures 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 Attribute-to-feature mappings . . . . . . . . . . . Cold-start recommendation: prec@5 results . . . Cold-start recommendation: prec@10 results . . . Cold-start recommendation: AUC results . . . . High-dimensional attribute sets: prec@5 results . High-dimensional attribute sets: prec@10 results High-dimensional attribute sets: AUC results . . Cold-start recommendation: test time per user . . . . . . . . . 47 54 55 56 57 58 59 60 5.1 5.2 5.3 Task of KDD Cup 2011, track 2 . . . . . . . . . . . . . . . . . . . The liked contrast . . . . . . . . . . . . . . . . . . . . . . . . . The rated contrast . . . . . . . . . . . . . . . . . . . . . . . . . 68 68 73 6.1 6.2 6.3 The MyMediaLite movie demo program . . . . . . . . . . . . . . Runtime of BiasedMatrixFactorization . . . . . . . . . . . . . . . Runtime of BiasedMatrixFactorization and MultiCoreMatrixFactorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Runtime and memory usage of MulticoreMatrixFactorization . . 83 94 6.4 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 95 12 LIST OF FIGURES List of Tables 1.1 Examples for recommender system applications . . . . . . . . . . 18 2.1 2.2 2.3 Attribute example: movie genres. . . . . . . . . . . . . . . . . . . Context-aware item recommendation: dierent scenarios . . . . . Evaluation datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 25 30 43 3.1 3.2 Cosine similarities between movies. . . . . . . . . . . . . . . . . . Item attribute sets. . . . . . . . . . . . . . . . . . . . . . . . . . . 49 52 5.1 5.2 5.3 73 74 5.4 5.5 Characteristics of the validation and competition splits . . . . . . Rating prediction accuracy for dierent matrix factorization models Validation set and KDD Cup 2011 leaderboard error percentages for dierent models . . . . . . . . . . . . . . . . . . . . . . . . . . Candidate components of the score ensemble . . . . . . . . . . . Rating prediction models used in the nal KDD Cup submission 6.1 6.2 Comparison of free/open source recommender system frameworks 92 Memory usage for rating prediction with BiasedMatrixFactorization 93 A.1 A.3 A.2 A.4 A.5 Rating predictors in MyMediaLite . . Item recommenders in MyMediaLite . Rating predictor hyperparameters . . Item recommender hyperparameters . Group recommenders in MyMediaLite . . . . . 115 118 118 120 121 B.1 Websites of dierent software . . . . . . . . . . . . . . . . . . . . B.2 Academic/experimental systems with recommender functionality and other recommender system resources . . . . . . . . . . . . . B.3 Websites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4 Websites of recommender system software . . . . . . . . . . . . . 130 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 75 76 130 131 132 14 LIST OF TABLES List of Algorithms 1 2 4 Learning with stochastic gradient descent . . . . . . . . . . . . . . Learning a basic matrix factorization for rating prediction with stochastic gradient descent . . . . . . . . . . . . . . . . . . . . . . Learning a matrix factorization with biases and a logistic transformation for rating prediction with stochastic gradient descent . . LearnBPR: Optimizing BPR using stochastic gradient ascent . . 38 39 5 Learning algorithm for the linear attribute-to-feature mapping . . 51 6 7 8 LearnWBPR-Neg: Optimizing WBPR with non-uniform weights for the negative items using stochastic gradient ascent . . . . . . . 64 LearnWBPR: Optimizing WBPR using stochastic gradient ascent 64 Optimizing a matrix factorization model for WBPR with nonuniform weights for the negative items using stochastic gradient ascent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 9 Sampling procedure for the KDD Cup validation split . . . . . . . 70 10 Parallel stochastic gradient descent for matrix factorization . . . . 85 3 15 35 37 16 LIST OF ALGORITHMS Chapter 1 Introduction In this chapter, we motivate why research on recommender systems is relevant, and outline the structure and contributions of this thesis. 1.1 Motivation Never in history has a higher number of dierent books, movies, music recordings, news items, software programs, and other media content, as well as physical products been readily available to Internet users. While this generally may be a good thing, the potential consumer faces a dilemma: there is too much choice. All oered items can never be examined, not to mention consumed, in a lifetime. Besides search and user interface technologies that support the user in actively looking for items in which they know they are interested, personalization technologies that suggest possible items of interest are one answer to the everyincreasing supply of information and products. Recommender systems [Goldberg et al., 1992, Resnick and Varian, 1997, Jannach et al., 2010, Kantor et al., 2011] are information systems that learn user preferences from past user actions (ratings, votes, ranked lists, mouse clicks, page views, product purchases, etc.) and suggest items (pages on the web, news articles, jokes, movies, products of any kind, music albums, individual songs, etc.) according to those user preferences. While rating prediction How much will a user like/rate a given item? has gained more attention in the recommender systems literature in the past, the task of item recommendation Which items will a user like/buy? [Deshpande and Karypis, 2004, Hu et al., 2008] is actually more relevant for practical recommender system applications: after all, the task of a recommender system is not to know how a person would rate something on an arbitrary scale, but rather which items a person is interested in and will really like. Table 1.1 gives some examples of what can be suggested by recommender systems, along with some sample companies, products, or websites providing such systems.1 As indicated by the questions just mentioned, typical tasks in recommender systems are prediction tasks. For eciency reasons, the answers to those questions are computed by prediction models, which have to be learned/trained 1 The URLs corresponding to those systems can be found in Appendix B. 17 18 CHAPTER 1. INTRODUCTION Application domain Example systems or literature references Books GoodReads, Amazon [Linden et al., 2003], LibraryThing Netix, MovieLens, Moviepilot, Ringo [Shardanand, 1994] TiVo [Ali and van Stam, 2004], Bambini et al. [2011], Xin and Steck [2011] YouTube [Davidson et al., 2010, Zhou et al., 2010] Pandora, last.fm, Ringo [Shardanand, 1994] Amazon [Linden et al., 2003] Net-A-Porter, Zalando, Asos, Amazon, Otto KDD Cup 2010, Nguyen et al. [2011] Aditya Parameswaran [2011] Donation Dashboard [Nathanson et al., 2009] Knijnenburg et al. [2011a] Google Mail, Twitter, Facebook, LinkedIn, YouTube Pizzato et al. [2010] Yahoo! Answers [Dror et al., 2011c] Mui et al. [2001] Takeuchi and Sugimoto [2006] AppRecommender Delicious, Balabanovic [1997] Sriram et al. [2010] Google News, Yahoo! News, Findory Tapestry [Goldberg et al., 1992], GroupLens [Resnick et al., 1994], Lang [1995] Bollacker et al. [2000], und Gordon McCalla [2004], Ekstrand et al. [2010], Wang and Blei [2011] Bibsonomy [J aschke et al., 2007] Jester [Goldberg et al., 2001] Songkick, Chen [2005] Jannach et al. [2009] Nokia Maps, Google Maps Movies TV, IPTV Video clips Playlists/songs General products Fashion products E-learning Teaching Donations for non-prots Energy-saving measures Friends/contacts Dating partners Questions Restaurants Shops Games/apps/software Web pages Messages/conversations News articles Usenet news Research papers Folksonomy tags Jokes Events Tourism Points of interest Table 1.1: Examples for recommender system applications. 1.2. OVERVIEW 19 based on past interaction data and possibly additional data like user and item attributes. This places recommender system methods in the realm of supervised machine learning . This thesis is about supervised machine learning techniques for recommender systems. 1.2 Overview Besides the introduction and the conclusion, this thesis contains ve chapters. • Chapter 2: Recommender Systems: Tasks and Methods We dene prediction tasks in the eld of recommender systems, and discuss existing learning approaches for those tasks, as well as how to measure and compare the eectiveness of dierent methods. • Chapter 3: Cold-Start Recommendation In this chapter, we develop a framework for dealing with cold-start problems in recommender systems, namely how to make accurate recommendations when we do not have sucient interaction information for using normal collaborative ltering methods. The framework can adapt all methods that represent the entities (for example users and items) as vectors of real numbers to cold-start scenarios, in particular latent factor models, which are the state-of-the-art approach for recommender systems. We present a case study in which we employ the framework to enhance an item recommendation method, matrix factorization for Bayesian Personalized Ranking (BPR-MF), with item attributes. • Chapter 4: Bayesian Personalized Ranking Revisited We look at dierent aspects of Bayesian Personalized Ranking: the relationship between ranking and classication and weighting entities of diering importance. We relate the approach to more general approaches in the machine learning literature, and extend it to weighted BPR (WBPR). We suggest a learning algorithm for the new WBPR optimization criterion based on adapting the sampling probabilities. • Chapter 5: Recommending Songs Here we show the eectiveness of the enhancements from the previous chapter, in a music recommendation scenario, using the large-scale dataset from the KDD Cup 2011 competition. In particular, we use WBPR to learn scalable and accurate matrix factorization models. • Chapter 6: The MyMediaLite Library Building upon the work of others is an important part of every scientic endeavour. With MyMediaLite, a fast and scalable, multi-purpose library of recommender system algorithms, we describe the free software package that was used to implement all methods presented in this thesis. The appendix contains material that may be of interest to the reader, but was left out of the main part of the thesis: a concise reference manual of the MyMediaLite recommender system software, and a list of URLs of websites mentioned in the text. 20 CHAPTER 1. 1.3 INTRODUCTION Contributions This thesis contains contributions in the area of item recommendation: 1. A formal denition of item recommendation that is general enough to cover most known prediction tasks in the area of recommender systems, and examples on how to frame specic problems in terms of the formal denition. 2. A generic framework for dealing with cold-start problems in factorization models, leading to an extension for matrix factorization based on Bayesian Personalized Ranking (BPR) that takes item attributes into account. 3. Weighted Bayesian Personalized Ranking (WBPR), a generalization of BPR that allows importance weights to be attached to specic users and items, including a generic and ecient learning algorithm. 4. An ensemble method for recommending songs based on WBPRMF and rating prediction matrix factorization that was one of the top entries in the KDD Cup 2011. 5. The description of MyMediaLite, a free software package of item recommendation and rating prediction methods, which contains reference implementations of many dierent recommendation algorithms, and the widest choice of evaluation protocols of all available free software/open source recommender system packages, among plenty of other features useful for researchers and practitioners. The availability of the software also makes all experiments presented here easily reproducible for others. This work also contains, to the best of our knowledge, the most complete and extensive survey of free software recommender system implementations in the literature so far. 1.4 Publications Most of the work presented in this thesis has already been published in the form of papers at peer-reviewed international workshops and conferences: • Zeno Gantner, Lucas Drumond, Christoph Freudenthaler, Steen Rendle, Lars Schmidt-Thieme (2010): Learning Attribute-to-Feature Mappings for Cold-Start Recommendations, in Proceedings of the 10th IEEE International Conference on Data Mining (ICDM 2010), Sydney, Australia. The content of this paper is mostly covered in chapter 3, and in the sections on item recommendation in chapter 2. • Zeno Gantner, Lucas Drumond, Christoph Freudenthaler, Lars SchmidtThieme (2011): Bayesian Personalized Ranking for Non-Uniformly Sampled Items, KDD Cup Workshop 2011, San Diego, USA. Theoretical insights rst published at the KDD Cup 2011 workshop form the core of chapter 4, whereas the application scenario of KDD Cup 2011 1.5. PRELIMINARIES 21 is used for the case study in chapter 5. An improved version of the workshop paper is currently under review for a special issue on the KDD Cup 2011 in the Journal of Machine Learning Research (JMLR). • Zeno Gantner, Steen Rendle, Christoph Freudenthaler, Lars SchmidtThieme (2011): MyMediaLite: A Free Recommender System Library, in Proceedings of the 5th ACM International Conference on Recommender Systems (RecSys 2011), Chicago, USA. Chapter 6 builds on this publication, while adding a lot of new content. During the time of my doctoral studies, I co-authored further publications that are not covered in this thesis, although there are of course relations and inuences between their contents and my thesis work. • Bart P. Knijnenburg, Martijn C. Willemsen, Zeno Gantner, Hakan Soncu, Chris Newell (2011): Explaining the user experience of recommender systems, to appear in User Modeling and User-Adapted Interaction (UMUAI). • Steen Rendle, Zeno Gantner, Christoph Freudenthaler, Lars SchmidtThieme (2011): Fast Context-aware Recommendations with Factorization Machines, in Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2011), Beijing, China. • Zeno Gantner, Steen Rendle, Lars Schmidt-Thieme (2010): Factorization Models for Context-/Time-Aware Movie Recommendations, in Challenge on Context-aware Movie Recommendation (CAMRa2010), ACM, Barcelona, Spain. • Zeno Gantner, Christoph Freudenthaler, Steen Rendle, Lars SchmidtThieme (2009): Optimal Ranking for Video Recommendation, in User Centric Media: First International Conference, UCMedia 2009, Revised Selected Papers, Springer. • Zeno Gantner, Lars Schmidt-Thieme (2009): Automatic Content-based Categorization of Wikipedia Articles, in The People's Web Meets NLP: Collaboratively Constructed Semantic Resources. Workshop at Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL 2009). • Steen Rendle, Christoph Freudenthaler, Zeno Gantner, Lars SchmidtThieme (2009): BPR: Bayesian Personalized Ranking from Implicit Feedback, in Proceedings of the 25th Conference on Uncertainty in Articial Intelligence (UAI 2009). 1.5 Preliminaries Throughout this thesis, scalar variables are set in the default math font, e.g. a, b, c, while matrices (upper case) and vectors (lower case) are in bold face, e.g. A, B, x, y. 22 CHAPTER 1. INTRODUCTION We will use the letter p for assignments to {0, 1}, or the probability of an assignment to 1, and s ∈ R for arbitrary scores. pu,i states whether item i was rated (highly) by user u. p̂u,i (Θ), usually simplied to p̂u,i , is the decision (estimation) of a model Θ for the true assignment pu,i . Output scores ŝu,i (Θ) = ŝu,i refer to arbitrary numerical predictions of recommendation models Θ, where higher scores refer to higher positions in the ranking. Such estimated rankings can then be used to make decisions p̂u,i . Chapter 2 Recommender Systems: Tasks and Methods In this chapter, we give an overview of the eld of recommender systems from a supervised machine learning perspective. After introducing machine learning, we discuss the most prominent prediction tasks for recommender systems, in particular a generic yet formal denition of recommendation. We then proceed to describe supervised machine learning methods for accomplishing these tasks, and dierent ways of measuring the properties of such methods, concentrating on simulated o-line experiments, as well as the datasets we use throughout this thesis to perform such experiments. 2.1 Supervised Machine Learning Tom Mitchell denes the term machine learning as follows (bold face/italic in the original, [Mitchell, 1997, p. 2]). A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P , if its performance at tasks in T , as measured by P , improves with experience E . Mitchell goes on to give examples for machine learning problems, such as learning to recognize spoken words, learning to drive an autonomous vehicle, learning to classify new astronomical structures, and learning to play backgammon [Mitchell, 1997, p. 3]. The performance of a system that uses machine learning methods is measured with a loss function that penalizes prediction errors [Hastie et al., 2009, p. 18]. Supervised machine learning [Hastie et al., 2009] tasks share the property that they learn models for the prediction/estimation of target variables (dependent variables, labels) from input variables (independent variables, predictor variables). Learning examples contain both types of variables, whereas we do not know the target variables of the instances we need to predict. Depending on the type of target variable, we distinguish several dierent tasks of supervised learning problems: 23 24 CHAPTER 2. RECOMMENDER SYSTEMS: TASKS AND METHODS • the task of classication is to predict nominal target variables, and • the task of regression is to predict real-valued target variables. Beyond classication and regression, there are other supervised learning tasks, for example density estimation [Hastie et al., 2009], ranking [Trotman, 2005], and so on. Supervised learning is in contrast to unsupervised learning , where examples are not labeled. An example for an unsupervised learning task is clustering. 2.2 Tasks In the introduction we described what recommender systems are. Let us now focus on the underlying learning problems. In a classical recommender system, there are two types of entities, users (e.g. customers) and items (e.g. movies, books, songs). We use U = {1, . . . , |U|} and I = {1, . . . , |I|} to denote the sets of all user and all item IDs, respectively. For simplicity, we will not dierentiate between the integer ID representing an entity and the entity itself. We have dierent kinds of information about the entities: 1. Information pertaining to one entity, content information, e.g. user attributes like age, gender, hobbies or item attributes like the price of a product, words in the title or description of a movie, editorial ratings. 2. Information that is linked to a user-item pair, collaborative information, e.g. the rating 4 stars on a scale from one to ve given to a movie by a specic user, the information that a user has purchased an item in an online shop or viewed a video in an IPTV system, or a tag in a collaborative tagging system. There are several types of collaborative information. One important distinction is between explicit (e.g. ratings, up- and downvotes) and implicit expressions of user preferences (e.g. clicks, purchases). Depending on the type of system, implicit information may be positive-only, i.e. there may be no recorded negative preference observations. We can represent interaction information by a partial function s : U ×I → S . Correspondingly, we represent content information about users as a function aU : U → AU and about the items by a function aI : I → AI . How S, AU , and AI exactly look like will depend on the concrete task we wish to describe. For convenience, we represent functions often as matrices. Let AU ∈ R|U |×m be the matrix of user attributes where aU ul is 1 if and only if user u has attribute l, i.e. l ∈ aU (u), and AI ∈ R|I|×n be the matrix of item attributes, where aIil is 1 if and only if l ∈ aI (i). There are m dierent user attributes, and n item attributes. Suppose we have the movies The Usual Suspects, American Beauty, The Godfather, and Road Trip in our recommender system. Each of those items is assigned to one or several of the genres Crime, Thriller, Comedy, Drama, and Example 1 2.2. 25 TASKS ID 1 2 3 4 Movie The Usual Suspects American Beauty The Godfather Road Trip Genres Crime, Thriller Comedy, Drama Crime, Action Comedy Table 2.1: Attribute example: movie genres. Action. If we assign consecutive IDs to the movie and genres, we can create the following item attribute matrix from the contents of Table 2.1: 1 1 0 0 0 0 0 1 1 0 AI = 1 0 0 0 1 , 0 0 1 0 0 where the rows refer to the dierent movies, and the columns refer to the different genres. We will use this data in the following examples. 2.2.1 Rating Prediction Ratings are a popular kind of explicit feedback. Users assess how much they like a given item (e.g. a movie or a news article) on a predened scale, e.g. 1 to 5, where 5 could mean that the user likes the item very much, whereas 1 means the user strongly dislikes the item. Rating prediction algorithms estimate unknown ratings from a given set of known ratings and possibly additional data such as user or item attributes. The predicted ratings can then indicate to users how much they will like an item, or the system can suggest items with high predicted ratings. Given (incomplete) rating information r : U × I → R, where R is the set of rating values, the task of rating prediction is to come up with a function r̂ : U × I → R that estimates how a given user will rate a given item. The quality of a single prediction is measured by a loss function ` : R × R → R. Denition 1. In the rest of this thesis, we will use ru,i := r(u, i) for actual ratings which may or may not be known to the learner and r̂u,i := r̂(u, i) for rating estimations by a recommendation method. In the example mentioned before, we have R := {1, 2, 3, 4, 5}, R := [1, 5]. or (2.1) (2.2) We use rmin and rmax to refer to the lowest and highest possible rating, respectively. In the example we have rmin = 1 and rmax = 5. One can view the second set [1, 5] as a relaxed variant of {1, 2, 3, 4, 5}. While the observed ratings may be limited to integer values, we usually allow real-valued predictions. Of course, other rating scales are possible, for example thumbs up/down like in Pandora Internet radio, which we could represent by {−1, 1}. 26 CHAPTER 2. RECOMMENDER SYSTEMS: TASKS AND METHODS A learning problem for this prediction task would then be to learn a model that can perform this task as accurately as possible measured by the loss for the intended users and items. Of course more information like user and item attributes can be taken into account for learning. Also, the times and dates of the rating event could be used as additional context information, leading to interaction/rating information of the form r : U × I × D → R and a function r̂ to be learned with the same signature, where D is the set of dates/times. If the prediction target is interpreted as a real number (equation 2.2), rating prediction can be seen as a regression task in the usual terminology of supervised machine learning. If R contains several levels that can be put in an order indicating the user preference (equation 2.1), it can also be seen as ordinal regression [Weimer et al., 2008, Koren and Sill, 2011, Menon and Elkan, 2010]. 2.2.2 Item Recommendation We rst give a generic denition of (item) recommendation, and then use this denition to dene more concrete recommendation tasks. In the subsequent section, we will discuss further variants of item recommendation. A Generic Denition of Recommendation Given collaborative information s : C × I → S , and (optionally) additional information, for example about users aU : U → AU or items aI : I → AI the task of item recommendation is to suggest a set S of items i ∈ I in a given context c ∈ C . Denition 2. The context is the circumstances, the particular situation in which we want to suggest items. What the context involves is dierent from scenario to scenario. If the context involves specic users, we have personalized recommendations. Even though we concentrate on personalization here, recommendations do not necessarily have to be personalized. For example, websites like Amazon display items similar to the one currently viewed by the user. These similar items may be the same for every user. Then we would have C = I . A very simple nonpersonalized recommendation task would have a singleton context set: C = {1}. In the basic personalized scenario, the set of contexts is exactly the set of users: C = U . We will later see more complex kinds of context. Denition 3. The quality of a recommended set in a given context c is measured by a loss function ` : 2I × 2I → R. Alternatively, we can also dene the quality of single recommended items, or of rankings of items: The quality of a recommended item in a given context c is measured by a loss function ` : I × I → R. Denition 4. The quality of an item ranking in a given context c here represented as a vector of score assignments to all items is measured by a loss function ` : R|I| × R|I| → R. Denition 5. 2.2. TASKS 27 The information given corresponds to the experience in Mitchell's denition of machine learning, suggesting an item (or a set of items) is a task, and the loss function is the performance measure. We believe this is a broad enough denition to capture most, if not all, recommendation tasks currently discussed in the literature. Below, we will express some specic tasks in terms of this denition. In the vocabulary of supervised machine learning, item recommendation can be viewed either as a classication task there are correct items and wrong items (binary classication, more classes are imaginable as well; denition 4) or as a ranking task (denition 5): can the recommender rank the candidate items such that the order resembles the actual user preferences? The task can also be viewed as a structured prediction task [Taskar et al., 2005] instead of a single label like a number, the prediction target is a set or a multi-label classication task [McCallum, 1999] every item is interpreted as a label. Item Recommendation by Rating Prediction In rating-based recommender systems, the known interactions are the ratings: S = R. It is common to suggest those items to the user that have the best rating predictions (see denition 1) , possibly weighted by the condence in the prediction. While this approach certainly makes sense, it is not clear whether it leads to optimal recommendations [Marlin and Zemel, 2009, Cremonesi et al., 2010]. Item Recommendation from Implicit Feedback Rating prediction has been popularized in the research community by systems (and corresponding publicly available datasets) like MovieLens [Herlocker et al., 1999] and Jester [Goldberg et al., 2001], and later by the Netix Prize [Koren et al., 2009]. Nevertheless, most real-world recommender systems (e.g. in ecommerce) do not rely on ratings, because users are hard to persuade to give explicit feedback, and other kinds of feedback implicit feedback [Oard and Kim, 1998] like user actions like selecting/buying an item, page views/clicks, listening/watching times, etc. are often recorded by the system anyway. Buying a product or watching a video is also a positive expression of preferences. Note that not buying or watching an item from a large collection does not necessarily means a user dislikes the item. Other kinds of measurements, like watching and listening times or percentages for videos and songs, have no direct interpretation as positive or negative feedback. Item recommendation from implicit feedback is the task of determining the items that a user will perform a certain action on from such past events (and possibly additional data). Item Recommendation from Positive-Only Feedback Often, implicit feedback contains only positive signals [Oard and Kim, 1998, Pan et al., 2008, Gunawardana and Meek, 2008, Rendle et al., 2009, Gunawardana and Meek, 2009]: A user has bought a product, which implies they have a positive preference for this product. We do not know, however, which items the user does not like from this kind of feedback. Distinguishing between items a 28 CHAPTER 2. RECOMMENDER SYSTEMS: TASKS AND METHODS user likes and items a user does not like from such data is called one-class classication in the language of supervised learning [Moya and Hush, 1996]. Learning models for such a task is less obvious than binary or multi-class classication. Note that positive-only feedback is not necessarily implicit. For example, websites like Facebook allow users to give thumbs up (like) to all kinds of items. This is an explicit statement of preference, but again we only observe positive feedback here. Another example is the favorite feature on Twitter: users can mark posts as their favorites. Again, this is an explicit statement. For convenience, we use a binary matrix S ∈ {0, 1}|U |×|I| to represent positive-only feedback, where su,i is 1 if and only if user u has given positive feedback about item i. We use Iu+ := {i ∈ I : sui = 1} to refer to the items for which user u has provided feedback and Iu− := {i ∈ I : sui = 0} to refer to the items for which that user has not provided feedback. Suppose we have the users Alice, Ben, and Christine. None of them has watched The Usual Suspects ; Christine has watched all three other movies, while Alice and Ben each have only watched American Beauty and The Godfather, respectively. If we assign IDs to all entities in order of their appearance, we have 0 1 0 0 S = 0 0 1 0 . 0 1 1 1 Example 2 Here, the rows refer to users, while the columns refer to items (movies). From S, we can deduce the sets I1+ = {2} and I1− = {1, 3, 4} for Alice (user 1). Note that we only see positive feedback here: We cannot deduce that Alice and Ben do not like the other movies only because they have not watched them. Whether the data is one-class data depends on the way the data is logged. Continuing with the Facebook thumbs up scenario, if we knew which items had been displayed to the user, we could construct negative examples from the items that were not liked. However, in practice such a kind of detailed logging does not happen, either because the logging mechanism was not designed with recommendation in mind, or because it is simply too costly in terms of resources. Distinguishing between explicit and implicit feedback is important when thinking about the overall system, about the user motivations to give feedback, and other cognitive aspects of recommender systems. For constructing a learning algorithm, it does not matter so much. In this thesis, we often deal with the task of item recommendation from positive-only feedback . For the sake of brevity, we will call this task item recommendation in the following, even though there are of course many other item recommendation tasks besides item recommendation from positive-only feedback. 2.2.3 Further Tasks and Variants Next, we provide details on several variants and special cases of recommendation tasks, and show how they t into our formal framework. Most of these variants can be combined. For example recommending playlists for a certain occasion, where we allow songs that a user has played before, can be expressed as contextaware sequential recommendation with repeated events. 2.2. TASKS 29 Context-Aware Recommendation We have already introduced the term context in the generic denition of the recommendation task. We speak of context-aware recommendation if the context consists of more than just the user. Adomavicius and Tuzhilin [2011] mention time, location, and the company of other people as examples for context.1 They distinguish between three dierent kinds of contextual information: 1. explicit context stated by/asked from the user. 2. implicit context logged by the system, for example time stamps. 3. inferred context for example guessing which member of a household is using the TV. Both the task of rating prediction and of item recommendation can be extended to context-aware variants. As mentioned at the end of section 2.2.1, interaction data can be augmented by time information. On one hand, time information can be used for distinguishing between temporal trends and underlying, more durable user preferences, and thus improve results for prediction tasks without context. On the other hand it can also be part of the recommendation context [Koren, 2009]. Such time-aware recommendation is a case of context-aware recommendation, with C = U × D. Suggestion of folksonomy tags (tag recommendation, J aschke et al. [2007]) is also a special case of context-aware recommendation. Table 2.2 gives an overview of several context-aware recommendation tasks. Filtered Recommendation In ltered recommendation , the list of candidate items is restricted (ltered) by the context. One example is recommending items from the particular part of a product taxonomy that a user is currently browsing. An important dierence to general context-aware recommendation is that the candidate items are necessarily and explicitly restricted by the context. This leads to interesting properties that can be exploited to improve the learning process of a recommender [Yang et al., 2011]. User Recommendation An important application of recommender system algorithms is the recommendation of users to other users in social networks like LinkedIn, Xing, Twitter, or Facebook: users are treated as items. Information from the social network graph can be used for both candidate generation and for making decisions on what to suggest. Another example for user recommendation is match making [Miller, 2003], like nding potential dating partners on dating websites, or nding scientists who could co-operate. 1 In their denition, the user is not part of the context. RECOMMENDER SYSTEMS: TASKS AND METHODS CHAPTER 2. 30 Netix Prize KDD Cup 2011, track 1 CAMRa 2011 Task 1 Music Folksonomy tags Restaurants Books Groceries CAMRa 2010 Task 1 CAMRa 2010 Task 2 Global recommendation Similar products Item recommendation from pos. feedback KDD Cup 2011, track 2 Scenario user, day user, day, time household user, location, mood, group user, resource user, location, mood user, basket user, basket user, week user, mood product user user Context rating rating rating listen, like, ban, listening time tagging rating, visit view, purchase view, purchase rating rating e.g. view, purchase view, purchase positive feedback (song, artist, album, genre) rating Feedback rating rating movie song, playlist tag restaurant product product movie movie product product item song Target no no no yes yes maybe no yes no no maybe maybe maybe no Rep. Table 2.2: Context-aware item recommendation: dierent scenarios. The rst group contains rating prediction tasks, the second one item recommendation tasks. The third group contains examples for tasks that are not considered to be context-aware. Rep. means repeated events (considering users). Adapted from Gantner et al. [2010c]. 2.3. METHODS 31 Repeated Events One thing to be aware of when building recommender systems is whether it is possible and useful to recommend items that a user has already accessed. If this is not the case, the solution is to simply remove the accessed items from the candidate list. An example for this scenario is buying books: If a user has already bought a book, it is not very likely they will buy the same book again. The situation is dierent for buying products that are meant to be consumed, like groceries. Then things can become more complicated: accessed items will, of course depending on the learning method, be quite likely to show up in a recommendation list. On the other hand, a recommender system should still allow the user to explore the item space, so one should make sure that items unknown to the user have the chance to make it to the recommendation list. In the marketing literature, this is known as repeat buying [Ehrenberg, 1988]. Work on repeated events and recommender systems can be found in Geyer-Schulz and Hahsler [2002], Cho et al. [2005], Rendle et al. [2010] and Kawase et al. [2011]. Note that there are several dierent notions of repeated events: the repetition can relate to the user, or to other parts of the context, or to the full context. For example, in the folksonomy tag scenario, it makes sense to recommend tags a user has already used, but it does not make sense to recommend tags a user has already used in exactly the same context, i.e. for the current resource. Sequential Recommendation Sometimes, the ideal suggestion depends on what users have seen, heard, or bought before. Take for example a customer who returns to an online shop after having made a purchase some time ago [Rendle et al., 2010], or the task of generating music playlists [Alghoniemy and Tewk, 2001, Logan, 2002, Pampalk et al., 2005, Pauws et al., 2006]. Sequential recommendation could be formalized in dierent ways. One option would be to extend the general denition of (set) recommendation, and add an order to the suggested set. Another option would be to model sequential recommendation as several separate recommendation problems with the previous items as context. We suggest to use the second option, because it does not require modifying our original generic denition of the recommendation task. A context set for sequential recommendation with the Markov property [Markov, 1906] is C = U × I , where the item in the context is the preceding item. Denitions that take more than one item, or even sets of items into account are also possible, for example C = U × 2I . 2.3 Methods In this section, we describe typical approaches to tackle the rating prediction and item recommendation tasks. Item recommendation is the main topic of this thesis, so the methods presented here present the current state of the art with respect to this task. While we do not focus on rating prediction, this task is very prominent in the recommender systems literature. Towards the end, we give pointers on more complex methods for some of the more specic recommendation tasks mentioned before. 32 CHAPTER 2. RECOMMENDER SYSTEMS: TASKS AND METHODS 2.3.1 Baselines Simple methods, like global, user, or item averages for rating predictions, are usually not employed as the method in recommender systems to make realistic suggestions. However, they are still useful as experimental baselines, and as components of more sophisticated methods. For example, we could compare the implementation of a new method with an existing baseline method. We would expect the new method to perform much better than the baseline; should this be not the case, we would expect to have made mistakes in the design or implementation of the new method. Rating Prediction Besides giving random answers, the most simple kind of rating prediction is to predict the same rating for each rating event. An obvious choice here is to use the global average (mean) µ. Another possibility is to predict the user average, or the item average. Note that the global or user average contain no information at all to let us rank the items for a given user. Averages: User and item bias: 2008]: One can also combine user and item biases [Koren, uib I r̂u,i = µ + bU u + bi , (2.3) where bU and bI are the solution of the following optimization problem: min bU ,bI X I 2 U U 2 I I 2 (ru,i − µ − bU u − bi ) + λ kb k + λ kb k , (2.4) (u,i,ru,i )∈R where R ⊆ U × I × R are the known ratings, and λU and λI are regularization constants. The sum represents the least squares error, while the two terms starting with λU and λI , respectively, are regularization terms that control the parameter sizes to avoid overtting. The optimization problem could be solved for example by a gradient descent algorithm, or by an alternating least squares method, as suggested by Koren [2008]. Item Recommendation In contrast to rating prediction, where we predict scalar values, we predict sets of items in the item recommendation task. In practice, many methods for item recommendation compute a score for each user-item combination, and then for each user rank the items according to their scores. In the following, we will present item recommendation methods as scoring functions, even though some of them may be expressed and implemented more elegantly (and eciently) as functions emitting sets. The score function allows us to present the methods in a unied way. Random: The most simple baseline is to assign random scores, resulting in randomly ordered recommendation sets for each user. 2.3. 33 METHODS A very simple method that returns the same items to every user (except the items already accessed by the user) is to take the globally most popular items. What most popular means can dier depending on the kind of feedback data available. In the case of positive-only feedback we rank the items according to the number of feedback events: Most popular items: ŝmp u,i = |{(u, i, su,i ) ∈ s}|. (2.5) 2.3.2 Neighborhood-Based Models A classical family of methods for recommendation tasks are neighborhood-based models. The general idea underlying these methods is to use similar examples from the past to make predictions [Hastie et al., 2009]. Which examples are similar is determined by a so-called similarity measure over the involved entities. User-based k-nearest Neighbors (kNN) Similar users are likely to like the same things. Then a straightforward idea is to take the past activities of similar users to make suggestions. For rating data, the estimated Pearson correlation of two users is dened as follows [Koren and Bell, 2011]. User-based kNN for rating prediction: P − b̂u,i )(rv,i − b̂v,i ) , P 2 2 i∈I(u,v) (ru,i − b̂u,i ) · i∈I(u,v) (rv,i − b̂v,i ) ˆ Pearson sim = qP u,v i∈I(u,v) (ru,i (2.6) where I(u, v) = I(u) ∩ I(v) are the items that were rated by both users, and b̂u,i , b̂v,i are baseline rating estimations, for example the one dened in equation 2.3. To avoid putting too much weight on correlations computed from just a few examples, a shrinkage factor is applied [Koren and Bell, 2011]: ˆ 0u,v = sim |I(u, v)| − 1 ˆ u,v , sim |I(u, v)| − 1 + λ (2.7) ˆ u,v is the initial estimate of the where λ is the shrinkage constant, and sim similarity of users u and v . Using any (estimated or actual) similarity measure sim, we can predict ratings: P v∈N k (i;u) simu,v (rv,i − b̂v,i ) UserKNN P r̂u,i = b̂u,i + , (2.8) v∈N k (i;u) simu,v where N k (i; u) represents the k nearest (most similar according to the given similarity measure) neighbors of user u that have rated item i. 34 CHAPTER 2. RECOMMENDER SYSTEMS: TASKS AND METHODS User-based kNN for item recommendation from positive-only feedback: Similarly, we can use a kNN model for item recommendation from positive-only feedback. One widely-used similarity measure for binary data is the cosine similarity [Manning et al., 2008]: |I(u) ∩ I(v)| . |I(u)| · |I(v)| simcosine = u,v (2.9) A related measure is the Jaccard index : simJaccard = u,v |I(u) ∩ I(v)| . |I(u) ∪ I(v)| (2.10) To compute the score for a given user-item combination, we count the number of neighbors of the user that have accessed the item: ŝUserKNN = |N k (i; u)|, u,i (2.11) where N k (i; u) in this case represents the k nearest neighbors of user u that have interacted with item i. To get more accurate predictions, we can sum up the similarities instead of just counting the neighbors that have accessed the item: X 0 ŝUserKNN = simuv . (2.12) u,i v∈N k (i;u) The similarity does not necessarily have to be computed from the interaction data. Instead, one could dene similarities over the user attributes, for example the age, occupation, or location. Demographic data: Item-based kNN User-based kNN uses similar users to predict ratings and item scores for a given user. Conversely, item-based kNN [Linden et al., 2003] uses similar items to predict ratings and user scores for a given item. Similarity measures like Pearson correlation, cosine similarity, or the Jaccard index can be computed for items using the formulae presented above just by exchanging users and items. Content data: As with user-based kNN, we can also dene the similarity over the item attributes, for example the keywords describing the individual items [Billsus et al., 2000]. Extensions to Basic kNN Depending on the type of feedback and attribute information, many other kinds of similarity measures, and combinations thereof [Tso-Sutter et al., 2008] are thinkable. Several advanced versions of kNN methods are known, for instance computing adaptive weights instead of using the similarity values as weights [Koren and Bell, 2011], or using approximate similarities like MinHash [Das et al., 2007]. 2.3. 35 METHODS 2.3.3 Attribute-Based Methods We have already seen kNN methods based on demographics and item attributes. A straightforward way for recommendations based on user or item attributes is to compute standard supervised machine learning models like linear regression, decision trees, support vector machines, or Naive Bayes [Hastie et al., 2009], one for each user or item, using the attributes of the other entity type as predictor variables, and the interaction data as the targets. Pazzani and Billsus [2007] and Lops et al. [2011] provide overviews of the state of the art in content-based recommendation methods. 2.3.4 Hybrid Methods and Ensembles Of course, it is not necessary to only rely on either attribute-based or collaborative methods. There are many dierent ways for creating hybrid methods [Balabanovic and Shoham, 1997, Good et al., 1999], either complete models that use dierent kinds of signals and data (for example Wang and Blei [2011]) or ensembles that combine the output of several dierent models [Koren, 2009, T oscher et al., 2010]. 2.3.5 Stochastic Gradient Learning Most recommender system models can be trained eciently using stochastic gradient descent2 [Gardner, 1984, LeCun et al., 1998, Tak acs et al., 2008, Bell et al., 2008, T oscher et al., 2008]. The algorithm's widespread use and general usefulness for large-scale data justify that we spend some time discussing its general working. Algorithm 1 illustrates the generic stochastic gradient descent procedure: After initializing the parameters of the model, often to zero, one, or small random values depending on the kind of model we repeatedly draw single examples, compute a local approximation of the gradient based on the example, and update the aected parameters. The time spent optimizing is usually measured in epochs. We dene one epoch as a complete pass over all training examples. Note that some scenarios do not require even a complete pass over the data [Hazan et al., 2011, Clarkson et al., 2010], whereas sometimes many epochs are needed to converge. Data: dataset Result: Θ̂ X , α, λ 1 initialize Θ̂ 2 repeat 3 draw example x from X 4 Θ̂ ← Θ̂ − α∇`x (Θ̂) 5 until convergence Algorithm 1: Learning with stochastic gradient descent. Θ̂ are the model parameters, α is the learning rate (step size), ∇`x (Θ̂) is the local gradient of the (possibly regularized) loss function with respect to the model parameters. 2 We use descent or ascent, depending whether we minimize or maximize our optimiza- tion objective. 36 CHAPTER 2. RECOMMENDER SYSTEMS: TASKS AND METHODS For training more complex models with dierent kinds of parameters, it is often useful to have dierent step sizes and regularization constants [Koren, 2009]. If we use several constants for training in this thesis, we will indicate it explicitly. Of course, there are many other learning methods for recommender systems besides stochastic gradient descent, like conjugate gradient [Rennie and Srebro, 2005, Pan et al., 2008] expectation-maximization (EM) [Yu et al., 2009], Markov Chain Monte Carlo [Salakhutdinov and Mnih, 2008a] or alternating least squares (ALS) [Hu et al., 2008, Pil aszy and Tikk, 2009, Pil aszy et al., 2010, Rendle et al., 2011]. The aspect that makes SGD so interesting is that it can be used to train a wide variety of dierent models, even with large scale datasets, while being conceptually simple, and also easy to implement. 2.3.6 Matrix Factorization The basic idea of matrix factorization for supervised learning is to represent a partially observed data matrix as the product of two smaller matrices. The rows of one of these two matrices represent the row elements of the original data matrix as a k -dimensional vector, and the rows of the other matrix represent the column elements of the original data as a k -dimensional vector. This allows the reconstruction of missing matrix elements by computing the scalar product of the corresponding rows. We introduce matrix factorization with a model for rating prediction [Rendle and Schmidt-Thieme, 2008], and will switch to item recommendation later on in section 2.3.7. Basic Model The basic matrix factorization model for rating prediction is R = WH0 + E, (2.13) where R ∈ R|U |×|I| is the (only partially observed) rating matrix, W ∈ R|U |×k is the user matrix, H ∈ R|I|×k is the item matrix, and E contains the prediction error of the factorization. We can compute a single rating estimate with the following formula: mf r̂u,i = hwu , hi i. (2.14) Viewing rating prediction as a regression problem with square loss, this leads to the following optimization problem: min W,H X mf 2 (ru,i − r̂u,i ) + λ(kWk2 + kHk2 ), (u,i,ru,i )∈R where λ is a regularization constant. Algorithm 2 describes an SGD procedure to optimize equation 2.15. (2.15) 2.3. 37 METHODS Data: dataset R, Result: W, H α, λ 1 initialize W, H to small random values 2 repeat 3 draw example (u, i, ru,i ) from R mf 4 e ← ru,i − r̂u,i new 5 wu ← wu − α(ehi + λwu ) 6 hnew ← hi − α(ewu + λhi ) i 7 wu ← wunew 8 hi ← hnew i 9 until convergence Learning a basic matrix factorization for rating prediction with stochastic gradient descent. W, H are the model parameters, α is the learning rate (step size), λ is the regularization constant. Algorithm 2: Modeling Global, User, and Item Biases To improve the predictive accuracy, it may be wise to only model the dierence between each rating and the global mean, and to explicitly take user and item biases into account, leading to the following prediction formula: bmf I r̂u,i = µ + bU u + bi + hwu , hi i, (2.16) leading to the optimization problem X bmf 2 (ru,i − r̂u,i min ) + λb (kbU k + kbI k) + λ(kWk2 + kHk2 ), (2.17) W,H (u,i,ru,i )∈R which again can be optimized using stochastic gradient descent. Using a Sigmoid Function To make the model and the corresponding learning algorithm less prone to numerical diculties due to dierent rating scales [0, 1] vs. [−1000, 1000] we can also apply a sigmoid function to the sum from equation 2.16, in order to make sure the result is inside the interval of valid ratings: blmf I r̂u,i = rmin + g(µ + bU u + bi + hwu , hi i)(rmax − rmin ), (2.18) again leading to an optimization problem similar to the one in equation 2.17, which can be (approximately) solved by Algorithm 3. g denotes the logistic function: 1 . (2.19) g(x) = 1 + e−x More Complex Models The matrix factorization models in this section are presented as an introduction to factorization models; there are more complex and powerful methods in the literature, which use additional data like SVD++ [Koren, 2008], timeSVD 38 CHAPTER 2. RECOMMENDER SYSTEMS: TASKS AND METHODS Data: dataset R, α, λ Result: bU , bI , W, H 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 bU ← 0 bI ← 0 initialize W, H to small random values repeat draw example (u, i, ru,i ) from R mf-bl e ← ru,i − r̂u,i I x ← g(µ + bU u + bi + hwu , hi i y ← e · x · (1 − x) · (rmax − rmin ) U U bU u ← bu − α(y + λbu ) bIi ← bIi − α(y + λbIi ) wunew ← wu − α(y · hi + λwu ) hnew ← hi − α(y · wu + λhi ) i wu ← wunew hi ← hnew i until convergence Algorithm 3: Learning a matrix factorization with biases and a logistic trans- formation for rating prediction with stochastic gradient descent. W, H are the user and item factors, bU , bI are the user and item bias vectors, α is the learning rate (step size), λ is the regularization constant. and timeSVD++ [Koren, 2010], time-aware Bayesian probabilistic tensor factorization [Xiong et al., 2010], SVDFeature [Chen et al., 2011b], Factorization Machines [Rendle, 2010b], the time- and taxonomy-aware model by Dror et al. [2011a], or which optimize for dierent loss functions like CoRank [Weimer et al., 2008]. 2.3.7 Bayesian Personalized Ranking Bayesian Personalized Ranking (BPR) is a framework for optimizing dierent kinds of models based on training data containing implicit feedback or other kinds of implicit and explicit (partial) ranking information. It has been successfully applied to k-nearest-neighbor (kNN), matrix factorization, and dierent tensor factorization models for the tasks of item recommendation [Rendle et al., 2009] and personalized tag prediction [Rendle and Schmidt-Thieme, 2010]. Rendle [2010a] refers to the context-aware generalization of BPR, which goes beyond mere personalization, as Bayesian Context-Aware Ranking (BCR). Because we concentrate on personalization in this thesis, we stick with the term BPR; all extensions to BPR presented in the next chapter can also be made to BCR. BPR's key ideas are to consider entity pairs instead of single entities in its loss function, which allows the interpretation of positive-only data as partial ranking data, and to learn the model parameters using a generic algorithm based on stochastic gradient descent. For convenience, we use Iu+ for positive items and Iu− for negative ones, similarly to the notation used by Rendle et al. [2009]. Depending on the context, Iu+ and Iu− may refer to the positive and negative items in the training or test 2.3. 39 METHODS set. What determines whether an item is positive or negative may dier (see section 5.2.5). For estimating whether a user prefers one item over another, we optimize for the BPR criterion:3 BPR(DS , Θ) X = ln g(ŝu,i,j (Θ)) − λ||Θ||2 , (2.20) (u,i,j)∈DS where ŝu,i,j (Θ) := ŝu,i (Θ) − ŝu,j (Θ) and DS = {(u, i, j)|i ∈ Iu+ ∧ j ∈ Iu− }. Θ represents the parameters of the model and λ is a regularization constant. Matrix Factorization Based on BPR Matrix factorization based on BPR (BPR-MF) approximates the event matrix S by the product of two low-rank matrices W ∈ R|U |×k and H ∈ R|I|×k . For a specic user u and item i, the score estimate is ŝu,i = k X wuf hif = hwu , hi i. (2.21) f =1 Each row wu in W can be seen as a feature vector describing a user u; each row hi of H describes an item i. For learning the MF model, we use the LearnBPR algorithm (Algorithm 4), which is a variant of stochastic gradient ascent that samples from DS . To apply LearnBPR to MF, only the gradient of x̂uij with respect to every model parameter has to be derived. Data: D train , α, λ Result: Θ̂ 1 initialize Θ̂ 2 repeat 3 draw (u, i) from Dtrain 4 draw j uniformly from Iu− 5 Θ̂ ← Θ̂ + α e−ŝu,i,j 1+e−ŝu,i,j · ∂ ŝ ∂ Θ̂ u,i,j − λ · Θ̂ 6 until convergence Algorithm 4: LearnBPR: Optimizing BPR using stochastic gradient ascent. α is the learning rate (step size), λ is the regularization constant. Besides BPR-MF and kNN, there exist of course other methods for item recommendation, most prominently weighted regularized matrix factorization (WR-MF, Hu et al. [2008], Pan et al. [2008]). Other Item Recommendation Methods: 3 In the original paper [Rendle et al., 2009] this is called BPR-Opt, for brevity we call it just BPR. 40 CHAPTER 2. RECOMMENDER SYSTEMS: TASKS AND METHODS 2.3.8 Context-Aware Recommendation Adomavicius and Tuzhilin [2011] distinguish three dierent approaches for contextaware recommendations: 1. contextual pre-ltering , where the set of candidate items is selected based on the context and then fed into a non-contextual recommender. 2. contextual post-ltering , where the items suggested by a non-contextual recommender (possibly including the score outputs) are post-processed according to the current context. 3. contextual modeling , where context is directly modeled. In recent years, there have been several publications on contextual modeling approaches, for example Rendle's thesis [Rendle, 2010a] and several follow-up publications [Rendle, 2010b, Rendle et al., 2011], as well as Multiverse recommendation [Karatzoglou et al., 2010], and the works by Baltrunas [Baltrunas, 2011, Baltrunas et al., 2011]. Work on the special case of tag recommendation was kicked o by J aschke et al. [2007], who suggest several methods inspired by PageRank [Brin and Page, 1998]. Other methods that have been presented are pairwise interaction tensor factorization (PITF, Rendle and Schmidt-Thieme [2010]), relational classication [Marinho et al., 2009], and content-based approaches [Lipczak, 2009]. Several time-aware methods have already been mentioned before in section 2.3.6. 2.4 Evaluation Criteria First of all, there are dierent things we can evaluate in recommender systems research. We can evaluate complete systems, and we can also evaluate certain components of a recommender system. In this thesis, we are mostly concerned with evaluating the method components, meaning the prediction model and the corresponding learning algorithm. Shani and Gunawardana [2011] discuss 14 recommendation system properties, some of which relate to the method components: 1. user preference the item ranking, 2. prediction accuracy error measures for rating prediction, information retrieval metrics for item recommendation, 3. coverage the part of the entities out of the overall set for which the system/method is able to make useful predictions, 4. condence whether the system is able to report the condence it has in the suggestions it makes, 5. trust whether the users trust in the system's recommendations, 6. novelty whether the system is able to suggest items that user previously did not know about, 7. serendipity how surprising recommendations are to a user, 2.4. 41 EVALUATION CRITERIA 8. diversity how diverse the recommended item sets are, 9. utility how useful the recommendations are for the user and for the operator of the recommender system, 10. risk how likely the users will be disappointed by the recommendations, 11. robustness how hard it is for attackers to modify the system's recommendations, 12. privacy whether user preferences can be exposed by the system, 13. adaptivity how well a system/method corresponds to new feedback, 14. scalability how well a system/method can cope with growing numbers of users, items, and feedback. While it is important to test recommender system algorithms in the eld [Marrow et al., 2010], experiments with real users are expensive and timeconsuming to execute. Oine experiments are inexpensive, and repeating them is only limited by computing resources, which nowadays are cheap; in the literature on recommender system methods [Karypis, 2001], and more generally in the machine learning/data mining literature, they are the primary means for comparing the performance of learning algorithms [Hastie et al., 2009]. 2.4.1 Predictive Accuracy Rating Prediction Typical evaluation measures for rating prediction are root mean square error (RMSE ) and mean absolute error (MAE ) [Shani and Gunawardana, 2011]: eRMSE (R, r̂) = s X (ru,i − r̂u,i )2 (2.22) (u,i,ru,i )∈R eMAE (R, r̂) = X |ru,i − r̂u,i |. (2.23) (u,i,ru,i )∈R Item Recommendation The accuracy of item recommendation methods can be measured in several ways. Ranking measures take the ordering of items by the system into account, while set measures rely on the information of whether or not an item is in the set of recommendations. If the recommendation method returns a ranked list of items for a user, we can compare this list with held-out preference information of the same user. One such measure is the area under the ROC curve (AUC). Intuitively, the AUC is the probability that, when we draw two items at random, their predicted pairwise ranking is correct [Bickel, 2006]. Ranking Measures: 42 CHAPTER 2. RECOMMENDER SYSTEMS: TASKS AND METHODS For recommendation from positive-only feedback, the per-user AUC on the test set can be dened as follows: AUC(u) = 1 X X |Iu+ ||Iu− | δ(ŝu,i > ŝu,j ) (2.24) + − i∈Iu j∈Iu where δ is dened as ( δ(x) := 1, condition x holds . 0, else (2.25) Note that the loss 1 − AUC has a nice property [Balcan et al., 2008]: It is greater for mistakes at the beginning and the end of an ordering, which satises the intuition that an unwanted item placed at the top of a recommendation list should have a higher associated loss than when placed in the middle. The average AUC over all relevant users is AUC = 1 |U test X | AUC(u), (2.26) u∈U test where U test = {u|(u, i) ∈ Dtest } is the set of users that are taken into account in the evaluation. Set Measures: If we look at the top n items [Karypis, 2001] of a ranked list, we can use those items to compute several measures that are commonly used in the area of information retrieval [Manning et al., 2008]. Precision measures the ratio of correctly predicted items in the set, whereas recall measures the ratio of items in the test set that were present in the result set. More formally, prec(u, S) = recall(u, S) = |Iu+ test ∩ S| (2.27) |S| |Iu+ test ∩ S| test |Iu+ | , (2.28) test where S is the set of recommended items (see denition 2), and Iu+ = {i|(u, i) ∈ Dtest } is the (held-out) set of items the user has provided positive feedback for. Because we look at these measures for xed recommendation set sizes n, we refer to them as precision at n (prec@n ) and recall at n (recall@n ). Note that with a xed n, changes in both prec@n and recall@n only depend on the number of hits |Iu+ ∩ S|. This means it is sucient to look at only one of these measures to compare dierent methods. As with AUC, we usually average these information retrieval measures over all relevant users. 2.5. 43 DATASETS Dataset MovieLens 100k MovieLens 1M Netix (test set) KDD Cup 2011 track 1 (validation set) KDD Cup 2011 track 2 Yahoo! Music Ratings (test set) Events 100,000 1,000,209 100,480,507 1,408,789 252,800,275 4,003,960 61,944,406 699,640,226 18,231,790 Users 943 6,040 480,189 463,122 1,000,990 1,000,990 249,012 1,823,179 1,823,179 Items 1,682 3,706 17,770 16,897 624,961 258,694 296,111 136,736 136,735 Sparsity 0.9369533 0.9553164 0.9882244 0.9995959 0.9991599 0.9971935 Table 2.3: Evaluation datasets used in this thesis. 2.4.2 Runtime Performance For judging the runtime performance of a learning or prediction algorithm, we may look at the computation time and memory consumption. Evaluating those aspects can be done from a theoretical perspective, looking at the computational complexity [Cormen et al., 2001], or from a more practical perspective, by measuring the runtime and memory consumption of the method for dierent input sizes. 2.5 Datasets Machine learning methods are usually tested in so-called oine experiments, which means that we measure the method's predictive accuracy (or possibly other properties) on user preference data that was collected on a real system with real users. Other kinds of experiments are possible for recommender system methods, in particular when investigating human aspects of recommender systems, like the psychology of decision-making, user satisfaction, usability, or other questions of human-computer interaction laboratory experiments with a small or mediumsized group of participants [Knijnenburg et al., 2011b], and live experiments on an existing system, with a potentially arbitrary number of participants. Note that the number of employed methods in live experiments and even more so in lab experiments is always limited, but one can use the initially mentioned oine experiments to get a pre-selection of methods to compare in conditions which are closer (or identical) to the intended application scenario. Hints about user preferences are usually observed in the form of user actions in an interactive system, where users can for example view and rate videos or movies, or can browse an online shop and purchase products. A dataset typically contains logs of one or more such action types for many users over a given time span. Data collection works independently of whether the system has a recommendation feature or not. In this section we describe a couple of such datasets, which we will use for evaluation throughout this thesis. Table 2.3 contains a comparison of several quantitative properties of the datasets. Larger datasets tend to be sparser than smaller ones. 44 CHAPTER 2. RECOMMENDER SYSTEMS: TASKS AND METHODS 2.5.1 MovieLens MovieLens 100k The MovieLens 100k [Herlocker et al., 1999] dataset was collected on the MovieLens web site between September 19th, 1997 and April 22nd, 1998. It contains 100,000 ratings on a scale from 1 to 5, as well as some demographic information on its 943 users, and genre information about the items (movies) it contains. MovieLens 1M MovieLens 1M is a set of approximately 1,000,000 ratings collected from 2000 onwards, from the same website. There is also an extended version of this dataset with 10,000,000 ratings, which will not be used for the evaluations in this thesis. 2.5.2 Netix For the Netix Prize , the rst public large-scale rating dataset was released. It contains about 100,000,000 ratings on a scale of 1 to 5, by almost half a million users on 17,770 movies. 2.5.3 KDD Cup 2011 The datasets used for KDD Cup 2011 are described in Dror et al. [2011b]. They were taken from the Yahoo! Music website, collected from during the years 1999 to 2010. Track 1 The dataset for track 1 which was about rating prediction is the largest set in terms of number items. It is also the most sparse of the datasets we compare here. Track 2 The dataset for track 2 which was about item recommendation was slightly smaller than the one for track 1, about one fourth of its size. 2.5.4 Yahoo! Music Ratings Like the KDD Cup 2011 data, the Yahoo! Music Ratings dataset was collected from the Yahoo! Music website. The collection period is the time between 2002 and 2006. It is even larger, but not more sparse, than the KDD Cup 2011 track 1 dataset, containing overall more than 617 million ratings. To the best of our knowledge, it is currently the largest available recommender system dataset. Chapter 3 Cold-Start Recommendation Matrix and tensor factorization are well-suited methods for solving problems in the eld of recommender systems, like rating prediction for a given user and item (section 2.2.1), or recommending a set of items to a given user (section 2.2.2). Because predictions from factorization models rely on computing simple dot products of latent feature vectors representing users, items, and possibly other entities in the application domain, they usually have good runtime performance. Training with regard to suitable optimization objectives usually leads to good predictive accuracy. The downside of standard factorization methods is that feature vectors are only available for entities observed in the training data, for example users who bought at least one book, or books bought by at least one user. Thus for entirely new users and items, such methods are not capable of computing meaningful recommendations. Even many hybrid systems that rely on both collaborative and content information cannot provide useful predictions for entirely new entities, i.e. those that have no interaction information associated with them. In real-world recommender systems, such cold-start problems1 are often solved by switching to a dierent, purely content-based method when encountering entirely new entities; other options are to present just the most popular items to new users and to randomly present new items to the users in order to gather collaborative information about those new entities. The approach we present here is a modular one, with well-dened interfaces between its components. At the core of our framework is a standard factorization model that only works for entities with collaborative training data. This factorization model is optimized for the given recommendation task. The additional components are mapping functions that compute adequate latent feature representations for new entities from their attribute representations. For example, in the classical recommendation task of movie rating prediction [Shardanand, 1994, Koren, 2009], this approach would handle new users and new items by computing rst the latent feature vectors for the unknown entities from attributes like the user's age or location and a movie's genres or main cast, and then by using those estimated latent feature vectors to compute the rating from the underlying matrix factorization (MF) model. 1 In this chapter, we use the terms cold start, new item, and new user in the narrower sense; see section 3.1 for the denition. 45 46 CHAPTER 3. COLD-START RECOMMENDATION The training of such a combined model consists of learning the underlying standard model from the collaborative data, and then learning the mapping functions from the pairs of latent feature vectors and attribute vectors belonging to entities that are present in the collaborative data. Note that this mapping approach is applicable to a variety of prediction tasks, underlying factorization models, and families of mapping functions. In the following, we describe the use of this framework for the task of item recommendation from positive-only feedback using a matrix factorization model optimized for Bayesian Personalized Ranking (BPR, see chapter 2.3.7 and Rendle et al. [2009]), and demonstrate its usefulness for the new-item recommendation task with a set of experiments. The main contributions of this chapter are 1. a general, simple and straightforward method to make factorization models attribute-aware by plugging learnable mapping functions onto them, and, 2. based on that method, an extension of matrix factorization optimized for Bayesian Personalized Ranking (BPR-MF) that can deal with the coldstart problem, yielding accurate and fast attribute-aware item recommendation methods based on dierent families of mapping functions. 3. We also show empirically that it is worth training the mapping function for optimal model performance with respect to application-specic losses, instead of just trying to map the latent features as accurately as possible. 3.1 Problem Statement In a wider sense, cold-start scenarios are those situations where we want to compute predictions for users or items that have little collaborative information [Cremonesi and Turrin, 2009, Pil aszy and Tikk, 2009]; in the narrow sense, coldstart scenarios are exactly those scenarios in which there is no collaborative information at all for the given users or items [Gunawardana and Meek, 2008, Park and Chu, 2009, Gunawardana and Meek, 2009]. In this chapter, we use the term in the latter sense. First, let us repeat the movie example from the preceding chapter: Suppose we have the users Alice, Ben, and Christine. None of them has watched The Usual Suspects ; Christine has watched all three other movies, while Alice and Ben each have only watched American Beauty and The Godfather, respectively. Example 3 item. 3.2 In our example, The Usual Suspects would be a new (cold-start) Attribute-to-Feature Mappings In this section, we describe the framework we have sketched in the introduction, and use it for the task of item recommendation from positive-only feedback. 3.2. ATTRIBUTE-TO-FEATURE MAPPINGS user factors 47 user attributes users new users mapping item factors item attributes items new items mapping Figure 3.1: Attribute-to-feature mappings, see section 3.2.1 for a description. 3.2.1 General Framework In factorization models, every entity (e.g. users, items, tags) is represented by a latent feature vector f ∈ Rk . In the matrix factorization models presented in the preceding chapter, the rows of the matrices W and H are such latent feature vectors. Usually, for example in the matrix factorization models just mentioned, the latent features of an entity can only be set to meaningful values during training if the entity occurs in the (collaborative) training data. If this is not the case, one way to still make use of the factorization model for new entities is to estimate their latent features from the existing content data: to map from the attribute space to the latent feature space. The recommender system could then use the factorization model to compute scores for all kinds of entities; latent feature vectors for new entities would be computed from the content attributes and further on used as if they were normally trained latent features. The mapping functions could theoretically take any form, although for practical purposes we will limit them to families of functions that allow the learning of useful mapping functions. The training of a factorization model with a mapping extension consists of the following steps: 1. training the factorization model using the data S, and then 2. learning the mapping functions from the latent features of the entities in the training data and their content attributes. Figure 3.1 illustrates the framework for a domain involving users and items: The rectangles on the left-hand side represent the factor matrices, the ones on the right-hand side the attribute matrices. Attributes are assumed to be known for all entities (vertical hatching), while factors are initially only known for those 48 CHAPTER 3. COLD-START RECOMMENDATION entities that occur in the training data (vertical hatching). Entities without collaborative data have no factors (blank). The unknown entity factors are estimated using the corresponding mapping function. The mapping functions are learned from the factor and attribute values of the entities with complete information (thin arrows). Note that this framework can be extended to application domains with additional entity types besides users and items. While we focus on strict cold-start problems here, we could also easily deal with scenarios involving just a few users, for example by using an adaptive ensemble of the underlying model and a model employing estimated latent features. To exemplify how attribute-to-feature mappings can be used for item recommendation from positive-only data, we use BPR-MF, a matrix factorization model based on the Bayesian Personalized Ranking (BPR) framework (see chapter 2.3.7). Bear in mind that the general framework presented here can be applied to other matrix factorization models, as well as to any other model where the entities of the application domain are represented by latent feature vectors, like Tucker decomposition [Tucker, 1966] or PARAFAC [Harshman, 1970]. In the examples and experiments, we focus on new items; new users (or other kinds of entities) can be handled analogously. Example 4 Training a hypothetical factorization model with k = 2 yields two matrices consisting of the user and item factor vectors, respectively: 0.2 W = 1.3 0.9 1.2 0.3 , 1.1 ? 0.9 H= 1.1 0.1 ? 1.0 . 0.2 1.2 Every row in W corresponds to one user, which means that row 1 represents Alice, row 2 Ben, and row 3 Christine. In H, each row corresponds to exactly one movie. Suppose that The Usual Suspects has not yet been added to the content catalog, so row 1 does not contain any meaningful values. We can compute item recommendations for Alice by ranking her previously unseen movies according to their predicted scores: ŝ1,3 = hw1 , h3 i = 0.2 · 1.1 + 1.2 · 0.2 = 0.46 ŝ1,4 = hw1 , h4 i = 0.2 · 0.1 + 1.2 · 1.2 = 1.46. Because the score for Road Trip is 1.46, the system would put it further on top of the result list than The Godfather, which only has a score of 0.46. If we want to make a prediction for The Usual Suspects, we need to estimate its factors from its attributes. 3.2.2 Item Mappings In this section, we show how to design attribute-to-feature mappings for items; user-attribute-to-feature mappings can be designed accordingly. 3.2. 49 ATTRIBUTE-TO-FEATURE MAPPINGS Movie The Usual Suspects American Beauty The Godfather Road Trip US 1 0 0.5 0 AB 0 1 0 0.5 TG 0.5 0 1 0 RT 0 0.5 0 1 Table 3.1: Cosine similarities between movies. The general form of score estimation by mapping from item attributes to item factors is k X (3.1) ŝui := wuf φf (aI i ) = hwu , φ(aI i )i, f =1 where φf : R → R denotes the function that maps the item attributes to the factor with index f , and φ : Rn → Rk denotes the vector-valued function that maps the item attributes to all item factors. n K-Nearest-Neighbor Mapping One approach to map the attribute space to the factor space is to use weighted k -nearest-neighbor (kNN) regression [Hastie et al., 2009] for each factor. We determine the k nearest neighbors Nk as the most similar items according to the cosine similarity (see section 2.3.2) of the attribute vectors. Each factor is then estimated by P I I j∈N (i) sim(a i , a j )hjf I φf (a i ) := P k . (3.2) I I j∈Nk (i) sim(a i , a j ) Note that for other kinds of attribute data (e.g. strings, real numbers), other similarity metrics could be employed. The cosine similarities of the dierent items are given in Table 3.1. The factors of The Usual Suspects, estimated by 1-NN, would be ! 0.5·h3,1 1.1 I 0.5 ĥ1 := φ(a i ) = 0.5·h3,2 = . 0.2 0.5 Example 5 With this estimation, we can compute a score for the new item: ŝ1,1 = hw1 , ĥ1 i = 0.2 · 1.1 + 1.2 · 0.2 = 0.46. This result means that we would still recommend Road Trip to Alice. Linear Mapping For score estimation with a linear mapping to the item factors, we plug linear functions into equation (3.1): φf (aI i ) = n X mf l aIil = hmf , aI i i. l=1 Each item factor is expressed by a weighted sum of the item attributes. (3.3) 50 Example 6 ing weights: CHAPTER 3. COLD-START RECOMMENDATION Suppose we have trained a linear mapping model with the follow- M= 0.7 0.0 0.1 1.0 0.7 . 0.1 0.0 0.8 1.1 0.0 The rows in matrix M correspond to the dierent latent features, while the columns denote the inuence of each attribute to the latent features. Then the latent feature estimates are 1 · 0.7 + 1 · 0.0 + 0 · 0.1 + 0 · 1.0 + 0 · 0.7 0.7 ĥ1 = = , 1 · 0.1 + 1 · 0.0 + 0 · 0.8 + 0 · 1.1 + 0 · 0.0 0.1 and the score for Alice and The Usual Suspects is ŝ1,1 = hw1 , ĥ1 i = 0.2 · 0.7 + 1.2 · 0.1 = 0.26. There are dierent ways of learning the linear mapping functions. We present options: simple least-squares optimization on the latent factors, and the more complex BPR optimization, which optimizes the mapping for the overall predictive accuracy of the resulting item recommendation model. As we will show in the empirical evaluation, this additional complexity is well worth the eort. One way to learn suitable parameters for the linear mapping functions is optimizing the model for the (regularized) squared error on the latent features, i.e. straightforward ridge regression [Hastie et al., 2009]. Because the number of input variables (attributes) can be in the tens of thousands, we use stochastic gradient descent for training. This simple approach did not yield optimal results (see section 3.3.3), so we investigated another mapping method, which is explained next. Optimizing for Least Squares Error on the Latent Factors: BPR Performance of the Complete Model: To optimize the parameters of the linear mapping functions, Θ = M ∈ Rk×n , for the BPR criterion (of the overall prediction model in equation 3.1) is a more suitable approach, because it ts the parameters leading to optimal model performance, rather than just accurately approximating the latent feature values. As stated above, when optimizing for BPR, we are interested in the dierence between two item scores for the same user: Optimizing for ŝui − ŝuj = k X wuf f =1 n X mf l aIil − l=1 k X wuf f =1 n X mf l aIjl . (3.4) l=1 Note that introducing a bias term m0f (via an articial attribute that is always set to 1) does not make sense for item mappings, because the bias part would be exactly the same for both sums. This can be simplied to ŝui − ŝuj = k X n X f =1 l=1 = k X n X f =1 l=1 wuf mf l aIil − k X n X f =1 l=1 wuf mf l (aIil − aIjl ). wuf mf l aIjl 3.3. 51 EXPERIMENTS For training with LearnBPR (see algorithm 4), we need the partial derivative with respect to mlf for f ∈ {1 . . . k}, l ∈ {1 . . . n}: ∂ (ŝui − ŝuj ) = wuf (aIil − aIjl ). ∂mf l (3.5) The resulting learning algorithm for the linear mapping model optimized for is shown in Figure 5. Note that we only need to update the mapping weights for those attributes where the two items drawn from DS dier. BPR Data: DS , W, H, AI Result: M 1 initialize M 2 repeat 3 draw (u, i, j) from DS 4 x̂uij ← ŝui − ŝuj 5 for 1 ≤ f ≤ kdo 6 mf ←mf +α −x̂uij e −x̂uij 1+e ·wuf (aI i −aI j )−λmf 7 end 8 until convergence Learning algorithm for the linear attribute-to-feature mapping: The score dierence ŝui − ŝui is dened in equation 3.4. Algorithm 5: Run-Time Overhead Generally, the runtime overhead of adding mapping functions to an existing factorization model is low. For each new entity, the factors need to be estimated once, and can be either be stored in the pre-existing factor matrices, or in special data structures. After that, the computation of a prediction takes the same time as with just the underlying model. Note that factorization models themselves are among the fastest state-of-the-art methods. The experimental part of this chapter contains a comparison in section 3.3.5 that shows the method's advantage over classical content-based ltering. 3.3 Experiments We performed experiments to conrm that our approach is able to produce useful new-item cold-start recommendations. We compare the two mapping methods described in section 3.2 to other approaches capable of solving the newitem cold-start problem (section 3.3.3). We also investigated how the number of attributes aects the prediction accuracy (section 3.3.4). 3.3.1 Datasets For the experiments, we use the MovieLens 1M dataset, which is described in section 2.5.1 of this thesis. MovieLens 1M is a commonly-used rating dataset [Gunawardana and Meek, 2009, Park and Chu, 2009]. Like Gunawardana and 52 CHAPTER 3. Name Source genres directors actors credits MovieLens IMDb IMDb IMDb COLD-START RECOMMENDATION # Attributes 18 479 16,149 17,739 Sparsity 90.83 99.59 99.91 99.91 % % % % Table 3.2: Item attribute sets. Meek [2009], we do not use the rating values, but just binary rating events, assuming that users tend to rate movies they have watched. To evaluate the performance of recommender algorithms in the presence of new items, we randomly split the items in the dataset into 5 groups of roughly equal size, and assign all corresponding rating events, to perform 5-fold crossvalidation. Note that the results of such a protocol are comparable to the results provided by Gunawardana and Meek [2009], but actually more robust, because Gunawardana and Meek [2009] only perform one experiment with 500 randomly chosen test users, whereas we perform 5 experiments on all available users. As attributes, we use the genre information included with the MovieLens dataset, and additional information from the Internet Movie Database (IMDb)2 . Table 3.2 gives an overview of the attribute sets. All attributes used in the evaluation are nominal (set-valued), their representation is binary, i.e. every possible attribute is represented by one number that is 1 if the item has the attribute, and 0 if not. # Attributes refers to the number of attributes in the set, and Sparsity refers to the relative number of zero values in the movies' attribute representations, the matrix AI . Note that the methods described here would also work for real-valued attributes. The credits attribute set contains actors, directors, producers, writers, cinematographers, etc. involved with the movies; it is a superset of the other two IMDb attribute sets. 3.3.2 Compared Methods We report results for prec@5, prec@10, and AUC (see section 2.4.1). We compared three mapping methods with two baseline methods. For the mapping methods, we computed BPR-MF models (see section 2.3.7) with k = 32 factors using hyperparameters that yielded satisfactory results in non-cold-start evaluations.3 We also performed the experiments with dierent numbers of factors (k ∈ {32, 56, 80, 120, 160}), and got similar results. map-knn: The kNN-based mapping method described in section 3.2.2; we determined suitable values for k using 4-fold cross-validation on the training data.4 map-lin: A linear mapping method that uses ridge regression to estimate the latent features from the attributes, described in section 3.2.2. We determined 2 downloaded April 16, 2010 3 α = 0.01, λU = 0.02125, λI = λJ = 0.00355, 265 iterations 4 For this and the other methods, we picked the hyperparameter best prec@5 performance. (combinations) with the 3.3. 53 EXPERIMENTS suitable values for the hyperparameters (learning rate, regularization constant) using 4-fold cross-validation on the training data. map-bpr: The linear mapping method that is optimized for BPR is described in section 3.2.2; Again, we determined suitable values for the hyperparameters (learning rate, regularization constant) using 4-fold cross-validation on the training data. For training, we performed NNZ 2.5 stochastic updates to the mapping weights, where NNZ is the number of non-zero entries in the feedback matrix S. cbf-knn: We used the cosine similarity (see section 2.3.2) between the items' binary attribute vectors as the similarity measure. We set k = ∞, so scores for user u and item i are computed by summing up the similarities of the item i with the items previously seen by user u: X ŝu,i = sim(i, j). (3.6) + j∈Iu Note that this is content-based ltering using kNN (see section 2.3.2), not attribute-to-factor mapping via kNN as mentioned in section 3.2.2. random: To put the other methods in perspective, we also included the results for predicting a random set of items. We do not compare against just recommending the most popular items, because in our evaluation protocol there are only previously unseen items in the test set, thus there is no popularity information about any of the candidate items. ubm: In the rst experiment, we cite experimental results by Gunawardana and Meek [2009], who used a comparable evaluation protocol to evaluate Unied Boltzmann Machines. 3.3.3 Experiment 1: Method Comparison The comparison of the aforementioned methods on the attribute sets genres, directors, and a combination of the two sets can be seen in Figures 3.2 to 3.4. Gunawardana and Meek [2009] used a similar evaluation protocol in their coldstart experiments: the same dataset, also an 80-20 split, but only evaluations for 500 randomly selected users, instead of all users. For genres, they report about 25% prec@5 for their primary method (Unied Boltzmann Machines). As is shown in Figure 3.2, the results for map-bpr also fall into this region, while map-knn and the two baseline methods perform considerably lower. For directors, map-bpr, map-knn, and cbf-knn are roughly on par. The comparison of map-lin and map-bpr shows that is really worth training the mapping function for overall recommendation performance, instead of least squares error on the latent features. Regarding the AUC metric (Figure 3.2), the results are similar. Note that for cbf-knn, the results deteriorate when the two attribute sets are combined, while the two mapping methods, and in particular map-bpr, prot from the additional data. We think that cbf-knn's suboptimal results could be 54 COLD-START RECOMMENDATION 0.4 CHAPTER 3. 0.2 cbf−knn random 0.0 0.1 prec@5 0.3 map−knn map−lin map−bpr genres directors genres+ directors Figure 3.2: Cold-start experiment: prec@5 results. xed by computing separate item similarities for the dierent attribute sets and then combining them, but we doubt that this would be a stronger method than map-bpr. 3.3.4 Experiment 2: Large Attribute Sets Next, we investigated the methods' performance on larger attribute sets (several thousand attributes). We notice (see Figure 3.5 and 3.7) that for large attribute sets the baseline method cbf-knn performs better than the mapping methods. Gunawardana and Meek [2009] observed similar behavior for their models, Unied Boltzmann Machines and Tied Boltzmann Machines [Gunawardana and Meek, 2008]: using only the genre data led to better results than using actor data (there: about 8,000 attributes) or the combined genres+actors data. Again, the combination of attribute sets leads to a deterioration of the prediction quality for cbf-knn, while the mapping methods do not suer from more data. 3.3.5 Run-Time Comparison Figure 3.8 shows the test times per user for the dierent methods. The number of factors per entity is again 32 for map-bpr and map-knn. One can clearly see that the mapping methods prot from the underlying fast matrix factorization model, while the kNN-based content-based ltering cbf-knn takes several times longer to compute the predictions. 3.3.6 Reproducibility All presented methods are available as part of the MyMediaLite software, which is described in chapter 6. 55 RELATED WORK 0.4 3.4. 0.2 cbf−knn random 0.0 0.1 prec@10 0.3 map−knn map−lin map−bpr genres directors genres+ directors Figure 3.3: Cold-start experiment: prec@10 results. 3.3.7 Discussion The experiments have shown that for the new-item recommendation task, BPRMF in combination with an attribute-to-feature mapping function yields accuracies comparable to state-of-the-art methods like Unied Boltzmann Machines (section 3.3.3). The performance on large attribute sets could still be improved over contentbased ltering with cosine similarity, however this is a problem that other methods in the literature also suer from (section 3.3.4). One reason for this could be that cosine similarity works particularly well for high-dimensional sparse data, and that linear models like map-bpr and simple models like map-knn (without much adaption to the data) are not powerful enough to make use of large, sparse attribute sets. A remedy may be using a non-linear learned mapping function, e.g. based on multi-layer neural networks, or support vector regression [Smola and Sch olkopf, 2004]. Additionally, the mapping approaches have the advantage of being much faster (section 3.3.5) than content-based ltering using kNN. 3.4 Related Work The approaches to solve cold-start problem can be roughly divided into two groups; 1. attribute-based methods use either item (content or meta-data) or user (demographic) attributes to make up the lack of interaction data for the cold-start predictions. 2. active learning does not solve the cold-start problem immediately, but engages the user to gather the necessary feedback; the task here is to 56 COLD-START RECOMMENDATION 1.0 CHAPTER 3. 0.4 0.6 cbf−knn random 0.0 0.2 AUC 0.8 map−knn map−lin map−bpr genres directors genres+ directors Figure 3.4: Cold-start experiment: AUC results. gather the most informative feedback, i.e. feedback that can be used to make the best recommendations. Of course, those two approaches could also be combined. Pazzani and Billsus [2007] and Lops et al. [2011] give overviews of contentbased methods that can be used for new-item scenarios (see also section 2.3.3). Most content-based methods work ne for the new-item problem. However, they usually see learning the preferences of dierent users as separate and isolated tasks, which means they do not exploit the similarities between those tasks, as is done in collaborative ltering and in other multi-task learning scenarios [Caruana, 1997]. Earlier active learning approaches were presented by Boutilier et al. [2003] and Rashid et al. [2008] Next, we discuss several factorization-based approaches to the cold-start problem. Note that none of the works mentioned below cover exactly our scenario, cold-start item recommendation from positive-only feedback. While several of those models would allow adaption to such a scenario, this has, to the best of our knowledge, not been done yet. One of the MF variants described in Koren et al. [2009] takes attributes into account for the rating prediction task; however, it is assumed that for every entity there is also collaborative information available, which makes the model unsuitable for cold-start scenarios in the narrower sense. Pil aszy and Tikk [2009] propose an MF model for rating prediction that maps attributes to the factor space using a linear transformation, based on a method proposed by Paterek [2007]. The method (NSVD1 ) can either handle user or item attributes; predictions are computed from item attributes by ŝui = hwu , n X l=1 mf l aIil i. (3.7) 57 RELATED WORK 0.4 3.4. cbf−knn 0.2 0.0 0.1 prec@5 0.3 map−knn map−bpr actors genres+ actors credits genres+ credits Figure 3.5: High-dimensional attribute sets: prec@5 results. This rating prediction method is similar to a special case of the framework presented here, but there are several dierences, considering the concrete application as well as the model: • NSVD1 is for rating prediction, while the models designed in this chapter deal with item recommendation (see section 2.2.2). • Pilaszy and Tikk's learning algorithm estimates all parameters at once, while we use a two-stage learning scheme (see section 3.2.1). • If NSVD1 uses user and item attributes at the same time, then there are no free latent features in the models the rating is estimated entirely from the entities' attributes; our model only uses the entity attributes if no collaborative information is known about the given entity. Pilaszy and Tikk learn the factors of one entity (e.g. the users) simultaneously with the mapping to the factors of the other entity (e.g. the items), which only exist implicitly via the mapping; the model is not based on a completely trained standard MF model, which is augmented by attribute-to-factor mappings like in our framework. In Pil aszy and Tikk [2009] there is also a generalization of NSVD1 that takes both user and item attributes into account, and which has free latent features. Because of the free latent features, this generalization is not capable of generating cold-start recommendations; it could, however, be enabled to do that using our framework. fLDA [Agarwal and Chen, 2010] uses content data for rating prediction. It combines one-way and two-way user-item interactions and jointly learns the parameters for those interactions. The authors assume a bag-of-words-like [Manning et al., 2008] structure for the content attributes of items such that latent feature extraction based on LDA [Blei et al., 2003] is possible. Thus, the fLDA approach is restricted to bag-of-word features whereas our approach can deal 58 COLD-START RECOMMENDATION 0.4 CHAPTER 3. cbf−knn 0.2 0.0 0.1 prec@10 0.3 map−knn map−bpr actors genres+ actors credits genres+ credits Figure 3.6: High-dimensional attribute sets: prec@10 results. with any type of attributes (nominal, ordinal, metric); it is not applicable to new-user scenarios. The same authors also proposed Regression-based Latent Factor Models (RLFM) [Agarwal and Chen, 2009], a similar hybrid collaborative ltering method for rating prediction, which works also in cold-start scenarios. It was extended to Generalized Matrix Factorization (GFM) [Zhang et al., 2011]. According to the authors, by assuming Bernoulli-distributed observations, these models would also be suitable for item recommendation with positive and negative feedback; nevertheless, the suitability of the approach for that task, or even item recommendation from positive-only feedback, has not been shown empirically. Pairwise Preference Regression [Park and Chu, 2009] is a regression model for rating prediction optimized for a personalized pairwise loss function. The two-way aspect model [Schein et al., 2002] is a variant of the aspect model [Hofmann and Puzicha, 1999] for the item recommendation and the rating prediction task. Filterbots [Sarwar et al., 1998] are a heuristic method to augment collaborative ltering systems with content data. Unied Boltzmann Machines [Gunawardana and Meek, 2009] are probabilistic models that learn from collaborative and content information by combining Untied Boltzmann Machines , that capture correlations between items, with Tied Boltzmann Machines [Gunawardana and Meek, 2008], that take content information into account. Menon and Elkan [2010] suggest a latent feature log-linear model , which is a generalization of matrix factorization with the same loss function as logistic regression. Again, this model would be suitable for item recommendation with positive and negative feedback; making it work for positive-only feedback would require modications to the method. It should be noted that the attribute-tofactor framework presented here could also be applied to that model. 59 SUMMARY AND OUTLOOK 1.0 3.5. 0.4 0.6 cbf−knn 0.0 0.2 AUC 0.8 map−knn map−bpr actors genres+ actors credits genres+ credits Figure 3.7: High-dimensional attribute sets: AUC results. 3.5 Summary and Outlook We presented a general and straightforward framework to make factorization models attribute-aware. The framework is applicable to both user and item attributes, and can deal with nominal/binary and real-valued attributes. We demonstrated the usefulness of the method by an extension of matrix factorization optimized for Bayesian Personalized Ranking (BPR) that is capable of making item recommendations for new items. The experimental evaluation on two dierent types of mappings kNN and linear mappings optimized for BPR showed that the method produces accurate predictions on par with state-ofthe-art methods, and that it carries little run-time overhead. We also showed empirically that it is worth training the mapping function for optimal model performance with respect to application-specic losses, instead of just trying to map the latent features as accurately as possible. An appealing property of our framework is its simplicity and modularity: because its components are only loosely coupled, it can be used to enhance existing factorization models to support new-user and new-item cold-start scenarios. In the future, we can extend this work in several directions, among others with experiments on user attributes and real-valued (instead of binary) attributes. We also want to see whether the method produces similarly good results for other applications like rating or tag recommendation. As stated before, we will investigate how to improve mapping and prediction accuracy for large attribute sets by employing non-linear learned mapping functions like multi-layer neural networks or support vector regression. Last but not least, the mapping framework should be modied to allow a smooth transition between the two extremes the cold-start scenario, where attributes are the only way of computing predictions, and normal operation, where latent factors learned from only the interactions usually provide better results than attributes. 60 COLD-START RECOMMENDATION map−knn cbf−knn 5 10 15 map−bpr 0 test time / user, milliseconds 20 CHAPTER 3. genres directors actors credits Figure 3.8: Cold-start recommendation: test time per user. Chapter 4 Bayesian Personalized Ranking Revisited Bayesian Personalized Ranking (BPR, introduced in section 2.3.7) is a per-user ranking approach that optimizes a smooth approximation of the area under the ROC curve (AUC, section 2.4.1). BPR has seen successful applications in item recommendation from implicit positive-only feedback [Rendle et al., 2009] even though the framework is not limited to positive-only data and tag recommendation [Rendle and Schmidt-Thieme, 2010]. This chapter looks at some aspects of Bayesian Personalized Ranking. First of all, it relates the BPR framework to other ndings in the machine learning literature, which should give the reader a clearer picture of the method (section 4.1). We then extend the BPR criterion by introducing weights, leading to the weighted BPR (WBPR) criterion, in order to make it suitable for applications that do not treat all users and items the same (section 4.2). Learning WBPR models can be achieved by adapting the sampling strategies accordingly. Note that Bayesian in Bayesian Personalized Ranking is not used as it is commonly understood in the literature. Here, the term merely refers to the derivation of the optimization criterion as the MAP estimator given the interaction data and a prior distribution on the parameters. The name Bayesian Personalized Ranking: 4.1 Relation to Other Approaches The original BPR paper derives the optimization criterion as an approximation of AUC optimization. The approximation is done by applying the logistic function to the dierence of the item scores: g(ŝu,i − ŝu,j ). Note that if the scores come from a matrix factorization model (section 2.3.7), and if we consider the item factors to be xed, we have a kind of pairwise classication model using logistic regression that predicts whether item i is preferred to j by user u. Reductions are a common concept in computer science. Solving a problem by reducing it to another one means expressing the original problem in terms of the other problem Cormen et al. [2001]. Balcan et al. [2008] reduce ranking 61 62 CHAPTER 4. BAYESIAN PERSONALIZED RANKING REVISITED to pairwise binary classication. They show that a 0/1 classication regret (dierence between the actual loss and best possible loss) r implies an AUC regret of at most 2r. A similar approach is taken by BPR, only that it is extended to a model that does dierent classication/ranking tasks at once, one per user. Thus we have an example of multi-task learning [Caruana, 1997] here. Collaborative ltering as multi-task learning has been discussed in several works, for example by Yu and Tresp [2005] and Abernethy et al. [2009]. 4.2 Weighted BPR Models optimized for BPR are suitable when the items to be ranked are sampled uniformly from the set of all items. Yet, this is not always the case, for example when the items to be ranked are sampled according to their general popularity, like in track 2 of the KDD Cup 2011 (see chapter 5). To deal with such scenarios, we extend the BPR criterion to a probabilistic ranking criterion that assumes the candidate items (those items that should be ranked by the model) to be sampled from a given distribution. Using this new, more general optimization criterion, we derive an extension of the generic BPR learning algorithm (which is a variant of stochastic gradient ascent) that samples its training examples according to the probability distribution used for the candidate sampling, and thus optimizes the model for the new criterion. Assume the candidate items are not sampled with uniform probabilities. Because the negative items in the training data are all weighted identically and not according to the way the candidate items are sampled optimizing a predictive model for BPR will not lead to optimal prediction quality. This issue can be solved by assigning adequate weights to the components of the optimization criterion. Motivation: Non-Uniformly Distributed Candidate Items 4.2.1 Generic Weighted Bayesian Personalized Ranking Taking into account non-uniform sampling probabilities of the negative items, and user weights that are not proportional to the number of feedback events provided by the users, a useful optimization criterion is then X WBPR(DS , Θ) = wu wi wj ln g(ŝu,i,j (Θ)) − λ||Θ||2 (4.1) (u,i,j)∈DS where ŝu,i,j (Θ) := ŝu,i (Θ) − ŝu,j (Θ) and DS = {(u, i, j)|i ∈ Iu+ ∧ j ∈ Iu− }. Θ represents the parameters of the model and λ is a regularization constant. wu is a weight that balances the contribution of each user to the criterion, wi , wj are the weight that determines the contribution of the positive and negative item, respectively. Note that WBPR is not limited to the task of the KDD Cup 2011: The weights wu , wi , wj can be adapted to other scenarios/sampling probabilities. Of course, the original BPR criterion is an instance of WBPR, where all weights are set to 1 or another constant > 0. Now we will discuss two usage scenarios of the new criterion. 4.3. 63 SAMPLING STRATEGIES 4.2.2 Example 1: Non-Uniform Negative Item Weighting Like in the task of track 2 of KDD Cup 2011, the weights for negative items could be proportional to their global popularity in the training data: X wj = δ(j ∈ Iu+ ). (4.2) u∈U 4.2.3 Example 2: Uniform User Weights The original BPR criterion gives higher weight to users with more feedback. This can be justied by the notion that parameters of users and items about which we know more feedback should be more inuenced by the training data than the parameters of users and items we have less information about. However, this leads to the circumstance that the training set performance of a user with twice as many feedback events as another user is twice as important as that user's performance. To equalize this, we can create an optimization criterion that maximizes the (approximated) AUC for all users weighted equally. X wu = δ(i ∈ Iu+ ). (4.3) i∈I Note that such a WBPR model could be still improved by making the regularization part dependent on the number of feedback events, as with only one regularization constant for all user parameters it is very likely to either overor undert most users. 4.3 Sampling Strategies How and what one samples can make a big dierence in stochastic gradient descent/ascent learning algorithms. Implementation details which are not covered in this thesis can have a signicant impact on the runtime of the algorithm. Furthermore, the weights in the optimization objective and the sampling procedures can be seen as two sides of one coin to give an entity more weight, one can either derive larger learning steps from the objective, or one can sample the entity with a higher probability, while keeping the step size the same as for all entities. 4.3.1 Sampling According to Entity Weights When training models for a weighted optimization target (like WBPR) using stochastic gradient methods, there are basically two options to take the weights into account: either by applying update steps proportional to the weights, or by sampling examples proportionally to their weights. Because the weights can dier drastically between examples, it is not prudent to use the former option, as it can lead to unrealistic parameter values when applying large update steps. This leaves us with the latter option, which means we have to adapt our sampling procedures. Chen et al. [2011b] point out that this approach has been used in cost-sensitive learning [Ting, 1998, Sheng and Ling, 2007]. Actually, weighted BPR can be seen as a variant of cost-sensitive learning: entities with higher weights have higher costs associated with them. 64 CHAPTER 4. BAYESIAN PERSONALIZED RANKING REVISITED Data: D train , α, λ Result: Θ̂ 1 initialize Θ̂ 2 repeat 3 for (u, i) ∈ D train do 4 draw j from Iu− proportionally to wj 5 Θ̂ ← Θ̂ + α e−ŝu,i,j 1+e−ŝu,i,j · ∂ ŝ ∂ Θ̂ u,i,j − λ · Θ̂ 6 end 7 until convergence Algorithm 6: LearnWBPR-Neg: Optimizing WBPR with non-uniform weights for the negative items using stochastic gradient ascent. The dierence to LearnBPR is the sampling in line 4. Data: D train , α, λ Result: Θ̂ 1 initialize Θ̂ 2 repeat 3 draw u from U proportionally to wu 4 draw i from Iu+ proportionally to wi 5 draw j from Iu− proportionally to wj 6 Θ̂ ← Θ̂ + α e−ŝu,i,j 1+e−ŝu,i,j · ∂ ŝ ∂ Θ̂ u,i,j − λ · Θ̂ 7 until convergence Algorithm 7: LearnWBPR: Optimizing ascent. α is the learning rate (step size). WBPR using stochastic gradient 4.4. SUMMARY AND OUTLOOK 65 To train a model according to the modied optimization criterion, we adapted the original learning algorithm (Algorithm 4); instead of sampling negative items uniformly, we sample them according to their overall popularity wj (line 4 in Algorithm 6). 4.3.2 Matrix Factorization Optimized for WBPR In the BPR framework, the pairwise prediction ŝu,i,j is often expressed as the dierence of two single predictions: ŝu,i,j := ŝu,i − ŝu,j . (4.4) We use the BPR framework and its adapted sampling extension to learn matrix factorization models with item biases: ŝu,i := bi + hwu , hi i, (4.5) where bi ∈ R is a bias value for item i, wu ∈ Rk is the latent factor vector representing the preferences of user u, and hi ∈ Rk is the latent factor vector representing item i. The optimization problem is then X wu wi wj ln g(bi − bj + hwu , hi − hj i) max W,H,b (u,i,j)∈DS − λU kWk2 − λI kHk2 − λb kbk2 . (4.6) The training algorithm LearnWBPR-MF-Neg (Algorithm 8) (approximately) optimizes this problem using stochastic gradient ascent. It is an instance of the generic LearnWBPR algorithm (Algorithm 7). The parameter updates make use of the partial derivatives of the local error with respect to the current parameter. The matrix entries must be initialized to non-zero values because otherwise all gradients and regularization updates for them would be zero, and thus no learning would take place. The item bias vector b does not have this problem. Note that the λ constants in the learning algorithm are not exactly equivalent to their counterparts in the optimization criterion. We also have two dierent regularization constants λI and λJ which lead to dierent regularization updates for positive and negative items. 4.4 Summary and Outlook In this chapter, we revisited several aspects of the Bayesian Personalized Ranking (BPR) framework. We linked it to a remarkable result in the general machine literature, and extended the optimization criterion to the weighted BPR (WBPR) criterion, for which we provide a generic learning algorithm. We described one instance of this learning algorithm that trains matrix factorization models for the case of non-uniformly weighted negative items. The WBPR framework can be applied to other scenarios, for example for providing personalized rankings of news articles that have been assigned weights/priorities by an editorial team. Another possible use of weighted BPR are learning scenarios with case weights, which could come up, for example, if we 66 CHAPTER 4. BAYESIAN PERSONALIZED RANKING REVISITED Data: D train , α, λU , λI , λJ , λb Result: W, H, b 1 set entries of W and H to small random values 2 b←0 3 repeat 4 draw (u, i) from Dtrain 5 draw j from Iu− proportionally to wj 6 ŝu,i,j ← bi − bj + hwu , hi − hj i 7 x← e−ŝu,i,j 1+e−ŝu,i,j 8 bi ← bi + α x − λb bi 9 bj ← bj + α −x − λb bj 10 wu ← wu + α x · (hi − hj ) − λU wu 11 hi ← hi + α x · wu − λI hi 12 hj ← hj + α x · (−wu ) − λJ hj 13 until convergence Optimizing a matrix factorization model for WBPR with nonuniform weights for the negative items using stochastic gradient ascent. Algorithm 8: perform soft clustering [Ruspini, 1969] of users to densify the training data, and then use all available training examples with the cluster weights as case weights for the given user cluster. An interesting aspect is the regularization of WBPR models. As mentioned when discussing equation 4.3, weighting the contribution of each user to the optimization criterion equally could lead to over- and undertting. Using several regularization constants [Koren, 2009, Dror et al., 2011a], or even regularization functions that depend on the number of feedback events, in combination with an ecient hyperparameter search procedure, may lead to more accurate models. In the following chapter 5, which is about music recommendation, we provide a use case for non-uniform item sampling for WBPR, including a range of experiments. Chapter 5 Recommending Songs In this chapter, we describe how some of the methods laid out in the preceding chapter can be applied to a real-world problem: music recommendation. In particular, we describe the approach we used for track 2 of the KDD Cup 2011, where the task was to recommend songs to individual users. The KDD Cup 2011 consisted of two separate tracks. Track 1 was about rating prediction (see section 2.2.1). The task of track 2 was to predict which 3 out of 6 candidate songs were positively rated (higher than a certain threshold) instead of not rated at all by a user. The candidate items were not sampled uniformly, but according to their general popularity, i.e. the number of users who gave a positive rating to them. We use the weighted BPR (WBPR) optimization criterion described in section 4.2 that takes the non-uniform sampling of negative test items into account, together with the modied version of the generic BPR learning algorithm, which maximizes the new criterion by adapting the sampling process (section 4.3.1). We use the learning algorithm to train ranking matrix factorization models as components of an ensemble. Additionally, we combine the ranking predictions with rating prediction models to also take into account the rating data in the provided dataset. With an ensemble of such combined models, we achieved an error of 4.49, which means that our method selected a wrong random instead of preferred song less than 1 in 20 times. We ranked 8th (out of more than 1850 teams [Dror et al., 2011b]) in track 2 of the KDD Cup 2011, without exploiting the additional taxonomic information available in the dataset. 5.1 Problem Statement The task of track 2 of the 2011 KDD Cup was to predict which 3 songs1 out of 6 candidates a user will like rate with a score of 80 or higher on a scale from 0 to 100 for a set of users, given the past ratings of a superset of the users. Additionally, an item taxonomy expressing relations between songs, albums, artists, and genres was provided by the contest organizers [Dror et al., 2011b]. We did not use this additional data in our approach. 1A song is called track in the competition description. We will use the term song to avoid confusion with track 1 and 2 of KDD Cup. 67 68 CHAPTER 5. ? rating >= 80 RECOMMENDING SONGS no rating Figure 5.1: Task of KDD Cup 2011, track 2: Distinguish between songs a user liked and songs the user has not rated. Items that have been rated below 80 by the user are not present in the test dataset. rating < 80 rating >= 80 liked? no rating Figure 5.2: The liked contrast: We say that a user likes an item if they rated it with 80 or higher. The 3 candidate songs that have not been rated highly by the user have not been rated at all by the user. They were not sampled uniformly, but according to how often they are rated highly in the overall dataset. To put it briey, the task was to distinguish items (in this case songs) that were likely to be rated with a score of 80 or higher by the user from items that were generally popular, but not rated by the user (Figure 5.1). This is similar to the task of distinguishing the highly rated items from generally popular ones, which we call the liked contrast (Figure 5.2). Generally, the training and testing sets in track 2 of the KDD Cup 2011 have the following structure: Dtrain ⊂ U × I × [0, 100] D test ⊂ U × I × {0, 1}. (5.1) (5.2) with ∀(u, i, pu,i ) ∈ Dtest : ¬∃(u, i, ru,i ) ∈ Dtrain . The training set Dtrain contains ratings, and the testing set Dtest contains binary variables that represent whether a user has rated an item with a score of at least 80 or not. 5.1.1 Evaluation Criterion The evaluation criterion is the error rate, which is just the relative number of wrong predictions: 5.2. 69 METHODS e=1− 1 |Dtest | X δ(pu,i = p̂u,i ), (5.3) (u,i,pu,i )∈D test where δ(x = y) is 1 if the condition (in this case: x = y ) holds, and 0 otherwise, and p̂u,i is the prediction whether item i is rated 80 or higher by user u. For a single user, the error rate is X 1 eu = 1 − δ(pu,i = p̂u,i ). (5.4) + test − test |Iu | + |Iu | (u,i,pu,i )∈Dtest For the KDD Cup 2011, we have the additional constraints that for every highly rated item of each user, there is an item that has not been rated in the evaluation set Dtest , and that exactly half of the candidate items must be given a prediction of p̂u,i = 1. We call this the 1-vs.-1 evaluation scheme. 5.2 Methods The obvious approach for track 2 of KDD Cup 2011 is to assign scores to the 6 candidate items of each user, and then to pick the 3 highest-scoring candidates. This is similar to classical top-N item recommendation. The decision function is ( 1, |{j|(u, j) ∈ Dtest ∧ ŝu,i > ŝu,j }| ≥ 3 . (5.5) p̂u,i = 0, else 5.2.1 Optimizing for the Competition Objective The area under the ROC curve (AUC, see section 2.4.1) is a ranking measure that can also be computed for the KDD Cup scenario. In the 1-vs.-1 evaluation scheme the per-user accuracy 1 − eu grows strictly monotonically with the per-user area under the ROC curve (AUC) and vice versa. Lemma 1. Proof. Items are ordered according to their scores ŝu,i . Be ntp and ntn the number of true positives and true negatives, respectively. Given Iu+ , Iu− , AUC(u) = n +n ntp ·ntn < 1 and 1 − eu = |I +tp|+|Itn− | < 1. If the scores change s.t. p̂0u,i 6= p̂u,i |I + ||I − | u u u u for exactly two items that have been wrongly classied before, then AUC0 (u) = (ntp +1)·(ntn +1) n +1+n +1 > AUC(u) and 1 − e0u = tp|I + |+|Itn− | > 1 − eu . |I + ||I − | u u u u This means that maximizing the user-wise AUC on the training data (while preventing overtting) is a viable strategy for learning models that perform well under 1-vs.-1 evaluation scheme. 5.2.2 Matrix Factorization Optimized for Weighted BPR Matrix factorization models are suitable prediction models for recommender systems, and are known to work well for item recommendation when trained using the BPR framework, which optimizes the user-wise AUC. Thus, we used matrix factorization for the KDD Cup (section 2.3.7). In particular, we used the weighted BPR (WBPR, section 4.2) approach to account for the specic evaluation criterion of the KDD Cup 2011. 70 CHAPTER 5. RECOMMENDING SONGS 5.2.3 Ensembles ensemble To get more accurate predictions, we trained models for dierent numbers of factors k and using dierent regularization settings. We combined the results of the dierent models, and of the same models at dierent training stages. We used two dierent combination schemes, score averaging and vote averaging. If models have similar output ranges, for example the same model at dierent training stages, we can achieve more accurate predictions by averaging the scores predicted by the models: Score Averaging ŝscore-ens = u,i X (m) ŝu,i . (5.6) m If we do not know whether the scale of the scores is comparable, we can still average the voting decisions of dierent models: Vote Averaging ŝvote-ens = u,i X (m) p̂u,i . (5.7) m Other possible combination schemes would be ranking ensembles [Rendle and Schmidt-Thieme, 2009], and of course weighted variants of all schemes discussed here. Because selecting the optimal set of models to use in an ensemble is not feasible if the number of models is high, we perform a greedy forward search to nd a good set of ensemble components. This search procedure tries all candidate components sorted by their validation set accuracy, and adds the candidate to the ensemble if it improves the current mix. When searching a large number (> 2, 000) of models, we ignored candidates above a given error threshold. Greedy Forward Selection of Models Data: D train , D test , n Result: D train-val , D test-val 1 D train-val ← D train 2 U test ← {u|(u, i, pu,i ) ∈ D test } 3 forall the u ∈ U test do 4 5 +(t2) I + ← {n random items from Iu } ˙ Dtest-val ← Dtest-val ∪{u} × I + × {1} −(t2) 6 I − ← {n items from Iu sampled prop. to popul.} ˙ 7 Dtest-val ← Dtest-val ∪{u} × I − × {0} ˙ − do 8 forall the i ∈ I + ∪I train-val train-val 9 D ←D − {(u, i, ru,i )} 10 end 11 end Algorithm 9: Sampling procedure for the validation split. 5.2. 71 METHODS 5.2.4 Incorporating Rating Information Except for the rating threshold of 80, the methods presented here so far do not take into account the actual rating values. We suggest two dierent schemes of combining probabilities of whether an item has been rated by a user with rating predictions produced by a matrix factorization model that incorporates user and item biases [Koren et al., 2009, Rendle and Schmidt-Thieme, 2008]: X I 2 min (rmin + g(µ + bU u + bi + hwu , hi i) · (rmax − rmin ) − ru,i ) W,H,bU ,bI (u,i,ru,i )∈D train + λb (kbU k2 + kbI k2 ) U 2 I (5.8) 2 + λ kWk + λ kHk , where µ is the global rating average, and [rmin , rmax ] is the rating range. The model is trained using stochastic gradient descent with the bold-driver heuristic that dynamically adapts the learn rate. Using this heuristic for learning matrix factorizations was rst suggested by Gemulla et al. [2011]. First, we describe how we compute probabilities from prediction scores of models that were trained to decide whether an item has been rated or not (Figure 5.3). After that, we will describe how such probabilities can be combined with rating predictions. Estimating Probabilities rated p̂u,i = 5 5 X X 5 X rated rated g(ŝrated u,i,jk )g(ŝu,i,jl )g(ŝu,i,jm ), (5.9) k=1 l=k+1 m=l+1 rated where ŝrated u,i,j1 . . . ŝu,i,j5 refer to the score estimates of the other 5 candidates. Note that the models for ŝrated are trained using all ratings as input, not just those of 80 or higher. The intuition behind this way of probability estimation is as follows: g(ŝrated u,i,jk ) ∈ (0, 1) can be interpreted, similar to the case of logistic regression (e.g. [Bishop, 2006]) as the probability that item i is ranked higher (more likely to be rated) than item j by user u. We know that exactly 3 items are rated by the user, which means we need to estimate how probable it is that a given item is ranked higher than 3 other items. Equation 5.9 sums up the probabilities for the dierent cases where this holds. Scheme 1: Multiplication with Rating Prediction The rst scheme takes a rated probability and multiplies it with a rating prediction from a model trained on the rating data: rated ŝone · r̂u,i , u,i = p̂u,i (5.10) where r̂u,i is the predicted rating. Scheme 2: Multiplication with Rating Probability The second scheme takes a rated probability and multiplies it with the probability that the item, if rated, gets a rating of 80 or more by the user: rated ŝtwo · p̂≥80 u,i = p̂u,i u,i , (5.11) 72 CHAPTER 5. RECOMMENDING SONGS ≥80 where p̂≥80 u,i is the estimated probability of ru,i ≥ 80. We estimate p̂u,i using several dierent rating prediction models: X (k) δ(r̂u,i ≥ 80). (5.12) p̂≥80 u,i = k 5.2.5 Contrasts Depending on the exact contrast we wish to learn, there are certain dierent conditions for what is in the set of positive (Iu+ ) and negative (Iu− ) items for each user. The contrast to be learned for the KDD Cup 2011 ignores all ratings below a score of 80. Such ratings are not used for sampling the negative candidate items only items that are not rated by users are potential candidates (Figure 5.1). Track 2 Contrast Iu+(t2) := {i|∃ru,i ≥ 80 : (u, i, ru,i ) ∈ Dtrain } (5.13) Iu−(t2) (5.14) := I \ {i|∃ru,i : (u, i, ru,i ) ∈ D train } Note that all items i with ru,i < 80 do not belong to either of the two sets. Liked Contrast The liked contrast dierentiates between what users have rated highly (80 or more), and what they have not rated or rated with a score below 80 (Figure 5.2): Iu+(liked) := {i|∃ru,i ≥ 80 : (u, i, ru,i ) ∈ Dtrain } (5.15) Iu−(liked) (5.16) := I \ Iu+(liked) −(liked) As can easily be seen from the denition of Iu and negative items is exhaustive for each user. , the split between positive Finally, the rated contrast dierentiates what users have rated vs. not rated (Figure 5.3): Rated Contrast Iu+(rated) := {i|∃ru,i : (u, i, ru,i ) ∈ Dtrain } (5.17) Iu−(rated) (5.18) := I \ Iu+(rated) Again, this split is exhaustive for each user. 5.3 Experiments 5.3.1 Datasets We created a validation split from the training set (see section 2.5.3) so that we could estimate the accuracy of dierent models, and use those estimates to drive the composition of ensemble models. The procedure to create the split, based on the task description of track 22 , is described in Figure 9. In the case of the KDD Cup data, the number of positive items per user in the test set is n = 3. Table 5.1 shows the characteristics of the dierent splits. 2 http://kddcup.yahoo.com/datasets.php 5.3. 73 EXPERIMENTS rating >= 80 rated? no rating rating < 80 Figure 5.3: The rated contrast: The question is not how a user has rated an item, but if. # # # # # # users items ratings sparsity test users test items # # # # # # users items ratings sparsity test users test items Ratings validation split competition split 249,012 249,012 296,111 296,111 61,640,890 61,944,406 0.999164 0.9991599 101,172 101,172 128,114 118,363 Ratings ≥ 80 validation split competition split 248,502 248,529 289,234 289,303 22,395,798 22,699,314 0.9996884 0.9996843 101,172 101,172 128,114 118,363 Table 5.1: Characteristics of the validation and competition splits when considering all ratings (Figure 5.3) and the ratings of 80 or more (Figure 5.2), respectively. 74 Model MF MF CHAPTER 5. RECOMMENDING SONGS Hyperparameters k = 40, λU = 2.3, λI = 1.4, λb = 0.009, α = 0.00002, i = 30 k = 60, λU = 3.9, λI = 1.7, λb = 0.00005, α = 0.00005, i = 55 RMSE 25.37 25.35 Table 5.2: Rating prediction accuracy on the validation split for dierent matrix factorization models (eq. 5.8). 5.3.2 Rating Prediction Table 5.2 contains the rating prediction accuracy in terms of root mean square error (RMSE, see section 2.4.1) and mean absolute error (MAE) on the validation split for dierent hyperparameter combinations. 5.3.3 Track 2 Results We trained all models on both splits. Some results for the validation splits, and from the leaderboard (the Test1 set) are in Table 5.3. 5.3.4 Final Submission For our nal submission (see Table 5.3), we used the second rating integration scheme (eq. 5.11). To estimate p̂rated u,i , we created a score ensemble (section 5.2.3) from candidate models described in Table 5.4, with a candidate error threshold of 5.2% models with a higher validation error were not considered for the ensemble. We estimated the probabilities for a high rating p̂≥80 u,i according to eq. 5.12, from the models listed in Table 5.5. MAE 16.88 16.67 Hyperparameters k = 60, λ = .0001, cpos = 320, i = 30 k = 20, λU = λI = λJ = 0.005, i = 86 k = 240, λU = .01, λI = .005, λJ = .0005, λb = .0000001, i = 222 k = 320, λU = .01, λI = .0025, λJ = .0005, λb = .0000001, i = 322 k = 320, λU = .0075, λI = .005, λJ = .00025, λb = .000015, i = 53 55 dierent models of WBPR-MF with k = 400 see section 5.3.4 8.90 6.275 6.089 5.4103 5.5948 3.80178 Validation 29.8027 29.0802 Leaderboard 42.8546 42.8810 13.7587 8.7482 6.0449 5.8944 6.0819 5.2996 4.4929 λU 0.005 λI {0.0015, 0.0025, 0.0035} λJ {0.00015, 0.00025, 0.00035} λb {0.000015, 0.00002} α 0.04 i {10, . . . , 200} # 3,420 Table 5.4: Candidate components of the score ensemble used for estimating p̂rated (section 5.2.3). The last column shows the number of u,i dierent models resulting from combining the hyperparameter values in that row. k 480 Table 5.3: Validation set and KDD Cup 2011 leaderboard error percentages for dierent models. i refers to the number of iterations used to train the model. See the method section for details about these methods. Model most popular most rated WR-MF [Hu et al., 2008] WBPR-MF (liked contrast) WBPR-MF (liked contrast) WBPR-MF (liked contrast) WBPR-MF (rated contrast) ensemble nal submission 5.3. EXPERIMENTS 75 λU {1.9, 2.0, 2.2} {2.1, 2.3} {3, 3.5} {3.4, 3.9} λI {0.8, 1.0, 1.2} {1.1, 1.4} {1.1, 1.25, 1.5} {1.2, 1.5, 1.7} λb {0.000075, 0.0001, 0.0075} {0.006, 0.0075, 0.009} {0.0000075, 0.00005} {0.00005} α 0.00002 0.00002 0.00005 0.00005 i {8, . . . 11, 20, 24, 30, 31, 33, 38, . . . , 41} {8, . . . 11, 20, 24, 30, 31, 33, 38, . . . , 41} {30, 50, 70, 89, . . . , 93} {30, 50, 70, 89, . . . , 93} # 351 156 84 48 Table 5.5: Rating prediction models used for estimating p̂≥80 u,i (eq. 5.12) in the nal KDD Cup submission. The last column shows the number of dierent models resulting from combining the hyperparameter values in that row. k 40 40 60 60 76 CHAPTER 5. RECOMMENDING SONGS 5.4. 5.4 RELATED WORK 77 Related Work The closest related work to the methods presented and evaluated in this chapter are obviously the publications at the KDD Cup 2011 workshop [Dror et al., 2011b], which is briey summarized here. McKenzie et al. [2011], the winners of the challenge, combined dozens of different approaches. Among other things, they employed dierent factorization and linear methods optimized for ranking criteria like BPR and element-wise criteria, as well as kNN-based methods and several techniques to take taxonomy information into account. Additionally, classiers like SVMs and a neural network were used. The methods were combined linearly by a bagging [Breiman, 1996] method using random coordinate descent [Li and Lin, 2007], and nonlinearly by AdaBoost [Freund and Schapire, 1997], LogitBoost [Friedman et al., 2000] and Random Forests [Breiman, 2001]. It is worth noting that they also combined rating predictions with the output of item recommendations methods, similar to the approach we describe in section 5.2.4. Lai et al. [2011] use an ensemble of factorization models, content-based models, and neighborhood models, plus specic post-processing rules to ne-tune their predictions. Other participants [Xie et al., 2011, Balakrishnan et al., 2011, Kong et al., 2011] also used engineered features, which were fed into dierent models like SVMs, logistic regression, generalized linear models, neural networks, Random Forests, and gradient boosted decision trees [Friedman, 2002]. Jahrer and T oscher [2011] suggest ranking methods based on the direct optimization of the error using stochastic gradient descent. Models they use are matrix factorization (see section 2.3.6), asymmetric factor models (Paterek [2007], both for users and items), asymmetric factor models with a ipped taxonomy, user- and item-based kNN using the Pearson similarity (see section 2.3.2) itembased kNN with matrix factorization features, and restricted Boltzmann machines [Salakhutdinov et al., 2007]. Blending of several models was done with a neural network [T oscher et al., 2010]. Mnih [2011] suggests a BPR factorization model with shared factors between tracks, albums, artists, but disregarding genres. As we do with WBPR, they use a popularity-based distribution for sampling negative items in order to optimize for the task-specic measure. In addition to latent factors, the author came up with manually engineered features that he integrated into his models. 5.5 Summary and Outlook We described how the optimization criterion WBPR can be applied to music recommendation, as in track 2 of the KDD Cup 2011. In addition to ensembles of dierent WBPR matrix factorization models, we enhanced the predictions by integrating additional rating information. The experiments presented in this chapter, and the ranking on the KDD Cup leaderboard even though we did not make use of the additional taxonomy information suggest that our methods are suitable for such recommendation tasks. We should also point out that in a real-world application, we would of course make use of the taxonomy information about the songs, as well as content features of the songs [Celma, 2010]. 78 CHAPTER 5. RECOMMENDING SONGS While the winning team [McKenzie et al., 2011, Chen et al., 2011a] used taxonomy information, it is remarkable how much can be done without taking it into account. As shown in the Netix Prize [Koren, 2009, Koren et al., 2009, Koren, 2010, Tak acs et al., 2008, 2009, T oscher et al., 2008], and again in the KDD Cup 2011, automatic learning algorithms are able to extract predictive features from interaction data, without even looking at the content. There are several aspects worth further investigation. First of all, we reduce a classication problem (optimization for the error rate) to a ranking problem, which we again solve using a reduction to pairwise classication. While in general item recommendation scenarios, ranking is the problem we want to solve, it would be still interesting to see whether improvements are possible by directly training a classier. We have not used the item taxonomy, so a next step will be to make use of this additional information, as well as trying other ways of integrating the rating information (see section 5.2.4). A fully Bayesian treatment of the WBPR framework, i.e. by estimating parameter distributions [Freudenthaler et al., 2011], could yield models that have less hyperparameters, while having accuracies comparable to ensembles of the current models. For the competition, we performed all training on the liked (Figure 5.2) and rated (Figure 5.3) contrasts, but not on the proper contrast (Figure 5.1) that was used for evaluation in the KDD Cup. We could investigate whether there are signicant benets when learning the correct contrast. Chapter 6 The MyMediaLite Library In this chapter, we describe MyMediaLite, a fast and scalable, multi-purpose library of recommender system algorithms, aimed both at researchers and practitioners. MyMediaLite implements all algorithms discussed in this thesis, plus several other methods from the literature. MyMediaLite addresses two common scenarios in collaborative ltering: rating prediction (e.g. on a scale of 1 to 5 stars) and item recommendation from positive-only feedback (e.g. from clicks, likes, or purchase actions). The library oers state-of-the-art algorithms for those two tasks. Programs that expose most of the library's functionality, plus a GUI demo, are included in the package. Ecient data structures and a common API are used by the implemented algorithms, and may be used to implement further algorithms. The API also contains methods for real-time updates and loading/storing of already trained recommender models. MyMediaLite is free/open source software, distributed under the terms of the GNU General Public License (GPL)1 . Its methods have been used in four dierent industrial eld trials of the MyMedia project2 , including one trial involving over 50,000 households [Marrow et al., 2010]. In the following, we describe MyMediaLite's features, and compare it to existing free/open source recommender system software. 6.1 Motivation: Free Software for Research In general machine learning and data mining, as well as in specic sub-domains like computer vision and text mining/natural language processing, there exist free/open source collections of common algorithms and evaluation protocols which are in broad use. Examples for such packages are Weka [Hall et al., 2009], R [Ihaka and Gentleman, 1996], scikit-learn [Pedregosa et al., 2011], Shogun [Sonnenburg et al., 2010], and RapidMiner for general machine learning, OpenCV [Bradski and Kaehler, 2008] for vision, and GATE [Cunningham et al., 2011] for text mining. The recommender systems community both researchers and technology users could of course also prot from the availability of one or more such software packages. 1 http://www.gnu.org/copyleft/gpl.html 2 http://www.mymediaproject.org 79 80 CHAPTER 6. THE MYMEDIALITE LIBRARY Free/open source implementations of recommender system algorithms are desirable for three reasons: 1. They relieve researchers from implementing existing methods for their experiments, either for comparing them against newly designed methods, or as recommendation methods in other kinds of studies, e.g. in user interface research, 2. they can play a crucial role in the practical adoption of newly developed techniques, either by providing software that can be directly adapted and deployed, or by at least giving example implementations, 3. and nally they can be used for (self-) teaching future recommender systems researchers and practitioners. Additionally, well-designed software frameworks can make the implementation and evaluation of new algorithms much easier. Ekstrand et al. [2011] argue that publicly available algorithm implementations should be the standard in recommender system research. MyMediaLite3 aims to be software developed with all of these aspects in mind. It targets both academic and industrial users, who may use the existing algorithms in the library, or use the framework for the implementation and evaluation of new algorithms. 6.2 Feature Overview MyMediaLite addresses two common scenarios in collaborative ltering: rating prediction (e.g. on a scale of 1 to 5 stars) and item recommendation from positive-only feedback (e.g. from clicks or purchase actions). It oers state-ofthe-art algorithms for those two tasks, plus incremental updates (where feasible), serialization of computed models, and a rich choice of evaluation protocols. MyMediaLite is implemented in C#, and runs on the .NET platform. With the free .NET implementation Mono, it can be used on all common operating systems. Using the library is not limited to C#, though: it can be easily called from other languages like C++ (by embedding the Mono runtime into the native code), Java (via IKVM) F#, Ruby and Python; code examples are included with the software. 6.2.1 Recommendation Tasks We give a brief overview of the recommendation methods available for each task. Details on how to use these recommenders can be found in the appendix A.5. Rating Prediction MyMediaLite contains dierent variants of k-nearest neighbor (kNN) models [Linden et al., 2003], simple baseline methods (averages, biases, time-dependent biases [Koren, 2009], Slope-One [Lemire and Maclachlan, 2005], co-clustering [George and Merugu, 2005], see section 2.3.1), and modern matrix factorization 3 Here and in the appendix we describe MyMediaLite 3.01, released in May 2012, unless stated otherwise. 6.2. FEATURE OVERVIEW 81 methods (see sections 2.3.6 and 9) [Rendle and Schmidt-Thieme, 2008, Rennie and Srebro, 2005, Koren et al., 2009, Koren, 2008] for the task of rating prediction (see section 2.2.1). Item Recommendation from Positive-Only Feedback The library contains kNN models for this task (see section 2.2.2), as well as simple baselines (random/most popular item; see section 2.3.1) and advanced matrix factorization methods like WR-MF [Hu et al., 2008], BPR-MF [Rendle et al., 2009] and WBPR-MF (section 4.3.2). Additionally, it contains an implementation of the mapping approach presented in chapter 3. Group Recommendation Recommendations to user groups instead of individual users can be provided by aggregating predicted scores for user-item combinations according to dierent schemes [Baltrunas et al., 2010]: • minimum use the lowest score for the group decision, • maximum use the maximum score, • average use the mean score, • weighted average use the average score, weighted by the number of ratings for each user in the training data, and • pairwise wins pick the item that is ranked above the other candidate items most frequently. 6.2.2 Command-Line Tools For each of the two main recommendation tasks (group recommendation is included in the item recommendation tool), MyMediaLite comes with a commandline program that allows users to train and evaluate all available recommenders on data provided in text les, without having to write a single line of code. Newly developed recommenders are automatically detected, and do not have to be manually added to the programs. Most of the other library features described here are exposed to the user by the command-line programs; if not, this is explicitly mentioned. Detailed usage information for the command-line tools can be found in the appendix in section A.2. 6.2.3 Data Sources Besides collaborative data, i.e. the ratings in the case of rating prediction and the positive-only user feedback in the case of item recommendation, recommenders in MyMediaLite also may access other kinds of data: user (like geographic location, age, profession) or item attributes (categories, keywords, etc.), and relations between users (e.g. the social network) or items (taxonomies, TV series), respectively. Algorithms in the library that make use of this data are, for example, attribute-based kNN methods [Billsus et al., 2000], a linear model 82 CHAPTER 6. THE MYMEDIALITE LIBRARY optimized for BPR [Gantner et al., 2010a], and SocialMF, a matrix factorization model that takes the social network of the users into account [Jamali and Ester, 2010]. MyMediaLite contains routines to read such data from SQL databases (not supported by the command-line programs) and from simple text les. 6.2.4 Evaluation MyMediaLite contains routines for computing evaluation measures [Herlocker et al., 2004] like root mean square error (RMSE) and mean absolute error (MAE) for the rating prediction task; for item recommendation, it supports area under the ROC curve (AUC), precision at n (prec@n), mean average precision (MAP), mean reciprocal rank (MRR) [Voorhees, 2000], and normalized discounted cumulative gain (NDCG). Besides this, the user has the possibility of creating arbitrary train-test splits and feeding them to MyMediaLite, the library implements several protocols for splitting the data provided by the user: 1. simple splits: use n% of the data for testing 2. k -fold cross-validation 3. chronological splits: use the last n% of the data for testing, or split at a given time point in the dataset 4. per-user chronological splits: use the last n% or n events of each user for testing Each of those methods has advantages and disadvantages. Evaluation on simple splits and (per-user) chronological splits are fast to evaluate, because only one model (per evaluated method/hyperparameter combination) is to be computed and evaluated on a part of the dataset. k -fold cross-validation uses all data for testing, and thus generally yields more robust results, which is one of the reasons why it is a standard technique for model comparison in machine learning and applied statistics. On the other hand, k models have to be trained per method/hyperparameter combination, making cross-validation computationally more expensive. Chronological splits are the most realistic kind of evaluation split, because in real-life systems one can only use past data to predict future events. Per-user chronological splits are a bit less realistic, because not necessarily all users have their latest events in the same period of time. On the other hand, this was the way the split was generated for the Netix Prize, making it a quite popular evaluation protocol that library users may want to use when trying to replicate results reported in the literature. There is also support for hyperparameter selection using grid search and the Nelder-Mead method [Piotte and Chabbert, 2009]. 6.2.5 Incremental Updates Academic experiments on recommender system algorithms are usually conducted o-line, by training a prediction model and then evaluating it. Yet real-world recommender systems constantly get new user feedback that should be immediately incorporated into the prediction model. MyMediaLite oers an 6.2. FEATURE OVERVIEW 83 Figure 6.1: The MyMediaLite movie demo program. API for immediate updates to already trained prediction models. Besides being supported in a special online evaluation mode by the two command-line tools, the use of incremental updates is demonstrated in the GUI demo, which asks the user for movie ratings and then immediately displays personalized movie recommendations (see Fig. 6.1). 6.2.6 Parallel Processing Several parts of the library can be easily parallelized, to potentially make use of several processors or cores present in a system. Whenever there are sequential code fragments that are independent of each other in terms of data accesses, those fragments can be parallelized. This is the case for cross-validation procedures (usually a small number of computationally intensive tasks), and for the prediction of personalized item scores in the evaluation of item recommendation methods (large number of computationally cheap tasks). Consequently, both kinds of computations have been parallelized in MyMediaLite. Other candidates for parallelization would be user- or item-wise counting statistics in simple baseline methods (see section 2.3.1), or rating prediction evaluations. However, as such tasks are quite fast even for huge datasets, we have not parallelized them so far. Parallel Stochastic Gradient Descent for Matrix Factorization While some parts of the library as described above are parallelized, most algorithms have not yet been parallelized. One exception is the block-free parallel SGD for matrix factorization, which uses the same idea as in Jellysh [Recht and Re, 2011] and in Gemulla et al.'s KDD paper on distributed matrix factorization, which was published a bit earlier than Jellysh. Idea: Generally, the sequence of training examples in stochastic gradient descent does not matter too much after all, the examples are drawn randomly from the training dataset. This is practical if we want to parallelize the training 84 CHAPTER 6. THE MYMEDIALITE LIBRARY procedure. One obstacle that prevents us from a simple and straightforward parallelization of the algorithm is that the parameter updates for the dierent training examples cannot be guaranteed to be independent. Both cited papers suggest to overcome this by dividing both all users and all items into roughly nb equal-sized parts. Updates on user factors are now independent if the users are in two dierent parts, and updates on item factors are also independent if the items are in two dierent (item) parts. This means that the updates for two given ratings (that is, user-item combinations) are independent if and only if both the users and the items are in dierent parts. Consequently, if we consider the user and item parts to be in a square matrix, then all blocks belonging to the same diagonal are independent of each other, meaning that all updates for ratings in one block on the diagonal are independent of all updates for all ratings in other blocks on that diagonal. This leads us to a block-free SGD algorithm for matrix factorization. To perform a complete pass over the training data, we process all diagonals (called sub-epoch s by Gemulla et al.) of the square matrix in random sequence. When processing one diagonal, each block in the diagonal can be processed parallel to the other blocks. Algorithm 10 contains the pseudocode of the procedure. The random partitioning of users and items is performed in lines 2 to 7, followed by the shuing of ratings in the dierent blocks in line 8 to 12. For every epoch, a dierent sub-epoch sequence is generated (lines 19 and 20). In a sub-epoch, the dierent blocks are processed in parallel (line 23). Algorithm description: 6.2.7 Serialization Another feature that is required in practice is storing and loading trained recommenders, which allows e.g. to train recommenders on a dierent machine than the one that produces the recommendations. All recommenders in MyMediaLite support this feature. 6.2.8 Documentation The library is accompanied by broad documentation. Besides the complete API documentation, there are example programs in Python, Ruby, and C#, and how-tos on typical tasks like embedding recommenders into a program, implementing new recommenders, or using the command-line tools. 6.2.9 Diversication and Ensembles Besides the features described so far, MyMediaLite also supports attribute-based diversication of recommendation lists [Ziegler et al., 2005, Knijnenburg et al., 2011b] and ensembles of recommenders [Rendle and Schmidt-Thieme, 2009]. 6.3 Development Practices In the development of the library, we employ best practices for (not only) free/open source projects, like keeping the code in a public (and distributed) 6.3. DEVELOPMENT PRACTICES 85 Data: R, α, λ, nit , nb Result: bU , bI , W, H 1 2 3 4 5 6 7 8 9 10 11 12 preparation shue and partition data: pU ← random permutation of 1 . . . |U| pI ← random permutation of 1 . . . |I| for j ∈ {1 . . . |R|} do (ru,i , u, i) ← rj bpU (u) mod nb ,pI (i) mod nb ← bpU (u) mod nb ,pI (i) mod nb ∪ rj 13 14 15 16 17 18 19 20 actual learning: bU ← 0 bI ← 0 initialize W, H to small random values for it ∈ {1, . . . , nit } do generate random sub-epoch sequence: s ← {1, . . . , nb } shue s end for i ∈ {1 . . . nb } do for j ∈ {1 . . . nb } do shue bij end end 21 sub-epoch: 22 for i ∈ s do 23 for j ∈ {1, . . . , nb } do parallel 24 for k ∈ bj,(i+j) mod nb do 25 (u, i, ru,i ) ← rk 26 end I 27 update bU u , bi , wu , hi 28 end 29 end 30 end Parallel stochastic gradient descent for matrix factorization. R is the rating dataset, W, H are the model parameters, α is the learning rate (step size), λ is the regularization constant, nit is the number of passes over training data, nb is the number of blocks that can be processed at the same time. We assume that iteration over sets is in the order of the indexes, unless otherwise noted. Algorithm 10: 86 CHAPTER 6. THE MYMEDIALITE LIBRARY version control repository, having regular releases (roughly one per month) and a collection of unit tests, and performing static analysis4 of the compiled code. 6.4 Existing Software In this section, we describe existing (mostly) free recommender system software. All mentioned features refer to the state in November 2011, unless otherwise mentioned. A more up-to-date listing of free recommender system software may be found on the MyMediaLite website.5 To our knowledge, this is the most complete overview of free recommender system software so far. Other surveys can be found in Ekstrand et al. [2011], Angermann [2010], and Gantner et al. [2011b]. For a cleaner presentation, the homepage URLs of the discussed software are compiled in Table B.4 in appendix B. 6.4.1 Recommender System Libraries This subsection contains libraries containing several dierent recommendation algorithms, and potentially additional functionality to support the research, evaluation, development, and deployment of recommender systems. GraphLab GraphLab [Low et al., 2010, Wu et al., 2011] is a novel framework for parallel/distributed computing on graphs. It contains a library of several recommender algorithms. Implemented algorithms are probabilistic matrix/tensor factorization using Markov Chain Monte Carlo (MCMC) [Salakhutdinov and Mnih, 2008b,a, Xiong et al., 2010], alternating least squares (ALS) for MF [Zhou et al., 2008], stochastic gradient descent (SGD) [Koren et al., 2009, Takacs et al., 2009], nonnegative matrix factorization (NMF) [Lee and Seung, 2001] and SVD++ [Koren, 2008] for rating prediction. It also contains one algorithm, weighted ALS [Hu et al., 2008, Pan et al., 2008], which is suitable for implicit/positive-only feedback. Apache Mahout Apache Mahout [Owen et al., 2011] is a collection of mostly distributed (via Hadoop ) implementations of machine learning and data mining algorithms. One section of the library is dedicated to collaborative ltering algorithms; the majority of its recommendation algorithms taken from the predecessor Taste is not distributed; an item-based kNN model [Linden et al., 2003], and Slope-One [Lemire and Maclachlan, 2005] are available as a distributed implementation. LensKit LensKit is a recommender system algorithm library aiming for research and educational use [Ekstrand et al., 2011]. Currently, it contains matrix factorization, probabilistic latent semantic indexing (pLSI, Hofmann [2004]), Slope-One, and several kNN-based models. Gendarme 4 We use to nd problems in MyMediaLite. 5 http://ismll.de/mymedialite/links.html 6.4. EXISTING SOFTWARE 87 R recommenderlab recommenderlab is a package for the R statistical language/environment. It contains association rules [Changchien and Lu, 2001], user-based and item-based kNN, and the most-popular baseline. EasyRec EasyRec is a recommender system web service that can be integrated into websites; however, it does not contain any advanced personalized algorithms; it is more a framework for connecting a recommender service with an application. In the future, its developers plan to make the software compatible with Mahout, so that it can use the methods contained there. RecLab RecLab [Vengro, 2011] is a framework for performing live evaluations in online shopping systems; it contains an API to be implemented by the shop system, and another one for providing recommendations. It is used as code infrastructure for the ongoing RecLab Prize on Overstock.com challenge. Waes Waes [Gashler, 2011] is a collection of general machine learning algorithms. One group of algorithms in Waes are rating prediction methods: PCA [Hastie et al., 2009], matrix factorization, user-based kNN, item averages, user clustering, and bagging (for ensembles). jCOLIBRI jCOLIBRI [Daz-Agudo et al., 2007] is a case-based reasoning (CBR) tool that can also be used for generating recommendations based on kNN. It supports about 30 dierent similarity measures. MyCBR MyCBR [Stahl and Roth-Berghofer, 2008] is another case-based reasoning (CBR) tool. It can also use jCOLIBRI, while oering additional features. COFI COFI [Lemire et al., 2005] is a Java-based collaborative ltering library for rating prediction. It contains simple baseline algorithms, kNN-based methods, Slope-One, the Eigentaste algorithm [Goldberg et al., 2001] (including a variant developed by the authors [Lemire, 2005]), and a variety of linear rating predictors. Crab Crab is a recommender system framework written in Python that currently supports kNN, Slope-One, and contains an early-stage matrix factorization implementation for rating prediction. 88 CHAPTER 6. THE MYMEDIALITE LIBRARY Duine Duine [van Setten, 2005] was developed at Telin (now Novay ). Its focus lies on kNN-based methods, both collaborative and content-based. According to its authors, it does not scale well to large datasets. Taste.NET Taste.NET is a port of the Mahout predecessor Taste (version 1.6) to C#. The choice of methods contained in Taste.NET is not as wide as in current Mahout versions. It is currently not actively developed, the last modication happened in January 2010. 6.4.2 Implementations of Single Methods In this subsection, we describe implementations of single recommender system algorithms that are not part of a larger library or framework. PyRSVD PyRSVD is a Python implementation of SGD-trained matrix factorization for rating prediction. CoRank CoRank is the implementation of an ordinal ranking method for rating data of the same name [Weimer et al., 2008]. SVDFeature SVDFeature [Chen et al., 2011b] implements a special case of factorization machines [Rendle, 2010b], which allows exible integration of information beyond user-item interactions. Jellysh Jellysh [Recht and Re, 2011] is a large-scale parallel matrix factorization technique for rating prediction that is similar in spirit to the algorithm devised by Gemulla et al. [Gemulla et al., 2011], which is implemented in MyMediaLite (see below in section6.2.6). Likelike Likelike is an implementation of locality sensitive hashing (LSH) [Das et al., 2007] on top of Hadoop. Vowpal Wabbit Vowpal Wabbit , a large-scale online learning system, while concentrating on linear models, also supports matrix factorization. 6.4. EXISTING SOFTWARE 89 OpenSlopeOne OpenSlopeOne is an implementation of Slope-One using the PHP language and the MySQL database. Vogoo Vogoo is an implementation of Slope-One. Its homepage is abandoned, and the last release was in early 2008. Wooix Wooix is an implementation of SVD++ [Koren, 2008] in Python. According to its author, it is not very fast. Ruby on Rails Components Recommendable and ActsAsRecommendable are two plug-ins for Ruby on Rails that allow the integration of recommendation features into websites. 6.4.3 Non-Free Publicly Available Software In this section, we describe several publicly available (in terms of source code) recommender system software packages. While these are not free software, and thus do not allow commercial use, for example, the availability of the source code still allows researchers to inspect the implementations, and to use them for educational/research purposes. LibFM LibFM implements factorization machines [Rendle, 2010b], which are a generalization of matrix and tensor factorization (handling arbitrary interactions between modes) and polynomial SVMs/logistic regression (with factorized features). Factorization machines can be used to mimic/implement several recommender system methods, and are particularly well-suited for context-aware recommendation [Rendle, 2010a]. Probabilistic Matrix/Tensor Factorization The authors of several papers about probabilistic matrix and tensor factorization [Salakhutdinov and Mnih, 2008b,a, Xiong et al., 2010] provide a Matlab implementation of their methods. Latent Log-Linear Models The implementation of another (generalized) matrix factorization method called latent feature log-linear model (LFL) [Menon and Elkan, 2010] is also available as Matlab code. 90 CHAPTER 6. THE MYMEDIALITE LIBRARY MultiLens MultiLens [Miller, 2003] was released in 2004, and its homepage states that it will at some point be released as free software. However, currently only a compiled package is available for download. It contains rating prediction and item recommendation algorithms, most of them based on kNN, as well as rulebased algorithms. 6.5. 6.5 91 SYSTEM COMPARISON System Comparison We collected information on some of the recommender system algorithm libraries described above, and compare their features to those of MyMediaLite in Table 6.1. We concentrated on free software/open source6 , and additionally included SUGGEST [Karypis, 2001], which is not free software, but was a fairly early publicly available software. Besides information on the latest available version, we compare the following features: 1. license : the license under which the software is distributed 2. language : the programming language the software is written in; note that all libraries run on the major computing platforms 3. actively developed : whether the software is under active development was there development activity within the last six months? 4. scalable : whether the software scales this means whether the software is capable of running at least one non-trivial recommender algorithm on the full Netix data on a modern computer. 5. distributed : Can the computation of at least one non-trivial recommender model be run on several computers at once? 6. matrix factorization : Does the package contain modern matrix factorization techniques? 7. kNN methods : Does the library contain k-nearest-neighbor methods? 8. rating prediction : Are there algorithms/evaluation routines for rating prediction? 9. positive-only feedback : Are there algorithms/evaluation routines for recommendation from positive-only feedback? 10. multi-core support : Can at least one non-trivial recommender algorithm be run on several cores at once? (evaluation routines like cross-validation can run in parallel on several cores) 11. time-aware : Are there algorithms that take the time of the events into account for training and (possibly) prediction? 12. group recommendation : Are there methods for providing recommendations to a group of users? 13. incremental updates : Are there recommender models that allow the dynamic incorporation of new feedback without having to re-train the complete model? 14. hyperparameter tuning : Are there routines for tuning the hyperparameters of recommendation methods? 6 See http://www.gnu.org/philosophy/free-sw.html docs/osd. and http://www.opensource.org/ LensKit 0.8.1 2011-10-10 LGPL 2 Java X X X X X ( X) (some) Mahout 0.5 2011-05-27 Apache 2.0 Java X X X X X X X X X GraphLab v1_134 2011-07-29 Apache 2.0 C++ X X X X X X C X X non-free SUGGEST 1.0 2000-11-08 MyMediaLite 1.02 2011-08-03 GPL 3 C# X X X X X X X X X X X Table 6.1: Comparison of some free/open source recommender system frameworks. hyperparameter tuning incremental updates group recommendation time-aware Duine 4.0.0-RC1 2009-02-17 LGPL 3 Java X X X CHAPTER 6. multi-core support positive-only feedback rating prediction kNN methods matrix factorization distributed scalable actively developed language license date version Library 92 THE MYMEDIALITE LIBRARY 6.6. 93 EXPERIMENTS Dataset MovieLens-100K (external split) MovieLens-100K (5-fold CV) MovieLens-1M (external split) MovieLens-1M (5-fold CV) Netix (external split) Yahoo! Music Ratings (external split) k=5 5 MB 10 MB 15 MB 56 MB 1271 MB 8883 MB k = 120 6 MB 10 MB 20 MB 56 MB 1490 MB 9743 MB Table 6.2: Memory usage for rating prediction with BiasedMatrixFactorization (MyMediaLite 1.04), as reported by the Mono runtime. Note that the actual memory usage may be lower because garbage collection is not always enforced, as well as higher because of the overhead of running the programs in a virtual machine. 6.6 Experiments We have performed several rating prediction experiments to showcase MyMediaLite's runtime performance and scalability. We rst report on general rating prediction experiments on three dierent datasets, and then on a particular set of experiments to nd out the speed-up gained by using the parallel algorithm described above on multiple cores instead of the sequential version. Note that we ran those experiments on a cluster that processes multiple jobs at the same time. This means the evaluations were not conducted in isolation, and their results should only be interpreted as a rough measurement of the system's runtime performance. 6.6.1 General Performance We ran the BiasedMatrixFactorization recommender on the Netix, MovieLens100K, and MovieLens-1M datasets (see section 2.5). For Netix, we used the probe dataset for validation, on the MovieLens dataset we performed 5-fold cross-validation. We measured the time needed for one pass over the training, the time needed for predicting the validation set, and the memory usage reported by the program. Measuring the time needed for a pass over the training data which should give a general idea of the performance of the employed data structures, not just the particular recommender.7 Table 6.2 shows the memory usage by dataset and evaluation protocol. Memory requirements are modest for small and medium-sized datasets, and even the Netix dataset can be processed on a fairly modern computer without any problem. Figures 6.2 and 6.3 show the time for one pass over the training data for dierent model sizes on the Netix dataset. Regarding prediction times, MyMediaLite is capable of making between 1,400,000 (k = 120) and 7,000,000 (k = 5) predictions per second. We have not performed a systematic comparison with other libraries yet. Angermann's master thesis [Angermann, 2010] reports on experiments involving several recommender system frameworks. Results: 7 The evaluation scripts with all relevant parameter settings are available from http://ismll.de/mymedialite/examples/recsys2011.html. [Gantner et al., 2011b] 94 CHAPTER 6. THE MYMEDIALITE LIBRARY 0 ●●●●●●●●●●●●●●●●●●●●● 20 40 60 80 100 500 300 0 ●●● 100 300 Avg. iteration time (s) 500 ml1m 100 0 Avg. iteration time (s) ml100k 120 ●●● 0 ●●●●●●●●●●●●●●●●●●●●● 20 number of factors 40 60 80 100 120 number of factors Figure 6.2: Runtime of BiasedMatrixFactorization (MyMediaLite 1.04). Plotted is the average time needed for one epoch, depending on the number of latent factors per user and item. netflix netflix 300 ● ● ● ●● ●● ● ●● ●● ● 500 ● 300 ● ●● ● 100 ● ● ● Avg. iteration time (s) 500 ● 100 Avg. iteration time (s) ●● ● 0 0 ● 0 20 40 60 80 number of factors 100 120 ● ●● 0 ●● ●●●● ●●● 20 40 ●●●● 60 ●●●● 80 100 ● ● ● 120 number of factors Figure 6.3: Runtime of BiasedMatrixFactorization (MyMediaLite 1.04) and MultiCoreMatrixFactorization (nb = 128). Plotted is the average time needed for one epoch, depending on the number of latent factors per user and item. 6.7. 95 IMPACT netflix netflix ● ● ● ● ● ● ● ● ● 2000 ● ● ● ● ● ● ● ● ● 0 100 ● ● 500 1000 300 ● Memory usage (MB) ● 500 ● 0 Avg. iteration time (s) ● 1 5 50 number of blocks 500 1 5 50 500 number of blocks Figure 6.4: Runtime and memory usage of MulticoreMatrixFactorization. Plotted is the average time needed for one epoch (left) and the memory usage (right), depending on the number of blocks. Note that the y-axes are logarithmic. 6.6.2 Parallel Stochastic Gradient Descent To measure the speed-up gained by using multiple cores for matrix factorization, as well as the memory overhead of our implementation, we conducted further experiments on the Netix dataset. In the rst stage of the experiment, we ran the algorithm with k = 120 factors and dierent values for nb . For comparison, we ran the sequential version of the same algorithm with otherwise identical settings. In the second stage, we set nb = 128, which turned out to be a practical value, for dierent values of k . The results of the rst stage can be seen in Figure 6.4. The sequential algorithm took 567.33 seconds on average for one pass over the training data, and consumed 1758 MB of memory. A good choice for nb on 8 cores seems to be 64 (where one pass took 144.95) or above. This means we have a speed-up of roughly 4 on 8 cores. The results of the second stage have been included in Figure 6.3. Results: Ideally, we should have a speed-up of close to 8; the dierence between ideal and reality can be explained by the overhead caused by the additional shuing, and low-level eects like thread switching, increased cache misses, and other factors. On the other hand, most modern computers have several cores, so it is benecial to make use of them even if the speed-up is not perfect. Discussion: 6.7 Impact MyMediaLite is based on parts of the framework [Marrow et al., 2009] that has been used in four dierent industrial eld trials of the MyMedia project, 96 CHAPTER 6. THE MYMEDIALITE LIBRARY including one involving 50,000 households [Marrow et al., 2010]. Application areas in the eld trials were IPTV, web-based video and audio, and online shopping. Additionally, the MyMedia framework was used to perform usercentric experiments [Bollen et al., 2010]. Towards the end of the project, we stripped o the framework's more heavyweight components, and released it as free software in September 2010. Since then, it has been downloaded more than 7,300 times8 , received numerous improvements, and has been successfully used in several research activities, for example in studies on student performance prediction [Nguyen et al., 2011] and item recommendation in social networks [Du et al., 2011, Krohn-Grimberghe et al., 2012], in the KDD Cup 2011 for music recommendation [Balakrishnan et al., 2011], in an information retrieval evaluation project [Bellogn et al., 2011], and as a baseline for context-aware recommendation (winner of two tracks of the CAMRa 2010 challenge [Gantner et al., 2010c]). The European project e-LICO has ported MyMediaLite to Java, in order to provide the basis of the recommender extension for the RapidMiner data analysis software. This could help to grow MyMediaLite's user base in two ways: First, Java is more widely used by both researchers and application developers than .NET, and second, RapidMiner allows calling the library from an easy-touse graphical user interface. 6.8 Summary and Outlook MyMediaLite is a versatile library of recommender system algorithms for rating and item recommendation from positive-only feedback. We believe MyMediaLite is currently one of the most complete free/open source recommender system frameworks in terms of recommendation tasks, implemented methods, eciency, features, exibility, and documentation (see section 6.4). We will continue MyMediaLite's development in several directions. Porting the library to Java or C++ is worth considering, given the popularity of these programming languages. Besides the RapidMiner port (see above), another partial Java port of a part of the library is already available for download. Another useful extension would be a web service interface for the easy integration of MyMediaLite, e.g. into online shop software; we will consider existing APIs for this feature. While the three supported recommendation tasks, rating prediction, item recommendation, and recommendation for groups, cover many use cases, we also plan to add additional recommendation tasks and types of input, e.g. item recommendation from other kinds of implicit feedback like viewing times or click counts, or tag recommendation (see section 2.2.3). 8 excluding search engine spider downloads, July 2012 Chapter 7 Conclusion This chapter concludes this thesis. After summarizing its contents, we discuss possible directions of future research. 7.1 Summary Item recommendation is an important prediction task in the application area of recommender systems. We have presented a generic formal denition of item recommendation, and have expressed some more specic tasks in terms of this denition. We suggested a framework that allows us to solve cold-start scenarios in the strict meaning of the term for prediction methods that represent entities as vectors of real numbers, in particular factorization models, which are the state-of-the-art approach for collaborative ltering tasks. The framework relies on learning mapping functions from the (new) entity attributes to the latent factors. Experiments on the new-item problem with a matrix factorization method for item recommendation from positive-only feedback (BPR-MF) showed the suitability of the approach, and that optimizing the mapping functions for the actual optimization objective is worthwhile. The Bayesian Personalized Ranking (BPR) criterion is a training objective and a generic learning algorithm for personalized ranking, which can be employed for item recommendation. We extended BPR to the more general weighted BPR (WBPR) criterion, which lets us individually specify the contribution of each entity to the global optimization target. WBPR can be used to tackle interesting large-scale item recommendation problems in which the candidate items to be scored and ranked are not drawn uniformly from the set of all available items, but according to other criteria like popularity. Such a scenario was the challenge posed by track 2 of the KDD Cup 2011: distinguishing between songs a user will like, and songs that are generally popular. We used matrix factorization models trained for an appropriate WBPR variant (WBPR-MF), augmented with information from rating-based models, to achieve an error rate of less than 5% on that task, which means that the prediction model's decision was wrong on less than 1 in 20 songs. Having publicly available high-quality implementations of state-of-the-art algorithms is important for the progress of a eld, as it enables researchers to 97 98 CHAPTER 7. CONCLUSION realistically compare their new developments to the state of the art. We implemented all methods presented in this thesis and some more as part of the MyMediaLite software package, which is a reference collection of recommender system algorithms, accompanied by a rich infrastructure for developing, evaluating, testing, and deployment of new and existing methods. We described the state-of-the-art of open source/free software recommender software, and compared MyMediaLite to other existing programs. The experiments showed that the implementation scales well even when processing very large datasets. Summarized, the main contributions of this thesis to the state of the art of machine learning techniques for recommendations are a method for solving hard cold-start problems for arbitrary latent factor models, and a new exible and generic optimization criterion and learning algorithm for item recommendation. The new methods are provided as part of the MyMediaLite package, to allow other researchers to reproduce the presented experiments, and to build new progress upon these and further methods. 7.2 Future Directions Future research directions have been discussed throughout this thesis at the end of the respective chapters. The MyMediaLite software, presented in chapter 6, can be enhanced to cover more application areas, for example tag recommendation, sequential recommendation, or general context-aware recommendation (see section 2.2.3), and even areas that are not about recommendation, but where the same or similar models and algorithms are useful, like student performance prediction [Nguyen et al., 2011] or link prediction [Sarukkai, 2000, Menon and Elkan, 2010, 2011]. Ideally, MyMediaLite could evolve into a package that supports arbitrary learning problems in complex domains with graph/network structure. As presented in chapter 3, the framework for enabling factorization models to deal with new-user and new-item problems is both simple and modular. We implemented the example of matrix factorization optimized for the BPR criterion (BPF-MF) in the software. This work can be extended to the other factorization models present in MyMediaLite: By having a generic API for mapping from user and item attributes to latent factors, all factorization models could be made ready for cold-start scenarios. With a generic implementation, we could investigate how well more advanced mapping functions like multi-layer neural networks or support-vector regression work for tasks like rating prediction or tag prediction. The weighted BPR (WBPR) criterion, introduced and discussed in section 4.2, can be applied to scenarios other than music recommendation (chapter 5), for example for providing personalized rankings of news articles that have been assigned weights/priorities by an editorial team. A further possible use of weighted BPR are learning scenarios with case weights. Seeing recommendation as a supervised learning and prediction problem, which is the main theme of this thesis, has been proven benecial over and over both in practice and in academic research. While it is a practical abstraction, it is not the last or only solution to modeling recommendation and personalization. After all, a recommender is often part of a larger interactive system, and is not as static as other scenarios where supervised learning techniques are used, like 7.2. FUTURE DIRECTIONS 99 hand-written digit recognition for ZIP codes [LeCun et al., 1989]. Approaches that view recommendation as an interactive, dynamic process, and that use the abstraction of reinforcement learning [Sutton and Barto, 1998] for example Shani et al. [2006] and Li et al. [2010, 2011] seem to be a promising research direction. 100 CHAPTER 7. CONCLUSION Appendix A MyMediaLite Reference See chapter 6 for a general introduction to the software. This appendix is meant to be a reference manual. Part of its contents are also available on MyMediaLite's homepage. MyMediaLite is a software containing dierent recommender system algorithms, plus tools that support developers and end-users in making ecient use of the software. The following sections give an introduction on how to use and extend MyMediaLite and its components, both from a developer and from and end-user's perspective. A.1 Installation A.1.1 Prerequisites For running MyMediaLite, you need at least Mono 2.8.x or another recent .NET runtime. Mono 2.10.x is highly recommended. Perl and the package File::Slurp are required for the download and data processing scripts, but not necessary for running/building MyMediaLite. For building MyMediaLite from its sources, you either need an integrated development environment (IDE) like MonoDevelop or VisualStudio, or the make utility. For building the API documentation, Doxygen 1.6.3 or later is needed. A.1.2 Packages MyMediaLite can be installed from source, or from a binary package. The download page1 oers three packages: A binary package, a source package, and a documentation package. The source code can also be obtained directly from MyMediaLite's repositories on GitHub and Gitorious. A.1.3 Instructions If you have the binary package, just copy its contents to wherever you want and run the programs from there. 1 http://ismll.de/mymedialite/download.html see the second appendix for more URLs 101 102 APPENDIX A. MYMEDIALITE REFERENCE To build MyMediaLite from source on Unix-like systems, run make all. Set the PREFIX variable in the Makefile, then run make install. On Windows, compile the software using Visual Studio or MonoDevelop. A.2 Command-Line Tools MyMediaLite is mainly a library, meant to be used by other applications. There are two command-line tools that oer much of MyMediaLite's functionality. They allow users to work with MyMediaLite without having to integrate the library in an application or having to develop their own programs. A.2.1 Rating Prediction The general usage of the rating prediction program is as follows: rating_prediction --training-file=TRAINING_FILE --test-file=TEST_FILE --recommender=METHOD [OPTIONS] METHOD is the recommender to use, which will be trained using the contents of TRAINING_FILE. The recommender will then predict the data in TEST_FILE, and the program will display the RMSE (root mean square error, see section 2.4.1) and MAE (mean absolute error) of the predictions. If you call rating_prediction without arguments, it will provide a list of recommenders to choose from, plus their arguments and further options: MyMediaLite rating prediction 2.99 usage: rating_prediction --training-file=FILE --recommender=METHOD [OPTIONS] recommenders (plus options and their defaults): - GlobalAverage supports --online-evaluation ... - SVDPlusPlus num_factors=10 regularization=0.015 bias_reg=0.33 learn_rate=0.001 bias_learn_rate=0.7 num_iter=30 init_mean=0 init_stddev=0.1 supports --find-iter=N, --online-evaluation method ARGUMENTS have the form name=value general OPTIONS: --recommender=METHOD --recommender-options=OPTIONS --help --version --random-seed=N --rating-type=float|byte --no-id-mapping files: --training-file=FILE --test-file=FILE set recommender method (default: BiasedMatrixFactorization) use OPTIONS as recommender options display this usage information and exit display version information and exit initialize the random number generator with N store ratings internally as floats (default) or bytes do not map user and item IDs to internal IDs, keep original IDs read training data from FILE read test data from FILE A.2. COMMAND-LINE TOOLS 103 --file-format=movielens_1m|kddcup_2011|ignore_first_line|default --data-dir=DIR load all files from DIR --user-attributes=FILE file containing user attribute information --item-attributes=FILE file containing item attribute information --user-relations=FILE file containing user relation information --item-relations=FILE file containing item relation information --save-model=FILE save computed model to FILE --load-model=FILE load model from FILE prediction options: --prediction-file=FILE --prediction-line=FORMAT --prediction-header=LINE write the rating predictions to FILE format of the prediction line; {0}, {1}, {2} refer to user ID, item ID, and predicted rating; default is {0}\\t{1}\\t{2} print LINE to the first line of the prediction file evaluation options: --cross-validation=K --show-fold-results --test-ratio=NUM perform k-fold cross-validation on the training data show results for individual folds in cross-validation use a ratio of NUM of the training data for evaluation (simple split) --chronological-split=NUM|DATETIME use the last ratio of NUM of the training data ratings for evaluation, or use the ratings from DATETIME on for evaluation (requires time information in the training data) --online-evaluation perform online evaluation (use every tested rating for incremental training) --search-hp search for good hyperparameter values --compute-fit display fit on training data options for finding the right number of iterations (iterative methods) --find-iter=N give out statistics every N iterations --max-iter=N perform at most N iterations --measure=RMSE|MAE|NMAE|CBD evaluation measure to use for the abort conditions below (default RMSE) --epsilon=NUM abort iterations if evaluation measure is more than best result plus NUM --cutoff=NUM abort if evaluation measure is above NUM One can download the MovieLens 100k ratings dataset (see section 2.5 and unzip it to go through the following examples. In the MyMediaLite directory, this is can be performed by entering make download-movielens The le formats supported by MyMediaLite are described in section A.4. To try out a simple baseline method on the data, one just enters rating_prediction --training-file=u1.base --test-file=u1.test --recommender=UserAverage which should give a result like UserAverage training_time 00:00:00.000098 RMSE 1.063 MAE 0.8502 testing_time 00:00:00.032326 To use a more advanced recommender, enter 104 APPENDIX A. MYMEDIALITE REFERENCE rating_prediction --training-file=u1.base --test-file=u1.test --recommender=BiasedMatrixFactorization which yields better result than the user average: BiasedMatrixFactorization num_factors=10 regularization=0.015 learn_rate=0.01 num_iter=30 init_mean=0 init_stdev=0.1 training_time 00:00:03.3575780 RMSE 0.96108 MAE 0.75124 testing_time 00:00:00.0159740 The key-value pairs after the method name represent arguments to the recommender that may be modied to get even better results. For instance, we could use more latent factors per user and item, which leads to a more complex (and hopefully more accurate) model: rating_prediction --training-file=u1.base --test-file=u1.test --recommender=BiasedMatrixFactorization --recommender-options="num_factors=20" ... ... RMSE 0.98029 MAE 0.76558 A.2.2 Item Recommendation The item recommendation program behaves similarly to the rating prediction program, so we concentrate on the dierences here. The basic usage is: item_recommendation --training-file=TRAINING_FILE --test-file=TEST_FILE --recommender=METHOD [OPTIONS] Again, if you call item_recommendation without arguments, it will provide a list of recommenders to choose from, plus their arguments and further options: MyMediaLite item recommendation from positive-only feedback 2.99 usage: item_recommendation --training-file=FILE --recommender=METHOD [OPTIONS] methods (plus arguments and their defaults): - BPRMF num_factors=10 bias_reg=0 reg_u=0.0025 reg_i=0.0025 reg_j=0.00025 num_iter=30 learn_rate=0.05 uniform_user_sampling=True with_replacement=False bold_driver=False fast_sampling_memory_limit=1024 update_j=True init_mean=0 init_stddev=0.1 supports --find-iter=N, --online-evaluation ... - MostPopular supports --online-evaluation method ARGUMENTS have the form name=value general OPTIONS: --recommender=METHOD --group-recommender=METHOD use METHOD for recommendations use METHOD to combine the predictions for several users A.2. 105 COMMAND-LINE TOOLS --recommender-options=OPTIONS --help --version --random-seed=N use OPTIONS as recommender options display this usage information and exit display version information and exit initialize random number generator with N files: --training-file=FILE read training data from FILE --test-file=FILE read test data from FILE --file-format=ignore_first_line|default --no-id-mapping do not map user and item IDs to internal IDs, keep the original IDs --data-dir=DIR load all files from DIR --user-attributes=FILE file with user attribute information --item-attributes=FILE file with item attribute information --user-relations=FILE file with user relation information --item-relations=FILE file with item relation information --user-groups=FILE file with group-to-user mappings --save-model=FILE save computed model to FILE --load-model=FILE load model from FILE data interpretation: --user-prediction --rating-threshold=NUM transpose the user-item matrix and perform user prediction instead of item prediction (for rating datasets) interpret rating >= NUM as positive feedback choosing the items for evaluation/prediction (mutually exclusive): --candidate-items=FILE use items in FILE (one per line) as candidate items --overlap-items use only items that are both in the training and the test set as candidate items --in-training-items use only items in the training set as candidate items --in-test-items use only items in the test set as candidate items --all-items use all known items as candidate items choosing the users for evaluation/prediction --test-users=FILE predict items for users specified in FILE (one user per line) prediction options: --prediction-file=FILE --predict-items-number=N evaluation options: --cross-validation=K --show-fold-results --test-ratio=NUM --num-test-users=N --online-evaluation --filtered-evaluation --repeat-evaluation --compute-fit write ranked predictions to FILE, one user per line predict N items per user (needs --prediction-file) perform k-fold cross-validation on the training data show results for individual folds in cross-validation evaluate by splitting of a NUM part of the feedback evaluate on only N randomly picked users (to save time) perform online evaluation (use every tested user-item combination for incremental training) perform evaluation filtered by item attribute (requires --item-attributes=FILE) items accessed by a user before may be in the recommendations (and are not ignored in the evaluation) display fit on training data finding the right number of iterations (iterative methods) --find-iter=N give out statistics every N iterations --max-iter=N perform at most N iterations --measure=MEASURE the evaluation measure to use for the abort conditions below (default is AUC) --epsilon=NUM abort iterations if MEASURE is less than best result 106 APPENDIX A. --cutoff=NUM MYMEDIALITE REFERENCE plus NUM abort if MEASURE is below NUM Instead of RMSE and MAP, the evaluation measures are now prec@N (precision at N), AUC (area under the ROC curve), MAP (mean average precision), and NDCG (normalized discounted cumulative gain). Let us start again with some baseline methods, Random and MostPopular: item_recommendation --training-file=u1.base --test-file=u1.test --recommender=Random random training_time 00:00:00.0001040 AUC 0.4992 prec@5 0.0279 prec@10 0.0290 MAP 0.0012 NDCG 0.3721 num_users 459 num_items 1650 testing_time 00:00:02.7115540 item_recommendation --training-file=u1.base --test-file=u1.test --recommender=MostPopular MostPopular training_time 00:00:00.0015710 AUC 0.8543 prec@5 0.322 prec@10 0.3046 MAP 0.0219 NDCG 0.5704 num_users 459 num_items 1650 testing_time 00:00:02.3813790 User-based collaborative ltering leads to output like the following: item_recommendation --training-file=u1.base --test-file=u1.test --recommender=UserKNN UserKNN k=80 training_time 00:00:05.6057200 AUC 0.9168 prec@5 0.5251 prec@10 0.4678 MAP 0.0648 NDCG 0.6879 num_users 459 num_items 1650 testing_time 00:00:08.8362840 Note that item recommendation evaluation usually takes longer than the rating prediction evaluation, because for each user, scores for every candidate item (possibly all items) have to be computed. You can restrict the number of predictions to be made using the options --test-users=FILE and --candidate-items=FILE to save time. The item recommendation program supports the same options for iteratively trained recommenders like BPRMF and WRMF, for example --find-iter=N. A.3 Library Structure MyMediaLite's library source code is structured into several namespaces: • MyMediaLite: generic recommender denitions like the IRecommender interface (see below) • MyMediaLite.Correlation: correlations and similarity measures, used by kNN recommenders • MyMediaLite.Data: data structures for storing interaction and attribute data A.3. LIBRARY STRUCTURE 107 • MyMediaLite.DataType: basic data types like vectors and matrices • MyMediaLite.Diversification: methods for diversifying recommendation lists • MyMediaLite.Ensemble: ensemble methods for combining the output of several recommenders • MyMediaLite.Eval: evaluation code • MyMediaLite.GroupRecommendation: recommenders for making recommendations to groups • MyMediaLite.HyperParameter: hyperparameter search methods • MyMediaLite.IO: input/output procedures • MyMediaLite.ItemRecommendation: item recommendation from positiveonly feedback • MyMediaLite.RatingPrediction: rating predictors • MyMediaLite.Taxonomy: data types to represent taxonomic information of entities, for example which entities exist • MyMediaLite.Util: miscellaneous utility code The main library is contained in src/MyMediaLite. Some experimental parts are in src/MyMediaLiteExperimental. Unit tests are in src/Tests. In the following, we describe and mention the most important interfaces and classes of MyMediaLite. A complete description of MyMediaLite's API can be found on the homepage and in the documentation package. A.3.1 Conventions There are several conventions followed in the MyMediaLite library. Users and items are referred to by int (32 bit integer) values called user_id or shorter just u and item_id or i. Usually, the user ID is immediately before the item ID in a method call. For example, IRecommender's Predict() method has the signature float Predict(int user_id, int item_id). Interface names start, as it is usual in C#, with an upper-case I. Often there are standard classes implementing such an interface, which have the same name except for the leading I. For example, there is the (non-abstract) standard implementation Ratings for the IRatings interface, and the abstract RatingPredictor class which contains code shared by many non-abstract recommenders inheriting from that class and thus implementing the IRatingPredictor interface. A.3.2 Interfaces In this section, we describe some general interfaces. More specialized interfaces are described in the following sections. 108 APPENDIX A. MYMEDIALITE REFERENCE Recommenders IRecommender is the most general recommender interface. Its method float Predict(int user_id, int item_id) returns a score for a given user-item combination. The higher the score, the higher the recommender estimates that the user will like the given item. bool CanPredict(int user_id, int item_id) can be used to check whether the recommender is able to provide a meaningful score for the given user-item combination. The void Train() method performs the recommender training. void SaveModel(string filename) stores the resulting model to a le, and void LoadModel(string filename) can be used to restore a trained model from a le. Finally, string ToString() returns a string representation of the recommender, containing the class name and the names and values of all hyperparameters. Iterative Models IIterativeModel is an interface for recommenders which learn by performing several passes over the training examples. The interface has a property NumIter for the number of passes, and the methods void Iterate() and float ComputeObjective(). Iterate() performs one learning iteration over the training examples, and ComputeObjective() returns the current value of the training objective. Similarity Providers Some recommenders have a notion of similarity between users or items. To make use of such similarities, the IUserSimilarityProvider and IItemSimilarityProvider interfaces provide two methods each, one for getting the similarity of two given entities float GetUserSimilarity(int user_id1, int user_id2) and float GetItemSimilarity(int item_id1, int item_id2) and one for getting the entities that are most similar to a given entity IList<int> GetMostSimilarUsers(int user_id, uint n = 10) and IList<int> GetMostSimilarItems(int item_id, uint n = 10). A.4 Data Structures A.4.1 Basic Data Types Vectors and Matrices One kind of basic data type used in MyMediaLite are vectors and matrices. Vectors are just instances of the .NET generic type IList<T>, where T is usually float or double, and have no specic interface in MyMediaLite. Matrices are represented by the interface IMatrix<T> and its dense standard implementation Matrix<T>. There are also more specic implementations for sparse matrices, (skew-) symmetric matrices, and combinations thereof. A particular case are Boolean matrices. They are represented by the interface IBooleanMatrix, a specialization of IMatrix<bool>, and which again has several specialized implementations. Methods for reading and writing vectors and matrices can be found in the classes IO.MatrixExtensions and A.4. DATA STRUCTURES 109 List Proxies and Combined Lists When dealing with large datasets, we do not want to unnecessarily replicate data in memory. Ideally, each dataset is loaded into memory once. All derived datasets, for example when having k dierent splits in k -fold cross-validation, are represented by referencing to the original dataset. We want the same for derived datasets that are combinations of several original datasets. To support the implementation of such scenarios, there is ListProxy<T>, whose constructor takes an IList<T> and a list of indexes. The created object is an IList<T>, where the i-th element is the element in the original list at the position specied by the i-th index. In a similar manner, the constructor of CombinedList<T> concatenates two lists. Pairs Pair<T, U> is a tuple data type that can be used for example for representing user-item combinations, or combinations of item IDs and rating values if the user ID is known. A.4.2 Entity Mappings MyMediaLite represents user, item, and attribute IDs internally as integers, starting from zero. In imported data, IDs are often arbitrary integers with great gaps in between, or even arbitrary strings. The interface IEntityMapping serves the purpose of mapping external string IDs to internal integer IDs. EntityMapping is the standard implementation, while IdentityMapping merely transforms the string to the integer it represents, without consuming any memory. This can save memory when handling large datasets, and be useful for debugging purposes. Giving the --no-id-mapping option to the command-line tools (see section A.2) enables IdentityMapping. A.4.3 Rating Data Rating data is represented by the IRatings interface, which inherits from IDataSet. It is usually read from text les or a database. Rating data les have at least three columns: the user ID, the item ID, and the rating value. Valid user and item IDs are strings, and the rating value is a single precision (32 bit float) oating point number. Date/time information or numerical timestamps will be used if necessary, and ignored otherwise. Reading: Reading in rating data is implemented in IO.RatingData, IO.StaticRatingData, and IO.MovieLensRatingData. RatingData.Read() returns a data structure that can be updated/modied, while the result of StaticRatingData.Read() is a read-only structure which cannot be updated. If you also want to read in time information, use the static class IO.TimedRatingData. MyMediaLite also supports the KDD Cup 2011 le format, the corresponding classes are in the IO.KDDCup2011 namespace. 110 APPENDIX A. MYMEDIALITE REFERENCE RatingPrediction.Extensions contains an extension method WritePredictions() that lets you write the predictions to a target, either a le or a TextWriter, for example: Writing: recommender.WritePredictions(ratings, user_mapping, item_mapping, target); This will use tabs as column separators. If you want other separators, provide a line format string to the method: recommender.WritePredictions(ratings, user_mapping, item_mapping, target, "{0}|{1}|{2}"); Examples Tab-separated columns (.tsv): 5951 5951 5951 5951 5951 5951 5951 50 223 260 293 356 364 457 5 5 5 5 4 3 3 Space-separated columns with string IDs and non-integer ratings: u5951 u5951 u5951 u5951 u5951 u5951 u5951 i50 i223 i260 i293 i356 i364 i457 5.0 5.0 5.0 4.5 4.0 3.5 3.0 Comma-separated columns (.csv) with timestamps: 5951,50,5,978300760 5951,223,5,978302109 5951,260,5,978301968 5951,293,5,978300275 5951,356,4,978824291 5951,364,3,978302268 5951,457,3,978300719 Rating data with dates: 5951 5951 5951 5951 50 5 2005-12-04 223 5 2005-12-04 260 5 2005-12-04 293 5 2005-11-27 A.4. 111 DATA STRUCTURES 5951 356 4 2005-11-27 5951 364 3 2005-11-27 5951 457 3 2005-11-27 It does not matter whether the date/time information is in quotation marks or not. Rating data with dates and times: 5951 5951 5951 5951 5951 5951 5951 50 223 260 293 356 364 457 5 5 5 5 4 3 3 "2009-08-05 "2009-08-02 "2010-05-04 "2009-09-25 "2010-06-30 "2010-06-11 "2010-06-11 00:50:30" 17:19:33" 21:21:03" 05:04:24" 02:07:57" 04:54:41" 14:26:32" MovieLens 1M/10M rating data les: 5951::50::5 5951::223::5 5951::260::5 5951::293::5 5951::356::4 5951::364::3 5951::457::3 The MovieLens 1M and 10M datasets use a double colon :: as separator. MovieLens 1M/10M data with timestamps: 5951::50::5::978300760 5951::223::5::978302109 5951::260::5::978301968 5951::293::5::978300275 5951::356::4::978824291 5951::364::3::978302268 5951::457::3::978300719 Command-Line Tools MyMediaLite's rating prediction tool supports all three variants by default for its --training-data=FILE and --test-data=FILE arguments. If all ratings are integers between 0 and 255, one can use --rating-type=byte to save memory. To read a le in the MovieLens 1M/10M le format, use --file-format=ml1m. For prediction, if you use --prediction-file=FILE, you will get a tabseparated le with predictions for the test data. With the --prediction-line="FORMAT" you can modify the line format, e.g. if you want the items to be in the rst column: --prediction-line="1,0,2" Rating prediction: 112 APPENDIX A. MYMEDIALITE REFERENCE The item recommendation tool also supports this rating data format. By default, every rating is interpreted as positive feedback, the rating value is ignored. You can use the option --rating-threshold=NUM to dene the minimum rating value that is to be interpreted as positive feedback. Ratings below this value will be ignored. The MovieLens 1M/10M format is currently not supported by the item recommendation tool. Item recommendation: A.4.4 Positive-Only Feedback Positive-only feedback is stored in memory by classes implementing the interface IPosOnlyData. The standard class PosOnlyData<T> takes a type argument implementing the ISparseBooleanMatrix interface, which determines the type to be used for the user- and item-wise representation of the feedback data. Positive-only feedback les have at least two columns: the user ID and the item ID. Additional columns are ignored, which means that the rating data examples from the previous subsection will also be read. The column separators are the same as the ones for the rating data: spaces, tabs, or commas. The class for reading in this kind of le is IO.ItemData. To use ratings above a certain threshold as feedback, one can use the IO.ItemDataRatingThreshold class. Reading: Writing the le format described is trivial, thus there is no code in MyMediaLite for that particular task. Writing: Example: tab-separated columns (.tsv): 5951 5951 5951 5951 5951 5951 5951 50 223 260 293 356 364 457 A.4.5 Attributes and Relations Recommenders that use user or item attributes or user or item relations implement at least one of the interfaces IUserAttributeAwareRecommender, IItemAttributeAwareRecommender, IUserRelationAwareRecommender, or IItemRelationAwareRecommender. Both the item recommendation tool and the rating prediction tool support attribute and relations les via the options --user-attributes=FILE, --item-attributes=FILE, --user-relations=FILE, and --item-relations=FILE. An attribute is an arbitrary property that a user or item can have. Relation les, in contrast, describe binary relations over one type of entity, for example over users or over items. An example for a user relation is the edges in the social network. An example for an item relation would be the relation "A is a sequel to B" for movies. Please note that relations are not automatically A.5. RECOMMENDERS 113 symmetric. This means that "1,2" does not imply "2,1". If you want to have a symmetric relation, make sure that both lines are contained in the le. Binary attribute les have exactly two columns: the entity (user or item) ID and the attribute ID. Relation les also have exactly two columns. Each column contains an entity ID. One line in a le means there exists a relation between the rst and the second entity mentioned in that line. The classes for reading in attribute and relation les are IO.AttributeData and IO.RelationData, respectively. Reading: Writing these le formats is trivial, so there is no specic class for that in MyMediaLite. Writing: Examples Again, the column separators are the same as the ones for the rating data: spaces, tabs, or commas. Attribute le with tab-separated columns (.tsv): 51 51 51 51 51 51 51 5 22 26 29 35 36 45 This means that the entity 51 (may be a user or item) has the attributes 5, 22, 26, . . . If this example is complete, it also means entity 51 does not have the attributes 0, 1, 2, 3, 4, 6, . . . Relation le with comma-separated columns (.csv): 51,5 51,22 51,26 51,29 51,35 51,36 51,45 This means that the entity 51 (may be a user or item) is in relation with the entities 5, 22, 26, . . . of the same type If this example is complete, it also means entity 51 does not have a relation with entities 0, 1, 2, 3, 4, 6, . . . A.5 Recommenders We distinguish three dierent recommendation tasks, and thus have three different kinds of recommenders: Rating predictors, which predict explicit ratings, 114 APPENDIX A. MYMEDIALITE REFERENCE item recommenders, which predict scores based on positive-only feedback, and group recommenders, which combine the predictions for individual users to get good group decisions. Additionally, there are ensembles, which combine the output of several recommenders for the same user-item combination in order to achieve more accurate predictions. A.5.1 Rating Prediction Rating predictors implement the interface IRatingPredictor. More specialized interfaces are ITimeAwareRatingPredictor, for recommenders that can use time information for training and prediction, IIncrementalRatingPredictor, for predictors that can learn incrementally as new feedback comes in, and IFoldInRatingPredictor, for recommenders that can make predictions for anonymous users based on their interactions, without adding their data to the model. Table A.1 lists all rating predictors in MyMediaLite. The column Class contains the class name, inc species whether the recommender is capable of incremental updates, ur means it uses user relation data, ua user attributes, ia item attributes, and t time data. Literature contains references to publications relevant to the recommendation method. The hyperparameters of the dierent recommenders, together with their default values and a brief explanation, are shown in Table A.2. X X X X X X X X X X X X X X X X X X X GlobalAverage UserAverage ItemAverage Random Constant UserItemBaseline TimeAwareBaseline TimeAwareBaselineWithFrequencies CoClustering SlopeOne BiPolarSlopeOne UserAttributeKNN UserKNNCosine UserKNNPearson ItemAttributeKNN ItemKNNCosine ItemKNNPearson MatrixFactorization FactorWiseMatrixFactorization BiasedMatrixFactorization SVDPlusPlus SocialMF X ur X ua X ia X X t Koren [2008] Koren [2009] Koren [2009] George and Merugu [2005] Lemire and Maclachlan [2005] Lemire and Maclachlan [2005] Koren [2008] Koren [2008] Koren [2008] Koren [2008] Koren [2008] Koren [2008] Rendle and Schmidt-Thieme [2008] Bell et al. [2007] Rendle and Schmidt-Thieme [2008], Menon and Elkan [2010], Gemulla et al. [2011] Koren [2008] Jamali and Ester [2010] Literature Table A.1: Rating predictors in MyMediaLite 2.99. Some of the methods are described in detail in section 2.3. inc Class A.5. RECOMMENDERS 115 116 Class Constant UserItemBaseline TimeAwareBaseline TimeAwareBaselineWithFrequencies CoClustering APPENDIX A. Hyperparameter constant_rating reg_u reg_i num_iter num_iter bin_size beta user_bias_learn_rate item_bias_learn_rate alpha_learn_rate item_bias_by_time_bin_learn_rate user_bias_by_day_learn_rate user_scaling_learn_rate user_scaling_by_day_learn_rate reg_u reg_i reg_alpha reg_item_bias_by_time_bin reg_user_bias_by_day reg_user_scaling reg_user_scaling_by_day num_iter bin_size beta user_bias_learn_rate item_bias_learn_rate alpha_learn_rate item_bias_by_time_bin_learn_rate user_bias_by_day_learn_rate user_scaling_learn_rate user_scaling_by_day_learn_rate reg_u reg_i reg_alpha reg_item_bias_by_time_bin reg_user_bias_by_day reg_user_scaling reg_user_scaling_by_day frequency_log_base item_bias_at_frequency_learn_rate reg_item_bias_at_frequency num_user_clusters MYMEDIALITE REFERENCE Default Description 1 10 5 10 30 70 0.4 0.003 0.002 1E-05 5E-06 rating value user bias regularization item bias regularization number of iterations number of iterations bin size in days user bias step size item bias step size learn rate for the α parameters 0.0025 0.008 0.002 0.03 0.03 50 0.1 0.005 0.01 0.005 40 70 0.4 0.00267 0.000488 3.11E-06 1.15E-06 number of iterations bin size in days 0.000257 0.00564 0.00103 0.0255 0.0255 3.95 0.0929 0.00231 0.0476 0.019 6.76 0.00236 1.1E-08 3 number of user clusters Continued on next page . . . A.5. 117 RECOMMENDERS Class Hyperparameter num_item_clusters num_iter Userk Attributereg_u KNN reg_i UserKNNk Cosine reg_u reg_i UserKNNk Pearson shrinkage reg_u reg_i ItemAttribute- k KNN reg_u reg_i ItemKNNk Cosine reg_u reg_i ItemKNNk Pearson shrinkage reg_u reg_i Matrixnum_factors Factorization regularization learn_rate num_iter init_mean FactorWiseMatrixFactorization BiasedMatrixFactorization Default Description 3 30 inf 10 5 inf 10 5 inf 10 10 5 inf 10 5 inf 10 5 inf 10 10 5 10 0.015 0.01 30 0 number of item clusters number of iterations number of neighbors user bias regularization item bias regularization number of neighbors user bias regularization item bias regularization number of neighbors shrinkage factor for similarities user bias regularization item bias regularization number of neighbors user bias regularization item bias regularization number of neighbors user bias regularization item bias regularization number of neighbors shrinkage factor for similarities user bias regularization item bias regularization number of factors / user and item regularization constant step size for learning number of iterations mean of the normal distribution used to initialize the factors standard deviation of the normal distribution used to initialize the factors number of factors / user and item regularization constant convergence sensibility mean of the normal distribution used to initialize the factors standard deviation of the normal distribution used to initialize the factors number of iterations number of factors / user and item bias regularization modier user factor regularization item factor regularization step size for learning step size modier for the biases number of iterations learning rate adaptation heuristics mean of the normal distribution used to initialize the factors Continued on next page . . . init_stdev 0.1 num_factors shrinkage sensibility init_mean 10 25 1E-05 0 init_stdev 0.1 num_iter num_factors bias_reg reg_u reg_i learn_rate bias_learn_rate num_iter bold_driver init_mean 10 10 0.01 0.015 0.015 0.01 1 30 False 0 118 APPENDIX A. MYMEDIALITE REFERENCE Class inc ua ia Literature Zero Random MostPopular UserAttributeKNN UserKNN WeightedUserKNN ItemAttributeKNN ItemKNN WeightedItemKNN ItemAttributeSVM BPRLinear WRMF BPRMF SoftMarginRankingMF WeightedBPRMF X X X X X X X X X Desrosiers and Karypis [2011] Desrosiers and Karypis [2011] Desrosiers and Karypis [2011] Desrosiers and Karypis [2011] Desrosiers and Karypis [2011] Desrosiers and Karypis [2011] Hsu et al. [2003] Gantner et al. [2010a] Hu et al. [2008], Pan et al. [2008] Gantner et al. [2010a] Rendle [2010a] Gantner et al. [2011a] Table A.3: Item recommenders in MyMediaLite 2.99. Class Hyperparameter init_stdev Default Description 0.1 standard deviation of the normal distribution used to initialize the factors loss RMSE the loss to optimize for: RMSE, MAE, or LogisticLoss max_threads 100 maximum number of parallel threads SocialMF num_factors 10 number of factors / user and item regularization 0.015 regularization constant social_regularization 1 strength of the social regularization learn_rate 0.01 step size for learning num_iter 30 number of iterations init_mean 0 mean of the normal distribution used to initialize the factors init_stdev 0.1 standard deviation of the normal distribution used to initialize the factors Table A.2: Rating predictors hyperparameters in MyMediaLite 2.99. A.5.2 Item Recommendation Item recommenders implement the interface IItemRecommender. IIncrementalItemRecommender is the interface for methods that can learn incrementally as new feedback comes in. Table A.3 lists all item recommenders in MyMediaLite, and the hyperparameters are shown in Table A.2. A.5. 119 RECOMMENDERS Class Hyperparameter Default Description UserAttributeKNN UserKNN WeightedUserKNN ItemAttributeKNN ItemKNN WeightedItemKNN ItemAttributeSVM BPRLinear k 80 number of neighbors k k 80 80 number of neighbors number of neighbors k 80 number of neighbors k k 80 80 number of neighbors number of neighbors c gamma reg num_iter learn_rate fast_sampling_memory_limit init_mean 1 0.002 0.015 10 0.05 1024 init_stdev 0.1 num_factors regularization c_pos 10 0.015 1 num_iter init_mean 15 0 init_stdev 0.1 C hyperparameter for the SVM γ parameter for the RBF kernel regularization constant number of iterations step size for learning MB to be used for fast sampling data mean of the normal distribution used to initialize the factors standard deviation of the normal distribution used to initialize the factors number of factors / user and item regularization constant the weight put on positive observations number of iterations mean of the normal distribution used to initialize the factors standard deviation of the normal distribution used to initialize the factors number of factors / user and item item bias regularizationes user factor regularization positive item factor regularization negative item factor regularization number of iterations step size for learning sample users uniformly sample examples with replacement learning rate adaptation heuristics MB to be used for fast sampling data perform updates on negative item factors mean of the normal distribution used to initialize the factors standard deviation of the normal distribution used to initialize the factors Continued on next page . . . WRMF BPRMF 0 num_factors 10 bias_reg 0 reg_u 0.0025 reg_i 0.0025 reg_j 0.00025 num_iter 30 learn_rate 0.05 uniform_user_samplingTrue with_replacement False bold_driver False fast_sampling_1024 memory_limit update_j True init_mean 0 init_stdev 0.1 120 APPENDIX A. Class SoftMarginRankingMF Hyperparameter num_factors bias_reg reg_u reg_i reg_j num_iter learn_rate bold_driver fast_sampling_memory_limit init_mean MYMEDIALITE REFERENCE Default Description 10 0 0.0025 0.0025 0.00025 30 0.05 False 1024 number of factors / user and item item bias regularizationes user factor regularization positive item factor regularization negative item factor regularization number of iterations step size for learning learning rate adaptation heuristics MB to be used for fast sampling data 0 mean of the normal distribution used to initialize the factors init_stdev 0.1 standard deviation of the normal distribution used to initialize the factors Weightednum_factors 10 number of factors / user and item BPRMF bias_reg 0 item bias regularizationes reg_u 0.0025 user factor regularization reg_i 0.0025 positive item factor regularization reg_j 0.00025 negative item factor regularization num_iter 30 number of iterations learn_rate 0.05 step size for learning bold_driver False learning rate adaptation heuristics init_mean 0 mean of the normal distribution used to initialize the factors init_stdev 0.1 standard deviation of the normal distribution used to initialize the factors Table A.4: Item recommender hyperparameters in MyMediaLite 2.99. Item Recommendation Files Files containing item recommendations contain the recommended items for one user in one line. An entry line contains the user ID followed by a tab character, followed by the top N recommended items with their score. Example: 0 [9:3.5,7:3.4,3:3.1] means that we recommend the items 9, 7, and 3 to user 0, and that their respective scores are 3.5, 3.4, and 3.1. The item recommendation tool supports this le format. Use --prediction-file=FILE to specify the le name and --predict-items-number=N to specify the number of items to recommend to each user. Command-line tools: A.5. 121 RECOMMENDERS Class Average Maximum Minimum PairwiseWins WeightedAverage Table A.5: Group recommenders in MyMediaLite 2.99. There is currently no class for reading this kind of le in MyMediaLite. Given the information here it should be easy to implement, though. Writing item recommendations to a stream or le can be performed using the extension methods dened in ItemRecommendation.Extensions All one needs to do is to import the MyMediaLite.ItemRecommendation namespace with the using statement. Then writing out predictions is simple: Programming: using MyMediaLite.ItemRecommendation; ... recommender.WritePredictions(training_data, candidate_items, predict_items_number, prediction_file); If you only want predictions for specic users, provide a list of user IDs to the method: recommender.WritePredictions(training_data, candidate_items, predict_items_number, prediction_file, user_list); If you mapped the user and item IDs to internal IDs, supply the mapping data as arguments to the method so that the internal IDs can be turned into their original counterparts again: recommender.WritePredictions(training_data, candidate_items, predict_items_number, prediction_file, user_list, user_mapping, item_mapping); A.5.3 Group Recommendation Group recommenders implement the interface IGroupRecommender. MyMediaLite's group recommenders are listed in Table A.5. See section 6.2.1 for more information on group recommendation. If you want to use group recommendation methods from the command line, you can use the item recommendation tool's --user-groups=FILE and --group-recommender=METHOD options. 122 APPENDIX A. MYMEDIALITE REFERENCE A.5.4 Ensembles Ensembles, that is combinations of several recommenders, inherit from the Ensemble.Ensemble class. Currently, the only inheriting class is WeightedEnsemble, which is a recommender that scores user-item combinations as a weighted sum of the scores emitted by the individual recommenders. A.6 Using MyMediaLite Recommenders This section describes how to use the MyMediaLite recommenders from a programmer's perspective. A.6.1 General Remarks Setting up a recommender so that it can make predictions requires three steps: 1. creating the recommender by callings its constructor 2. assigning the training data 3. calling the Train() method The training data may consist of interaction data, attribute data, and relation data. The property for the interaction data is called Ratings for rating predictors and Feedback for item recommenders. For attribute and relation data, the corresponding interfaces dene the properties UserAttributes, ItemAttributes, UserRelation, and ItemRelation. After these steps, the recommender can be called using the Predict() method. In the following, we give examples on how to use recommenders from dierent programming languages. All examples used here can be also found in the examples/ directory of the MyMediaLite source code. A.6.2 C# using using using using using System; MyMediaLite.Data; MyMediaLite.Eval; MyMediaLite.IO; MyMediaLite.RatingPrediction; public class RatingPrediction { public static void Main(string[] args) { // load the data var training_data = RatingData.Read(args[0]); var test_data = RatingData.Read(args[1]); // set up the recommender var recommender = new UserItemBaseline(); A.6. USING MYMEDIALITE RECOMMENDERS 123 recommender.Ratings = training_data; recommender.Train(); // measure the accuracy on the test dataset var results = recommender.Evaluate(test_data); Console.WriteLine("RMSE={0} MAE={1}", results["RMSE"], results["MAE"]); Console.WriteLine(results); // make a prediction for a certain user and item Console.WriteLine(recommender.Predict(1, 1)); } } using using using using using var bmf = new BiasedMatrixFactorization {Ratings = training_data}; Console.WriteLine(bmf.DoCrossValidation()); System; MyMediaLite.Data; MyMediaLite.Eval; MyMediaLite.IO; MyMediaLite.ItemRecommendation; public class ItemPrediction { public static void Main(string[] args) { // load the data var training_data = ItemData.Read(args[0]); var test_data = ItemData.Read(args[1]); // set up the recommender var recommender = new MostPopular(); recommender.Feedback = training_data; recommender.Train(); // measure the accuracy on the test dataset var results = recommender.Evaluate(test_data, training_data); foreach (var key in results.Keys) Console.WriteLine("{0}={1}", key, results[key]); Console.WriteLine(results); } } // make a score prediction for a certain user and item Console.WriteLine(recommender.Predict(1, 1)); A.6.3 F# open System open MyMediaLite.IO 124 APPENDIX A. MYMEDIALITE REFERENCE open MyMediaLite.RatingPrediction open MyMediaLite.Eval (* load the data *) let train_data = RatingData.Read "u1.base" let test_data = RatingData.Read "u1.test" (* set up the recommender *) let recommender = new UserItemBaseline(Ratings=train_data) recommender.Train() (* measure the accuracy on the test dataset *) let result = recommender.Evaluate(test_data) Console.WriteLine(result) (* make a prediction for a certain user and item *) let prediction = recommender.Predict(1, 1) Console.WriteLine(prediction) open open open open System MyMediaLite.IO MyMediaLite.ItemRecommendation MyMediaLite.Eval (* load the data *) let train_data = ItemData.Read "u1.base" let test_data = ItemData.Read "u1.test" (* set up the recommender *) let recommender = new UserKNN(K=20u, Feedback=train_data) recommender.Train() (* measure the accuracy on the test dataset *) let result = recommender.Evaluate(test_data, train_data) Console.WriteLine(result) (* make a prediction for a certain user and item *) let prediction = recommender.Predict(1, 1) Console.WriteLine(prediction) A.6.4 Python A rating prediction example in Python: #!/usr/bin/env ipy import clr clr.AddReference("MyMediaLite.dll") from MyMediaLite import * # load the data A.6. USING MYMEDIALITE RECOMMENDERS 125 train_data = IO.RatingData.Read("u1.base") test_data = IO.RatingData.Read("u1.test") # set up the recommender recommender = RatingPrediction.UserItemBaseline() # don't forget () recommender.Ratings = train_data recommender.Train() # measure the accuracy on the test dataset print Eval.Ratings.Evaluate(recommender, test_data) # make a prediction for a certain user and item print recommender.Predict(1, 1) An item recommendation example in Python: #!/usr/bin/env ipy import clr clr.AddReference("MyMediaLite.dll") from MyMediaLite import * # load the data train_data = IO.ItemData.Read("u1.base") test_data = IO.ItemData.Read("u1.test") # set up the recommender recommender = ItemRecommendation.UserKNN() # don't forget () recommender.K = 20 recommender.Feedback = train_data recommender.Train() # measure the accuracy on the test dataset print Eval.Items.Evaluate(recommender, test_data, train_data) # make a prediction for a certain user and item print recommender.Predict(1, 1) A.6.5 Ruby #!/usr/bin/env ir require 'MyMediaLite' # load the data train_data = MyMediaLite::IO::RatingData.Read("u1.base") test_data = MyMediaLite::IO::RatingData.Read("u1.test") # set up the recommender recommender = MyMediaLite::RatingPrediction::UserItemBaseline.new() recommender.Ratings = train_data 126 APPENDIX A. MYMEDIALITE REFERENCE recommender.Train() # measure the accuracy on the test dataset eval_results = MyMediaLite::Eval::Ratings::Evaluate(recommender, test_data) eval_results.each do |entry| puts "#{entry}" end # make a prediction for a certain user and item puts recommender.Predict(1, 1) #!/usr/bin/env ir require 'MyMediaLite' using_clr_extensions MyMediaLite # load the data train_data = MyMediaLite::IO::ItemData.Read("u1.base") test_data = MyMediaLite::IO::ItemData.Read("u1.test") # set up the recommender recommender = MyMediaLite::ItemRecommendation::MostPopular.new() recommender.Feedback = train_data; recommender.Train() # measure the accuracy on the test dataset eval_results = MyMediaLite::Eval::Items.Evaluate( recommender, test_data, train_data) eval_results.each do |entry| puts "#{entry}" end # make a prediction for a certain user and item puts recommender.Predict(1, 1) A.7 Implementing MyMediaLite Recommenders All infrastructure for implementing new recommenders is already in place, which means the developer can concentrate on programming the details of the algorithm, without having to worry about storing the interaction data and so on. Furthermore, there is no need to manually integrate the new recommender in the command-line tools, as these tools automatically nd all available recommenders using reection. For implementing the basic functionality of a recommender, the necessary steps are: • derive the new class from RatingPredictor or ItemRecommender • dene the model data structures • dene the hyperparameters as object properties A.7. IMPLEMENTING MYMEDIALITE RECOMMENDERS 127 • write the Train() method • write the Predict() method To get the full functionality, LoadModel(), SaveModel(), and ToString() have to be dened. Plenty of examples of implemented recommenders can be found in the MyMediaLite source code. Simple recommenders to start with are SlopeOne, MostPopular, and UserItemBaseline. 128 APPENDIX A. MYMEDIALITE REFERENCE Appendix B URLs 129 130 APPENDIX B. Software Gendarme Mono MonoDevelop Embedding Mono IKVM git Doxygen Perl GATE OpenCV R RapidMiner scikit-learn Shogun Weka URLS URL http://www.mono-project.com/Gendarme http://www.mono-project.com http://monodevelop.com http://www.mono-project.com/Embedding_Mono http://www.ikvm.net http://git-scm.com http://www.doxygen.org http://www.perl.org http://gate.ac.uk http://opencv.willowgarage.com http://www.r-project.org http://rapid-i.com http://scikit-learn.org http://www.shogun-toolbox.org http://www.cs.waikato.ac.nz/ml/weka/ Table B.1: Development tools and other software mentioned in this thesis. Website AppRecommender Bibsonomy Donation Dashboard Jester MovieLens Internet Movie Database (IMDb) KDD Cup 2010 KDD Cup 2011 Netix Prize MovieLens datasets Yahoo! Music Ratings RecLab Prize on Overstock.com RecSysWiki URL https://github.com/tassia/AppRecommender http://www.bibsonomy.org http://dd.berkeley.edu http://eigentaste.berkeley.edu http://movielens.umn.edu http://www.imdb.com/interfaces https://pslcdatashop.web.cmu.edu/KDDCup http://kddcup.yahoo.com http://www.netflixprize.com http://www.grouplens.org/node/73 http://webscope.sandbox.yahoo.com http://overstockreclabprize.com http://recsyswiki.com Table B.2: Academic/experimental systems with recommender functionality and other recommender system resources mentioned in this thesis. 131 Website Amazon Asos Delicious Facebook Findory GoodReads GMail Google Google Maps Google News last.fm LibraryThing LinkedIn MoviePilot MoviePilot Germany Net-A-Porter Netix Nokia Maps Otto Pandora Songkick TiVo Twitter Xing Yahoo! Yahoo! Answers Yahoo! News Zalando URL http://amazon.com http://www.asos.de http://delicious.com http://www.facebook.com http://www.findory.com http://www.goodreads.com http://mail.google.com http://www.google.com http://maps.google.com http://news.google.com http://last.fm http://www.librarything.com http://www.linkedin.com http://moviepilot.com http://moviepilot.de http://www.net-a-porter.com http://netflix.com http://maps.nokia.com http://www.otto.de http://pandora.com http://songkick.com http://tivo.com http://twitter.com http://xing.com http://www.yahoo.com http://answers.yahoo.com http://news.yahoo.com http://www.zalando.de Table B.3: Websites mentioned in this thesis. 132 Software MyMediaLite . . . for Java GitHub repository Gitorious repository Evaluation using MyMediaLite MyMediaLite for RapidMiner Duine GraphLab LensKit Mahout RecLab recommenderlab Waes jCOLIBRI MyCBR PyRSVD CoRank SVDFeature Jellysh Likelike OpenSlopeOne Vogoo Vowpal Wabbit Wooix Recommendable ActsAsRecommendable SUGGEST LibFM PMF/PTF BPTF MultiLens APPENDIX B. URLS URL http://ismll.de/mymedialite https://github/zenogantner/MyMediaLiteJava https://github.com/zenogantner/MyMediaLite http://gitorious.org/mymedialite http://ir.ii.uam.es/evaluation/rs/ http://elico.rapid-i.com/recommender-extension.html http://duineframework.org http://graphlab.org http://lenskit.grouplens.org http://mahout.apache.org http://code.richrelevance.com http://cran.r-project.org/web/packages/recommenderlab http://waffles.sourceforge.net http://gaia.fdi.ucm.es/research/colibri/jcolibri http://mycbr-project.net http://code.google.com/p/pyrsvd/ http://cofirank.org/ http://apex.sjtu.edu.cn/apex_wiki/svdfeature http://research.cs.wisc.edu/hazy/victor/download/ http://code.google.com/p/likelike/ http://code.google.com/p/openslopeone/ http://sourceforge.net/projects/vogoo/ https://github.com/JohnLangford/vowpal_wabbit/ http://code.gustavonarea.net/wooflix.tar.gz https://github.com/davidcelis/recommendable https://github.com/maccman/acts_as_recommendable http://glaros.dtc.umn.edu/gkhome/suggest/overview http://cms.uni-konstanz.de/informatik/rendle/software/libfm/ http://www.mit.edu/~rsalakhu/BPMF.html http://www.cs.cmu.edu/~lxiong/bptf/bptf.html http://knuth.luther.edu/~bmiller/dynahome.php?page=multilens Table B.4: Websites of recommender system software mentioned in this thesis. Bibliography Jacob Abernethy, Francis R. Bach, Theodoros Evgeniou, and Jean-Philippe Vert. A new approach to collaborative ltering: Operator estimation with spectral regularization. Journal of Machine Learning Research, 10:803826, 2009. Hector Garcia-Molina Aditya Parameswaran, Petros Venetis. Recommendation systems with complex constraints: A CourseRank perspective. Transactions on Information Systems (TOIS), 2011. Gediminas Adomavicius and Alexander Tuzhilin. Context-aware recommender systems. In Recommender Systems Handbook Kantor et al. (2011), pages 217253. Deepak Agarwal and Bee-Chung Chen. Regression-based latent factor models. In John F. Elder IV, Francoise Fogelman-Soulie, Peter A. Flach, and Mohammed Javeed Zaki, editors, Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1928. ACM, 2009. Deepak Agarwal and Bee-Chung Chen. fLDA: Matrix factorization through latent dirichlet allocation. In Davison et al. (2010), pages 91100. Masoud Alghoniemy and Ahmed H. Tewk. A network ow model for playlist generation. In Proceedings of the 2001 IEEE International Conference on Multimedia and Expo, 2001. Kamal Ali and Wijnand van Stam. TiVo: Making show recommendations using a distributed collaborative ltering architecture. In Proceedings of the Tenth ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2004. Xavier Amatriain, Marc Torrens, Paul Resnick, and Markus Zanker, editors. Proceedings of the Fourth ACM Conference on Recommender Systems (RecSys 2010), 2010. ACM. Thorsten Angermann. Empirische Analyse von Open-SourceEmpfehlungssystemen. Master's thesis, University of Hildesheim, 2010. Marko Balabanovic and Yoav Shoham. Fab: content-based, collaborative recommendation. Communications of the ACM, 40(3):6672, 1997. Marko Balabanovic. An adaptive web page recommendation service. In Proceedings of the First International Conference on Autonomous Agents, pages 378385. ACM, 1997. 133 134 BIBLIOGRAPHY Suhrid Balakrishnan, Rensheng Wang, Carlos Scheidegger, Angus MacLellan, Yifan Hu, and Aaron Archer. Combining predictors for recommending music: the False Positives' approach to KDD Cup track 2. In KDD Cup Workshop 2011, 2011. Maria-Florina Balcan, Nikhil Bansal, Alina Beygelzimer, Don Coppersmith, John Langford, and Gregory B. Sorkin. Robust reductions from ranking to classication. Machine Learning, 72(12):139153, 2008. Linas Baltrunas. Context-Aware Collaborative Filtering Recommender Systems. PhD thesis, Free University of Bozen-Bolzano, 2011. Linas Baltrunas, Tadas Makcinskas, and Francesco Ricci. Group recommendations with rank aggregation and collaborative ltering. In Amatriain et al. (2010), pages 119126. Linas Baltrunas, Bernd Ludwig, and Francesco Ricci. Matrix factorization techniques for context aware recommendation. In Mobasher et al. (2011), pages 301304. Riccardo Bambini, Paolo Cremonesi, and Roberto Turrin. A recommender system for an IPTV service provider: a real large-scale production environment. In Recommender Systems Handbook Kantor et al. (2011), pages 299331. Robert M. Bell, Yehuda Koren, and Chris Volinsky. Modeling relationships at multiple scales to improve accuracy of large recommender systems. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 95104. ACM, 2007. Robert M. Bell, Yehuda Koren, and Chris Volinsky. The BellKor 2008 solution to the Netix Prize. Technical report, Statistics Research Department at AT&T Research, 2008. Alejandro Bellogn, Pablo Castells, and Iv an Cantador. Precision-oriented evaluation of recommender systems: an algorithmic comparison. In Mobasher et al. (2011), pages 333336. Lawrence D. Bergman, Alexander Tuzhilin, Robin D. Burke, Alexander Felfernig, and Lars Schmidt-Thieme, editors. Proceedings of the Third ACM Conference on Recommender Systems (RecSys 2009), 2009. ACM. Steen Bickel. ECML-PKDD Discovery Challenge 2006 overview. In The Discovery Challenge Workshop, page 1, 2006. Daniel Billsus, Michael J. Pazzani, and James Chen. A learning agent for wireless news access. In Intelligent User Interfaces, pages 3336, 2000. Christopher M. Bishop. Pattern recognition and machine learning. Springer, New York, 2006. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:9931022, 2003. BIBLIOGRAPHY 135 Kurt D. Bollacker, Steve Lawrence, and C. Lee Giles. Discovering relevant scientic literature on the web. Intelligent Systems and their Applications, 15 (2):4247, 2000. Dirk Bollen, Bart P. Knijnenburg, Martijn C. Willemsen, and Mark Graus. Understanding choice overload in recommender systems. In Amatriain et al. (2010), pages 6370. Craig Boutilier, Richard S. Zemel, and Benjamin Marlin. Active collaborative ltering. In Proceedings of the Nineteenth Conference on Uncertainty in Articial Intelligence, pages 98106, 2003. Gary Bradski and Adrian Kaehler. Learning OpenCV: Computer vision with the OpenCV library. O'Reilly Media, 2008. Leo Breiman. Bagging predictors. Machine Learning, 24(2):123140, 1996. Leo Breiman. Random forests. Machine Learning, 45(1):532, 2001. Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(17):107117, 1998. Rich Caruana. Multitask learning. Machine Learning, 28(1):4175, 1997. Oscar Celma. Music Recommendation and Discovery: The Long Tail, Long Fail, and Long Play in the Digital Music Space. Springer, 2010. S. Wesley Changchien and Tzu-Chuen Lu. Mining association rules procedure to support on-line recommendation by customers and products fragmentation. Expert Systems with Applications, 20(4):325335, 2001. Po-Lung Chen, Chen-Tse Tsai, Yao-Nan Chen, Ku-Chun Chou, Chun-Liang Li, Cheng-Hao Tsai, Kuan-Wei Wu, Yu-Cheng Chou, Chung-Yi Li, Wei-Shih Lin, Shu-Hao Yu, Rong-Bing Chiu, Chieh-Yen Lin, Chien-Chih Wang, Po-Wei Wang, Wei-Lun Su, Chen-Hung Wu, Tsung-Ting Kuo, Todd G. McKenzie, Ya-Hsuan Chang, Chun-Sung Ferng, Chia-Mau Ni, Hsuan-Tien Lin, Chih-Jen Lin, and Shou-De Lin. A linear ensemble of individual and blended models for music rating prediction. In KDD Cup Workshop 2011, 2011a. Tianqi Chen, Zhao Zheng, Qiuxia Lu, Xiao Jiang, Yuqiang Chen, Weinan Zhang Kailong Chen, Yong Yu, Nathan N. Liu, Bin Cao, Luheng He, and Qiang Yang. Informative ensemble of multi-resolution dynamic factorization models. In KDD Cup Workshop 2011, 2011b. Yu-Chen Chen. A personalized local event recommendation system for mobile users. Master's thesis, Dept. of Information Management, Chaoyang University of Technology, Taiwan, 2005. Yeong Bin Cho, Yoon Ho Cho, and Soung Hie Kim. Mining changes in customer buying behavior for collaborative recommendations. Expert Systems with Applications, 28(2):359369, 2005. 136 BIBLIOGRAPHY Kenneth L. Clarkson, Elad Hazan, and David P. Woodru. Sublinear optimization for machine learning. In 51st Annual IEEE Symposium on Foundations of Computer Science (FOCS 2010), pages 449457. IEEE Computer Society, 2010. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Cliord Stein. Introduction to algorithms, volume 6. MIT Press, 2001. Paolo Cremonesi and Roberto Turrin. Analysis of cold-start recommendations in IPTV systems. In Bergman et al. (2009), pages 233236. Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. Performance of recommender algorithms on top-n recommendation tasks. In Amatriain et al. (2010), pages 3946. Hamish Cunningham, Diana Maynard, Kalina Bontcheva, Valentin Tablan, Niraj Aswani, Ian Roberts, Genevieve Gorrell, Adam Funk, Angus Roberts, Danica Damljanovic, Thomas Heitz, Mark A. Greenwood, Horacio Saggion, Johann Petrak, Yaoyong Li, and Wim Peters. Text Processing with GATE (Version 6). 2011. Abhinandan Das, Mayur Datar, Ashutosh Garg, and Shyam Rajaram. Google News personalization: scalable online collaborative ltering. In Proceedings of the 16th International World Wide Web Conference, pages 271280. ACM, 2007. James Davidson, Benjamin Liebald, Junning Liu, Palash Nandy, Taylor Van Vleet, Ullas Gargi, Sujoy Gupta, Yu He, Mike Lambert, Blake Livingston, and Dasarathi Sampath. The YouTube video recommendation system. In Amatriain et al. (2010), pages 293296. Brian D. Davison, Torsten Suel, Nick Craswell, and Bing Liu, editors. Proceedings of the Third International Conference on Web Search and Web Data Mining (WSDM 2010), 2010. ACM. Mukund Deshpande and George Karypis. Item-based top-n recommendation algorithms. ACM Transactions on Information Systems, 22/1:143177, 2004. Christian Desrosiers and George Karypis. A comprehensive survey of neighborhood-based recommendation methods. In Recommender Systems Handbook Kantor et al. (2011), pages 107144. Belen Daz-Agudo, Pedro A. Gonz alez-Calero, Juan A. Recio-Garca, and Antonio A. S anchez-Ruiz-Granados. Building CBR systems with jCOLIBRI. Science of Computer Programming, 69(1-3):6875, 2007. Gideon Dror, Noam Koenigstein, and Yehuda Koren. Yahoo! music recommendations: Modeling music ratings with temporal dynamics and item taxonomy. In Mobasher et al. (2011). Gideon Dror, Noam Koenigstein, Yehuda Koren, and Markus Weimer. The Yahoo! Music Dataset and KDD-Cup'11. In KDD Cup Workshop 2011, 2011b. BIBLIOGRAPHY 137 Gideon Dror, Yehuda Koren, Yoelle Maarek, and Idan Szpektor. I want to answer, who has a question? yahoo! answers recommender system. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 11091117. ACM, 2011c. Liang Du, Xuan Li, and Yi-Dong Shen. User graph regularized pairwise matrix factorization for item recommendation. In Jie Tang, Irwin King, Ling Chen, and Jianyong Wang, editors, Advanced Data Mining and Applications, volume 7121 of Lecture Notes in Computer Science, pages 372385. Springer, 2011. Andrew S.C. Ehrenberg. Repeat-buying: Facts, theory and applications. Grin New London, 1988. Michael D. Ekstrand, Praveen Kannan, James A. Stemper, John T. Butler, Joseph A. Konstan, and John T. Riedl. Automatically building research reading lists. In Amatriain et al. (2010), pages 159166. Michael D. Ekstrand, Michael Ludwig, Joseph A. Konstan, and John T. Riedl. Rethinking the recommender research ecosystem: reproducibility, openness, and LensKit. In Mobasher et al. (2011), pages 133140. Christoph Freudenthaler, Lars Schmidt-Thieme, and Steen Rendle. Bayesian factorization machines. In Workshop on Sparse Representation and Low-rank Approximation, Neural Information Processing Systems (NIPS 2011), 2011. Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119139, 1997. Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The Annals of Statistics, 28(2):337407, 2000. Jerome H. Friedman. Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367378, 2002. Zeno Gantner and Lars Schmidt-Thieme. Automatic content-based categorization of wikipedia articles. In Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources, pages 3237. Association for Computational Linguistics, 2009. Zeno Gantner, Lucas Drumond, Christoph Freudenthaler, Steen Rendle, and Lars Schmidt-Thieme. Learning attribute-to-feature mappings for cold-start recommendations. In Webb et al. (2010). Zeno Gantner, Christoph Freudenthaler, Steen Rendle, and Lars SchmidtThieme. Optimal ranking for video recommendation. In User Centric Media: First International Conference (UCMedia 2009), Revised Selected Papers, pages 255258. Springer, 2010b. Zeno Gantner, Steen Rendle, and Lars Schmidt-Thieme. Factorization models for context-/time-aware movie recommendations. In Challenge on Contextaware Movie Recommendation (CAMRa2010). ACM, 2010c. 138 BIBLIOGRAPHY Zeno Gantner, Lucas Drumond, Christoph Freudenthaler, and Lars SchmidtThieme. Bayesian personalized ranking for non-uniformly sampled items. In KDD Cup Workshop 2011, 2011a. Zeno Gantner, Steen Rendle, Christoph Freudenthaler, and Lars SchmidtThieme. MyMediaLite: A free recommender system library. In Mobasher et al. (2011). William A. Gardner. Learning characteristics of stochastic-gradient-descent algorithms: A general study, analysis, and critique. Signal Processing, 6(2): 113133, 1984. Michael S. Gashler. Waes: A machine learning toolkit. Journal of Machine Learning Research, 2:23832387, July 2011. Rainer Gemulla, Peter J. Haas, Erik Nijkamp, and Yannis Sismanis. Largescale matrix factorization with distributed stochastic gradient descent. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 6977. ACM, 2011. Thomas George and Srujana Merugu. A scalable collaborative ltering framework based on co-clustering. In Jiawei Han, Benjamin W. Wah, Vijay Raghavan, Xindong Wu, and Rajeev Rastogi, editors, Proceedings of the 5th IEEE International Conference on Data Mining (ICDM 2005). IEEE Computer Society, 2005. Andreas Geyer-Schulz and Michael Hahsler. Evaluation of recommender algorithms for an internet information broker based on simple association rules and on the repeat-buying theory. In Proceedings of the ACM WebKDD Workshop, 2002. Fosca Giannotti, Dimitrios Gunopulos, Naren Ramakrishnan, Franco Turini, Carlo Zanilo, and Xindong Wu, editors. Proceedings of the 8th IEEE International Conference on Data Mining (ICDM 2008), 2008. IEEE Computer Society. David Goldberg, David Nichols, Brian M. Oki, and Douglas Terry. Using collaborative ltering to weave an information tapestry. Communications of the ACM, 35(12):6170, 1992. Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Eigentaste: A constant time collaborative ltering algorithm. Information Retrieval, 4 (2):133151, 2001. Nathaniel Good, J. Ben Schafer, Joseph A. Konstan, Al Borchers, Badrul Sarwar, Jon Herlocker, and John Riedl. Combining collaborative ltering with personal agents for better recommendations. In Proceedings of the National Conference on Articial Intelligence, pages 439446. John Wiley & Sons, 1999. Asela Gunawardana and Christopher Meek. Tied Boltzmann machines for cold start recommendations. In Pu et al. (2008), pages 1926. BIBLIOGRAPHY 139 Asela Gunawardana and Christopher Meek. A unied approach to building hybrid recommender systems. In Bergman et al. (2009), pages 117124. Mark Hall, Eibe Frank, Georey Holmes, Bernd Pfahringer, Peter Reutemann, and Ian H. Witten. The weka data mining software: an update. ACM SIGKDD Explorations Newsletter, 11(1):1018, 2009. Richard A. Harshman. Foundations of the PARAFAC procedure: models and conditions for an 'exploratory' multimodal factor analysis. In UCLA Working Papers in Phonetics, pages 184, 1970. Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning, Second Edition: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer, 2nd ed. corr. 3rd printing edition, January 2009. Elad Hazan, Tomer Koren, and Nathan Srebro. Beating SGD: Learning SVMs in sublinear time. In Advances in Neural Information Processing Systems (NIPS) 24, 2011. Jonathan L. Herlocker, Joseph A. Konstan, Al Borchers, and John Riedl. An algorithmic framework for performing collaborative ltering. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 230237. ACM, 1999. Jonathan L. Herlocker, Joseph A. Konstan, Loren G. Terveen, and John T. Riedl. Evaluating collaborative ltering recommender systems. ACM Transactions on Information Systems, 22:553, January 2004. Thomas Hofmann. Latent semantic models for collaborative ltering. ACM Transactions on Information Systems, 22(1):89115, 2004. Thomas Hofmann and Jan Puzicha. Latent class models for collaborative ltering. In Proceedings of the Sixteenth International Joint Conference on Articial Intelligence, pages 688693, Stockholm, 1999. Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. A practical guide to support vector classication, 2003. Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative ltering for implicit feedback datasets. In Giannotti et al. (2008). Ross Ihaka and Robert Gentleman. R: a language for data analysis and graphics. Journal of Computational and Graphical Statistics, pages 299314, 1996. Michael Jahrer and Andreas T oscher. Collaborative ltering ensemble for ranking. In KDD Cup Workshop 2011, 2011. Mohsen Jamali and Martin Ester. A matrix factorization technique with trust propagation for recommendation in social networks. In Amatriain et al. (2010). Dietmar Jannach, Markus Zanker, and Matthias Fuchs. Constraint-based recommendation in tourism: A multiperspective case study. Information Technology & Tourism, 11(2):139155, 2009. 140 BIBLIOGRAPHY Dietmar Jannach, Markus Zanker, Alexander Felfering, and Gerhard Friedrich. Recommender Systems: An Introduction. Cambridge University Press, 2010. Robert J aschke, Leandro Balby Marinho, Andreas Hotho, Lars SchmidtThieme, and Gerd Stumme. Tag recommendations in folksonomies. Knowledge Discovery in Databases: PKDD 2007, pages 506514, 2007. Paul B. Kantor, Francesco Ricci, Lior Rokach, and Bracha Shapira. Recommender Systems Handbook. Springer, 2011. Alexandros Karatzoglou, Xavier Amatriain, Linas Baltrunas, and Nuria Oliver. Multiverse recommendation: n-dimensional tensor factorization for contextaware collaborative ltering. In Amatriain et al. (2010), pages 7986. George Karypis. Evaluation of item-based top-n recommendation algorithms. In Henrique Paques, Ling Liu, and David A. Grossman, editors, Proceedings of the 2001 ACM CIKM International Conference on Information and Knowledge Management, 2001. Ricardo Kawase, George Papadakis, Eelco Herder, and Wolfgang Nejdl. Beyond the usual suspects: context-aware revisitation support. In Proceedings of the 22nd ACM Conference on Hypertext and Hypermedia, pages 2736. ACM, 2011. Bart P. Knijnenburg, Niels J. M. Reijmer, and Martijn C. Willemsen. Each to his own: how dierent users call for dierent interaction methods in recommender systems. In Mobasher et al. (2011), pages 141148. Bart P. Knijnenburg, Martijn C. Willemsen, Zeno Gantner, Hakan Soncu, and Chris Newell. Explaining the user experience of recommender systems. User Modeling and User-Adapted Interaction Journal (UMUAI), Special Issue on User Interfaces for Recommender Systems, 2011b. Joseph S. Kong, Kyle Teague, and Justin Kessler. Just count the love-hate squares: a rating network based method for recommender systems. In KDD Cup Workshop 2011, 2011. Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative ltering model. In Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 426434, New York, USA, 2008. ACM. Yehuda Koren. The BellKor solution to the Netix Grand Prize, 2009. URL http://www.netflixprize.com/assets/GrandPrize2009_BPC_ BellKor.pdf. Yehuda Koren. Collaborative ltering with temporal dynamics. Communications of the ACM, 53(4):8997, 2010. Yehuda Koren and Robert Bell. Advances in collaborative ltering. In Recommender Systems Handbook Kantor et al. (2011), pages 145186. Yehuda Koren and Joseph Sill. OrdRec: an ordinal model for predicting personalized item rating distributions. In Mobasher et al. (2011), pages 117124. BIBLIOGRAPHY 141 Yehuda Koren, Robert M. Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. IEEE Computer, 42(8):3037, 2009. Artus Krohn-Grimberghe, Lucas Drumond, Christoph Freudenthaler, and Lars Schmidt-Thieme. Multi-relational matrix factorization using bayesian personalized ranking for social network data. In Proceedings of the Fifth International Conference on Web Search and Web Data Mining (WSDM 2012). ACM, 2012. Siwei Lai, Liang Xiang, Rui Diao, Yang Liu, Huxiang Gu, Liheng Xu, Hang Li, Dong Wang, Kang Liu, Jun Zhao, and Chunhong Pan. Hybrid recommendation models for binary user preference prediction problem. In KDD Cup Workshop 2011, 2011. Ken Lang. Newsweeder: Learning to lter netnews. In Proceedings of the Twelfth International Conference on Machine Learning, 1995. Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, and L.D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541551, 1989. Yann LeCun, Leon Bottou, Genevieve B. Orr, and Klaus-Robert M uller. Ecient BackProp. Neural networks: Tricks of the trade, pages 546546, 1998. Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems (NIPS) 13. MIT Press, 2001. Daniel Lemire. Scale and translation invariant collaborative ltering systems. Information Retrieval, 8(1):129150, 2005. Daniel Lemire and Anna Maclachlan. Slope one predictors for online ratingbased collaborative ltering. In Proceedings of SIAM Data Mining (SDM 2005), 2005. Daniel Lemire, Harold Boley, Sean McGrath, and Marcel Ball. Collaborative ltering and inference rules for context-aware learning object recommendation. Interactive Technology and Smart Education, 2(3):179188, 2005. Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th International World Wide Web Conference, pages 661670. ACM, 2010. Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased oine evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining (WSDM 2011), pages 297306. ACM, 2011. Ling Li and Hsuan-Tien Lin. Optimizing 0/1 loss for perceptrons by random coordinate descent. In International Joint Conference on Neural Networks 2007 (IJCNN 2007), pages 749754. IEEE Computer Society, 2007. Greg Linden, Brent Smith, and Jeremy York. Amazon.com recommendations: Item-to-item collaborative ltering. IEEE Internet Computing, 7, 2003. 142 BIBLIOGRAPHY Marek Lipczak. Tag recommendation for folksonomies oriented towards individual users. In Proceedings of the ECML-PKDD Discovery Challenge Workshop, 2009. Beth Logan. Content-based playlist generation: Exploratory experiments. In Proceedings of the International Symposium on Music Information Retrieval, 2002. Pasquale Lops, Marco de Gemmis, and Giovanni Semeraro. Content-based recommender systems: State of the art and trends. In Recommender Systems Handbook Kantor et al. (2011), pages 73105. Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein. GraphLab: A new parallel framework for machine learning. In Proceedings of the 26th Conference on Uncertainty in Articial Intelligence, 2010. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Sch utze. Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK, 2008. Leandro Balby Marinho, Christine Preisach, and Lars Schmidt-Thieme. Relational classication for personalized tag recommendation. In Proceedings of the ECML-PKDD Discovery Challenge Workshop, 2009. Andrey Andreyevich Markov. Ðàñïðîñòðàíåíèå çàêîíà áîë'øèõ ÷èñåë íà æåëè÷èíú, çàæèñúàñ÷èå äðóã îò äðóãà. Èçæåñòèúà Ôèçèêîìàòåìàòè÷åñêîãî îáñ÷åñòæà ïðè Êàçàíñêîì óíèæåðñèòåòå, pages 135 156, 1906. Benjamin M. Marlin and Richard S. Zemel. Collaborative prediction and ranking with non-random missing data. In Bergman et al. (2009), pages 512. Paul Marrow, Rich Hanbidge, Steen Rendle, Christian Wartena, and Christoph Freudenthaler. MyMedia: Producing an extensible framework for recommendation. In NEM Summit, 2009. Paul Marrow, Tim Stevens, Ian Kegel, Joshan Meenowa, and Craig McCahill. Future IPTV services eld trial report. Technical report, MyMedia project, 2010. Andrew K. McCallum. Multi-label text classication with a mixture model trained by EM. In AAAI 99 Workshop on Text Learning, 1999. Todd G. McKenzie, Chun-Sung Ferng, Yao-Nan Chen, Chun Liang Li, ChengHao Tsai, Kuan-Wei Wu, Ya-Hsuan Chang, Chung-Yi Li, Wei-Shih Lin, ShuHao Yu, Chieh-Yen Lin, Po-Wei Wang, Chia-Mai Ni, Wei-Lun Su, Tsung-Ting Kuo, Chen-Tse Tsai, Po-Lung Chen, Rong-Bing Chiu, Ku-Chun Chou, YuCheng Chou, Chien-Shih Wang, Cheng-Hung Wu, Hsuan-Tien Lin, Chih-Jen Lin, and Shou-De Lin. Novel models and ensemble techniques to discriminate favorite items from unrated ones for personalized music recommendation. In KDD Cup Workshop 2011, 2011. BIBLIOGRAPHY 143 Aditya Krishna Menon and Charles Elkan. A log-linear model with latent features for dyadic prediction. In Webb et al. (2010), pages 364373. Aditya Krishna Menon and Charles Elkan. Link prediction via matrix factorization. Machine Learning and Knowledge Discovery in Databases, pages 437452, 2011. Bradley N. Miller. Toward a Personal Recommender System. PhD thesis, University of Minnesota, 2003. Tom M. Mitchell. Machine Learning. McGraw Hill, 1997. Andriy Mnih. Taxonomy-informed latent factor models for implicit feedback. In KDD Cup Workshop 2011, 2011. Bamshad Mobasher, Robin Burke, Dietmar Jannach, and Gediminas Adomavicius, editors. Proceedings of the Fifth ACM Conference on Recommender Systems (RecSys 2011), 2011. ACM. Mary M. Moya and Don R. Hush. Network constraints and multi-objective optimization for one-class classication. Neural Networks, 9(3):463474, 1996. Lik Mui, Peter Szolovits, and Cheewee Wang. Collaborative sanctioning: applications in restaurant recommendations based on reputation. In Proceedings of the Fifth International Conference on Autonomous Agents, pages 118119. ACM, 2001. Tavi Nathanson, Ephrat Bitton, and Ken Goldberg. Donation dashboard: a recommender system for donation portfolios. In Bergman et al. (2009), pages 253256. Thai-Nghe Nguyen, Lucas Drumond, Tomas Horvath, Alexandros Nanopoulos, and Lars Schmidt-Thieme. Matrix and tensor factorization for predicting student performance. In Proceedings of the 3rd International Conference on Computer Supported Education (CSEDU 2011), 2011. Douglas Oard and Jinmook Kim. Implicit feedback for recommender systems. In Proceedings of the AAAI Workshop on Recommender Systems, pages 8183, 1998. Sean Owen, Robin Anil, Ted Dunning, and Ellen Friedman. Mahout in Action. Manning Publications, 2011. Elias Pampalk, Tim Pohle, and Gerhard Widmer. Dynamic playlist generation based on skipping behavior. In Proceedings of the 6th International Conference on Music Information Retrieval (ISMIR), pages 634637, 2005. Rong Pan, Yunhong Zhou, Bin Cao, Nathan Nan Liu, Rajan M. Lukose, Martin Scholz, and Qiang Yang. One-class collaborative ltering. In Giannotti et al. (2008), pages 502511. Seung-Taek Park and Wei Chu. Pairwise preference regression for cold-start recommendation. In Bergman et al. (2009), pages 2128. 144 BIBLIOGRAPHY Arek Paterek. Improving regularized singular value decomposition for collaborative ltering. In KDD Cup Workshop at KDD 2007, San Jose, California, USA, 2007. Steen Pauws, Wim Verhaegh, and Mark Vossen. Fast generation of optimal music playlists using local search. In Proceedings of the 7th International Conference on Music Information Retrieval (ISMIR), 2006. Michael Pazzani and Daniel Billsus. Content-based recommendation systems. In The Adaptive Web: Methods and Strategies of Web Personalization, pages 325341. Springer, 2007. Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Courna peau, Matthieu Brucher, Matthieu Perrot, and Edouard Duchesnay. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:28252830, 2011. Istv an Pil aszy and Domonkos Tikk. Recommending new movies: even a few ratings are more valuable than metadata. In Bergman et al. (2009), pages 93100. Istv an Pilaszy, D avid Zibriczky, and Domonkos Tikk. Fast ALS-based matrix factorization for explicit and implicit feedback datasets. In Amatriain et al. (2010), pages 7178. Martin Piotte and Martin Chabbert. The Pragmatic Theory solution to the Netix Grand Prize, 2009. URL http://netflixprize.com/assets/ GrandPrize2009_BPC_PragmaticTheory.pdf. Luiz Pizzato, Tomek Rej, Thomas Chung, Irena Koprinska, and Judy Kay. RECON: a reciprocal recommender for online dating. In Amatriain et al. (2010), pages 207214. Pearl Pu, Derek G. Bridge, Bamshad Mobasher, and Francesco Ricci, editors. Proceedings of the Second ACM Conference on Recommender Systems (RecSys 2008), 2008. ACM. Al Mamunur Rashid, George Karypis, and John Riedl. Learning preferences of new users in recommender systems: an information theoretic approach. ACM SIGKDD Explorations Newsletter, 10(2):90100, 2008. Benjamin Recht and Christopher Re. Parallel Stochastic Gradient Algorithms for Large-Scale Matrix Completion. Optimization Online, 2011. Steen Rendle. Context-aware ranking with factorization models, volume 330 of Studies in Computational Intelligence. Springer, 2010a. Steen Rendle. Factorization machines. In Webb et al. (2010), pages 9951000. Steen Rendle and Lars Schmidt-Thieme. Online-updating regularized kernel matrix factorization models for large-scale recommender systems. In Pu et al. (2008). BIBLIOGRAPHY 145 Steen Rendle and Lars Schmidt-Thieme. Factor models for tag recommendation in BibSonomy. In Proceedings of the ECML-PKDD Discovery Challenge Workshop, 2009. Steen Rendle and Lars Schmidt-Thieme. Pairwise interaction tensor factorization for personalized tag recommendation. In Davison et al. (2010), pages 8190. Steen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars SchmidtThieme. BPR: Bayesian personalized ranking from implicit feedback. In UAI 2009, 2009. Steen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. Factorizing personalized Markov chains for next-basket recommendation. In Proceedings of the 19th International World Wide Web Conference, pages 811820. ACM, 2010. Steen Rendle, Zeno Gantner, Christoph Freudenthaler, and Lars SchmidtThieme. Fast context-aware recommendations with factorization machines. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 635644. ACM, 2011. Jason D. M. Rennie and Nathan Srebro. Fast maximum margin matrix factorization for collaborative prediction. In Proceedings of the 22nd International Conference on Machine Learning, pages 713719, New York, USA, 2005. ACM, ACM. Paul Resnick and Hal R. Varian. Recommender systems. Communications of the ACM, 40(3):5658, 1997. Paul Resnick, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, and John Riedl. Grouplens: An open architecture for collaborative ltering of netnews. In Proceedings of the 1994 ACM Conference on Computer Supported Cooperative Work, pages 175186. ACM, 1994. Enrique H. Ruspini. A new approach to clustering. Information and Control, 15(1):2232, 1969. Ruslan Salakhutdinov and Andriy Mnih. Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In Proceedings of the 25th International Conference on Machine Learning, pages 880887. ACM, 2008a. Ruslan Salakhutdinov and Andriy Mnih. Probabilistic matrix factorization. In Advances in Neural Information Processing Systems (NIPS) 20, pages 1257 1264. MIT Press, 2008b. Ruslan Salakhutdinov, Andriy Mnih, and Georey Hinton. Restricted Boltzmann machines for collaborative ltering. In Proceedings of the 24th International Conference on Machine Learning, pages 791798. ACM, 2007. Ramesh R. Sarukkai. Link prediction and path analysis using Markov chains. Computer Networks, 33(1):377386, 2000. 146 BIBLIOGRAPHY Badrul M. Sarwar, Joseph A. Konstan, Al Borchers, Jon Herlocker, Brad Miller, and John Riedl. Using ltering agents to improve prediction quality in the grouplens research collaborative ltering system. In Computer Supported Cooperative Work, pages 345354, 1998. Andrew I. Schein, Alexandrin Popescul, Lyle H. Ungar, and David M. Pennock. Methods and metrics for cold-start recommendations. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 253260, New York, NY, USA, 2002. ACM. Guy Shani and Asela Gunawardana. Evaluating recommendation systems. In Recommender Systems Handbook Kantor et al. (2011), pages 257297. Guy Shani, David Heckerman, and Ronen I. Brafman. An MDP-based recommender system. Journal of Machine Learning Research, 6(2):1265, 2006. Upendra Shardanand. Social information ltering for music recommendation, 1994. Bachelor thesis, Massachussetts Institute of Technology. Victor S. Sheng and Charles X. Ling. Roulette sampling for cost-sensitive learning. Proceedings of the 18th European Conference on Machine Learning (ECML 2007), pages 724731, 2007. Alex J. Smola and Bernhard Sch olkopf. A tutorial on support vector regression. Statistics and Computing, 14(3), 2004. S oren Sonnenburg, Gunnar R atsch, Sebastian Henschel, Christian Widmer, Jonas Behr, Alexander Zien, Fabio de Bona, Alexander Binder, Christian Gehl, and Vojtech Franc. The SHOGUN machine learning toolbox. The Journal of Machine Learning Research, 11:17991802, 2010. Bharath Sriram, David Fuhry, Engin Demir, Hakan Ferhatosmanoglu, and Murat Demirbas. Short text classication in twitter to improve information ltering. In Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 841842. ACM, 2010. Armin Stahl and Thomas Roth-Berghofer. Rapid prototyping of CBR applications with the open source tool myCBR. Advances in Case-Based Reasoning, pages 615629, 2008. Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction. Cambridge University Press, 1998. G abor Tak acs, Istv an Pil aszy, Botty an Nemeth, and D. Tikk. Matrix factorization and neighbor based algorithms for the Netix prize problem. In Pu et al. (2008), pages 267274. G abor Tak acs, Istv an Pil aszy, Botty an Nemeth, and D. Tikk. Scalable collaborative ltering approaches for large recommender systems. Journal of Machine Learning Research, 10:623656, 2009. BIBLIOGRAPHY 147 Yuichiro Takeuchi and Masanori Sugimoto. Cityvoyager: An outdoor recommendation system based on user location history. Ubiquitous Intelligence and Computing, pages 625636, 2006. Ben Taskar, Vasil Chatalbashev, Daphne Koller, and Carlos Guestrin. Learning structured prediction models: A large margin approach. In Proceedings of the 22nd International Conference on Machine Learning, pages 896903. ACM, 2005. Kai Ming Ting. Inducing cost-sensitive trees via instance weighting. In Proceedings of The Second European Symposium on Principles of Data Mining and Knowledge Discovery, pages 139147. Springer, 1998. Andrew Trotman. Learning to rank. Information Retrieval, 8(3):359381, 2005. Karen H. L. Tso-Sutter, Leandro Balby Marinho, and Lars Schmidt-Thieme. Tag-aware recommender systems by fusion of collaborative ltering algorithms. In Proceedings of the 2008 ACM Symposium on Applied Computing, pages 19951999. ACM, 2008. Ledyard R. Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, pages 279311, 1966. Andreas Toscher, Michael Jahrer, and Robert Legenstein. Improved neighborhood-based algorithms for large-scale recommender systems. In Proceedings of the 2nd KDD Workshop on Large-Scale Recommender Systems and the Netix Prize Competition, page 4. ACM, 2008. Andreas T oscher, Michael Jahrer, and Robert Legenstein. Combining predictions for accurate recommender systems. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 693702. ACM, 2010. Tiany Tang und Gordon McCalla. Utilizing articial learners to help overcome the cold-start problem in a pedagogically-oriented paper recommendation system. In Adaptive Hypermedia and Adaptive Web-Based Systems, pages 395423. Springer, 2004. Mark van Setten. Supporting People In Finding Information Hybrid Recommender Systems and Goal-Based Structuring. PhD thesis, University of Twente, 2005. Darren E. Vengro. RecLab: a system for ecommerce recommender research with real data, context and feedback. In Proceedings of the 2011 Workshop on Context-awareness in Retrieval and Recommendation, pages 3138. ACM, 2011. Ellen M. Voorhees. The TREC-8 question answering track report. Technical report, National Institute of Standards and Technology, 2000. Chong Wang and David M. Blei. Collaborative topic modeling for recommending scientic articles. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 448456. ACM, 2011. 148 BIBLIOGRAPHY Georey I. Webb, Bing Liu, Chengqi Zhang, Dimitrios Gunopulos, and Xindong Wu, editors. Proceedings of the 10th IEEE International Conference on Data Mining (ICDM 2010), 2010. IEEE Computer Society. Markus Weimer, Alexandros Karatzoglou, and Alex Smola. Improving maximum margin matrix factorization. Machine Learning, 72(3):263276, 2008. Yao Wu, Qiang Yan, Danny Bickson, Yucheng Low, and Qing Yang. Ecient multicore collaborative ltering. ArXiv preprint arXiv:1108.2580, 2011. Jianjun Xie, Scott Leishman, Liang Tian, David Lisuk, Seongjoon Koo, and Matthias Blume. Feature engineering in user's music preference prediction. In KDD Cup Workshop 2011, 2011. Yu Xin and Harald Steck. Multi-value probabilistic matrix factorization for IP-TV recommendations. In Mobasher et al. (2011), pages 221228. Liang Xiong, Xi Chen, Tzu-Kuo Huang, Je Schneider, and Jaime G. Carbonell. Temporal collaborative ltering with Bayesian probabilistic tensor factorization. In Proceedings of SIAM Data Mining, 2010. Shuang-Hong Yang, Bo Long, Alexander J. Smola, Hongyuan Zha, and Zhaohui Zheng. Collaborative competitive ltering: learning recommender using context of user choice. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 295304. ACM, 2011. Kai Yu and Volker Tresp. Learning to learn and collaborative ltering. In Neural Information Processing Systems Workshop on Inductive Transfer, volume 10, 2005. Kai Yu, Shenghuo Zhu, John Laerty, and Yihong Gong. Fast nonparametric matrix factorization for large-scale collaborative ltering. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 211218. ACM, 2009. Liang Zhang, Deepak Agarwal, and Bee-Chung Chen. Generalizing matrix factorization through exible regression priors. In Mobasher et al. (2011), pages 1320. Renjie Zhou, Samamon Khemmarat, and Lixin Gao. The impact of YouTube recommendation system on video views. In Proceedings of the 10th Annual Conference on Internet Measurement, pages 404410. ACM, 2010. Yunhong Zhou, Dennis Wilkinson, Robert Schreiber, and Rong Pan. Largescale parallel collaborative ltering for the Netix Prize. Algorithmic Aspects in Information and Management, pages 337348, 2008. Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, and Georg Lausen. Improving recommendation lists through topic diversication. In Allan Ellis and Tatsuya Hagino, editors, Proceedings of the 14th International World Wide Web Conference (WWW 2005). ACM, 2005. Index command-line tools, 81 competitive collaborative ltering, 29 computer vision, 79 condence, 40 conjugate gradient, 36 content-based ltering, 88 context, 26 context-aware recommendation, 29, 89 contextual modeling, 40 contextual post-ltering, 40 contextual pre-ltering, 40 cosine similarity, 34 cost-sensitive learning, 63 coverage, 40 Crab, 87 cross-validation, 83 0/1 classication, 62 k -fold cross-validation, 82 .NET platform, 80 abstract class, 107 accuracy, 40 AdaBoost, 77 adaptivity, 41 age, 81 ALS, see alternating least squares alternating least squares, 36, 86 Apache Mahout, 86 area under the ROC curve, 41, 61, 82 aspect model, 58 association rules, 87 attribute-based kNN, 81 attribute-to-factor mapping, 4559 AUC, see area under the ROC curve bagging, 77, 87 Bayesian Context-Aware Ranking, 38 Bayesian Personalized Ranking, 3839, 6166, 82 binary classication, 62 BPR, see Bayesian Personalized Ranking BPR-MF, 39, 81 C, 92 C++, 80, 92 C#, 88, 92, 122 case-based reasoning, 87 categories, 81 chronological splits, 82 classication, 24, 27, 61 clustering, 24, 87 co-clustering, 80 COFI, 87 CoRank, 38, 88 cold-start problem, 4559 collaborative ltering, 61 decision trees, 35 demographic data, 81 dense matrix, 108 diagonal, 84 distributed matrix factorization, 83 diversity, 41 documentation, 84 dot product, 45 Duine, 88, 92 e-commerce, 27 e-LICO, 96 EasyRec, 87 Eigentaste, 87 EM, see expectation-maximization ensemble, 70, 87, 122 epoch, 35 evaluation, 40, 51, 82, 93 expectation-maximization, 36 explicit context, 29 explicit feedback, 25 F#, 80, 123 Facebook, 28 149 150 Factorization Machines, 38 feedback data, 33 Filterbots, 58 ltered recommendation, 29 fLDA, 57 folksonomy, 29 free software, 91 GATE, 79 generalized linear models, 77 Generalized Matrix Factorization, 58 geographic data, 81 GNU General Public License, 79 gradient boosted decision trees, 77 GraphLab, 86, 92 grid search, 82 group recommendation, 81, 91, 92, 121 INDEX like, 28 Likelike, 88 linear regression, 35 locality sensitive hashing, 88 location, 81 logistic regression, 61, 77, 89 LogitBoost, 77 loss function, 23 machine learning, 23 MAE, see mean absolute error Mahout, 86, 92 Markov Chain Monte Carlo, 36, 86 Markov property, 31 match making, 29 Matlab, 89 matrix, 107 matrix factorization, 36, 39, 80, 81, 86 Hadoop, 86, 88 88, 91, 92 hyperparameter tuning, 91 mean absolute error, 41, 82, 103, 118 mean average precision, 82 implicit context, 29 mean reciprocal rank, 82 implicit feedback, 27, 81 memory usage, 93 incremental updates, 82, 91 MinHash, 34 inferred context, 29 Mono, 93 information retrieval, 40, 42 most popular items, 33 interface, 107 MovieLens, 44 item attributes, 34, 81 multi-core, 83, 91, 92 item average, 87 multi-label classication, 27 item ranking, 40 multi-task learning, 62 item recommendation, 26, 81 MultiLens, 90 item recommendation from positive-only Multiverse recommendation, 40 feedback, 28 music recommendation, 31 item-based kNN, 34, 86 MyCBR, 87 MyMediaLite, 7996, 98, 101127 Jaccard index, 34 MySQL, 89 Java, 80, 87, 92 jCOLIBRI, 87 Jellysh, 83, 88 Jester, 27 KDD Cup 2011, 44, 67 keywords, 81 kNN, 33, 87, 90 latent feature log-linear model, 58, 89 LDA, 57 learning reduction, 61 learning to rank, 61 LensKit, 86, 92 LibFM, 89 Naive Bayes, 35 natural language processing, 79 neighborhood-based models, 33 Nelder-Mead method, 82 Netix Prize, 44, 82 neural networks, 77 new-item problem, 98 new-user problem, 98 non-negative matrix factorization, 86 normalized discounted cumulative gain, 82 novelty, 40 NSVD1, 56 INDEX 151 recall, 42 recall at n, 42 RecLab, 87 recommenderlab, 87 reduction, 61 reection, 126 regression, 24, 26 Regression-based Latent Factor ModPageRank, 40 els, 58 pairwise classication, 61 regret, 62 pairwise interaction tensor factorization, reinforcement learning, 99 40 repeat buying, 31 Pairwise Preference Regression, 58 repeated events, 31 pairwise ranking, 61 ridge regression, 50 Pandora, 25 risk, 41 PARAFAC, 48 RMSE, see root mean square error parallel processing, 83 robustness, 41 parallel SGD, 83 root mean square error, 41, 82, 103, PCA, 87 118 Pearson correlation, 33 Ruby, 80, 89, 125 per-user AUC, 42 scalability, 41, 91, 93 personalized recommendations, 26 scikit-learn, 79 PHP, 89 sequential recommendation, 31 PITF, 40 serendipity, 40 playlist recommendation, 31 serialization, 84 positive-only feedback, 27, 81, 91 SGD, see stochastic gradient descent post-ltering, 40 Shogun, 79 pre-ltering, 40 similar items, 26 precision, 42 similarity measure, 87 precision at n, 42, 82 Slope-One, 80, 86, 87, 89 prediction accuracy, 40 social network, 29, 81 prediction tasks, 23 SocialMF, 82 privacy, 41 probabilistic latent semantic indexing, soft clustering, 66 SQL, 82 86 stochastic gradient descent, 35, 63, 86 probabilistic matrix factorization, 89 structured prediction, 27 probabilistic tensor factorization, 89 property, 108 sub-epoch, 84 publicly available software, 89, 91 SUGGEST, 92 supervised machine learning, 19, 23 PyRSVD, 88 support vector machine, 89 Python, 80, 8789, 124 support vector machines, 35 R, 79, 87 SVD++, 37, 86, 89 R recommenderlab, 87 SVDFeature, 38, 88 Rails, 89 tag recommendation, 29, 40 random coordinate descent, 77 Taste, 86 Random Forests, 77 Taste.NET, 88 ranking, 61 RapidMiner, 79, 96 taxonomies, 81 rating prediction, 25, 80, 91, 102 tensor factorization, 86, 89 one-class classication, 28 one-class feedback, 27 online shopping, 87 open source, 91 OpenCV, 79 OpenSlopeOne, 89 ordinal regression, 26 152 text mining, 79 thumbs up, 28 Tied Boltzmann Machines, 58 time-aware Bayesian probabilistic tensor factorization, 38 time-aware rating prediction, 91 time-aware recommendation, 29 timeSVD, 37 timeSVD++, 38 trust, 40 Tucker decomposition, 48 TV series, 81 Twitter, 28 two-way aspect model, 58 Unied Boltzmann Machine, 58 unit testing, 107 unsupervised learning, 24 Untied Boltzmann Machines, 58 user attributes, 34, 81 user clustering, 66 user preference, 40 user recommendation, 29 user-based collaborative ltering, 33 user-based kNN, 33, 87 utility, 41 vector, 107 virtual machine, 93 Vogoo, 89 Vowpal Wabbit, 88 Waes, 87 WBPR-MF, 81 weighted regularized matrix factorization, 39 Weka, 79 Wooix, 89 WR-MF, see weighted regularized matrix factorization Yahoo Music Ratings, 44 INDEX