Supervised Machine Learning Methods for Item

advertisement
Supervised Machine Learning Methods
for Item Recommendation
A Thesis submitted for the degree of
Doctor of Natural Science (Dr. rer. nat.)
by
Zeno Gantner
Department of Computer Science
Information Systems and Machine Learning Lab (ISMLL)
University of Hildesheim
February 2012
2
Preface
Recommender systems are personalized information systems that learn individual preferences from interacting with users. Recommender systems use machine
learning techniques to compute suggestions for the users. Supervised machine
learning relies on optimizing for a suitable objective function. Suitability means
here that the function actually reects what users and operators consider to be
a good system performance.
Most of the academic literature on recommendation is about rating prediction. For two reasons, this is not the most practically relevant prediction task
in the area of recommender systems: First, the important question is not how
much a user will express to like a given item (by the rating), but rather which
items a user will like. Second, obtaining explicit preference information like ratings requires additional actions from the side of the user, which always comes
at a cost. Implicit feedback in the form of purchases, viewing times, clicks, etc.,
on the other hand, is abundant anyway. Very often, this implicit feedback is
only present in the form of positive expressions of preference. In this work, we
primarily consider item recommendation from positive-only feedback.
A particular problem is the suggestion of new items items that have no
interaction data associated with them yet. This is an example of a cold-start scenario in recommender systems. Collaborative models like matrix factorization
rely on interaction data to make predictions. We augment a matrix factorization model for item recommendation with a mechanism to estimate the latent
factors of new items from their attributes (e.g. descriptive keywords). In particular, we demonstrate that optimizing the latent factor estimation with regard
to the overall loss of the item recommendation task is superior to optimizing it
with regard to the prediction error on the latent factors. The idea of estimating
latent factors from attributes can be extended to other tasks (new users, rating
prediction) and prediction models, yielding a general framework to deal with
cold-start scenarios.
Next, we adapt the Bayesian Personalized Ranking (BPR) framework, which
is state of the art in item recommendation, to a setting where more popular items
are more frequently encountered when making predictions. By generalizing even
more, we get Weighted Bayesian Personalized Ranking, an extension of BPR
that allows importance weights to be placed on specic users and items.
All method contributions are supported by experiments using large-scale
real-life datasets from various application areas like movie recommendation and
music recommendation.
Finally, this thesis presents an ecient and scalable free software package,
MyMediaLite, that implements, among other things, all the presented methods
(plus related work) and evaluation protocols. Besides oering the existing mod3
4
els and evaluation protocols as a library and via command-line tools and web
services, MyMediaLite allows the easy development of new models and learning
methods.
Acknowledgements
This thesis would not have been possible without the support and inuence of
many people. First and foremost, I thank my advisor, Lars Schmidt-Thieme,
for providing a stimulating environment to pursue the research presented here,
and for his support throughout these years. I would also like to thank all
members of the Information Systems and Machine Learning Lab (ISMLL) at
the University of Hildesheim, in particular my friends and collaborators Steen
Rendle, Christoph Freudenthaler, Lucas Drumond, Tom
as Horv
ath, Leandro
Balby Marinho, Christine Preisach, and Karen Tso-Sutter. Carsten Witzke,
Franziska Leithold, and Christina Lichtenthaler in their roles as student assistants relieved me of several time-consuming duties and thus helped me to focus
on the crucial aspects of my research. All contributors to MyMediaLite, which
is an important part of the thesis, helped to shape it towards what it is today.
Ruth Janning, Christoph Freudenthaler, Tom
as Horv
ath and Thorsten Zitterell
provided useful feedback on preliminary drafts of this thesis. Wesley Dopkins
proof-read the thesis and made sure it contains mostly proper English. Finally,
I would like to thank Anna for her understanding, encouragement, and support
in the last years.
To my family.
6
Contents
1
2
Introduction
1.1
1.2
1.3
1.4
1.5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Recommender Systems: Tasks and Methods
2.1
2.2
2.3
2.4
2.5
3
Motivation .
Overview . .
Contributions
Publications .
Preliminaries
Supervised Machine Learning . . . . . .
Tasks . . . . . . . . . . . . . . . . . . .
2.2.1 Rating Prediction . . . . . . . .
2.2.2 Item Recommendation . . . . . .
2.2.3 Further Tasks and Variants . . .
Methods . . . . . . . . . . . . . . . . . .
2.3.1 Baselines . . . . . . . . . . . . .
2.3.2 Neighborhood-Based Models . .
2.3.3 Attribute-Based Methods . . . .
2.3.4 Hybrid Methods and Ensembles
2.3.5 Stochastic Gradient Learning . .
2.3.6 Matrix Factorization . . . . . . .
2.3.7 Bayesian Personalized Ranking .
2.3.8 Context-Aware Recommendation
Evaluation Criteria . . . . . . . . . . . .
2.4.1 Predictive Accuracy . . . . . . .
2.4.2 Runtime Performance . . . . . .
Datasets . . . . . . . . . . . . . . . . . .
2.5.1 MovieLens . . . . . . . . . . . . .
2.5.2 Netix . . . . . . . . . . . . . . .
2.5.3 KDD Cup 2011 . . . . . . . . . .
2.5.4 Yahoo! Music Ratings . . . . . .
Cold-Start Recommendation
3.1
3.2
3.3
Problem Statement . . . . . . .
Attribute-to-Feature Mappings
3.2.1 General Framework . .
3.2.2 Item Mappings . . . . .
Experiments . . . . . . . . . . .
3.3.1 Datasets . . . . . . . . .
7
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17
17
19
20
20
21
23
23
24
25
26
28
31
32
33
35
35
35
36
38
40
40
41
43
43
44
44
44
44
45
46
46
47
48
51
51
8
CONTENTS
3.4
3.5
4
4.4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Relation to Other Approaches . . . . . . . . . . . . . . . .
Weighted BPR . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Generic Weighted Bayesian Personalized Ranking .
4.2.2 Example 1: Non-Uniform Negative Item Weighting
4.2.3 Example 2: Uniform User Weights . . . . . . . . .
Sampling Strategies . . . . . . . . . . . . . . . . . . . . .
4.3.1 Sampling According to Entity Weights . . . . . . .
4.3.2 Matrix Factorization Optimized for WBPR . . . .
Summary and Outlook . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Problem Statement . . . . . . . . . . . . . . . . . . . . . . .
5.1.1 Evaluation Criterion . . . . . . . . . . . . . . . . . .
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 Optimizing for the Competition Objective . . . . . .
5.2.2 Matrix Factorization Optimized for Weighted BPR .
5.2.3 Ensembles . . . . . . . . . . . . . . . . . . . . . . . .
5.2.4 Incorporating Rating Information . . . . . . . . . . .
5.2.5 Contrasts . . . . . . . . . . . . . . . . . . . . . . . .
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.2 Rating Prediction . . . . . . . . . . . . . . . . . . .
5.3.3 Track 2 Results . . . . . . . . . . . . . . . . . . . . .
5.3.4 Final Submission . . . . . . . . . . . . . . . . . . . .
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary and Outlook . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Recommending Songs
5.1
5.2
5.3
5.4
5.5
6
.
.
.
.
.
.
.
.
Bayesian Personalized Ranking Revisited
4.1
4.2
4.3
5
3.3.2 Compared Methods . . . . . . . . .
3.3.3 Experiment 1: Method Comparison
3.3.4 Experiment 2: Large Attribute Sets
3.3.5 Run-Time Comparison . . . . . . . .
3.3.6 Reproducibility . . . . . . . . . . . .
3.3.7 Discussion . . . . . . . . . . . . . . .
Related Work . . . . . . . . . . . . . . . . .
Summary and Outlook . . . . . . . . . . . .
The MyMediaLite Library
6.1
6.2
6.3
Motivation: Free Software for Research .
Feature Overview . . . . . . . . . . . . .
6.2.1 Recommendation Tasks . . . . .
6.2.2 Command-Line Tools . . . . . .
6.2.3 Data Sources . . . . . . . . . . .
6.2.4 Evaluation . . . . . . . . . . . .
6.2.5 Incremental Updates . . . . . . .
6.2.6 Parallel Processing . . . . . . . .
6.2.7 Serialization . . . . . . . . . . . .
6.2.8 Documentation . . . . . . . . . .
6.2.9 Diversication and Ensembles . .
Development Practices . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
52
53
54
54
54
55
55
59
61
61
62
62
63
63
63
63
65
65
67
67
68
69
69
69
70
71
72
72
72
74
74
74
77
77
79
79
80
80
81
81
82
82
83
84
84
84
84
9
CONTENTS
6.4
6.5
6.6
6.7
6.8
7
Existing Software . . . . . . . . . . . . . . .
6.4.1 Recommender System Libraries . . .
6.4.2 Implementations of Single Methods .
6.4.3 Non-Free Publicly Available Software
System Comparison . . . . . . . . . . . . .
Experiments . . . . . . . . . . . . . . . . . .
6.6.1 General Performance . . . . . . . . .
6.6.2 Parallel Stochastic Gradient Descent
Impact . . . . . . . . . . . . . . . . . . . . .
Summary and Outlook . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Conclusion
7.1
7.2
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . .
A MyMediaLite Reference
A.1 Installation . . . . . . . . . . . . . . . . . .
A.1.1 Prerequisites . . . . . . . . . . . . .
A.1.2 Packages . . . . . . . . . . . . . . . .
A.1.3 Instructions . . . . . . . . . . . . . .
A.2 Command-Line Tools . . . . . . . . . . . . .
A.2.1 Rating Prediction . . . . . . . . . .
A.2.2 Item Recommendation . . . . . . . .
A.3 Library Structure . . . . . . . . . . . . . . .
A.3.1 Conventions . . . . . . . . . . . . . .
A.3.2 Interfaces . . . . . . . . . . . . . . .
A.4 Data Structures . . . . . . . . . . . . . . . .
A.4.1 Basic Data Types . . . . . . . . . .
A.4.2 Entity Mappings . . . . . . . . . . .
A.4.3 Rating Data . . . . . . . . . . . . . .
A.4.4 Positive-Only Feedback . . . . . . .
A.4.5 Attributes and Relations . . . . . . .
A.5 Recommenders . . . . . . . . . . . . . . . .
A.5.1 Rating Prediction . . . . . . . . . .
A.5.2 Item Recommendation . . . . . . . .
A.5.3 Group Recommendation . . . . . . .
A.5.4 Ensembles . . . . . . . . . . . . . . .
A.6 Using MyMediaLite Recommenders . . . . .
A.6.1 General Remarks . . . . . . . . . . .
A.6.2 C# . . . . . . . . . . . . . . . . . . .
A.6.3 F# . . . . . . . . . . . . . . . . . . .
A.6.4 Python . . . . . . . . . . . . . . . .
A.6.5 Ruby . . . . . . . . . . . . . . . . .
A.7 Implementing MyMediaLite Recommenders
B URLs
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
86
86
88
89
91
93
93
95
95
96
97
97
98
101
101
101
101
101
102
102
104
106
107
107
108
108
109
109
112
112
113
114
118
121
122
122
122
122
123
124
125
126
129
10
CONTENTS
List of Figures
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
Attribute-to-feature mappings . . . . . . . . . . .
Cold-start recommendation: prec@5 results . . .
Cold-start recommendation: prec@10 results . . .
Cold-start recommendation: AUC results . . . .
High-dimensional attribute sets: prec@5 results .
High-dimensional attribute sets: prec@10 results
High-dimensional attribute sets: AUC results . .
Cold-start recommendation: test time per user .
.
.
.
.
.
.
.
.
47
54
55
56
57
58
59
60
5.1
5.2
5.3
Task of KDD Cup 2011, track 2 . . . . . . . . . . . . . . . . . . .
The liked contrast . . . . . . . . . . . . . . . . . . . . . . . . .
The rated contrast . . . . . . . . . . . . . . . . . . . . . . . . .
68
68
73
6.1
6.2
6.3
The MyMediaLite movie demo program . . . . . . . . . . . . . .
Runtime of BiasedMatrixFactorization . . . . . . . . . . . . . . .
Runtime of BiasedMatrixFactorization and MultiCoreMatrixFactorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Runtime and memory usage of MulticoreMatrixFactorization . .
83
94
6.4
11
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
94
95
12
LIST OF FIGURES
List of Tables
1.1
Examples for recommender system applications . . . . . . . . . .
18
2.1
2.2
2.3
Attribute example: movie genres. . . . . . . . . . . . . . . . . . .
Context-aware item recommendation: dierent scenarios . . . . .
Evaluation datasets . . . . . . . . . . . . . . . . . . . . . . . . . .
25
30
43
3.1
3.2
Cosine similarities between movies. . . . . . . . . . . . . . . . . .
Item attribute sets. . . . . . . . . . . . . . . . . . . . . . . . . . .
49
52
5.1
5.2
5.3
73
74
5.4
5.5
Characteristics of the validation and competition splits . . . . . .
Rating prediction accuracy for dierent matrix factorization models
Validation set and KDD Cup 2011 leaderboard error percentages
for dierent models . . . . . . . . . . . . . . . . . . . . . . . . . .
Candidate components of the score ensemble . . . . . . . . . . .
Rating prediction models used in the nal KDD Cup submission
6.1
6.2
Comparison of free/open source recommender system frameworks 92
Memory usage for rating prediction with BiasedMatrixFactorization 93
A.1
A.3
A.2
A.4
A.5
Rating predictors in MyMediaLite . .
Item recommenders in MyMediaLite .
Rating predictor hyperparameters . .
Item recommender hyperparameters .
Group recommenders in MyMediaLite
.
.
.
.
.
115
118
118
120
121
B.1 Websites of dierent software . . . . . . . . . . . . . . . . . . . .
B.2 Academic/experimental systems with recommender functionality
and other recommender system resources . . . . . . . . . . . . .
B.3 Websites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.4 Websites of recommender system software . . . . . . . . . . . . .
130
13
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
75
75
76
130
131
132
14
LIST OF TABLES
List of Algorithms
1
2
4
Learning with stochastic gradient descent . . . . . . . . . . . . . .
Learning a basic matrix factorization for rating prediction with
stochastic gradient descent . . . . . . . . . . . . . . . . . . . . . .
Learning a matrix factorization with biases and a logistic transformation for rating prediction with stochastic gradient descent . .
LearnBPR: Optimizing BPR using stochastic gradient ascent . .
38
39
5
Learning algorithm for the linear attribute-to-feature mapping . .
51
6
7
8
LearnWBPR-Neg: Optimizing WBPR with non-uniform weights
for the negative items using stochastic gradient ascent . . . . . . . 64
LearnWBPR: Optimizing WBPR using stochastic gradient ascent 64
Optimizing a matrix factorization model for WBPR with nonuniform weights for the negative items using stochastic gradient
ascent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
9
Sampling procedure for the KDD Cup validation split . . . . . . .
70
10
Parallel stochastic gradient descent for matrix factorization . . . .
85
3
15
35
37
16
LIST OF ALGORITHMS
Chapter 1
Introduction
In this chapter, we motivate why research on recommender systems is relevant,
and outline the structure and contributions of this thesis.
1.1
Motivation
Never in history has a higher number of dierent books, movies, music recordings, news items, software programs, and other media content, as well as physical
products been readily available to Internet users. While this generally may be a
good thing, the potential consumer faces a dilemma: there is too much choice.
All oered items can never be examined, not to mention consumed, in a lifetime.
Besides search and user interface technologies that support the user in actively looking for items in which they know they are interested, personalization
technologies that suggest possible items of interest are one answer to the everyincreasing supply of information and products.
Recommender systems [Goldberg et al., 1992, Resnick and Varian, 1997,
Jannach et al., 2010, Kantor et al., 2011] are information systems that learn
user preferences from past user actions (ratings, votes, ranked lists, mouse clicks,
page views, product purchases, etc.) and suggest items (pages on the web, news
articles, jokes, movies, products of any kind, music albums, individual songs,
etc.) according to those user preferences.
While rating prediction How much will a user like/rate a given item? has gained more attention in the recommender systems literature in the past,
the task of item recommendation Which items will a user like/buy? [Deshpande and Karypis, 2004, Hu et al., 2008] is actually more relevant for practical
recommender system applications: after all, the task of a recommender system
is not to know how a person would rate something on an arbitrary scale, but
rather which items a person is interested in and will really like.
Table 1.1 gives some examples of what can be suggested by recommender
systems, along with some sample companies, products, or websites providing
such systems.1
As indicated by the questions just mentioned, typical tasks in recommender
systems are prediction tasks. For eciency reasons, the answers to those questions are computed by prediction models, which have to be learned/trained
1
The URLs corresponding to those systems can be found in Appendix B.
17
18
CHAPTER 1.
INTRODUCTION
Application domain
Example systems or literature references
Books
GoodReads, Amazon [Linden et al., 2003], LibraryThing
Netix, MovieLens, Moviepilot, Ringo [Shardanand,
1994]
TiVo [Ali and van Stam, 2004], Bambini et al. [2011],
Xin and Steck [2011]
YouTube [Davidson et al., 2010, Zhou et al., 2010]
Pandora, last.fm, Ringo [Shardanand, 1994]
Amazon [Linden et al., 2003]
Net-A-Porter, Zalando, Asos, Amazon, Otto
KDD Cup 2010, Nguyen et al. [2011]
Aditya Parameswaran [2011]
Donation Dashboard [Nathanson et al., 2009]
Knijnenburg et al. [2011a]
Google Mail, Twitter, Facebook, LinkedIn, YouTube
Pizzato et al. [2010]
Yahoo! Answers [Dror et al., 2011c]
Mui et al. [2001]
Takeuchi and Sugimoto [2006]
AppRecommender
Delicious, Balabanovic [1997]
Sriram et al. [2010]
Google News, Yahoo! News, Findory
Tapestry [Goldberg et al., 1992], GroupLens [Resnick
et al., 1994], Lang [1995]
Bollacker et al. [2000], und Gordon McCalla [2004],
Ekstrand et al. [2010], Wang and Blei [2011]
Bibsonomy [J
aschke et al., 2007]
Jester [Goldberg et al., 2001]
Songkick, Chen [2005]
Jannach et al. [2009]
Nokia Maps, Google Maps
Movies
TV, IPTV
Video clips
Playlists/songs
General products
Fashion products
E-learning
Teaching
Donations for non-prots
Energy-saving measures
Friends/contacts
Dating partners
Questions
Restaurants
Shops
Games/apps/software
Web pages
Messages/conversations
News articles
Usenet news
Research papers
Folksonomy tags
Jokes
Events
Tourism
Points of interest
Table 1.1: Examples for recommender system applications.
1.2.
OVERVIEW
19
based on past interaction data and possibly additional data like user and item
attributes.
This places recommender system methods in the realm of supervised machine learning . This thesis is about supervised machine learning techniques for
recommender systems.
1.2
Overview
Besides the introduction and the conclusion, this thesis contains ve chapters.
• Chapter 2: Recommender Systems: Tasks and Methods
We dene prediction tasks in the eld of recommender systems, and discuss
existing learning approaches for those tasks, as well as how to measure and
compare the eectiveness of dierent methods.
• Chapter 3: Cold-Start Recommendation
In this chapter, we develop a framework for dealing with cold-start problems in recommender systems, namely how to make accurate recommendations when we do not have sucient interaction information for using normal collaborative ltering methods. The framework can adapt
all methods that represent the entities (for example users and items) as
vectors of real numbers to cold-start scenarios, in particular latent factor
models, which are the state-of-the-art approach for recommender systems.
We present a case study in which we employ the framework to enhance
an item recommendation method, matrix factorization for Bayesian Personalized Ranking (BPR-MF), with item attributes.
• Chapter 4: Bayesian Personalized Ranking Revisited
We look at dierent aspects of Bayesian Personalized Ranking: the relationship between ranking and classication and weighting entities of diering importance. We relate the approach to more general approaches in the
machine learning literature, and extend it to weighted BPR (WBPR). We
suggest a learning algorithm for the new WBPR optimization criterion
based on adapting the sampling probabilities.
• Chapter 5: Recommending Songs
Here we show the eectiveness of the enhancements from the previous
chapter, in a music recommendation scenario, using the large-scale dataset
from the KDD Cup 2011 competition. In particular, we use WBPR to
learn scalable and accurate matrix factorization models.
• Chapter 6: The MyMediaLite Library
Building upon the work of others is an important part of every scientic
endeavour. With MyMediaLite, a fast and scalable, multi-purpose library
of recommender system algorithms, we describe the free software package
that was used to implement all methods presented in this thesis.
The appendix contains material that may be of interest to the reader, but
was left out of the main part of the thesis: a concise reference manual of the
MyMediaLite recommender system software, and a list of URLs of websites
mentioned in the text.
20
CHAPTER 1.
1.3
INTRODUCTION
Contributions
This thesis contains contributions in the area of item recommendation:
1. A formal denition of item recommendation that is general enough
to cover most known prediction tasks in the area of recommender systems,
and examples on how to frame specic problems in terms of the formal
denition.
2. A generic framework for dealing with cold-start problems in factorization models, leading to an extension for matrix factorization based
on Bayesian Personalized Ranking (BPR) that takes item attributes into
account.
3. Weighted Bayesian Personalized Ranking (WBPR), a generalization of BPR that allows importance weights to be attached to specic
users and items, including a generic and ecient learning algorithm.
4. An ensemble method for recommending songs based on WBPRMF and rating prediction matrix factorization that was one of the
top entries in the KDD Cup 2011.
5. The description of MyMediaLite, a free software package of item
recommendation and rating prediction methods, which contains
reference implementations of many dierent recommendation algorithms,
and the widest choice of evaluation protocols of all available free software/open source recommender system packages, among plenty of other
features useful for researchers and practitioners. The availability of the
software also makes all experiments presented here easily reproducible for
others.
This work also contains, to the best of our knowledge, the most complete
and extensive survey of free software recommender system implementations in the literature so far.
1.4
Publications
Most of the work presented in this thesis has already been published in the form
of papers at peer-reviewed international workshops and conferences:
• Zeno Gantner, Lucas Drumond, Christoph Freudenthaler, Steen Rendle,
Lars Schmidt-Thieme (2010): Learning Attribute-to-Feature Mappings for
Cold-Start Recommendations, in Proceedings of the 10th IEEE International Conference on Data Mining (ICDM 2010), Sydney, Australia.
The content of this paper is mostly covered in chapter 3, and in the sections on item recommendation in chapter 2.
• Zeno Gantner, Lucas Drumond, Christoph Freudenthaler, Lars SchmidtThieme (2011): Bayesian Personalized Ranking for Non-Uniformly Sampled Items, KDD Cup Workshop 2011, San Diego, USA.
Theoretical insights rst published at the KDD Cup 2011 workshop form
the core of chapter 4, whereas the application scenario of KDD Cup 2011
1.5.
PRELIMINARIES
21
is used for the case study in chapter 5.
An improved version of the workshop paper is currently under review for
a special issue on the KDD Cup 2011 in the Journal of Machine Learning
Research (JMLR).
• Zeno Gantner, Steen Rendle, Christoph Freudenthaler, Lars SchmidtThieme (2011): MyMediaLite: A Free Recommender System Library, in
Proceedings of the 5th ACM International Conference on Recommender
Systems (RecSys 2011), Chicago, USA.
Chapter 6 builds on this publication, while adding a lot of new content.
During the time of my doctoral studies, I co-authored further publications
that are not covered in this thesis, although there are of course relations and
inuences between their contents and my thesis work.
• Bart P. Knijnenburg, Martijn C. Willemsen, Zeno Gantner, Hakan Soncu,
Chris Newell (2011): Explaining the user experience of recommender systems, to appear in User Modeling and User-Adapted Interaction (UMUAI).
• Steen Rendle, Zeno Gantner, Christoph Freudenthaler, Lars SchmidtThieme (2011): Fast Context-aware Recommendations with Factorization
Machines, in Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2011),
Beijing, China.
• Zeno Gantner, Steen Rendle, Lars Schmidt-Thieme (2010): Factorization Models for Context-/Time-Aware Movie Recommendations, in Challenge on Context-aware Movie Recommendation (CAMRa2010), ACM,
Barcelona, Spain.
• Zeno Gantner, Christoph Freudenthaler, Steen Rendle, Lars SchmidtThieme (2009): Optimal Ranking for Video Recommendation, in User
Centric Media: First International Conference, UCMedia 2009, Revised
Selected Papers, Springer.
• Zeno Gantner, Lars Schmidt-Thieme (2009): Automatic Content-based
Categorization of Wikipedia Articles, in The People's Web Meets NLP:
Collaboratively Constructed Semantic Resources. Workshop at Joint Conference of the 47th Annual Meeting of the Association for Computational
Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing
(ACL 2009).
• Steen Rendle, Christoph Freudenthaler, Zeno Gantner, Lars SchmidtThieme (2009): BPR: Bayesian Personalized Ranking from Implicit Feedback, in Proceedings of the 25th Conference on Uncertainty in Articial
Intelligence (UAI 2009).
1.5
Preliminaries
Throughout this thesis, scalar variables are set in the default math font, e.g.
a, b, c, while matrices (upper case) and vectors (lower case) are in bold face, e.g.
A, B, x, y.
22
CHAPTER 1.
INTRODUCTION
We will use the letter p for assignments to {0, 1}, or the probability of an
assignment to 1, and s ∈ R for arbitrary scores. pu,i states whether item i
was rated (highly) by user u. p̂u,i (Θ), usually simplied to p̂u,i , is the decision
(estimation) of a model Θ for the true assignment pu,i . Output scores ŝu,i (Θ) =
ŝu,i refer to arbitrary numerical predictions of recommendation models Θ, where
higher scores refer to higher positions in the ranking. Such estimated rankings
can then be used to make decisions p̂u,i .
Chapter 2
Recommender Systems: Tasks
and Methods
In this chapter, we give an overview of the eld of recommender systems from
a supervised machine learning perspective. After introducing machine learning,
we discuss the most prominent prediction tasks for recommender systems, in
particular a generic yet formal denition of recommendation. We then proceed
to describe supervised machine learning methods for accomplishing these tasks,
and dierent ways of measuring the properties of such methods, concentrating
on simulated o-line experiments, as well as the datasets we use throughout this
thesis to perform such experiments.
2.1
Supervised Machine Learning
Tom Mitchell denes the term machine learning as follows (bold face/italic in
the original, [Mitchell, 1997, p. 2]).
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P , if its
performance at tasks in T , as measured by P , improves with experience E .
Mitchell goes on to give examples for machine learning problems, such as
learning to recognize spoken words, learning to drive an autonomous vehicle,
learning to classify new astronomical structures, and learning to play backgammon [Mitchell, 1997, p. 3].
The performance of a system that uses machine learning methods is measured with a loss function that penalizes prediction errors [Hastie et al., 2009, p.
18]. Supervised machine learning [Hastie et al., 2009] tasks share the property
that they learn models for the prediction/estimation of target variables (dependent variables, labels) from input variables (independent variables, predictor
variables). Learning examples contain both types of variables, whereas we do
not know the target variables of the instances we need to predict. Depending on
the type of target variable, we distinguish several dierent tasks of supervised
learning problems:
23
24
CHAPTER 2.
RECOMMENDER SYSTEMS: TASKS AND METHODS
• the task of classication is to predict nominal target variables, and
• the task of regression is to predict real-valued target variables.
Beyond classication and regression, there are other supervised learning
tasks, for example density estimation [Hastie et al., 2009], ranking [Trotman,
2005], and so on. Supervised learning is in contrast to unsupervised learning ,
where examples are not labeled. An example for an unsupervised learning task
is clustering.
2.2
Tasks
In the introduction we described what recommender systems are. Let us now
focus on the underlying learning problems.
In a classical recommender system, there are two types of entities, users (e.g.
customers) and items (e.g. movies, books, songs). We use U = {1, . . . , |U|} and
I = {1, . . . , |I|} to denote the sets of all user and all item IDs, respectively.
For simplicity, we will not dierentiate between the integer ID representing an
entity and the entity itself.
We have dierent kinds of information about the entities:
1. Information pertaining to one entity, content information, e.g. user attributes like age, gender, hobbies or item attributes like the price of a
product, words in the title or description of a movie, editorial ratings.
2. Information that is linked to a user-item pair, collaborative information,
e.g. the rating 4 stars on a scale from one to ve given to a movie by
a specic user, the information that a user has purchased an item in an
online shop or viewed a video in an IPTV system, or a tag in a collaborative
tagging system.
There are several types of collaborative information. One important distinction is between explicit (e.g. ratings, up- and downvotes) and implicit expressions of user preferences (e.g. clicks, purchases). Depending on the type of
system, implicit information may be positive-only, i.e. there may be no recorded
negative preference observations.
We can represent interaction information by a partial function s : U ×I → S .
Correspondingly, we represent content information about users as a function
aU : U → AU and about the items by a function aI : I → AI . How S, AU , and
AI exactly look like will depend on the concrete task we wish to describe.
For convenience, we represent functions often as matrices. Let AU ∈ R|U |×m
be the matrix of user attributes where aU
ul is 1 if and only if user u has attribute
l, i.e. l ∈ aU (u), and AI ∈ R|I|×n be the matrix of item attributes, where aIil
is 1 if and only if l ∈ aI (i). There are m dierent user attributes, and n item
attributes.
Suppose we have the movies The Usual Suspects, American Beauty,
The Godfather, and Road Trip in our recommender system. Each of those items
is assigned to one or several of the genres Crime, Thriller, Comedy, Drama, and
Example 1
2.2.
25
TASKS
ID
1
2
3
4
Movie
The Usual Suspects
American Beauty
The Godfather
Road Trip
Genres
Crime, Thriller
Comedy, Drama
Crime, Action
Comedy
Table 2.1: Attribute example: movie genres.
Action. If we assign consecutive IDs to the movie and genres, we can create the
following item attribute matrix from the contents of Table 2.1:


1 1 0 0 0
0 0 1 1 0

AI = 
1 0 0 0 1 ,
0 0 1 0 0
where the rows refer to the dierent movies, and the columns refer to the different genres. We will use this data in the following examples.
2.2.1 Rating Prediction
Ratings are a popular kind of explicit feedback. Users assess how much they like
a given item (e.g. a movie or a news article) on a predened scale, e.g. 1 to 5,
where 5 could mean that the user likes the item very much, whereas 1 means the
user strongly dislikes the item. Rating prediction algorithms estimate unknown
ratings from a given set of known ratings and possibly additional data such as
user or item attributes. The predicted ratings can then indicate to users how
much they will like an item, or the system can suggest items with high predicted
ratings.
Given (incomplete) rating information r : U × I → R, where R
is the set of rating values, the task of rating prediction is to come up with a
function r̂ : U × I → R that estimates how a given user will rate a given item.
The quality of a single prediction is measured by a loss function ` : R × R → R.
Denition 1.
In the rest of this thesis, we will use ru,i := r(u, i) for actual ratings which may or may not be known to the learner and r̂u,i := r̂(u, i) for rating
estimations by a recommendation method.
In the example mentioned before, we have
R := {1, 2, 3, 4, 5},
R := [1, 5].
or
(2.1)
(2.2)
We use rmin and rmax to refer to the lowest and highest possible rating, respectively. In the example we have rmin = 1 and rmax = 5. One can view the second
set [1, 5] as a relaxed variant of {1, 2, 3, 4, 5}. While the observed ratings may
be limited to integer values, we usually allow real-valued predictions.
Of course, other rating scales are possible, for example thumbs up/down like
in Pandora Internet radio, which we could represent by {−1, 1}.
26
CHAPTER 2.
RECOMMENDER SYSTEMS: TASKS AND METHODS
A learning problem for this prediction task would then be to learn a model
that can perform this task as accurately as possible measured by the loss for the intended users and items.
Of course more information like user and item attributes can be taken
into account for learning. Also, the times and dates of the rating event could
be used as additional context information, leading to interaction/rating information of the form r : U × I × D → R and a function r̂ to be learned with the
same signature, where D is the set of dates/times.
If the prediction target is interpreted as a real number (equation 2.2), rating
prediction can be seen as a regression task in the usual terminology of supervised
machine learning. If R contains several levels that can be put in an order
indicating the user preference (equation 2.1), it can also be seen as ordinal
regression [Weimer et al., 2008, Koren and Sill, 2011, Menon and Elkan, 2010].
2.2.2 Item Recommendation
We rst give a generic denition of (item) recommendation, and then use this
denition to dene more concrete recommendation tasks. In the subsequent
section, we will discuss further variants of item recommendation.
A Generic Denition of Recommendation
Given collaborative information s : C × I → S , and (optionally)
additional information, for example about users aU : U → AU or items aI : I →
AI the task of item recommendation is to suggest a set S of items i ∈ I in
a given context c ∈ C .
Denition 2.
The context is the circumstances, the particular situation in which we want to
suggest items. What the context involves is dierent from scenario to scenario.
If the context involves specic users, we have personalized recommendations.
Even though we concentrate on personalization here, recommendations do not
necessarily have to be personalized. For example, websites like Amazon display
items similar to the one currently viewed by the user. These similar items may
be the same for every user. Then we would have C = I . A very simple nonpersonalized recommendation task would have a singleton context set: C = {1}.
In the basic personalized scenario, the set of contexts is exactly the set of users:
C = U . We will later see more complex kinds of context.
Denition 3. The quality of a recommended set in a given context c is
measured by a loss function ` : 2I × 2I → R.
Alternatively, we can also dene the quality of single recommended items,
or of rankings of items:
The quality of a recommended item in a given context c is
measured by a loss function ` : I × I → R.
Denition 4.
The quality of an item ranking in a given context c here
represented as a vector of score assignments to all items is measured by a loss
function ` : R|I| × R|I| → R.
Denition 5.
2.2.
TASKS
27
The information given corresponds to the experience in Mitchell's denition
of machine learning, suggesting an item (or a set of items) is a task, and the
loss function is the performance measure.
We believe this is a broad enough denition to capture most, if not all,
recommendation tasks currently discussed in the literature. Below, we will
express some specic tasks in terms of this denition.
In the vocabulary of supervised machine learning, item recommendation can
be viewed either as a classication task there are correct items and wrong
items (binary classication, more classes are imaginable as well; denition 4)
or as a ranking task (denition 5): can the recommender rank the candidate
items such that the order resembles the actual user preferences?
The task can also be viewed as a structured prediction task [Taskar et al.,
2005] instead of a single label like a number, the prediction target is a set or a multi-label classication task [McCallum, 1999] every item is interpreted
as a label.
Item Recommendation by Rating Prediction
In rating-based recommender systems, the known interactions are the ratings:
S = R. It is common to suggest those items to the user that have the best
rating predictions (see denition 1) , possibly weighted by the condence in the
prediction. While this approach certainly makes sense, it is not clear whether it
leads to optimal recommendations [Marlin and Zemel, 2009, Cremonesi et al.,
2010].
Item Recommendation from Implicit Feedback
Rating prediction has been popularized in the research community by systems
(and corresponding publicly available datasets) like MovieLens [Herlocker et al.,
1999] and Jester [Goldberg et al., 2001], and later by the Netix Prize [Koren
et al., 2009]. Nevertheless, most real-world recommender systems (e.g. in ecommerce) do not rely on ratings, because users are hard to persuade to give
explicit feedback, and other kinds of feedback implicit feedback [Oard and
Kim, 1998] like user actions like selecting/buying an item, page views/clicks, listening/watching times, etc. are often recorded by the system anyway. Buying
a product or watching a video is also a positive expression of preferences. Note
that not buying or watching an item from a large collection does not necessarily
means a user dislikes the item. Other kinds of measurements, like watching and
listening times or percentages for videos and songs, have no direct interpretation
as positive or negative feedback. Item recommendation from implicit feedback
is the task of determining the items that a user will perform a certain action on
from such past events (and possibly additional data).
Item Recommendation from Positive-Only Feedback
Often, implicit feedback contains only positive signals [Oard and Kim, 1998,
Pan et al., 2008, Gunawardana and Meek, 2008, Rendle et al., 2009, Gunawardana and Meek, 2009]: A user has bought a product, which implies they have a
positive preference for this product. We do not know, however, which items the
user does not like from this kind of feedback. Distinguishing between items a
28
CHAPTER 2.
RECOMMENDER SYSTEMS: TASKS AND METHODS
user likes and items a user does not like from such data is called one-class classication in the language of supervised learning [Moya and Hush, 1996]. Learning
models for such a task is less obvious than binary or multi-class classication.
Note that positive-only feedback is not necessarily implicit. For example,
websites like Facebook allow users to give thumbs up (like) to all kinds of
items. This is an explicit statement of preference, but again we only observe
positive feedback here. Another example is the favorite feature on Twitter:
users can mark posts as their favorites. Again, this is an explicit statement.
For convenience, we use a binary matrix S ∈ {0, 1}|U |×|I| to represent
positive-only feedback, where su,i is 1 if and only if user u has given positive
feedback about item i. We use Iu+ := {i ∈ I : sui = 1} to refer to the items for
which user u has provided feedback and Iu− := {i ∈ I : sui = 0} to refer to the
items for which that user has not provided feedback.
Suppose we have the users Alice, Ben, and Christine. None of
them has watched The Usual Suspects ; Christine has watched all three other
movies, while Alice and Ben each have only watched American Beauty and
The Godfather, respectively. If we assign IDs to all entities in order of their
appearance, we have


0 1 0 0
S = 0 0 1 0 .
0 1 1 1
Example 2
Here, the rows refer to users, while the columns refer to items (movies). From
S, we can deduce the sets I1+ = {2} and I1− = {1, 3, 4} for Alice (user 1). Note
that we only see positive feedback here: We cannot deduce that Alice and Ben
do not like the other movies only because they have not watched them.
Whether the data is one-class data depends on the way the data is logged.
Continuing with the Facebook thumbs up scenario, if we knew which items
had been displayed to the user, we could construct negative examples from the
items that were not liked. However, in practice such a kind of detailed logging
does not happen, either because the logging mechanism was not designed with
recommendation in mind, or because it is simply too costly in terms of resources.
Distinguishing between explicit and implicit feedback is important when
thinking about the overall system, about the user motivations to give feedback,
and other cognitive aspects of recommender systems. For constructing a learning
algorithm, it does not matter so much.
In this thesis, we often deal with the task of item recommendation from
positive-only feedback . For the sake of brevity, we will call this task item recommendation in the following, even though there are of course many other item
recommendation tasks besides item recommendation from positive-only feedback.
2.2.3 Further Tasks and Variants
Next, we provide details on several variants and special cases of recommendation
tasks, and show how they t into our formal framework. Most of these variants
can be combined. For example recommending playlists for a certain occasion,
where we allow songs that a user has played before, can be expressed as contextaware sequential recommendation with repeated events.
2.2.
TASKS
29
Context-Aware Recommendation
We have already introduced the term context in the generic denition of the
recommendation task. We speak of context-aware recommendation if the context
consists of more than just the user.
Adomavicius and Tuzhilin [2011] mention time, location, and the company of
other people as examples for context.1 They distinguish between three dierent
kinds of contextual information:
1. explicit context stated by/asked from the user.
2. implicit context logged by the system, for example time stamps.
3. inferred context for example guessing which member of a household is
using the TV.
Both the task of rating prediction and of item recommendation can be extended to context-aware variants.
As mentioned at the end of section 2.2.1, interaction data can be augmented
by time information. On one hand, time information can be used for distinguishing between temporal trends and underlying, more durable user preferences, and
thus improve results for prediction tasks without context. On the other hand it
can also be part of the recommendation context [Koren, 2009]. Such time-aware
recommendation is a case of context-aware recommendation, with C = U × D.
Suggestion of folksonomy tags (tag recommendation, J
aschke et al. [2007])
is also a special case of context-aware recommendation. Table 2.2 gives an
overview of several context-aware recommendation tasks.
Filtered Recommendation
In ltered recommendation , the list of candidate items is restricted (ltered) by
the context. One example is recommending items from the particular part of a
product taxonomy that a user is currently browsing. An important dierence to
general context-aware recommendation is that the candidate items are necessarily and explicitly restricted by the context. This leads to interesting properties
that can be exploited to improve the learning process of a recommender [Yang
et al., 2011].
User Recommendation
An important application of recommender system algorithms is the recommendation of users to other users in social networks like LinkedIn, Xing, Twitter,
or Facebook: users are treated as items. Information from the social network
graph can be used for both candidate generation and for making decisions on
what to suggest.
Another example for user recommendation is match making [Miller, 2003],
like nding potential dating partners on dating websites, or nding scientists
who could co-operate.
1
In their denition, the user is not part of the context.
RECOMMENDER SYSTEMS: TASKS AND METHODS
CHAPTER 2.
30
Netix Prize
KDD Cup 2011, track 1
CAMRa 2011 Task 1
Music
Folksonomy tags
Restaurants
Books
Groceries
CAMRa 2010 Task 1
CAMRa 2010 Task 2
Global recommendation
Similar products
Item recommendation from pos. feedback
KDD Cup 2011, track 2
Scenario
user, day
user, day, time
household
user, location, mood, group
user, resource
user, location, mood
user, basket
user, basket
user, week
user, mood
product
user
user
Context
rating
rating
rating
listen, like, ban, listening time
tagging
rating, visit
view, purchase
view, purchase
rating
rating
e.g. view, purchase
view, purchase
positive feedback
(song, artist, album, genre) rating
Feedback
rating
rating
movie
song, playlist
tag
restaurant
product
product
movie
movie
product
product
item
song
Target
no
no
no
yes
yes
maybe
no
yes
no
no
maybe
maybe
maybe
no
Rep.
Table 2.2: Context-aware item recommendation: dierent scenarios. The rst group contains rating prediction tasks, the second one
item recommendation tasks. The third group contains examples for tasks that are not considered to be context-aware. Rep. means
repeated events (considering users). Adapted from Gantner et al. [2010c].
2.3.
METHODS
31
Repeated Events
One thing to be aware of when building recommender systems is whether it is
possible and useful to recommend items that a user has already accessed. If
this is not the case, the solution is to simply remove the accessed items from
the candidate list. An example for this scenario is buying books: If a user has
already bought a book, it is not very likely they will buy the same book again.
The situation is dierent for buying products that are meant to be consumed,
like groceries. Then things can become more complicated: accessed items will,
of course depending on the learning method, be quite likely to show up in a
recommendation list. On the other hand, a recommender system should still
allow the user to explore the item space, so one should make sure that items
unknown to the user have the chance to make it to the recommendation list. In
the marketing literature, this is known as repeat buying [Ehrenberg, 1988]. Work
on repeated events and recommender systems can be found in Geyer-Schulz and
Hahsler [2002], Cho et al. [2005], Rendle et al. [2010] and Kawase et al. [2011].
Note that there are several dierent notions of repeated events: the repetition
can relate to the user, or to other parts of the context, or to the full context.
For example, in the folksonomy tag scenario, it makes sense to recommend tags
a user has already used, but it does not make sense to recommend tags a user
has already used in exactly the same context, i.e. for the current resource.
Sequential Recommendation
Sometimes, the ideal suggestion depends on what users have seen, heard, or
bought before. Take for example a customer who returns to an online shop
after having made a purchase some time ago [Rendle et al., 2010], or the task of
generating music playlists [Alghoniemy and Tewk, 2001, Logan, 2002, Pampalk
et al., 2005, Pauws et al., 2006].
Sequential recommendation could be formalized in dierent ways. One option would be to extend the general denition of (set) recommendation, and add
an order to the suggested set. Another option would be to model sequential recommendation as several separate recommendation problems with the previous
items as context. We suggest to use the second option, because it does not
require modifying our original generic denition of the recommendation task. A
context set for sequential recommendation with the Markov property [Markov,
1906] is C = U × I , where the item in the context is the preceding item. Denitions that take more than one item, or even sets of items into account are also
possible, for example C = U × 2I .
2.3
Methods
In this section, we describe typical approaches to tackle the rating prediction
and item recommendation tasks. Item recommendation is the main topic of
this thesis, so the methods presented here present the current state of the art
with respect to this task. While we do not focus on rating prediction, this
task is very prominent in the recommender systems literature. Towards the
end, we give pointers on more complex methods for some of the more specic
recommendation tasks mentioned before.
32
CHAPTER 2.
RECOMMENDER SYSTEMS: TASKS AND METHODS
2.3.1 Baselines
Simple methods, like global, user, or item averages for rating predictions, are
usually not employed as the method in recommender systems to make realistic
suggestions. However, they are still useful as experimental baselines, and as
components of more sophisticated methods. For example, we could compare
the implementation of a new method with an existing baseline method. We
would expect the new method to perform much better than the baseline; should
this be not the case, we would expect to have made mistakes in the design or
implementation of the new method.
Rating Prediction
Besides giving random answers, the most simple kind of rating
prediction is to predict the same rating for each rating event. An obvious choice
here is to use the global average (mean) µ. Another possibility is to predict the
user average, or the item average. Note that the global or user average contain
no information at all to let us rank the items for a given user.
Averages:
User and item bias:
2008]:
One can also combine user and item biases [Koren,
uib
I
r̂u,i
= µ + bU
u + bi ,
(2.3)
where bU and bI are the solution of the following optimization problem:
min
bU ,bI
X
I 2
U
U 2
I
I 2
(ru,i − µ − bU
u − bi ) + λ kb k + λ kb k ,
(2.4)
(u,i,ru,i )∈R
where R ⊆ U × I × R are the known ratings, and λU and λI are regularization
constants. The sum represents the least squares error, while the two terms
starting with λU and λI , respectively, are regularization terms that control the
parameter sizes to avoid overtting. The optimization problem could be solved
for example by a gradient descent algorithm, or by an alternating least squares
method, as suggested by Koren [2008].
Item Recommendation
In contrast to rating prediction, where we predict scalar values, we predict sets
of items in the item recommendation task. In practice, many methods for item
recommendation compute a score for each user-item combination, and then for
each user rank the items according to their scores. In the following, we will
present item recommendation methods as scoring functions, even though some
of them may be expressed and implemented more elegantly (and eciently) as
functions emitting sets. The score function allows us to present the methods in
a unied way.
Random: The most simple baseline is to assign random scores, resulting in
randomly ordered recommendation sets for each user.
2.3.
33
METHODS
A very simple method that returns the same items
to every user (except the items already accessed by the user) is to take the
globally most popular items. What most popular means can dier depending
on the kind of feedback data available. In the case of positive-only feedback we
rank the items according to the number of feedback events:
Most popular items:
ŝmp
u,i = |{(u, i, su,i ) ∈ s}|.
(2.5)
2.3.2 Neighborhood-Based Models
A classical family of methods for recommendation tasks are neighborhood-based
models. The general idea underlying these methods is to use similar examples
from the past to make predictions [Hastie et al., 2009]. Which examples are
similar is determined by a so-called similarity measure over the involved entities.
User-based k-nearest Neighbors (kNN)
Similar users are likely to like the same things. Then a straightforward idea is
to take the past activities of similar users to make suggestions.
For rating data, the estimated
Pearson correlation of two users is dened as follows [Koren and Bell, 2011].
User-based kNN for rating prediction:
P
− b̂u,i )(rv,i − b̂v,i )
,
P
2
2
i∈I(u,v) (ru,i − b̂u,i ) ·
i∈I(u,v) (rv,i − b̂v,i )
ˆ Pearson
sim
= qP
u,v
i∈I(u,v) (ru,i
(2.6)
where I(u, v) = I(u) ∩ I(v) are the items that were rated by both users, and
b̂u,i , b̂v,i are baseline rating estimations, for example the one dened in equation
2.3.
To avoid putting too much weight on correlations computed from just a few
examples, a shrinkage factor is applied [Koren and Bell, 2011]:
ˆ 0u,v =
sim
|I(u, v)| − 1
ˆ u,v ,
sim
|I(u, v)| − 1 + λ
(2.7)
ˆ u,v is the initial estimate of the
where λ is the shrinkage constant, and sim
similarity of users u and v .
Using any (estimated or actual) similarity measure sim, we can predict ratings:
P
v∈N k (i;u) simu,v (rv,i − b̂v,i )
UserKNN
P
r̂u,i
= b̂u,i +
,
(2.8)
v∈N k (i;u) simu,v
where N k (i; u) represents the k nearest (most similar according to the given
similarity measure) neighbors of user u that have rated item i.
34
CHAPTER 2.
RECOMMENDER SYSTEMS: TASKS AND METHODS
User-based kNN for item recommendation from positive-only feedback: Similarly, we can use a kNN model for item recommendation from
positive-only feedback.
One widely-used similarity measure for binary data is the cosine similarity
[Manning et al., 2008]:
|I(u) ∩ I(v)|
.
|I(u)| · |I(v)|
simcosine
=
u,v
(2.9)
A related measure is the Jaccard index :
simJaccard
=
u,v
|I(u) ∩ I(v)|
.
|I(u) ∪ I(v)|
(2.10)
To compute the score for a given user-item combination, we count the number of neighbors of the user that have accessed the item:
ŝUserKNN
= |N k (i; u)|,
u,i
(2.11)
where N k (i; u) in this case represents the k nearest neighbors of user u that
have interacted with item i.
To get more accurate predictions, we can sum up the similarities instead of
just counting the neighbors that have accessed the item:
X
0
ŝUserKNN
=
simuv .
(2.12)
u,i
v∈N k (i;u)
The similarity does not necessarily have to be computed
from the interaction data. Instead, one could dene similarities over the user
attributes, for example the age, occupation, or location.
Demographic data:
Item-based kNN
User-based kNN uses similar users to predict ratings and item scores for a given
user. Conversely, item-based kNN [Linden et al., 2003] uses similar items to
predict ratings and user scores for a given item. Similarity measures like Pearson
correlation, cosine similarity, or the Jaccard index can be computed for items
using the formulae presented above just by exchanging users and items.
Content data: As with user-based kNN, we can also dene the similarity over
the item attributes, for example the keywords describing the individual items
[Billsus et al., 2000].
Extensions to Basic kNN
Depending on the type of feedback and attribute information, many other kinds
of similarity measures, and combinations thereof [Tso-Sutter et al., 2008] are
thinkable. Several advanced versions of kNN methods are known, for instance
computing adaptive weights instead of using the similarity values as weights
[Koren and Bell, 2011], or using approximate similarities like MinHash [Das
et al., 2007].
2.3.
35
METHODS
2.3.3 Attribute-Based Methods
We have already seen kNN methods based on demographics and item attributes.
A straightforward way for recommendations based on user or item attributes is
to compute standard supervised machine learning models like linear regression,
decision trees, support vector machines, or Naive Bayes [Hastie et al., 2009], one
for each user or item, using the attributes of the other entity type as predictor
variables, and the interaction data as the targets. Pazzani and Billsus [2007]
and Lops et al. [2011] provide overviews of the state of the art in content-based
recommendation methods.
2.3.4 Hybrid Methods and Ensembles
Of course, it is not necessary to only rely on either attribute-based or collaborative methods. There are many dierent ways for creating hybrid methods
[Balabanovic and Shoham, 1997, Good et al., 1999], either complete models
that use dierent kinds of signals and data (for example Wang and Blei [2011])
or ensembles that combine the output of several dierent models [Koren, 2009,
T
oscher et al., 2010].
2.3.5 Stochastic Gradient Learning
Most recommender system models can be trained eciently using stochastic
gradient descent2 [Gardner, 1984, LeCun et al., 1998, Tak
acs et al., 2008, Bell
et al., 2008, T
oscher et al., 2008]. The algorithm's widespread use and general
usefulness for large-scale data justify that we spend some time discussing its
general working.
Algorithm 1 illustrates the generic stochastic gradient descent procedure:
After initializing the parameters of the model, often to zero, one, or small random values depending on the kind of model we repeatedly draw single
examples, compute a local approximation of the gradient based on the example, and update the aected parameters. The time spent optimizing is usually
measured in epochs. We dene one epoch as a complete pass over all training
examples. Note that some scenarios do not require even a complete pass over
the data [Hazan et al., 2011, Clarkson et al., 2010], whereas sometimes many
epochs are needed to converge.
Data: dataset
Result: Θ̂
X , α, λ
1 initialize Θ̂
2 repeat
3
draw example x from X
4
Θ̂ ← Θ̂ − α∇`x (Θ̂)
5 until convergence
Algorithm 1: Learning with stochastic gradient descent. Θ̂ are the model
parameters, α is the learning rate (step size), ∇`x (Θ̂) is the local gradient of
the (possibly regularized) loss function with respect to the model parameters.
2
We use descent or ascent, depending whether we minimize or maximize our optimiza-
tion objective.
36
CHAPTER 2.
RECOMMENDER SYSTEMS: TASKS AND METHODS
For training more complex models with dierent kinds of parameters, it is
often useful to have dierent step sizes and regularization constants [Koren,
2009]. If we use several constants for training in this thesis, we will indicate it
explicitly.
Of course, there are many other learning methods for recommender systems
besides stochastic gradient descent, like conjugate gradient [Rennie and Srebro,
2005, Pan et al., 2008] expectation-maximization (EM) [Yu et al., 2009], Markov
Chain Monte Carlo [Salakhutdinov and Mnih, 2008a] or alternating least squares
(ALS) [Hu et al., 2008, Pil
aszy and Tikk, 2009, Pil
aszy et al., 2010, Rendle et al.,
2011].
The aspect that makes SGD so interesting is that it can be used to train
a wide variety of dierent models, even with large scale datasets, while being
conceptually simple, and also easy to implement.
2.3.6 Matrix Factorization
The basic idea of matrix factorization for supervised learning is to represent
a partially observed data matrix as the product of two smaller matrices. The
rows of one of these two matrices represent the row elements of the original data
matrix as a k -dimensional vector, and the rows of the other matrix represent the
column elements of the original data as a k -dimensional vector. This allows the
reconstruction of missing matrix elements by computing the scalar product of
the corresponding rows.
We introduce matrix factorization with a model for rating prediction [Rendle
and Schmidt-Thieme, 2008], and will switch to item recommendation later on
in section 2.3.7.
Basic Model
The basic matrix factorization model for rating prediction is
R = WH0 + E,
(2.13)
where R ∈ R|U |×|I| is the (only partially observed) rating matrix, W ∈ R|U |×k
is the user matrix, H ∈ R|I|×k is the item matrix, and E contains the prediction
error of the factorization.
We can compute a single rating estimate with the following formula:
mf
r̂u,i
= hwu , hi i.
(2.14)
Viewing rating prediction as a regression problem with square loss, this leads
to the following optimization problem:
min
W,H
X
mf 2
(ru,i − r̂u,i
) + λ(kWk2 + kHk2 ),
(u,i,ru,i )∈R
where λ is a regularization constant.
Algorithm 2 describes an SGD procedure to optimize equation 2.15.
(2.15)
2.3.
37
METHODS
Data: dataset R,
Result: W, H
α, λ
1 initialize W, H to small random values
2 repeat
3
draw example (u, i, ru,i ) from R
mf
4
e ← ru,i − r̂u,i
new
5
wu ← wu − α(ehi + λwu )
6
hnew
← hi − α(ewu + λhi )
i
7
wu ← wunew
8
hi ← hnew
i
9 until convergence
Learning a basic matrix factorization for rating prediction with
stochastic gradient descent. W, H are the model parameters, α is the learning
rate (step size), λ is the regularization constant.
Algorithm 2:
Modeling Global, User, and Item Biases
To improve the predictive accuracy, it may be wise to only model the dierence
between each rating and the global mean, and to explicitly take user and item
biases into account, leading to the following prediction formula:
bmf
I
r̂u,i
= µ + bU
u + bi + hwu , hi i,
(2.16)
leading to the optimization problem
X
bmf 2
(ru,i − r̂u,i
min
) + λb (kbU k + kbI k) + λ(kWk2 + kHk2 ), (2.17)
W,H
(u,i,ru,i )∈R
which again can be optimized using stochastic gradient descent.
Using a Sigmoid Function
To make the model and the corresponding learning algorithm less prone to
numerical diculties due to dierent rating scales [0, 1] vs. [−1000, 1000] we can also apply a sigmoid function to the sum from equation 2.16, in order
to make sure the result is inside the interval of valid ratings:
blmf
I
r̂u,i
= rmin + g(µ + bU
u + bi + hwu , hi i)(rmax − rmin ),
(2.18)
again leading to an optimization problem similar to the one in equation 2.17,
which can be (approximately) solved by Algorithm 3. g denotes the logistic
function:
1
.
(2.19)
g(x) =
1 + e−x
More Complex Models
The matrix factorization models in this section are presented as an introduction to factorization models; there are more complex and powerful methods in
the literature, which use additional data like SVD++ [Koren, 2008], timeSVD
38
CHAPTER 2.
RECOMMENDER SYSTEMS: TASKS AND METHODS
Data: dataset R, α, λ
Result: bU , bI , W, H
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
bU ← 0
bI ← 0
initialize W, H to small random values
repeat
draw example (u, i, ru,i ) from R
mf-bl
e ← ru,i − r̂u,i
I
x ← g(µ + bU
u + bi + hwu , hi i
y ← e · x · (1 − x) · (rmax − rmin )
U
U
bU
u ← bu − α(y + λbu )
bIi ← bIi − α(y + λbIi )
wunew ← wu − α(y · hi + λwu )
hnew
← hi − α(y · wu + λhi )
i
wu ← wunew
hi ← hnew
i
until convergence
Algorithm 3: Learning a matrix factorization with biases and a logistic trans-
formation for rating prediction with stochastic gradient descent. W, H are the
user and item factors, bU , bI are the user and item bias vectors, α is the learning rate (step size), λ is the regularization constant.
and timeSVD++ [Koren, 2010], time-aware Bayesian probabilistic tensor factorization [Xiong et al., 2010], SVDFeature [Chen et al., 2011b], Factorization
Machines [Rendle, 2010b], the time- and taxonomy-aware model by Dror et al.
[2011a], or which optimize for dierent loss functions like CoRank [Weimer
et al., 2008].
2.3.7 Bayesian Personalized Ranking
Bayesian Personalized Ranking (BPR) is a framework for optimizing dierent
kinds of models based on training data containing implicit feedback or other
kinds of implicit and explicit (partial) ranking information. It has been successfully applied to k-nearest-neighbor (kNN), matrix factorization, and dierent tensor factorization models for the tasks of item recommendation [Rendle
et al., 2009] and personalized tag prediction [Rendle and Schmidt-Thieme, 2010].
Rendle [2010a] refers to the context-aware generalization of BPR, which goes
beyond mere personalization, as Bayesian Context-Aware Ranking (BCR). Because we concentrate on personalization in this thesis, we stick with the term
BPR; all extensions to BPR presented in the next chapter can also be made to
BCR.
BPR's key ideas are to consider entity pairs instead of single entities in
its loss function, which allows the interpretation of positive-only data as partial
ranking data, and to learn the model parameters using a generic algorithm based
on stochastic gradient descent.
For convenience, we use Iu+ for positive items and Iu− for negative ones,
similarly to the notation used by Rendle et al. [2009]. Depending on the context,
Iu+ and Iu− may refer to the positive and negative items in the training or test
2.3.
39
METHODS
set. What determines whether an item is positive or negative may dier (see
section 5.2.5).
For estimating whether a user prefers one item over another, we optimize
for the BPR criterion:3
BPR(DS , Θ)
X
=
ln g(ŝu,i,j (Θ)) − λ||Θ||2 ,
(2.20)
(u,i,j)∈DS
where ŝu,i,j (Θ) := ŝu,i (Θ) − ŝu,j (Θ) and DS = {(u, i, j)|i ∈ Iu+ ∧ j ∈ Iu− }. Θ
represents the parameters of the model and λ is a regularization constant.
Matrix Factorization Based on
BPR
Matrix factorization based on BPR (BPR-MF) approximates the event matrix
S by the product of two low-rank matrices W ∈ R|U |×k and H ∈ R|I|×k .
For a specic user u and item i, the score estimate is
ŝu,i =
k
X
wuf hif = hwu , hi i.
(2.21)
f =1
Each row wu in W can be seen as a feature vector describing a user u; each
row hi of H describes an item i.
For learning the MF model, we use the LearnBPR algorithm (Algorithm
4), which is a variant of stochastic gradient ascent that samples from DS . To
apply LearnBPR to MF, only the gradient of x̂uij with respect to every model
parameter has to be derived.
Data: D train , α, λ
Result: Θ̂
1 initialize Θ̂
2 repeat
3
draw (u, i) from Dtrain
4
draw j uniformly from Iu−
5
Θ̂ ← Θ̂ + α
e−ŝu,i,j
1+e−ŝu,i,j
·
∂
ŝ
∂ Θ̂ u,i,j
− λ · Θ̂
6 until convergence
Algorithm 4: LearnBPR:
Optimizing BPR using stochastic gradient ascent. α is the learning rate (step size), λ is the regularization constant.
Besides BPR-MF and kNN, there
exist of course other methods for item recommendation, most prominently
weighted regularized matrix factorization (WR-MF, Hu et al. [2008], Pan et al.
[2008]).
Other Item Recommendation Methods:
3 In
the original paper [Rendle et al., 2009] this is called BPR-Opt, for brevity we call it
just BPR.
40
CHAPTER 2.
RECOMMENDER SYSTEMS: TASKS AND METHODS
2.3.8 Context-Aware Recommendation
Adomavicius and Tuzhilin [2011] distinguish three dierent approaches for contextaware recommendations:
1. contextual pre-ltering , where the set of candidate items is selected based
on the context and then fed into a non-contextual recommender.
2. contextual post-ltering , where the items suggested by a non-contextual
recommender (possibly including the score outputs) are post-processed
according to the current context.
3. contextual modeling , where context is directly modeled.
In recent years, there have been several publications on contextual modeling
approaches, for example Rendle's thesis [Rendle, 2010a] and several follow-up
publications [Rendle, 2010b, Rendle et al., 2011], as well as Multiverse recommendation [Karatzoglou et al., 2010], and the works by Baltrunas [Baltrunas,
2011, Baltrunas et al., 2011].
Work on the special case of tag recommendation was kicked o by J
aschke
et al. [2007], who suggest several methods inspired by PageRank [Brin and Page,
1998]. Other methods that have been presented are pairwise interaction tensor
factorization (PITF, Rendle and Schmidt-Thieme [2010]), relational classication [Marinho et al., 2009], and content-based approaches [Lipczak, 2009].
Several time-aware methods have already been mentioned before in section
2.3.6.
2.4
Evaluation Criteria
First of all, there are dierent things we can evaluate in recommender systems
research. We can evaluate complete systems, and we can also evaluate certain
components of a recommender system. In this thesis, we are mostly concerned
with evaluating the method components, meaning the prediction model and the
corresponding learning algorithm.
Shani and Gunawardana [2011] discuss 14 recommendation system properties, some of which relate to the method components:
1. user preference the item ranking,
2. prediction accuracy error measures for rating prediction, information
retrieval metrics for item recommendation,
3. coverage the part of the entities out of the overall set for which the
system/method is able to make useful predictions,
4. condence whether the system is able to report the condence it has in
the suggestions it makes,
5. trust whether the users trust in the system's recommendations,
6. novelty whether the system is able to suggest items that user previously
did not know about,
7. serendipity how surprising recommendations are to a user,
2.4.
41
EVALUATION CRITERIA
8. diversity how diverse the recommended item sets are,
9. utility how useful the recommendations are for the user and for the
operator of the recommender system,
10. risk how likely the users will be disappointed by the recommendations,
11. robustness how hard it is for attackers to modify the system's recommendations,
12. privacy whether user preferences can be exposed by the system,
13. adaptivity how well a system/method corresponds to new feedback,
14. scalability how well a system/method can cope with growing numbers
of users, items, and feedback.
While it is important to test recommender system algorithms in the eld
[Marrow et al., 2010], experiments with real users are expensive and timeconsuming to execute. Oine experiments are inexpensive, and repeating them
is only limited by computing resources, which nowadays are cheap; in the literature on recommender system methods [Karypis, 2001], and more generally
in the machine learning/data mining literature, they are the primary means for
comparing the performance of learning algorithms [Hastie et al., 2009].
2.4.1 Predictive Accuracy
Rating Prediction
Typical evaluation measures for rating prediction are root mean square error
(RMSE ) and mean absolute error (MAE ) [Shani and Gunawardana, 2011]:
eRMSE (R, r̂) =
s
X
(ru,i − r̂u,i )2
(2.22)
(u,i,ru,i )∈R
eMAE (R, r̂) =
X
|ru,i − r̂u,i |.
(2.23)
(u,i,ru,i )∈R
Item Recommendation
The accuracy of item recommendation methods can be measured in several
ways. Ranking measures take the ordering of items by the system into account,
while set measures rely on the information of whether or not an item is in the
set of recommendations.
If the recommendation method returns a ranked list of
items for a user, we can compare this list with held-out preference information
of the same user.
One such measure is the area under the ROC curve (AUC). Intuitively, the
AUC is the probability that, when we draw two items at random, their predicted
pairwise ranking is correct [Bickel, 2006].
Ranking Measures:
42
CHAPTER 2.
RECOMMENDER SYSTEMS: TASKS AND METHODS
For recommendation from positive-only feedback, the per-user AUC on the
test set can be dened as follows:
AUC(u) =
1
X X
|Iu+ ||Iu− |
δ(ŝu,i > ŝu,j )
(2.24)
+
−
i∈Iu
j∈Iu
where δ is dened as
(
δ(x) :=
1, condition x holds
.
0, else
(2.25)
Note that the loss 1 − AUC has a nice property [Balcan et al., 2008]:
It is greater for mistakes at the beginning and the end of an ordering,
which satises the intuition that an unwanted item placed at the top
of a recommendation list should have a higher associated loss than
when placed in the middle.
The average AUC over all relevant users is
AUC =
1
|U
test
X
|
AUC(u),
(2.26)
u∈U test
where U test = {u|(u, i) ∈ Dtest } is the set of users that are taken into account
in the evaluation.
Set Measures: If we look at the top n items [Karypis, 2001] of a ranked list,
we can use those items to compute several measures that are commonly used in
the area of information retrieval [Manning et al., 2008].
Precision measures the ratio of correctly predicted items in the set, whereas
recall measures the ratio of items in the test set that were present in the result
set. More formally,
prec(u, S) =
recall(u, S) =
|Iu+
test
∩ S|
(2.27)
|S|
|Iu+
test
∩ S|
test
|Iu+ |
,
(2.28)
test
where S is the set of recommended items (see denition 2), and Iu+
=
{i|(u, i) ∈ Dtest } is the (held-out) set of items the user has provided positive
feedback for.
Because we look at these measures for xed recommendation set sizes n, we
refer to them as precision at n (prec@n ) and recall at n (recall@n ).
Note that with a xed n, changes in both prec@n and recall@n only depend
on the number of hits |Iu+ ∩ S|. This means it is sucient to look at only one
of these measures to compare dierent methods.
As with AUC, we usually average these information retrieval measures over
all relevant users.
2.5.
43
DATASETS
Dataset
MovieLens 100k
MovieLens 1M
Netix
(test set)
KDD Cup 2011 track 1
(validation set)
KDD Cup 2011 track 2
Yahoo! Music Ratings
(test set)
Events
100,000
1,000,209
100,480,507
1,408,789
252,800,275
4,003,960
61,944,406
699,640,226
18,231,790
Users
943
6,040
480,189
463,122
1,000,990
1,000,990
249,012
1,823,179
1,823,179
Items
1,682
3,706
17,770
16,897
624,961
258,694
296,111
136,736
136,735
Sparsity
0.9369533
0.9553164
0.9882244
0.9995959
0.9991599
0.9971935
Table 2.3: Evaluation datasets used in this thesis.
2.4.2 Runtime Performance
For judging the runtime performance of a learning or prediction algorithm, we
may look at the computation time and memory consumption. Evaluating those
aspects can be done from a theoretical perspective, looking at the computational complexity [Cormen et al., 2001], or from a more practical perspective,
by measuring the runtime and memory consumption of the method for dierent
input sizes.
2.5
Datasets
Machine learning methods are usually tested in so-called oine experiments,
which means that we measure the method's predictive accuracy (or possibly
other properties) on user preference data that was collected on a real system
with real users.
Other kinds of experiments are possible for recommender system methods, in
particular when investigating human aspects of recommender systems, like the
psychology of decision-making, user satisfaction, usability, or other questions of
human-computer interaction laboratory experiments with a small or mediumsized group of participants [Knijnenburg et al., 2011b], and live experiments on
an existing system, with a potentially arbitrary number of participants. Note
that the number of employed methods in live experiments and even more so
in lab experiments is always limited, but one can use the initially mentioned
oine experiments to get a pre-selection of methods to compare in conditions
which are closer (or identical) to the intended application scenario.
Hints about user preferences are usually observed in the form of user actions
in an interactive system, where users can for example view and rate videos
or movies, or can browse an online shop and purchase products. A dataset
typically contains logs of one or more such action types for many users over a
given time span. Data collection works independently of whether the system
has a recommendation feature or not.
In this section we describe a couple of such datasets, which we will use for
evaluation throughout this thesis. Table 2.3 contains a comparison of several
quantitative properties of the datasets. Larger datasets tend to be sparser than
smaller ones.
44
CHAPTER 2.
RECOMMENDER SYSTEMS: TASKS AND METHODS
2.5.1 MovieLens
MovieLens 100k
The MovieLens 100k [Herlocker et al., 1999] dataset was collected on the MovieLens web site between September 19th, 1997 and April 22nd, 1998. It contains
100,000 ratings on a scale from 1 to 5, as well as some demographic information
on its 943 users, and genre information about the items (movies) it contains.
MovieLens 1M
MovieLens 1M is a set of approximately 1,000,000 ratings collected from 2000
onwards, from the same website. There is also an extended version of this
dataset with 10,000,000 ratings, which will not be used for the evaluations in
this thesis.
2.5.2 Netix
For the Netix Prize , the rst public large-scale rating dataset was released. It
contains about 100,000,000 ratings on a scale of 1 to 5, by almost half a million
users on 17,770 movies.
2.5.3 KDD Cup 2011
The datasets used for KDD Cup 2011 are described in Dror et al. [2011b]. They
were taken from the Yahoo! Music website, collected from during the years 1999
to 2010.
Track 1
The dataset for track 1 which was about rating prediction is the largest set
in terms of number items. It is also the most sparse of the datasets we compare
here.
Track 2
The dataset for track 2 which was about item recommendation was slightly
smaller than the one for track 1, about one fourth of its size.
2.5.4 Yahoo! Music Ratings
Like the KDD Cup 2011 data, the Yahoo! Music Ratings dataset was collected
from the Yahoo! Music website. The collection period is the time between 2002
and 2006. It is even larger, but not more sparse, than the KDD Cup 2011 track
1 dataset, containing overall more than 617 million ratings. To the best of our
knowledge, it is currently the largest available recommender system dataset.
Chapter 3
Cold-Start Recommendation
Matrix and tensor factorization are well-suited methods for solving problems in
the eld of recommender systems, like rating prediction for a given user and item
(section 2.2.1), or recommending a set of items to a given user (section 2.2.2).
Because predictions from factorization models rely on computing simple dot
products of latent feature vectors representing users, items, and possibly other
entities in the application domain, they usually have good runtime performance.
Training with regard to suitable optimization objectives usually leads to good
predictive accuracy.
The downside of standard factorization methods is that feature vectors are
only available for entities observed in the training data, for example users who
bought at least one book, or books bought by at least one user. Thus for entirely
new users and items, such methods are not capable of computing meaningful
recommendations. Even many hybrid systems that rely on both collaborative
and content information cannot provide useful predictions for entirely new entities, i.e. those that have no interaction information associated with them.
In real-world recommender systems, such cold-start problems1 are often
solved by switching to a dierent, purely content-based method when encountering entirely new entities; other options are to present just the most popular
items to new users and to randomly present new items to the users in order to
gather collaborative information about those new entities.
The approach we present here is a modular one, with well-dened interfaces
between its components. At the core of our framework is a standard factorization model that only works for entities with collaborative training data. This
factorization model is optimized for the given recommendation task. The additional components are mapping functions that compute adequate latent feature
representations for new entities from their attribute representations.
For example, in the classical recommendation task of movie rating prediction
[Shardanand, 1994, Koren, 2009], this approach would handle new users and new
items by computing rst the latent feature vectors for the unknown entities from
attributes like the user's age or location and a movie's genres or main cast, and
then by using those estimated latent feature vectors to compute the rating from
the underlying matrix factorization (MF) model.
1
In this chapter, we use the terms cold start, new item, and new user in the narrower
sense; see section 3.1 for the denition.
45
46
CHAPTER 3.
COLD-START RECOMMENDATION
The training of such a combined model consists of learning the underlying
standard model from the collaborative data, and then learning the mapping
functions from the pairs of latent feature vectors and attribute vectors belonging
to entities that are present in the collaborative data.
Note that this mapping approach is applicable to a variety of prediction
tasks, underlying factorization models, and families of mapping functions. In
the following, we describe the use of this framework for the task of item recommendation from positive-only feedback using a matrix factorization model optimized for Bayesian Personalized Ranking (BPR, see chapter 2.3.7 and Rendle
et al. [2009]), and demonstrate its usefulness for the new-item recommendation
task with a set of experiments.
The main contributions of this chapter are
1. a general, simple and straightforward method to make factorization models
attribute-aware by plugging learnable mapping functions onto them, and,
2. based on that method, an extension of matrix factorization optimized for
Bayesian Personalized Ranking (BPR-MF) that can deal with the coldstart problem, yielding accurate and fast attribute-aware item recommendation methods based on dierent families of mapping functions.
3. We also show empirically that it is worth training the mapping function
for optimal model performance with respect to application-specic losses,
instead of just trying to map the latent features as accurately as possible.
3.1
Problem Statement
In a wider sense, cold-start scenarios are those situations where we want to
compute predictions for users or items that have little collaborative information
[Cremonesi and Turrin, 2009, Pil
aszy and Tikk, 2009]; in the narrow sense, coldstart scenarios are exactly those scenarios in which there is no collaborative
information at all for the given users or items [Gunawardana and Meek, 2008,
Park and Chu, 2009, Gunawardana and Meek, 2009]. In this chapter, we use
the term in the latter sense.
First, let us repeat the movie example from the preceding chapter: Suppose
we have the users Alice, Ben, and Christine. None of them has watched The
Usual Suspects ; Christine has watched all three other movies, while Alice and
Ben each have only watched American Beauty and The Godfather, respectively.
Example 3
item.
3.2
In our example, The Usual Suspects would be a new (cold-start)
Attribute-to-Feature Mappings
In this section, we describe the framework we have sketched in the introduction,
and use it for the task of item recommendation from positive-only feedback.
3.2.
ATTRIBUTE-TO-FEATURE MAPPINGS
user factors
47
user attributes
users
new users
mapping
item factors
item attributes
items
new items
mapping
Figure 3.1: Attribute-to-feature mappings, see section 3.2.1 for a description.
3.2.1 General Framework
In factorization models, every entity (e.g. users, items, tags) is represented by
a latent feature vector f ∈ Rk . In the matrix factorization models presented
in the preceding chapter, the rows of the matrices W and H are such latent
feature vectors. Usually, for example in the matrix factorization models just
mentioned, the latent features of an entity can only be set to meaningful values
during training if the entity occurs in the (collaborative) training data.
If this is not the case, one way to still make use of the factorization model for
new entities is to estimate their latent features from the existing content data:
to map from the attribute space to the latent feature space. The recommender
system could then use the factorization model to compute scores for all kinds
of entities; latent feature vectors for new entities would be computed from the
content attributes and further on used as if they were normally trained latent
features.
The mapping functions could theoretically take any form, although for practical purposes we will limit them to families of functions that allow the learning
of useful mapping functions.
The training of a factorization model with a mapping extension consists of
the following steps:
1. training the factorization model using the data S, and then
2. learning the mapping functions from the latent features of the entities in
the training data and their content attributes.
Figure 3.1 illustrates the framework for a domain involving users and items: The
rectangles on the left-hand side represent the factor matrices, the ones on the
right-hand side the attribute matrices. Attributes are assumed to be known for
all entities (vertical hatching), while factors are initially only known for those
48
CHAPTER 3.
COLD-START RECOMMENDATION
entities that occur in the training data (vertical hatching). Entities without
collaborative data have no factors (blank). The unknown entity factors are estimated using the corresponding mapping function. The mapping functions are
learned from the factor and attribute values of the entities with complete information (thin arrows). Note that this framework can be extended to application
domains with additional entity types besides users and items.
While we focus on strict cold-start problems here, we could also easily deal
with scenarios involving just a few users, for example by using an adaptive
ensemble of the underlying model and a model employing estimated latent features.
To exemplify how attribute-to-feature mappings can be used for item recommendation from positive-only data, we use BPR-MF, a matrix factorization
model based on the Bayesian Personalized Ranking (BPR) framework (see chapter 2.3.7).
Bear in mind that the general framework presented here can be applied to
other matrix factorization models, as well as to any other model where the
entities of the application domain are represented by latent feature vectors, like
Tucker decomposition [Tucker, 1966] or PARAFAC [Harshman, 1970]. In the
examples and experiments, we focus on new items; new users (or other kinds of
entities) can be handled analogously.
Example 4 Training a hypothetical factorization model with k = 2 yields two
matrices consisting of the user and item factor vectors, respectively:

0.2
W = 1.3
0.9

1.2
0.3 ,
1.1

?
0.9
H=
1.1
0.1

?
1.0
.
0.2
1.2
Every row in W corresponds to one user, which means that row 1 represents
Alice, row 2 Ben, and row 3 Christine. In H, each row corresponds to exactly one
movie. Suppose that The Usual Suspects has not yet been added to the content
catalog, so row 1 does not contain any meaningful values. We can compute item
recommendations for Alice by ranking her previously unseen movies according
to their predicted scores:
ŝ1,3 = hw1 , h3 i = 0.2 · 1.1 + 1.2 · 0.2 = 0.46
ŝ1,4 = hw1 , h4 i = 0.2 · 0.1 + 1.2 · 1.2 = 1.46.
Because the score for Road Trip is 1.46, the system would put it further on top
of the result list than The Godfather, which only has a score of 0.46. If we want
to make a prediction for The Usual Suspects, we need to estimate its factors
from its attributes.
3.2.2 Item Mappings
In this section, we show how to design attribute-to-feature mappings for items;
user-attribute-to-feature mappings can be designed accordingly.
3.2.
49
ATTRIBUTE-TO-FEATURE MAPPINGS
Movie
The Usual Suspects
American Beauty
The Godfather
Road Trip
US
1
0
0.5
0
AB
0
1
0
0.5
TG
0.5
0
1
0
RT
0
0.5
0
1
Table 3.1: Cosine similarities between movies.
The general form of score estimation by mapping from item attributes to
item factors is
k
X
(3.1)
ŝui :=
wuf φf (aI i ) = hwu , φ(aI i )i,
f =1
where φf : R → R denotes the function that maps the item attributes to the
factor with index f , and φ : Rn → Rk denotes the vector-valued function that
maps the item attributes to all item factors.
n
K-Nearest-Neighbor Mapping
One approach to map the attribute space to the factor space is to use weighted
k -nearest-neighbor (kNN) regression [Hastie et al., 2009] for each factor.
We determine the k nearest neighbors Nk as the most similar items according
to the cosine similarity (see section 2.3.2) of the attribute vectors. Each factor
is then estimated by
P
I
I
j∈N (i) sim(a i , a j )hjf
I
φf (a i ) := P k
.
(3.2)
I
I
j∈Nk (i) sim(a i , a j )
Note that for other kinds of attribute data (e.g. strings, real numbers), other
similarity metrics could be employed.
The cosine similarities of the dierent items are given in Table
3.1. The factors of The Usual Suspects, estimated by 1-NN, would be
! 0.5·h3,1
1.1
I
0.5
ĥ1 := φ(a i ) = 0.5·h3,2 =
.
0.2
0.5
Example 5
With this estimation, we can compute a score for the new item:
ŝ1,1 = hw1 , ĥ1 i = 0.2 · 1.1 + 1.2 · 0.2 = 0.46.
This result means that we would still recommend Road Trip to Alice.
Linear Mapping
For score estimation with a linear mapping to the item factors, we plug linear
functions into equation (3.1):
φf (aI i ) =
n
X
mf l aIil = hmf , aI i i.
l=1
Each item factor is expressed by a weighted sum of the item attributes.
(3.3)
50
Example 6
ing weights:
CHAPTER 3.
COLD-START RECOMMENDATION
Suppose we have trained a linear mapping model with the follow-
M=
0.7 0.0 0.1 1.0 0.7
.
0.1 0.0 0.8 1.1 0.0
The rows in matrix M correspond to the dierent latent features, while the
columns denote the inuence of each attribute to the latent features. Then the
latent feature estimates are
1 · 0.7 + 1 · 0.0 + 0 · 0.1 + 0 · 1.0 + 0 · 0.7
0.7
ĥ1 =
=
,
1 · 0.1 + 1 · 0.0 + 0 · 0.8 + 0 · 1.1 + 0 · 0.0
0.1
and the score for Alice and The Usual Suspects is
ŝ1,1 = hw1 , ĥ1 i = 0.2 · 0.7 + 1.2 · 0.1 = 0.26.
There are dierent ways of learning the linear mapping functions. We present
options: simple least-squares optimization on the latent factors, and the more
complex BPR optimization, which optimizes the mapping for the overall predictive accuracy of the resulting item recommendation model. As we will show
in the empirical evaluation, this additional complexity is well worth the eort.
One way
to learn suitable parameters for the linear mapping functions is optimizing the
model for the (regularized) squared error on the latent features, i.e. straightforward ridge regression [Hastie et al., 2009]. Because the number of input variables
(attributes) can be in the tens of thousands, we use stochastic gradient descent
for training. This simple approach did not yield optimal results (see section
3.3.3), so we investigated another mapping method, which is explained next.
Optimizing for Least Squares Error on the Latent Factors:
BPR
Performance of the Complete Model: To optimize the parameters of the linear mapping functions, Θ = M ∈ Rk×n , for
the BPR criterion (of the overall prediction model in equation 3.1) is a more
suitable approach, because it ts the parameters leading to optimal model performance, rather than just accurately approximating the latent feature values.
As stated above, when optimizing for BPR, we are interested in the dierence
between two item scores for the same user:
Optimizing for
ŝui − ŝuj =
k
X
wuf
f =1
n
X
mf l aIil −
l=1
k
X
wuf
f =1
n
X
mf l aIjl .
(3.4)
l=1
Note that introducing a bias term m0f (via an articial attribute that is
always set to 1) does not make sense for item mappings, because the bias part
would be exactly the same for both sums.
This can be simplied to
ŝui − ŝuj =
k X
n
X
f =1 l=1
=
k X
n
X
f =1 l=1
wuf mf l aIil −
k X
n
X
f =1 l=1
wuf mf l (aIil − aIjl ).
wuf mf l aIjl
3.3.
51
EXPERIMENTS
For training with LearnBPR (see algorithm 4), we need the partial derivative with respect to mlf for f ∈ {1 . . . k}, l ∈ {1 . . . n}:
∂
(ŝui − ŝuj ) = wuf (aIil − aIjl ).
∂mf l
(3.5)
The resulting learning algorithm for the linear mapping model optimized for
is shown in Figure 5. Note that we only need to update the mapping
weights for those attributes where the two items drawn from DS dier.
BPR
Data: DS , W, H, AI
Result: M
1 initialize M
2 repeat
3
draw (u, i, j) from DS
4
x̂uij ← ŝui − ŝuj
5
for 1 ≤ f ≤ kdo
6
mf ←mf +α
−x̂uij
e
−x̂uij
1+e
·wuf (aI i −aI j )−λmf
7
end
8 until convergence
Learning algorithm for the linear attribute-to-feature mapping:
The score dierence ŝui − ŝui is dened in equation 3.4.
Algorithm 5:
Run-Time Overhead
Generally, the runtime overhead of adding mapping functions to an existing factorization model is low. For each new entity, the factors need to be estimated
once, and can be either be stored in the pre-existing factor matrices, or in special data structures. After that, the computation of a prediction takes the same
time as with just the underlying model. Note that factorization models themselves are among the fastest state-of-the-art methods. The experimental part
of this chapter contains a comparison in section 3.3.5 that shows the method's
advantage over classical content-based ltering.
3.3
Experiments
We performed experiments to conrm that our approach is able to produce
useful new-item cold-start recommendations. We compare the two mapping
methods described in section 3.2 to other approaches capable of solving the newitem cold-start problem (section 3.3.3). We also investigated how the number
of attributes aects the prediction accuracy (section 3.3.4).
3.3.1 Datasets
For the experiments, we use the MovieLens 1M dataset, which is described in
section 2.5.1 of this thesis. MovieLens 1M is a commonly-used rating dataset
[Gunawardana and Meek, 2009, Park and Chu, 2009]. Like Gunawardana and
52
CHAPTER 3.
Name
Source
genres
directors
actors
credits
MovieLens
IMDb
IMDb
IMDb
COLD-START RECOMMENDATION
# Attributes
18
479
16,149
17,739
Sparsity
90.83
99.59
99.91
99.91
%
%
%
%
Table 3.2: Item attribute sets.
Meek [2009], we do not use the rating values, but just binary rating events,
assuming that users tend to rate movies they have watched.
To evaluate the performance of recommender algorithms in the presence of
new items, we randomly split the items in the dataset into 5 groups of roughly
equal size, and assign all corresponding rating events, to perform 5-fold crossvalidation. Note that the results of such a protocol are comparable to the results
provided by Gunawardana and Meek [2009], but actually more robust, because
Gunawardana and Meek [2009] only perform one experiment with 500 randomly
chosen test users, whereas we perform 5 experiments on all available users.
As attributes, we use the genre information included with the MovieLens
dataset, and additional information from the Internet Movie Database (IMDb)2 .
Table 3.2 gives an overview of the attribute sets. All attributes used in the
evaluation are nominal (set-valued), their representation is binary, i.e. every
possible attribute is represented by one number that is 1 if the item has the
attribute, and 0 if not. # Attributes refers to the number of attributes in the
set, and Sparsity refers to the relative number of zero values in the movies'
attribute representations, the matrix AI . Note that the methods described here
would also work for real-valued attributes. The credits attribute set contains
actors, directors, producers, writers, cinematographers, etc. involved with the
movies; it is a superset of the other two IMDb attribute sets.
3.3.2 Compared Methods
We report results for prec@5, prec@10, and AUC (see section 2.4.1).
We compared three mapping methods with two baseline methods. For the
mapping methods, we computed BPR-MF models (see section 2.3.7) with k = 32
factors using hyperparameters that yielded satisfactory results in non-cold-start
evaluations.3
We also performed the experiments with dierent numbers of factors (k ∈
{32, 56, 80, 120, 160}), and got similar results.
map-knn: The kNN-based mapping method described in section 3.2.2; we determined suitable values for k using 4-fold cross-validation on the training data.4
map-lin: A linear mapping method that uses ridge regression to estimate the
latent features from the attributes, described in section 3.2.2. We determined
2 downloaded April 16, 2010
3 α = 0.01, λU = 0.02125, λI = λJ = 0.00355, 265 iterations
4 For this and the other methods, we picked the hyperparameter
best prec@5 performance.
(combinations) with the
3.3.
53
EXPERIMENTS
suitable values for the hyperparameters (learning rate, regularization constant)
using 4-fold cross-validation on the training data.
map-bpr: The linear mapping method that is optimized for BPR is described
in section 3.2.2; Again, we determined suitable values for the hyperparameters
(learning rate, regularization constant) using 4-fold cross-validation on the training data. For training, we performed NNZ
2.5 stochastic updates to the mapping
weights, where NNZ is the number of non-zero entries in the feedback matrix
S.
cbf-knn: We used the cosine similarity (see section 2.3.2) between the items'
binary attribute vectors as the similarity measure. We set k = ∞, so scores for
user u and item i are computed by summing up the similarities of the item i
with the items previously seen by user u:
X
ŝu,i =
sim(i, j).
(3.6)
+
j∈Iu
Note that this is content-based ltering using kNN (see section 2.3.2), not
attribute-to-factor mapping via kNN as mentioned in section 3.2.2.
random: To put the other methods in perspective, we also included the results
for predicting a random set of items.
We do not compare against just recommending the most popular items,
because in our evaluation protocol there are only previously unseen items in
the test set, thus there is no popularity information about any of the candidate
items.
ubm: In the rst experiment, we cite experimental results by Gunawardana and
Meek [2009], who used a comparable evaluation protocol to evaluate Unied
Boltzmann Machines.
3.3.3 Experiment 1: Method Comparison
The comparison of the aforementioned methods on the attribute sets genres,
directors, and a combination of the two sets can be seen in Figures 3.2 to 3.4.
Gunawardana and Meek [2009] used a similar evaluation protocol in their coldstart experiments: the same dataset, also an 80-20 split, but only evaluations
for 500 randomly selected users, instead of all users. For genres, they report
about 25% prec@5 for their primary method (Unied Boltzmann Machines). As
is shown in Figure 3.2, the results for map-bpr also fall into this region, while
map-knn and the two baseline methods perform considerably lower. For directors, map-bpr, map-knn, and cbf-knn are roughly on par. The comparison of
map-lin and map-bpr shows that is really worth training the mapping function
for overall recommendation performance, instead of least squares error on the
latent features. Regarding the AUC metric (Figure 3.2), the results are similar.
Note that for cbf-knn, the results deteriorate when the two attribute sets
are combined, while the two mapping methods, and in particular map-bpr, prot
from the additional data. We think that cbf-knn's suboptimal results could be
54
COLD-START RECOMMENDATION
0.4
CHAPTER 3.
0.2
cbf−knn
random
0.0
0.1
prec@5
0.3
map−knn
map−lin
map−bpr
genres
directors
genres+
directors
Figure 3.2: Cold-start experiment: prec@5 results.
xed by computing separate item similarities for the dierent attribute sets and
then combining them, but we doubt that this would be a stronger method than
map-bpr.
3.3.4 Experiment 2: Large Attribute Sets
Next, we investigated the methods' performance on larger attribute sets (several
thousand attributes). We notice (see Figure 3.5 and 3.7) that for large attribute
sets the baseline method cbf-knn performs better than the mapping methods.
Gunawardana and Meek [2009] observed similar behavior for their models, Unied Boltzmann Machines and Tied Boltzmann Machines [Gunawardana and
Meek, 2008]: using only the genre data led to better results than using actor
data (there: about 8,000 attributes) or the combined genres+actors data.
Again, the combination of attribute sets leads to a deterioration of the prediction quality for cbf-knn, while the mapping methods do not suer from more
data.
3.3.5 Run-Time Comparison
Figure 3.8 shows the test times per user for the dierent methods. The number
of factors per entity is again 32 for map-bpr and map-knn. One can clearly see
that the mapping methods prot from the underlying fast matrix factorization
model, while the kNN-based content-based ltering cbf-knn takes several times
longer to compute the predictions.
3.3.6 Reproducibility
All presented methods are available as part of the MyMediaLite software, which
is described in chapter 6.
55
RELATED WORK
0.4
3.4.
0.2
cbf−knn
random
0.0
0.1
prec@10
0.3
map−knn
map−lin
map−bpr
genres
directors
genres+
directors
Figure 3.3: Cold-start experiment: prec@10 results.
3.3.7 Discussion
The experiments have shown that for the new-item recommendation task, BPRMF in combination with an attribute-to-feature mapping function yields accuracies comparable to state-of-the-art methods like Unied Boltzmann Machines
(section 3.3.3).
The performance on large attribute sets could still be improved over contentbased ltering with cosine similarity, however this is a problem that other methods in the literature also suer from (section 3.3.4). One reason for this could be
that cosine similarity works particularly well for high-dimensional sparse data,
and that linear models like map-bpr and simple models like map-knn (without
much adaption to the data) are not powerful enough to make use of large, sparse
attribute sets. A remedy may be using a non-linear learned mapping function,
e.g. based on multi-layer neural networks, or support vector regression [Smola
and Sch
olkopf, 2004].
Additionally, the mapping approaches have the advantage of being much
faster (section 3.3.5) than content-based ltering using kNN.
3.4
Related Work
The approaches to solve cold-start problem can be roughly divided into two
groups;
1. attribute-based methods use either item (content or meta-data) or user
(demographic) attributes to make up the lack of interaction data for the
cold-start predictions.
2. active learning does not solve the cold-start problem immediately, but
engages the user to gather the necessary feedback; the task here is to
56
COLD-START RECOMMENDATION
1.0
CHAPTER 3.
0.4
0.6
cbf−knn
random
0.0
0.2
AUC
0.8
map−knn
map−lin
map−bpr
genres
directors
genres+
directors
Figure 3.4: Cold-start experiment: AUC results.
gather the most informative feedback, i.e. feedback that can be used to
make the best recommendations.
Of course, those two approaches could also be combined.
Pazzani and Billsus [2007] and Lops et al. [2011] give overviews of contentbased methods that can be used for new-item scenarios (see also section 2.3.3).
Most content-based methods work ne for the new-item problem. However, they
usually see learning the preferences of dierent users as separate and isolated
tasks, which means they do not exploit the similarities between those tasks,
as is done in collaborative ltering and in other multi-task learning scenarios
[Caruana, 1997].
Earlier active learning approaches were presented by Boutilier et al. [2003]
and Rashid et al. [2008]
Next, we discuss several factorization-based approaches to the cold-start
problem. Note that none of the works mentioned below cover exactly our scenario, cold-start item recommendation from positive-only feedback. While several of those models would allow adaption to such a scenario, this has, to the
best of our knowledge, not been done yet.
One of the MF variants described in Koren et al. [2009] takes attributes into
account for the rating prediction task; however, it is assumed that for every
entity there is also collaborative information available, which makes the model
unsuitable for cold-start scenarios in the narrower sense.
Pil
aszy and Tikk [2009] propose an MF model for rating prediction that
maps attributes to the factor space using a linear transformation, based on a
method proposed by Paterek [2007]. The method (NSVD1 ) can either handle
user or item attributes; predictions are computed from item attributes by
ŝui = hwu ,
n
X
l=1
mf l aIil i.
(3.7)
57
RELATED WORK
0.4
3.4.
cbf−knn
0.2
0.0
0.1
prec@5
0.3
map−knn
map−bpr
actors
genres+
actors
credits
genres+
credits
Figure 3.5: High-dimensional attribute sets: prec@5 results.
This rating prediction method is similar to a special case of the framework
presented here, but there are several dierences, considering the concrete application as well as the model:
• NSVD1 is for rating prediction, while the models designed in this chapter
deal with item recommendation (see section 2.2.2).
• Pilaszy and Tikk's learning algorithm estimates all parameters at once,
while we use a two-stage learning scheme (see section 3.2.1).
• If NSVD1 uses user and item attributes at the same time, then there are
no free latent features in the models the rating is estimated entirely
from the entities' attributes; our model only uses the entity attributes if
no collaborative information is known about the given entity.
Pilaszy and Tikk learn the factors of one entity (e.g. the users) simultaneously
with the mapping to the factors of the other entity (e.g. the items), which only
exist implicitly via the mapping; the model is not based on a completely trained
standard MF model, which is augmented by attribute-to-factor mappings like
in our framework. In Pil
aszy and Tikk [2009] there is also a generalization of
NSVD1 that takes both user and item attributes into account, and which has
free latent features. Because of the free latent features, this generalization is not
capable of generating cold-start recommendations; it could, however, be enabled
to do that using our framework.
fLDA [Agarwal and Chen, 2010] uses content data for rating prediction. It
combines one-way and two-way user-item interactions and jointly learns the parameters for those interactions. The authors assume a bag-of-words-like [Manning et al., 2008] structure for the content attributes of items such that latent
feature extraction based on LDA [Blei et al., 2003] is possible. Thus, the fLDA
approach is restricted to bag-of-word features whereas our approach can deal
58
COLD-START RECOMMENDATION
0.4
CHAPTER 3.
cbf−knn
0.2
0.0
0.1
prec@10
0.3
map−knn
map−bpr
actors
genres+
actors
credits
genres+
credits
Figure 3.6: High-dimensional attribute sets: prec@10 results.
with any type of attributes (nominal, ordinal, metric); it is not applicable to
new-user scenarios. The same authors also proposed Regression-based Latent
Factor Models (RLFM) [Agarwal and Chen, 2009], a similar hybrid collaborative ltering method for rating prediction, which works also in cold-start scenarios. It was extended to Generalized Matrix Factorization (GFM) [Zhang
et al., 2011]. According to the authors, by assuming Bernoulli-distributed observations, these models would also be suitable for item recommendation with
positive and negative feedback; nevertheless, the suitability of the approach for
that task, or even item recommendation from positive-only feedback, has not
been shown empirically.
Pairwise Preference Regression [Park and Chu, 2009] is a regression model
for rating prediction optimized for a personalized pairwise loss function. The
two-way aspect model [Schein et al., 2002] is a variant of the aspect model
[Hofmann and Puzicha, 1999] for the item recommendation and the rating prediction task. Filterbots [Sarwar et al., 1998] are a heuristic method to augment
collaborative ltering systems with content data. Unied Boltzmann Machines
[Gunawardana and Meek, 2009] are probabilistic models that learn from collaborative and content information by combining Untied Boltzmann Machines ,
that capture correlations between items, with Tied Boltzmann Machines [Gunawardana and Meek, 2008], that take content information into account.
Menon and Elkan [2010] suggest a latent feature log-linear model , which is
a generalization of matrix factorization with the same loss function as logistic
regression. Again, this model would be suitable for item recommendation with
positive and negative feedback; making it work for positive-only feedback would
require modications to the method. It should be noted that the attribute-tofactor framework presented here could also be applied to that model.
59
SUMMARY AND OUTLOOK
1.0
3.5.
0.4
0.6
cbf−knn
0.0
0.2
AUC
0.8
map−knn
map−bpr
actors
genres+
actors
credits
genres+
credits
Figure 3.7: High-dimensional attribute sets: AUC results.
3.5
Summary and Outlook
We presented a general and straightforward framework to make factorization
models attribute-aware. The framework is applicable to both user and item
attributes, and can deal with nominal/binary and real-valued attributes. We
demonstrated the usefulness of the method by an extension of matrix factorization optimized for Bayesian Personalized Ranking (BPR) that is capable of
making item recommendations for new items. The experimental evaluation on
two dierent types of mappings kNN and linear mappings optimized for BPR
showed that the method produces accurate predictions on par with state-ofthe-art methods, and that it carries little run-time overhead.
We also showed empirically that it is worth training the mapping function for
optimal model performance with respect to application-specic losses, instead
of just trying to map the latent features as accurately as possible.
An appealing property of our framework is its simplicity and modularity: because its components are only loosely coupled, it can be used to enhance existing
factorization models to support new-user and new-item cold-start scenarios.
In the future, we can extend this work in several directions, among others with experiments on user attributes and real-valued (instead of binary) attributes. We also want to see whether the method produces similarly good
results for other applications like rating or tag recommendation. As stated
before, we will investigate how to improve mapping and prediction accuracy
for large attribute sets by employing non-linear learned mapping functions like
multi-layer neural networks or support vector regression. Last but not least, the
mapping framework should be modied to allow a smooth transition between
the two extremes the cold-start scenario, where attributes are the only way
of computing predictions, and normal operation, where latent factors learned
from only the interactions usually provide better results than attributes.
60
COLD-START RECOMMENDATION
map−knn
cbf−knn
5
10
15
map−bpr
0
test time / user, milliseconds
20
CHAPTER 3.
genres
directors
actors
credits
Figure 3.8: Cold-start recommendation: test time per user.
Chapter 4
Bayesian Personalized
Ranking Revisited
Bayesian Personalized Ranking (BPR, introduced in section 2.3.7) is a per-user
ranking approach that optimizes a smooth approximation of the area under
the ROC curve (AUC, section 2.4.1). BPR has seen successful applications in
item recommendation from implicit positive-only feedback [Rendle et al., 2009]
even though the framework is not limited to positive-only data and tag
recommendation [Rendle and Schmidt-Thieme, 2010].
This chapter looks at some aspects of Bayesian Personalized Ranking. First
of all, it relates the BPR framework to other ndings in the machine learning
literature, which should give the reader a clearer picture of the method (section
4.1). We then extend the BPR criterion by introducing weights, leading to the
weighted BPR (WBPR) criterion, in order to make it suitable for applications
that do not treat all users and items the same (section 4.2). Learning WBPR
models can be achieved by adapting the sampling strategies accordingly.
Note that Bayesian in
Bayesian Personalized Ranking is not used as it is commonly understood in
the literature. Here, the term merely refers to the derivation of the optimization
criterion as the MAP estimator given the interaction data and a prior distribution on the parameters.
The name Bayesian Personalized Ranking:
4.1
Relation to Other Approaches
The original BPR paper derives the optimization criterion as an approximation
of AUC optimization. The approximation is done by applying the logistic function to the dierence of the item scores: g(ŝu,i − ŝu,j ). Note that if the scores
come from a matrix factorization model (section 2.3.7), and if we consider the
item factors to be xed, we have a kind of pairwise classication model using
logistic regression that predicts whether item i is preferred to j by user u.
Reductions are a common concept in computer science. Solving a problem
by reducing it to another one means expressing the original problem in terms
of the other problem Cormen et al. [2001]. Balcan et al. [2008] reduce ranking
61
62
CHAPTER 4.
BAYESIAN PERSONALIZED RANKING REVISITED
to pairwise binary classication. They show that a 0/1 classication regret
(dierence between the actual loss and best possible loss) r implies an AUC
regret of at most 2r.
A similar approach is taken by BPR, only that it is extended to a model that
does dierent classication/ranking tasks at once, one per user. Thus we have
an example of multi-task learning [Caruana, 1997] here. Collaborative ltering
as multi-task learning has been discussed in several works, for example by Yu
and Tresp [2005] and Abernethy et al. [2009].
4.2
Weighted
BPR
Models optimized for BPR are suitable when the items to be ranked are sampled
uniformly from the set of all items. Yet, this is not always the case, for example
when the items to be ranked are sampled according to their general popularity,
like in track 2 of the KDD Cup 2011 (see chapter 5).
To deal with such scenarios, we extend the BPR criterion to a probabilistic
ranking criterion that assumes the candidate items (those items that should
be ranked by the model) to be sampled from a given distribution. Using this
new, more general optimization criterion, we derive an extension of the generic
BPR learning algorithm (which is a variant of stochastic gradient ascent) that
samples its training examples according to the probability distribution used for
the candidate sampling, and thus optimizes the model for the new criterion.
Assume the
candidate items are not sampled with uniform probabilities. Because the negative items in the training data are all weighted identically and not according
to the way the candidate items are sampled optimizing a predictive model
for BPR will not lead to optimal prediction quality. This issue can be solved
by assigning adequate weights to the components of the optimization criterion.
Motivation: Non-Uniformly Distributed Candidate Items
4.2.1 Generic Weighted Bayesian Personalized Ranking
Taking into account non-uniform sampling probabilities of the negative items,
and user weights that are not proportional to the number of feedback events
provided by the users, a useful optimization criterion is then
X
WBPR(DS , Θ) =
wu wi wj ln g(ŝu,i,j (Θ)) − λ||Θ||2
(4.1)
(u,i,j)∈DS
where ŝu,i,j (Θ) := ŝu,i (Θ) − ŝu,j (Θ) and DS = {(u, i, j)|i ∈ Iu+ ∧ j ∈ Iu− }. Θ
represents the parameters of the model and λ is a regularization constant. wu is
a weight that balances the contribution of each user to the criterion, wi , wj are
the weight that determines the contribution of the positive and negative item,
respectively.
Note that WBPR is not limited to the task of the KDD Cup 2011: The
weights wu , wi , wj can be adapted to other scenarios/sampling probabilities. Of
course, the original BPR criterion is an instance of WBPR, where all weights
are set to 1 or another constant > 0. Now we will discuss two usage scenarios
of the new criterion.
4.3.
63
SAMPLING STRATEGIES
4.2.2 Example 1: Non-Uniform Negative Item Weighting
Like in the task of track 2 of KDD Cup 2011, the weights for negative items
could be proportional to their global popularity in the training data:
X
wj =
δ(j ∈ Iu+ ).
(4.2)
u∈U
4.2.3 Example 2: Uniform User Weights
The original BPR criterion gives higher weight to users with more feedback.
This can be justied by the notion that parameters of users and items about
which we know more feedback should be more inuenced by the training data
than the parameters of users and items we have less information about. However, this leads to the circumstance that the training set performance of a user
with twice as many feedback events as another user is twice as important as that
user's performance. To equalize this, we can create an optimization criterion
that maximizes the (approximated) AUC for all users weighted equally.
X
wu =
δ(i ∈ Iu+ ).
(4.3)
i∈I
Note that such a WBPR model could be still improved by making the
regularization part dependent on the number of feedback events, as with only
one regularization constant for all user parameters it is very likely to either overor undert most users.
4.3
Sampling Strategies
How and what one samples can make a big dierence in stochastic gradient
descent/ascent learning algorithms. Implementation details which are not
covered in this thesis can have a signicant impact on the runtime of the
algorithm. Furthermore, the weights in the optimization objective and the sampling procedures can be seen as two sides of one coin to give an entity more
weight, one can either derive larger learning steps from the objective, or one
can sample the entity with a higher probability, while keeping the step size the
same as for all entities.
4.3.1 Sampling According to Entity Weights
When training models for a weighted optimization target (like WBPR) using
stochastic gradient methods, there are basically two options to take the weights
into account: either by applying update steps proportional to the weights, or
by sampling examples proportionally to their weights. Because the weights can
dier drastically between examples, it is not prudent to use the former option,
as it can lead to unrealistic parameter values when applying large update steps.
This leaves us with the latter option, which means we have to adapt our sampling
procedures. Chen et al. [2011b] point out that this approach has been used in
cost-sensitive learning [Ting, 1998, Sheng and Ling, 2007]. Actually, weighted
BPR can be seen as a variant of cost-sensitive learning: entities with higher
weights have higher costs associated with them.
64
CHAPTER 4.
BAYESIAN PERSONALIZED RANKING REVISITED
Data: D train , α, λ
Result: Θ̂
1 initialize Θ̂
2 repeat
3
for (u, i) ∈ D train do
4
draw j from Iu− proportionally to wj
5
Θ̂ ← Θ̂ + α
e−ŝu,i,j
1+e−ŝu,i,j
·
∂
ŝ
∂ Θ̂ u,i,j
− λ · Θ̂
6
end
7 until convergence
Algorithm 6: LearnWBPR-Neg:
Optimizing WBPR with non-uniform
weights for the negative items using stochastic gradient ascent. The dierence
to LearnBPR is the sampling in line 4.
Data: D train , α, λ
Result: Θ̂
1 initialize Θ̂
2 repeat
3
draw u from U proportionally to wu
4
draw i from Iu+ proportionally to wi
5
draw j from Iu− proportionally to wj
6
Θ̂ ← Θ̂ + α
e−ŝu,i,j
1+e−ŝu,i,j
·
∂
ŝ
∂ Θ̂ u,i,j
− λ · Θ̂
7 until convergence
Algorithm 7: LearnWBPR: Optimizing
ascent. α is the learning rate (step size).
WBPR
using stochastic gradient
4.4.
SUMMARY AND OUTLOOK
65
To train a model according to the modied optimization criterion, we adapted
the original learning algorithm (Algorithm 4); instead of sampling negative items
uniformly, we sample them according to their overall popularity wj (line 4 in
Algorithm 6).
4.3.2 Matrix Factorization Optimized for WBPR
In the BPR framework, the pairwise prediction ŝu,i,j is often expressed as the
dierence of two single predictions:
ŝu,i,j := ŝu,i − ŝu,j .
(4.4)
We use the BPR framework and its adapted sampling extension to learn
matrix factorization models with item biases:
ŝu,i := bi + hwu , hi i,
(4.5)
where bi ∈ R is a bias value for item i, wu ∈ Rk is the latent factor vector
representing the preferences of user u, and hi ∈ Rk is the latent factor vector
representing item i.
The optimization problem is then
X
wu wi wj ln g(bi − bj + hwu , hi − hj i)
max
W,H,b
(u,i,j)∈DS
− λU kWk2 − λI kHk2 − λb kbk2 .
(4.6)
The training algorithm LearnWBPR-MF-Neg (Algorithm 8) (approximately) optimizes this problem using stochastic gradient ascent. It is an instance of the generic LearnWBPR algorithm (Algorithm 7). The parameter
updates make use of the partial derivatives of the local error with respect to the
current parameter. The matrix entries must be initialized to non-zero values
because otherwise all gradients and regularization updates for them would be
zero, and thus no learning would take place. The item bias vector b does not
have this problem. Note that the λ constants in the learning algorithm are not
exactly equivalent to their counterparts in the optimization criterion. We also
have two dierent regularization constants λI and λJ which lead to dierent
regularization updates for positive and negative items.
4.4
Summary and Outlook
In this chapter, we revisited several aspects of the Bayesian Personalized Ranking (BPR) framework. We linked it to a remarkable result in the general machine literature, and extended the optimization criterion to the weighted BPR
(WBPR) criterion, for which we provide a generic learning algorithm. We described one instance of this learning algorithm that trains matrix factorization
models for the case of non-uniformly weighted negative items.
The WBPR framework can be applied to other scenarios, for example for
providing personalized rankings of news articles that have been assigned weights/priorities by an editorial team. Another possible use of weighted BPR are
learning scenarios with case weights, which could come up, for example, if we
66
CHAPTER 4.
BAYESIAN PERSONALIZED RANKING REVISITED
Data: D train , α, λU , λI , λJ , λb
Result: W, H, b
1 set entries of W and H to small random values
2 b←0
3 repeat
4
draw (u, i) from Dtrain
5
draw j from Iu− proportionally to wj
6
ŝu,i,j ← bi − bj + hwu , hi − hj i
7
x←
e−ŝu,i,j
1+e−ŝu,i,j
8
bi ← bi + α x − λb bi
9
bj ← bj + α −x − λb bj
10
wu ← wu + α x · (hi − hj ) − λU wu
11
hi ← hi + α x · wu − λI hi
12
hj ← hj + α x · (−wu ) − λJ hj
13 until convergence
Optimizing a matrix factorization model for WBPR with nonuniform weights for the negative items using stochastic gradient ascent.
Algorithm 8:
perform soft clustering [Ruspini, 1969] of users to densify the training data, and
then use all available training examples with the cluster weights as case weights
for the given user cluster.
An interesting aspect is the regularization of WBPR models. As mentioned
when discussing equation 4.3, weighting the contribution of each user to the
optimization criterion equally could lead to over- and undertting. Using several
regularization constants [Koren, 2009, Dror et al., 2011a], or even regularization
functions that depend on the number of feedback events, in combination with an
ecient hyperparameter search procedure, may lead to more accurate models.
In the following chapter 5, which is about music recommendation, we provide
a use case for non-uniform item sampling for WBPR, including a range of
experiments.
Chapter 5
Recommending Songs
In this chapter, we describe how some of the methods laid out in the preceding
chapter can be applied to a real-world problem: music recommendation. In
particular, we describe the approach we used for track 2 of the KDD Cup 2011,
where the task was to recommend songs to individual users.
The KDD Cup 2011 consisted of two separate tracks. Track 1 was about
rating prediction (see section 2.2.1). The task of track 2 was to predict which 3
out of 6 candidate songs were positively rated (higher than a certain threshold)
instead of not rated at all by a user. The candidate items were not sampled
uniformly, but according to their general popularity, i.e. the number of users
who gave a positive rating to them.
We use the weighted BPR (WBPR) optimization criterion described in section 4.2 that takes the non-uniform sampling of negative test items into account,
together with the modied version of the generic BPR learning algorithm, which
maximizes the new criterion by adapting the sampling process (section 4.3.1).
We use the learning algorithm to train ranking matrix factorization models as
components of an ensemble. Additionally, we combine the ranking predictions
with rating prediction models to also take into account the rating data in the
provided dataset.
With an ensemble of such combined models, we achieved an error of 4.49,
which means that our method selected a wrong random instead of preferred
song less than 1 in 20 times. We ranked 8th (out of more than 1850 teams
[Dror et al., 2011b]) in track 2 of the KDD Cup 2011, without exploiting the
additional taxonomic information available in the dataset.
5.1
Problem Statement
The task of track 2 of the 2011 KDD Cup was to predict which 3 songs1 out of
6 candidates a user will like rate with a score of 80 or higher on a scale from
0 to 100 for a set of users, given the past ratings of a superset of the users.
Additionally, an item taxonomy expressing relations between songs, albums,
artists, and genres was provided by the contest organizers [Dror et al., 2011b].
We did not use this additional data in our approach.
1A
song is called track in the competition description. We will use the term song to
avoid confusion with track 1 and 2 of KDD Cup.
67
68
CHAPTER 5.
?
rating >= 80
RECOMMENDING SONGS
no rating
Figure 5.1: Task of KDD Cup 2011, track 2: Distinguish between songs a user
liked and songs the user has not rated. Items that have been rated below 80 by
the user are not present in the test dataset.
rating < 80
rating >= 80
liked?
no rating
Figure 5.2: The liked contrast: We say that a user likes an item if they rated
it with 80 or higher.
The 3 candidate songs that have not been rated highly by the user have not
been rated at all by the user. They were not sampled uniformly, but according
to how often they are rated highly in the overall dataset.
To put it briey, the task was to distinguish items (in this case songs) that
were likely to be rated with a score of 80 or higher by the user from items that
were generally popular, but not rated by the user (Figure 5.1). This is similar
to the task of distinguishing the highly rated items from generally popular ones,
which we call the liked contrast (Figure 5.2).
Generally, the training and testing sets in track 2 of the KDD Cup 2011 have
the following structure:
Dtrain ⊂ U × I × [0, 100]
D
test
⊂ U × I × {0, 1}.
(5.1)
(5.2)
with ∀(u, i, pu,i ) ∈ Dtest : ¬∃(u, i, ru,i ) ∈ Dtrain . The training set Dtrain contains ratings, and the testing set Dtest contains binary variables that represent
whether a user has rated an item with a score of at least 80 or not.
5.1.1 Evaluation Criterion
The evaluation criterion is the error rate, which is just the relative number of
wrong predictions:
5.2.
69
METHODS
e=1−
1
|Dtest |
X
δ(pu,i = p̂u,i ),
(5.3)
(u,i,pu,i )∈D test
where δ(x = y) is 1 if the condition (in this case: x = y ) holds, and 0 otherwise,
and p̂u,i is the prediction whether item i is rated 80 or higher by user u. For a
single user, the error rate is
X
1
eu = 1 −
δ(pu,i = p̂u,i ).
(5.4)
+ test
− test
|Iu
| + |Iu
| (u,i,pu,i )∈Dtest
For the KDD Cup 2011, we have the additional constraints that for every highly
rated item of each user, there is an item that has not been rated in the evaluation
set Dtest , and that exactly half of the candidate items must be given a prediction
of p̂u,i = 1. We call this the 1-vs.-1 evaluation scheme.
5.2
Methods
The obvious approach for track 2 of KDD Cup 2011 is to assign scores to the 6
candidate items of each user, and then to pick the 3 highest-scoring candidates.
This is similar to classical top-N item recommendation. The decision function
is
(
1, |{j|(u, j) ∈ Dtest ∧ ŝu,i > ŝu,j }| ≥ 3
.
(5.5)
p̂u,i =
0, else
5.2.1 Optimizing for the Competition Objective
The area under the ROC curve (AUC, see section 2.4.1) is a ranking measure
that can also be computed for the KDD Cup scenario.
In the 1-vs.-1 evaluation scheme the per-user accuracy 1 − eu grows
strictly monotonically with the per-user area under the ROC curve (AUC) and
vice versa.
Lemma 1.
Proof. Items are ordered according to their scores ŝu,i . Be ntp and ntn the number of true positives and true negatives, respectively. Given Iu+ , Iu− , AUC(u) =
n +n
ntp ·ntn
< 1 and 1 − eu = |I +tp|+|Itn− | < 1. If the scores change s.t. p̂0u,i 6= p̂u,i
|I + ||I − |
u
u
u
u
for exactly two items that have been wrongly classied before, then AUC0 (u) =
(ntp +1)·(ntn +1)
n +1+n +1
> AUC(u) and 1 − e0u = tp|I + |+|Itn− | > 1 − eu .
|I + ||I − |
u
u
u
u
This means that maximizing the user-wise AUC on the training data (while
preventing overtting) is a viable strategy for learning models that perform well
under 1-vs.-1 evaluation scheme.
5.2.2 Matrix Factorization Optimized for Weighted BPR
Matrix factorization models are suitable prediction models for recommender
systems, and are known to work well for item recommendation when trained
using the BPR framework, which optimizes the user-wise AUC. Thus, we used
matrix factorization for the KDD Cup (section 2.3.7). In particular, we used
the weighted BPR (WBPR, section 4.2) approach to account for the specic
evaluation criterion of the KDD Cup 2011.
70
CHAPTER 5.
RECOMMENDING SONGS
5.2.3 Ensembles
ensemble To get more accurate predictions, we trained models for dierent numbers of factors k and using dierent regularization settings. We combined the
results of the dierent models, and of the same models at dierent training
stages. We used two dierent combination schemes, score averaging and vote
averaging.
If models have similar output ranges, for example the same
model at dierent training stages, we can achieve more accurate predictions by
averaging the scores predicted by the models:
Score Averaging
ŝscore-ens
=
u,i
X
(m)
ŝu,i .
(5.6)
m
If we do not know whether the scale of the scores is comparable, we can still average the voting decisions of dierent models:
Vote Averaging
ŝvote-ens
=
u,i
X
(m)
p̂u,i .
(5.7)
m
Other possible combination schemes would be ranking ensembles [Rendle and
Schmidt-Thieme, 2009], and of course weighted variants of all schemes discussed
here.
Because selecting the optimal set
of models to use in an ensemble is not feasible if the number of models is high,
we perform a greedy forward search to nd a good set of ensemble components.
This search procedure tries all candidate components sorted by their validation
set accuracy, and adds the candidate to the ensemble if it improves the current
mix. When searching a large number (> 2, 000) of models, we ignored candidates
above a given error threshold.
Greedy Forward Selection of Models
Data: D train , D test , n
Result: D train-val , D test-val
1 D train-val ← D train
2 U test ← {u|(u, i, pu,i ) ∈ D test }
3 forall the u ∈ U test do
4
5
+(t2)
I + ← {n random items from Iu
}
˙
Dtest-val ← Dtest-val ∪{u}
× I + × {1}
−(t2)
6
I − ← {n items from Iu
sampled prop. to popul.}
˙
7
Dtest-val ← Dtest-val ∪{u}
× I − × {0}
˙ − do
8
forall the i ∈ I + ∪I
train-val
train-val
9
D
←D
− {(u, i, ru,i )}
10
end
11 end
Algorithm 9:
Sampling procedure for the validation split.
5.2.
71
METHODS
5.2.4 Incorporating Rating Information
Except for the rating threshold of 80, the methods presented here so far do not
take into account the actual rating values. We suggest two dierent schemes
of combining probabilities of whether an item has been rated by a user with
rating predictions produced by a matrix factorization model that incorporates
user and item biases [Koren et al., 2009, Rendle and Schmidt-Thieme, 2008]:
X
I
2
min
(rmin + g(µ + bU
u + bi + hwu , hi i) · (rmax − rmin ) − ru,i )
W,H,bU ,bI
(u,i,ru,i )∈D train
+ λb (kbU k2 + kbI k2 )
U
2
I
(5.8)
2
+ λ kWk + λ kHk ,
where µ is the global rating average, and [rmin , rmax ] is the rating range. The
model is trained using stochastic gradient descent with the bold-driver heuristic
that dynamically adapts the learn rate. Using this heuristic for learning matrix
factorizations was rst suggested by Gemulla et al. [2011].
First, we describe how we compute probabilities from prediction scores of
models that were trained to decide whether an item has been rated or not (Figure
5.3). After that, we will describe how such probabilities can be combined with
rating predictions.
Estimating Probabilities
rated
p̂u,i
=
5
5
X
X
5
X
rated
rated
g(ŝrated
u,i,jk )g(ŝu,i,jl )g(ŝu,i,jm ),
(5.9)
k=1 l=k+1 m=l+1
rated
where ŝrated
u,i,j1 . . . ŝu,i,j5 refer to the score estimates of the other 5 candidates.
Note that the models for ŝrated are trained using all ratings as input, not just
those of 80 or higher. The intuition behind this way of probability estimation
is as follows: g(ŝrated
u,i,jk ) ∈ (0, 1) can be interpreted, similar to the case of logistic
regression (e.g. [Bishop, 2006]) as the probability that item i is ranked higher
(more likely to be rated) than item j by user u. We know that exactly 3 items
are rated by the user, which means we need to estimate how probable it is that
a given item is ranked higher than 3 other items. Equation 5.9 sums up the
probabilities for the dierent cases where this holds.
Scheme 1: Multiplication with Rating Prediction
The rst scheme takes a rated probability and multiplies it with a rating
prediction from a model trained on the rating data:
rated
ŝone
· r̂u,i ,
u,i = p̂u,i
(5.10)
where r̂u,i is the predicted rating.
Scheme 2: Multiplication with Rating Probability
The second scheme takes a rated probability and multiplies it with the probability that the item, if rated, gets a rating of 80 or more by the user:
rated
ŝtwo
· p̂≥80
u,i = p̂u,i
u,i ,
(5.11)
72
CHAPTER 5.
RECOMMENDING SONGS
≥80
where p̂≥80
u,i is the estimated probability of ru,i ≥ 80. We estimate p̂u,i using
several dierent rating prediction models:
X
(k)
δ(r̂u,i ≥ 80).
(5.12)
p̂≥80
u,i =
k
5.2.5 Contrasts
Depending on the exact contrast we wish to learn, there are certain dierent
conditions for what is in the set of positive (Iu+ ) and negative (Iu− ) items for
each user.
The contrast to be learned for the KDD Cup 2011 ignores
all ratings below a score of 80. Such ratings are not used for sampling the
negative candidate items only items that are not rated by users are potential
candidates (Figure 5.1).
Track 2 Contrast
Iu+(t2) := {i|∃ru,i ≥ 80 : (u, i, ru,i ) ∈ Dtrain }
(5.13)
Iu−(t2)
(5.14)
:= I \ {i|∃ru,i : (u, i, ru,i ) ∈ D
train
}
Note that all items i with ru,i < 80 do not belong to either of the two sets.
Liked Contrast The liked contrast dierentiates between what users have
rated highly (80 or more), and what they have not rated or rated with a score
below 80 (Figure 5.2):
Iu+(liked) := {i|∃ru,i ≥ 80 : (u, i, ru,i ) ∈ Dtrain }
(5.15)
Iu−(liked)
(5.16)
:= I \
Iu+(liked)
−(liked)
As can easily be seen from the denition of Iu
and negative items is exhaustive for each user.
, the split between positive
Finally, the rated contrast dierentiates what users have
rated vs. not rated (Figure 5.3):
Rated Contrast
Iu+(rated) := {i|∃ru,i : (u, i, ru,i ) ∈ Dtrain }
(5.17)
Iu−(rated)
(5.18)
:= I \
Iu+(rated)
Again, this split is exhaustive for each user.
5.3
Experiments
5.3.1 Datasets
We created a validation split from the training set (see section 2.5.3) so that
we could estimate the accuracy of dierent models, and use those estimates to
drive the composition of ensemble models. The procedure to create the split,
based on the task description of track 22 , is described in Figure 9. In the case
of the KDD Cup data, the number of positive items per user in the test set is
n = 3. Table 5.1 shows the characteristics of the dierent splits.
2 http://kddcup.yahoo.com/datasets.php
5.3.
73
EXPERIMENTS
rating >= 80
rated?
no rating
rating < 80
Figure 5.3: The rated contrast: The question is not how a user has rated an
item, but if.
#
#
#
#
#
#
users
items
ratings
sparsity
test users
test items
#
#
#
#
#
#
users
items
ratings
sparsity
test users
test items
Ratings
validation split competition split
249,012
249,012
296,111
296,111
61,640,890
61,944,406
0.999164
0.9991599
101,172
101,172
128,114
118,363
Ratings ≥ 80
validation split competition split
248,502
248,529
289,234
289,303
22,395,798
22,699,314
0.9996884
0.9996843
101,172
101,172
128,114
118,363
Table 5.1: Characteristics of the validation and competition splits when considering all ratings (Figure 5.3) and the ratings of 80 or more (Figure 5.2),
respectively.
74
Model
MF
MF
CHAPTER 5.
RECOMMENDING SONGS
Hyperparameters
k = 40, λU = 2.3, λI = 1.4, λb = 0.009, α = 0.00002, i = 30
k = 60, λU = 3.9, λI = 1.7, λb = 0.00005, α = 0.00005, i = 55
RMSE
25.37
25.35
Table 5.2: Rating prediction accuracy on the validation split for dierent matrix
factorization models (eq. 5.8).
5.3.2 Rating Prediction
Table 5.2 contains the rating prediction accuracy in terms of root mean square
error (RMSE, see section 2.4.1) and mean absolute error (MAE) on the validation split for dierent hyperparameter combinations.
5.3.3 Track 2 Results
We trained all models on both splits. Some results for the validation splits, and
from the leaderboard (the Test1 set) are in Table 5.3.
5.3.4 Final Submission
For our nal submission (see Table 5.3), we used the second rating integration
scheme (eq. 5.11). To estimate p̂rated
u,i , we created a score ensemble (section 5.2.3)
from candidate models described in Table 5.4, with a candidate error threshold
of 5.2% models with a higher validation error were not considered for the
ensemble. We estimated the probabilities for a high rating p̂≥80
u,i according to
eq. 5.12, from the models listed in Table 5.5.
MAE
16.88
16.67
Hyperparameters
k = 60, λ = .0001, cpos = 320, i = 30
k = 20, λU = λI = λJ = 0.005, i = 86
k = 240, λU = .01, λI = .005, λJ = .0005, λb = .0000001, i = 222
k = 320, λU = .01, λI = .0025, λJ = .0005, λb = .0000001, i = 322
k = 320, λU = .0075, λI = .005, λJ = .00025, λb = .000015, i = 53
55 dierent models of WBPR-MF with k = 400
see section 5.3.4
8.90
6.275
6.089
5.4103
5.5948
3.80178
Validation
29.8027
29.0802
Leaderboard
42.8546
42.8810
13.7587
8.7482
6.0449
5.8944
6.0819
5.2996
4.4929
λU
0.005
λI
{0.0015, 0.0025, 0.0035}
λJ
{0.00015, 0.00025, 0.00035}
λb
{0.000015, 0.00002}
α
0.04
i
{10, . . . , 200}
#
3,420
Table 5.4: Candidate components of the score ensemble used for estimating p̂rated
(section 5.2.3). The last column shows the number of
u,i
dierent models resulting from combining the hyperparameter values in that row.
k
480
Table 5.3: Validation set and KDD Cup 2011 leaderboard error percentages for dierent models. i refers to the number of iterations used
to train the model. See the method section for details about these methods.
Model
most popular
most rated
WR-MF [Hu et al., 2008]
WBPR-MF (liked contrast)
WBPR-MF (liked contrast)
WBPR-MF (liked contrast)
WBPR-MF (rated contrast)
ensemble
nal submission
5.3.
EXPERIMENTS
75
λU
{1.9, 2.0, 2.2}
{2.1, 2.3}
{3, 3.5}
{3.4, 3.9}
λI
{0.8, 1.0, 1.2}
{1.1, 1.4}
{1.1, 1.25, 1.5}
{1.2, 1.5, 1.7}
λb
{0.000075, 0.0001, 0.0075}
{0.006, 0.0075, 0.009}
{0.0000075, 0.00005}
{0.00005}
α
0.00002
0.00002
0.00005
0.00005
i
{8, . . . 11, 20, 24, 30, 31, 33, 38, . . . , 41}
{8, . . . 11, 20, 24, 30, 31, 33, 38, . . . , 41}
{30, 50, 70, 89, . . . , 93}
{30, 50, 70, 89, . . . , 93}
#
351
156
84
48
Table 5.5: Rating prediction models used for estimating p̂≥80
u,i (eq. 5.12) in the nal KDD Cup submission. The last column shows the
number of dierent models resulting from combining the hyperparameter values in that row.
k
40
40
60
60
76
CHAPTER 5.
RECOMMENDING SONGS
5.4.
5.4
RELATED WORK
77
Related Work
The closest related work to the methods presented and evaluated in this chapter
are obviously the publications at the KDD Cup 2011 workshop [Dror et al.,
2011b], which is briey summarized here.
McKenzie et al. [2011], the winners of the challenge, combined dozens of different approaches. Among other things, they employed dierent factorization
and linear methods optimized for ranking criteria like BPR and element-wise
criteria, as well as kNN-based methods and several techniques to take taxonomy
information into account. Additionally, classiers like SVMs and a neural network were used. The methods were combined linearly by a bagging [Breiman,
1996] method using random coordinate descent [Li and Lin, 2007], and nonlinearly by AdaBoost [Freund and Schapire, 1997], LogitBoost [Friedman et al.,
2000] and Random Forests [Breiman, 2001]. It is worth noting that they also
combined rating predictions with the output of item recommendations methods,
similar to the approach we describe in section 5.2.4.
Lai et al. [2011] use an ensemble of factorization models, content-based models, and neighborhood models, plus specic post-processing rules to ne-tune
their predictions.
Other participants [Xie et al., 2011, Balakrishnan et al., 2011, Kong et al.,
2011] also used engineered features, which were fed into dierent models like
SVMs, logistic regression, generalized linear models, neural networks, Random
Forests, and gradient boosted decision trees [Friedman, 2002].
Jahrer and T
oscher [2011] suggest ranking methods based on the direct optimization of the error using stochastic gradient descent. Models they use are matrix factorization (see section 2.3.6), asymmetric factor models (Paterek [2007],
both for users and items), asymmetric factor models with a ipped taxonomy,
user- and item-based kNN using the Pearson similarity (see section 2.3.2) itembased kNN with matrix factorization features, and restricted Boltzmann machines [Salakhutdinov et al., 2007]. Blending of several models was done with a
neural network [T
oscher et al., 2010].
Mnih [2011] suggests a BPR factorization model with shared factors between
tracks, albums, artists, but disregarding genres. As we do with WBPR, they use
a popularity-based distribution for sampling negative items in order to optimize
for the task-specic measure. In addition to latent factors, the author came up
with manually engineered features that he integrated into his models.
5.5
Summary and Outlook
We described how the optimization criterion WBPR can be applied to music
recommendation, as in track 2 of the KDD Cup 2011. In addition to ensembles
of dierent WBPR matrix factorization models, we enhanced the predictions
by integrating additional rating information. The experiments presented in
this chapter, and the ranking on the KDD Cup leaderboard even though we
did not make use of the additional taxonomy information suggest that our
methods are suitable for such recommendation tasks.
We should also point out that in a real-world application, we would of course
make use of the taxonomy information about the songs, as well as content
features of the songs [Celma, 2010].
78
CHAPTER 5.
RECOMMENDING SONGS
While the winning team [McKenzie et al., 2011, Chen et al., 2011a] used
taxonomy information, it is remarkable how much can be done without taking
it into account. As shown in the Netix Prize [Koren, 2009, Koren et al., 2009,
Koren, 2010, Tak
acs et al., 2008, 2009, T
oscher et al., 2008], and again in the
KDD Cup 2011, automatic learning algorithms are able to extract predictive
features from interaction data, without even looking at the content.
There are several aspects worth further investigation.
First of all, we reduce a classication problem (optimization for the error
rate) to a ranking problem, which we again solve using a reduction to pairwise classication. While in general item recommendation scenarios, ranking
is the problem we want to solve, it would be still interesting to see whether
improvements are possible by directly training a classier.
We have not used the item taxonomy, so a next step will be to make use of
this additional information, as well as trying other ways of integrating the rating information (see section 5.2.4). A fully Bayesian treatment of the WBPR
framework, i.e. by estimating parameter distributions [Freudenthaler et al.,
2011], could yield models that have less hyperparameters, while having accuracies comparable to ensembles of the current models.
For the competition, we performed all training on the liked (Figure 5.2)
and rated (Figure 5.3) contrasts, but not on the proper contrast (Figure 5.1)
that was used for evaluation in the KDD Cup. We could investigate whether
there are signicant benets when learning the correct contrast.
Chapter 6
The MyMediaLite Library
In this chapter, we describe MyMediaLite, a fast and scalable, multi-purpose
library of recommender system algorithms, aimed both at researchers and practitioners. MyMediaLite implements all algorithms discussed in this thesis, plus
several other methods from the literature. MyMediaLite addresses two common scenarios in collaborative ltering: rating prediction (e.g. on a scale of
1 to 5 stars) and item recommendation from positive-only feedback (e.g. from
clicks, likes, or purchase actions). The library oers state-of-the-art algorithms
for those two tasks. Programs that expose most of the library's functionality,
plus a GUI demo, are included in the package. Ecient data structures and
a common API are used by the implemented algorithms, and may be used to
implement further algorithms. The API also contains methods for real-time
updates and loading/storing of already trained recommender models.
MyMediaLite is free/open source software, distributed under the terms of the
GNU General Public License (GPL)1 . Its methods have been used in four dierent industrial eld trials of the MyMedia project2 , including one trial involving
over 50,000 households [Marrow et al., 2010]. In the following, we describe MyMediaLite's features, and compare it to existing free/open source recommender
system software.
6.1
Motivation: Free Software for Research
In general machine learning and data mining, as well as in specic sub-domains
like computer vision and text mining/natural language processing, there exist free/open source collections of common algorithms and evaluation protocols
which are in broad use. Examples for such packages are Weka [Hall et al.,
2009], R [Ihaka and Gentleman, 1996], scikit-learn [Pedregosa et al., 2011],
Shogun [Sonnenburg et al., 2010], and RapidMiner for general machine learning,
OpenCV [Bradski and Kaehler, 2008] for vision, and GATE [Cunningham et al.,
2011] for text mining. The recommender systems community both researchers
and technology users could of course also prot from the availability of one
or more such software packages.
1 http://www.gnu.org/copyleft/gpl.html
2 http://www.mymediaproject.org
79
80
CHAPTER 6.
THE MYMEDIALITE LIBRARY
Free/open source implementations of recommender system algorithms are
desirable for three reasons:
1. They relieve researchers from implementing existing methods for their
experiments, either for comparing them against newly designed methods,
or as recommendation methods in other kinds of studies, e.g. in user
interface research,
2. they can play a crucial role in the practical adoption of newly developed
techniques, either by providing software that can be directly adapted and
deployed, or by at least giving example implementations,
3. and nally they can be used for (self-) teaching future recommender systems researchers and practitioners.
Additionally, well-designed software frameworks can make the implementation
and evaluation of new algorithms much easier. Ekstrand et al. [2011] argue
that publicly available algorithm implementations should be the standard in
recommender system research.
MyMediaLite3 aims to be software developed with all of these aspects in
mind. It targets both academic and industrial users, who may use the existing
algorithms in the library, or use the framework for the implementation and
evaluation of new algorithms.
6.2
Feature Overview
MyMediaLite addresses two common scenarios in collaborative ltering: rating
prediction (e.g. on a scale of 1 to 5 stars) and item recommendation from
positive-only feedback (e.g. from clicks or purchase actions). It oers state-ofthe-art algorithms for those two tasks, plus incremental updates (where feasible),
serialization of computed models, and a rich choice of evaluation protocols.
MyMediaLite is implemented in C#, and runs on the .NET platform. With
the free .NET implementation Mono, it can be used on all common operating
systems. Using the library is not limited to C#, though: it can be easily called
from other languages like C++ (by embedding the Mono runtime into the native
code), Java (via IKVM) F#, Ruby and Python; code examples are included with
the software.
6.2.1 Recommendation Tasks
We give a brief overview of the recommendation methods available for each task.
Details on how to use these recommenders can be found in the appendix A.5.
Rating Prediction
MyMediaLite contains dierent variants of k-nearest neighbor (kNN) models
[Linden et al., 2003], simple baseline methods (averages, biases, time-dependent
biases [Koren, 2009], Slope-One [Lemire and Maclachlan, 2005], co-clustering
[George and Merugu, 2005], see section 2.3.1), and modern matrix factorization
3 Here
and in the appendix we describe MyMediaLite 3.01, released in May 2012, unless
stated otherwise.
6.2.
FEATURE OVERVIEW
81
methods (see sections 2.3.6 and 9) [Rendle and Schmidt-Thieme, 2008, Rennie and Srebro, 2005, Koren et al., 2009, Koren, 2008] for the task of rating
prediction (see section 2.2.1).
Item Recommendation from Positive-Only Feedback
The library contains kNN models for this task (see section 2.2.2), as well as
simple baselines (random/most popular item; see section 2.3.1) and advanced
matrix factorization methods like WR-MF [Hu et al., 2008], BPR-MF [Rendle et al., 2009] and WBPR-MF (section 4.3.2). Additionally, it contains an
implementation of the mapping approach presented in chapter 3.
Group Recommendation
Recommendations to user groups instead of individual users can be provided by
aggregating predicted scores for user-item combinations according to dierent
schemes [Baltrunas et al., 2010]:
• minimum use the lowest score for the group decision,
• maximum use the maximum score,
• average use the mean score,
• weighted average use the average score, weighted by the number of
ratings for each user in the training data, and
• pairwise wins pick the item that is ranked above the other candidate
items most frequently.
6.2.2 Command-Line Tools
For each of the two main recommendation tasks (group recommendation is included in the item recommendation tool), MyMediaLite comes with a commandline program that allows users to train and evaluate all available recommenders
on data provided in text les, without having to write a single line of code.
Newly developed recommenders are automatically detected, and do not have
to be manually added to the programs. Most of the other library features described here are exposed to the user by the command-line programs; if not, this
is explicitly mentioned.
Detailed usage information for the command-line tools can be found in the
appendix in section A.2.
6.2.3 Data Sources
Besides collaborative data, i.e. the ratings in the case of rating prediction and
the positive-only user feedback in the case of item recommendation, recommenders in MyMediaLite also may access other kinds of data: user (like geographic location, age, profession) or item attributes (categories, keywords, etc.),
and relations between users (e.g. the social network) or items (taxonomies, TV
series), respectively. Algorithms in the library that make use of this data are,
for example, attribute-based kNN methods [Billsus et al., 2000], a linear model
82
CHAPTER 6.
THE MYMEDIALITE LIBRARY
optimized for BPR [Gantner et al., 2010a], and SocialMF, a matrix factorization
model that takes the social network of the users into account [Jamali and Ester,
2010].
MyMediaLite contains routines to read such data from SQL databases (not
supported by the command-line programs) and from simple text les.
6.2.4 Evaluation
MyMediaLite contains routines for computing evaluation measures [Herlocker
et al., 2004] like root mean square error (RMSE) and mean absolute error (MAE)
for the rating prediction task; for item recommendation, it supports area under
the ROC curve (AUC), precision at n (prec@n), mean average precision (MAP),
mean reciprocal rank (MRR) [Voorhees, 2000], and normalized discounted cumulative gain (NDCG).
Besides this, the user has the possibility of creating arbitrary train-test splits
and feeding them to MyMediaLite, the library implements several protocols for
splitting the data provided by the user:
1. simple splits: use n% of the data for testing
2. k -fold cross-validation
3. chronological splits: use the last n% of the data for testing, or split at a
given time point in the dataset
4. per-user chronological splits: use the last n% or n events of each user for
testing
Each of those methods has advantages and disadvantages. Evaluation on
simple splits and (per-user) chronological splits are fast to evaluate, because
only one model (per evaluated method/hyperparameter combination) is to be
computed and evaluated on a part of the dataset. k -fold cross-validation uses
all data for testing, and thus generally yields more robust results, which is
one of the reasons why it is a standard technique for model comparison in
machine learning and applied statistics. On the other hand, k models have to
be trained per method/hyperparameter combination, making cross-validation
computationally more expensive. Chronological splits are the most realistic
kind of evaluation split, because in real-life systems one can only use past data
to predict future events. Per-user chronological splits are a bit less realistic,
because not necessarily all users have their latest events in the same period
of time. On the other hand, this was the way the split was generated for the
Netix Prize, making it a quite popular evaluation protocol that library users
may want to use when trying to replicate results reported in the literature.
There is also support for hyperparameter selection using grid search and the
Nelder-Mead method [Piotte and Chabbert, 2009].
6.2.5 Incremental Updates
Academic experiments on recommender system algorithms are usually conducted o-line, by training a prediction model and then evaluating it. Yet
real-world recommender systems constantly get new user feedback that should
be immediately incorporated into the prediction model. MyMediaLite oers an
6.2.
FEATURE OVERVIEW
83
Figure 6.1: The MyMediaLite movie demo program.
API for immediate updates to already trained prediction models. Besides being
supported in a special online evaluation mode by the two command-line tools,
the use of incremental updates is demonstrated in the GUI demo, which asks
the user for movie ratings and then immediately displays personalized movie
recommendations (see Fig. 6.1).
6.2.6 Parallel Processing
Several parts of the library can be easily parallelized, to potentially make use of
several processors or cores present in a system. Whenever there are sequential
code fragments that are independent of each other in terms of data accesses,
those fragments can be parallelized. This is the case for cross-validation procedures (usually a small number of computationally intensive tasks), and for
the prediction of personalized item scores in the evaluation of item recommendation methods (large number of computationally cheap tasks). Consequently,
both kinds of computations have been parallelized in MyMediaLite. Other candidates for parallelization would be user- or item-wise counting statistics in
simple baseline methods (see section 2.3.1), or rating prediction evaluations.
However, as such tasks are quite fast even for huge datasets, we have not parallelized them so far.
Parallel Stochastic Gradient Descent for Matrix Factorization
While some parts of the library as described above are parallelized, most
algorithms have not yet been parallelized. One exception is the block-free parallel SGD for matrix factorization, which uses the same idea as in Jellysh
[Recht and Re, 2011] and in Gemulla et al.'s KDD paper on distributed matrix
factorization, which was published a bit earlier than Jellysh.
Idea: Generally, the sequence of training examples in stochastic gradient descent does not matter too much after all, the examples are drawn randomly
from the training dataset. This is practical if we want to parallelize the training
84
CHAPTER 6.
THE MYMEDIALITE LIBRARY
procedure. One obstacle that prevents us from a simple and straightforward
parallelization of the algorithm is that the parameter updates for the dierent
training examples cannot be guaranteed to be independent. Both cited papers
suggest to overcome this by dividing both all users and all items into roughly
nb equal-sized parts. Updates on user factors are now independent if the users
are in two dierent parts, and updates on item factors are also independent if
the items are in two dierent (item) parts. This means that the updates for two
given ratings (that is, user-item combinations) are independent if and only if
both the users and the items are in dierent parts. Consequently, if we consider
the user and item parts to be in a square matrix, then all blocks belonging to
the same diagonal are independent of each other, meaning that all updates for
ratings in one block on the diagonal are independent of all updates for all ratings
in other blocks on that diagonal.
This leads us to a block-free SGD algorithm for
matrix factorization. To perform a complete pass over the training data, we
process all diagonals (called sub-epoch s by Gemulla et al.) of the square matrix
in random sequence. When processing one diagonal, each block in the diagonal
can be processed parallel to the other blocks.
Algorithm 10 contains the pseudocode of the procedure. The random partitioning of users and items is performed in lines 2 to 7, followed by the shuing
of ratings in the dierent blocks in line 8 to 12. For every epoch, a dierent
sub-epoch sequence is generated (lines 19 and 20). In a sub-epoch, the dierent
blocks are processed in parallel (line 23).
Algorithm description:
6.2.7 Serialization
Another feature that is required in practice is storing and loading trained recommenders, which allows e.g. to train recommenders on a dierent machine than
the one that produces the recommendations. All recommenders in MyMediaLite
support this feature.
6.2.8 Documentation
The library is accompanied by broad documentation. Besides the complete
API documentation, there are example programs in Python, Ruby, and C#,
and how-tos on typical tasks like embedding recommenders into a program,
implementing new recommenders, or using the command-line tools.
6.2.9 Diversication and Ensembles
Besides the features described so far, MyMediaLite also supports attribute-based
diversication of recommendation lists [Ziegler et al., 2005, Knijnenburg et al.,
2011b] and ensembles of recommenders [Rendle and Schmidt-Thieme, 2009].
6.3
Development Practices
In the development of the library, we employ best practices for (not only)
free/open source projects, like keeping the code in a public (and distributed)
6.3.
DEVELOPMENT PRACTICES
85
Data: R, α, λ, nit , nb
Result: bU , bI , W, H
1
2
3
4
5
6
7
8
9
10
11
12
preparation shue and partition data:
pU ← random permutation of 1 . . . |U|
pI ← random permutation of 1 . . . |I|
for j ∈ {1 . . . |R|} do
(ru,i , u, i) ← rj
bpU (u) mod nb ,pI (i) mod nb ← bpU (u) mod nb ,pI (i) mod nb ∪ rj
13
14
15
16
17
18
19
20
actual learning:
bU ← 0
bI ← 0
initialize W, H to small random values
for it ∈ {1, . . . , nit } do
generate random sub-epoch sequence:
s ← {1, . . . , nb }
shue s
end
for
i ∈ {1 . . . nb } do
for j ∈ {1 . . . nb } do
shue bij
end
end
21
sub-epoch:
22
for i ∈ s do
23
for j ∈ {1, . . . , nb } do parallel
24
for k ∈ bj,(i+j) mod nb do
25
(u, i, ru,i ) ← rk
26
end
I
27
update bU
u , bi , wu , hi
28
end
29
end
30 end
Parallel stochastic gradient descent for matrix factorization.
R is the rating dataset, W, H are the model parameters, α is the learning rate
(step size), λ is the regularization constant, nit is the number of passes over
training data, nb is the number of blocks that can be processed at the same
time. We assume that iteration over sets is in the order of the indexes, unless
otherwise noted.
Algorithm 10:
86
CHAPTER 6.
THE MYMEDIALITE LIBRARY
version control repository, having regular releases (roughly one per month) and
a collection of unit tests, and performing static analysis4 of the compiled code.
6.4
Existing Software
In this section, we describe existing (mostly) free recommender system software.
All mentioned features refer to the state in November 2011, unless otherwise
mentioned. A more up-to-date listing of free recommender system software
may be found on the MyMediaLite website.5 To our knowledge, this is the
most complete overview of free recommender system software so far. Other
surveys can be found in Ekstrand et al. [2011], Angermann [2010], and Gantner
et al. [2011b]. For a cleaner presentation, the homepage URLs of the discussed
software are compiled in Table B.4 in appendix B.
6.4.1 Recommender System Libraries
This subsection contains libraries containing several dierent recommendation
algorithms, and potentially additional functionality to support the research,
evaluation, development, and deployment of recommender systems.
GraphLab
GraphLab [Low et al., 2010, Wu et al., 2011] is a novel framework for parallel/distributed computing on graphs. It contains a library of several recommender algorithms. Implemented algorithms are probabilistic matrix/tensor factorization
using Markov Chain Monte Carlo (MCMC) [Salakhutdinov and Mnih, 2008b,a,
Xiong et al., 2010], alternating least squares (ALS) for MF [Zhou et al., 2008],
stochastic gradient descent (SGD) [Koren et al., 2009, Takacs et al., 2009], nonnegative matrix factorization (NMF) [Lee and Seung, 2001] and SVD++ [Koren,
2008] for rating prediction. It also contains one algorithm, weighted ALS [Hu
et al., 2008, Pan et al., 2008], which is suitable for implicit/positive-only feedback.
Apache Mahout
Apache Mahout [Owen et al., 2011] is a collection of mostly distributed (via
Hadoop ) implementations of machine learning and data mining algorithms. One
section of the library is dedicated to collaborative ltering algorithms; the majority of its recommendation algorithms taken from the predecessor Taste is
not distributed; an item-based kNN model [Linden et al., 2003], and Slope-One
[Lemire and Maclachlan, 2005] are available as a distributed implementation.
LensKit
LensKit is a recommender system algorithm library aiming for research and educational use [Ekstrand et al., 2011]. Currently, it contains matrix factorization,
probabilistic latent semantic indexing (pLSI, Hofmann [2004]), Slope-One, and
several kNN-based models.
Gendarme
4 We use
to nd problems in MyMediaLite.
5 http://ismll.de/mymedialite/links.html
6.4.
EXISTING SOFTWARE
87
R recommenderlab
recommenderlab is a package for the R statistical language/environment. It
contains association rules [Changchien and Lu, 2001], user-based and item-based
kNN, and the most-popular baseline.
EasyRec
EasyRec is a recommender system web service that can be integrated into websites; however, it does not contain any advanced personalized algorithms; it is
more a framework for connecting a recommender service with an application.
In the future, its developers plan to make the software compatible with Mahout,
so that it can use the methods contained there.
RecLab
RecLab [Vengro, 2011] is a framework for performing live evaluations in online
shopping systems; it contains an API to be implemented by the shop system, and
another one for providing recommendations. It is used as code infrastructure
for the ongoing RecLab Prize on Overstock.com challenge.
Waes
Waes [Gashler, 2011] is a collection of general machine learning algorithms.
One group of algorithms in Waes are rating prediction methods: PCA [Hastie
et al., 2009], matrix factorization, user-based kNN, item averages, user clustering, and bagging (for ensembles).
jCOLIBRI
jCOLIBRI [Daz-Agudo et al., 2007] is a case-based reasoning (CBR) tool that
can also be used for generating recommendations based on kNN. It supports
about 30 dierent similarity measures.
MyCBR
MyCBR [Stahl and Roth-Berghofer, 2008] is another case-based reasoning (CBR)
tool. It can also use jCOLIBRI, while oering additional features.
COFI
COFI [Lemire et al., 2005] is a Java-based collaborative ltering library for
rating prediction. It contains simple baseline algorithms, kNN-based methods,
Slope-One, the Eigentaste algorithm [Goldberg et al., 2001] (including a variant developed by the authors [Lemire, 2005]), and a variety of linear rating
predictors.
Crab
Crab is a recommender system framework written in Python that currently
supports kNN, Slope-One, and contains an early-stage matrix factorization implementation for rating prediction.
88
CHAPTER 6.
THE MYMEDIALITE LIBRARY
Duine
Duine [van Setten, 2005] was developed at Telin (now Novay ). Its focus lies
on kNN-based methods, both collaborative and content-based. According to its
authors, it does not scale well to large datasets.
Taste.NET
Taste.NET is a port of the Mahout predecessor Taste (version 1.6) to C#. The
choice of methods contained in Taste.NET is not as wide as in current Mahout
versions. It is currently not actively developed, the last modication happened
in January 2010.
6.4.2 Implementations of Single Methods
In this subsection, we describe implementations of single recommender system
algorithms that are not part of a larger library or framework.
PyRSVD
PyRSVD is a Python implementation of SGD-trained matrix factorization for
rating prediction.
CoRank
CoRank is the implementation of an ordinal ranking method for rating data
of the same name [Weimer et al., 2008].
SVDFeature
SVDFeature [Chen et al., 2011b] implements a special case of factorization machines [Rendle, 2010b], which allows exible integration of information beyond
user-item interactions.
Jellysh
Jellysh [Recht and Re, 2011] is a large-scale parallel matrix factorization technique for rating prediction that is similar in spirit to the algorithm devised by
Gemulla et al. [Gemulla et al., 2011], which is implemented in MyMediaLite
(see below in section6.2.6).
Likelike
Likelike is an implementation of locality sensitive hashing (LSH) [Das et al.,
2007] on top of Hadoop.
Vowpal Wabbit
Vowpal Wabbit , a large-scale online learning system, while concentrating on
linear models, also supports matrix factorization.
6.4.
EXISTING SOFTWARE
89
OpenSlopeOne
OpenSlopeOne is an implementation of Slope-One using the PHP language and
the MySQL database.
Vogoo
Vogoo is an implementation of Slope-One. Its homepage is abandoned, and the
last release was in early 2008.
Wooix
Wooix is an implementation of SVD++ [Koren, 2008] in Python. According
to its author, it is not very fast.
Ruby on Rails Components
Recommendable and ActsAsRecommendable are two plug-ins for Ruby on Rails
that allow the integration of recommendation features into websites.
6.4.3 Non-Free Publicly Available Software
In this section, we describe several publicly available (in terms of source code)
recommender system software packages. While these are not free software, and
thus do not allow commercial use, for example, the availability of the source
code still allows researchers to inspect the implementations, and to use them
for educational/research purposes.
LibFM
LibFM implements factorization machines [Rendle, 2010b], which are a generalization of matrix and tensor factorization (handling arbitrary interactions
between modes) and polynomial SVMs/logistic regression (with factorized features). Factorization machines can be used to mimic/implement several recommender system methods, and are particularly well-suited for context-aware
recommendation [Rendle, 2010a].
Probabilistic Matrix/Tensor Factorization
The authors of several papers about probabilistic matrix and tensor factorization [Salakhutdinov and Mnih, 2008b,a, Xiong et al., 2010] provide a Matlab
implementation of their methods.
Latent Log-Linear Models
The implementation of another (generalized) matrix factorization method called
latent feature log-linear model (LFL) [Menon and Elkan, 2010] is also available
as Matlab code.
90
CHAPTER 6.
THE MYMEDIALITE LIBRARY
MultiLens
MultiLens [Miller, 2003] was released in 2004, and its homepage states that
it will at some point be released as free software. However, currently only a
compiled package is available for download. It contains rating prediction and
item recommendation algorithms, most of them based on kNN, as well as rulebased algorithms.
6.5.
6.5
91
SYSTEM COMPARISON
System Comparison
We collected information on some of the recommender system algorithm libraries
described above, and compare their features to those of MyMediaLite in Table
6.1. We concentrated on free software/open source6 , and additionally included
SUGGEST [Karypis, 2001], which is not free software, but was a fairly early
publicly available software. Besides information on the latest available version,
we compare the following features:
1. license : the license under which the software is distributed
2. language : the programming language the software is written in; note that
all libraries run on the major computing platforms
3. actively developed : whether the software is under active development was there development activity within the last six months?
4. scalable : whether the software scales this means whether the software is
capable of running at least one non-trivial recommender algorithm on the
full Netix data on a modern computer.
5. distributed : Can the computation of at least one non-trivial recommender
model be run on several computers at once?
6. matrix factorization : Does the package contain modern matrix factorization techniques?
7. kNN methods : Does the library contain k-nearest-neighbor methods?
8. rating prediction : Are there algorithms/evaluation routines for rating prediction?
9. positive-only feedback : Are there algorithms/evaluation routines for recommendation from positive-only feedback?
10. multi-core support : Can at least one non-trivial recommender algorithm
be run on several cores at once? (evaluation routines like cross-validation
can run in parallel on several cores)
11. time-aware : Are there algorithms that take the time of the events into
account for training and (possibly) prediction?
12. group recommendation : Are there methods for providing recommendations to a group of users?
13. incremental updates : Are there recommender models that allow the dynamic incorporation of new feedback without having to re-train the complete model?
14. hyperparameter tuning : Are there routines for tuning the hyperparameters
of recommendation methods?
6 See http://www.gnu.org/philosophy/free-sw.html
docs/osd.
and
http://www.opensource.org/
LensKit
0.8.1
2011-10-10
LGPL 2
Java
X
X
X
X
X
( X)
(some)
Mahout
0.5
2011-05-27
Apache 2.0
Java
X
X
X
X
X
X
X
X
X
GraphLab
v1_134
2011-07-29
Apache 2.0
C++
X
X
X
X
X
X
C
X
X
non-free
SUGGEST
1.0
2000-11-08
MyMediaLite
1.02
2011-08-03
GPL 3
C#
X
X
X
X
X
X
X
X
X
X
X
Table 6.1: Comparison of some free/open source recommender system frameworks.
hyperparameter tuning
incremental updates
group recommendation
time-aware
Duine
4.0.0-RC1
2009-02-17
LGPL 3
Java
X
X
X
CHAPTER 6.
multi-core support
positive-only feedback
rating prediction
kNN methods
matrix factorization
distributed
scalable
actively developed
language
license
date
version
Library
92
THE MYMEDIALITE LIBRARY
6.6.
93
EXPERIMENTS
Dataset
MovieLens-100K (external split)
MovieLens-100K (5-fold CV)
MovieLens-1M (external split)
MovieLens-1M (5-fold CV)
Netix (external split)
Yahoo! Music Ratings (external split)
k=5
5 MB
10 MB
15 MB
56 MB
1271 MB
8883 MB
k = 120
6 MB
10 MB
20 MB
56 MB
1490 MB
9743 MB
Table 6.2: Memory usage for rating prediction with BiasedMatrixFactorization
(MyMediaLite 1.04), as reported by the Mono runtime. Note that the actual
memory usage may be lower because garbage collection is not always enforced,
as well as higher because of the overhead of running the programs in a virtual
machine.
6.6
Experiments
We have performed several rating prediction experiments to showcase MyMediaLite's runtime performance and scalability. We rst report on general rating
prediction experiments on three dierent datasets, and then on a particular set
of experiments to nd out the speed-up gained by using the parallel algorithm
described above on multiple cores instead of the sequential version.
Note that we ran those experiments on a cluster that processes multiple jobs
at the same time. This means the evaluations were not conducted in isolation,
and their results should only be interpreted as a rough measurement of the
system's runtime performance.
6.6.1 General Performance
We ran the BiasedMatrixFactorization recommender on the Netix, MovieLens100K, and MovieLens-1M datasets (see section 2.5). For Netix, we used the
probe dataset for validation, on the MovieLens dataset we performed 5-fold
cross-validation. We measured the time needed for one pass over the training,
the time needed for predicting the validation set, and the memory usage reported by the program. Measuring the time needed for a pass over the training
data which should give a general idea of the performance of the employed data
structures, not just the particular recommender.7
Table 6.2 shows the memory usage by dataset and evaluation protocol. Memory requirements are modest for small and medium-sized datasets, and
even the Netix dataset can be processed on a fairly modern computer without
any problem. Figures 6.2 and 6.3 show the time for one pass over the training
data for dierent model sizes on the Netix dataset. Regarding prediction times,
MyMediaLite is capable of making between 1,400,000 (k = 120) and 7,000,000
(k = 5) predictions per second.
We have not performed a systematic comparison with other libraries yet.
Angermann's master thesis [Angermann, 2010] reports on experiments involving
several recommender system frameworks.
Results:
7
The evaluation scripts with all relevant parameter settings are available from
http://ismll.de/mymedialite/examples/recsys2011.html.
[Gantner et al., 2011b]
94
CHAPTER 6.
THE MYMEDIALITE LIBRARY
0
●●●●●●●●●●●●●●●●●●●●●
20
40
60
80
100
500
300
0
●●●
100
300
Avg. iteration time (s)
500
ml1m
100
0
Avg. iteration time (s)
ml100k
120
●●●
0
●●●●●●●●●●●●●●●●●●●●●
20
number of factors
40
60
80
100
120
number of factors
Figure 6.2: Runtime of BiasedMatrixFactorization (MyMediaLite 1.04). Plotted
is the average time needed for one epoch, depending on the number of latent
factors per user and item.
netflix
netflix
300
●
●
●
●●
●●
●
●●
●●
●
500
●
300
●
●●
●
100
●
●
●
Avg. iteration time (s)
500
●
100
Avg. iteration time (s)
●●
●
0
0
●
0
20
40
60
80
number of factors
100
120
●
●●
0
●●
●●●●
●●●
20
40
●●●●
60
●●●●
80
100
●
●
●
120
number of factors
Figure 6.3: Runtime of BiasedMatrixFactorization (MyMediaLite 1.04) and
MultiCoreMatrixFactorization (nb = 128). Plotted is the average time needed
for one epoch, depending on the number of latent factors per user and item.
6.7.
95
IMPACT
netflix
netflix
●
●
●
●
●
●
●
●
●
2000
●
●
●
●
●
●
●
●
●
0
100
●
●
500 1000
300
●
Memory usage (MB)
●
500
●
0
Avg. iteration time (s)
●
1
5
50
number of blocks
500
1
5
50
500
number of blocks
Figure 6.4: Runtime and memory usage of MulticoreMatrixFactorization. Plotted is the average time needed for one epoch (left) and the memory usage (right),
depending on the number of blocks. Note that the y-axes are logarithmic.
6.6.2 Parallel Stochastic Gradient Descent
To measure the speed-up gained by using multiple cores for matrix factorization,
as well as the memory overhead of our implementation, we conducted further
experiments on the Netix dataset. In the rst stage of the experiment, we ran
the algorithm with k = 120 factors and dierent values for nb . For comparison,
we ran the sequential version of the same algorithm with otherwise identical
settings. In the second stage, we set nb = 128, which turned out to be a
practical value, for dierent values of k .
The results of the rst stage can be seen in Figure 6.4. The sequential
algorithm took 567.33 seconds on average for one pass over the training data,
and consumed 1758 MB of memory. A good choice for nb on 8 cores seems to
be 64 (where one pass took 144.95) or above. This means we have a speed-up
of roughly 4 on 8 cores. The results of the second stage have been included in
Figure 6.3.
Results:
Ideally, we should have a speed-up of close to 8; the dierence
between ideal and reality can be explained by the overhead caused by the additional shuing, and low-level eects like thread switching, increased cache
misses, and other factors. On the other hand, most modern computers have
several cores, so it is benecial to make use of them even if the speed-up is not
perfect.
Discussion:
6.7
Impact
MyMediaLite is based on parts of the framework [Marrow et al., 2009] that
has been used in four dierent industrial eld trials of the MyMedia project,
96
CHAPTER 6.
THE MYMEDIALITE LIBRARY
including one involving 50,000 households [Marrow et al., 2010]. Application
areas in the eld trials were IPTV, web-based video and audio, and online
shopping. Additionally, the MyMedia framework was used to perform usercentric experiments [Bollen et al., 2010].
Towards the end of the project, we stripped o the framework's more heavyweight components, and released it as free software in September 2010. Since
then, it has been downloaded more than 7,300 times8 , received numerous improvements, and has been successfully used in several research activities, for
example in studies on student performance prediction [Nguyen et al., 2011] and
item recommendation in social networks [Du et al., 2011, Krohn-Grimberghe
et al., 2012], in the KDD Cup 2011 for music recommendation [Balakrishnan
et al., 2011], in an information retrieval evaluation project [Bellogn et al., 2011],
and as a baseline for context-aware recommendation (winner of two tracks of
the CAMRa 2010 challenge [Gantner et al., 2010c]).
The European project e-LICO has ported MyMediaLite to Java, in order
to provide the basis of the recommender extension for the RapidMiner data
analysis software. This could help to grow MyMediaLite's user base in two ways:
First, Java is more widely used by both researchers and application developers
than .NET, and second, RapidMiner allows calling the library from an easy-touse graphical user interface.
6.8
Summary and Outlook
MyMediaLite is a versatile library of recommender system algorithms for rating and item recommendation from positive-only feedback. We believe MyMediaLite is currently one of the most complete free/open source recommender
system frameworks in terms of recommendation tasks, implemented methods,
eciency, features, exibility, and documentation (see section 6.4).
We will continue MyMediaLite's development in several directions. Porting
the library to Java or C++ is worth considering, given the popularity of these
programming languages. Besides the RapidMiner port (see above), another partial Java port of a part of the library is already available for download. Another
useful extension would be a web service interface for the easy integration of
MyMediaLite, e.g. into online shop software; we will consider existing APIs for
this feature. While the three supported recommendation tasks, rating prediction, item recommendation, and recommendation for groups, cover many use
cases, we also plan to add additional recommendation tasks and types of input,
e.g. item recommendation from other kinds of implicit feedback like viewing
times or click counts, or tag recommendation (see section 2.2.3).
8 excluding
search engine spider downloads, July 2012
Chapter 7
Conclusion
This chapter concludes this thesis. After summarizing its contents, we discuss
possible directions of future research.
7.1
Summary
Item recommendation is an important prediction task in the application area of
recommender systems. We have presented a generic formal denition of item
recommendation, and have expressed some more specic tasks in terms of this
denition.
We suggested a framework that allows us to solve cold-start scenarios in
the strict meaning of the term for prediction methods that represent entities as vectors of real numbers, in particular factorization models, which are
the state-of-the-art approach for collaborative ltering tasks. The framework
relies on learning mapping functions from the (new) entity attributes to the
latent factors. Experiments on the new-item problem with a matrix factorization method for item recommendation from positive-only feedback (BPR-MF)
showed the suitability of the approach, and that optimizing the mapping functions for the actual optimization objective is worthwhile.
The Bayesian Personalized Ranking (BPR) criterion is a training objective and a generic learning algorithm for personalized ranking, which can be
employed for item recommendation. We extended BPR to the more general
weighted BPR (WBPR) criterion, which lets us individually specify the contribution of each entity to the global optimization target.
WBPR can be used to tackle interesting large-scale item recommendation
problems in which the candidate items to be scored and ranked are not drawn
uniformly from the set of all available items, but according to other criteria like
popularity. Such a scenario was the challenge posed by track 2 of the KDD Cup
2011: distinguishing between songs a user will like, and songs that are generally popular. We used matrix factorization models trained for an appropriate
WBPR variant (WBPR-MF), augmented with information from rating-based
models, to achieve an error rate of less than 5% on that task, which means that
the prediction model's decision was wrong on less than 1 in 20 songs.
Having publicly available high-quality implementations of state-of-the-art
algorithms is important for the progress of a eld, as it enables researchers to
97
98
CHAPTER 7.
CONCLUSION
realistically compare their new developments to the state of the art. We implemented all methods presented in this thesis and some more as part of the
MyMediaLite software package, which is a reference collection of recommender
system algorithms, accompanied by a rich infrastructure for developing, evaluating, testing, and deployment of new and existing methods. We described the
state-of-the-art of open source/free software recommender software, and compared MyMediaLite to other existing programs. The experiments showed that
the implementation scales well even when processing very large datasets.
Summarized, the main contributions of this thesis to the state of the art of
machine learning techniques for recommendations are a method for solving hard
cold-start problems for arbitrary latent factor models, and a new exible and
generic optimization criterion and learning algorithm for item recommendation.
The new methods are provided as part of the MyMediaLite package, to allow
other researchers to reproduce the presented experiments, and to build new
progress upon these and further methods.
7.2
Future Directions
Future research directions have been discussed throughout this thesis at the end
of the respective chapters.
The MyMediaLite software, presented in chapter 6, can be enhanced to cover
more application areas, for example tag recommendation, sequential recommendation, or general context-aware recommendation (see section 2.2.3), and even
areas that are not about recommendation, but where the same or similar models
and algorithms are useful, like student performance prediction [Nguyen et al.,
2011] or link prediction [Sarukkai, 2000, Menon and Elkan, 2010, 2011]. Ideally, MyMediaLite could evolve into a package that supports arbitrary learning
problems in complex domains with graph/network structure.
As presented in chapter 3, the framework for enabling factorization models
to deal with new-user and new-item problems is both simple and modular.
We implemented the example of matrix factorization optimized for the BPR
criterion (BPF-MF) in the software. This work can be extended to the other
factorization models present in MyMediaLite: By having a generic API for
mapping from user and item attributes to latent factors, all factorization models
could be made ready for cold-start scenarios. With a generic implementation,
we could investigate how well more advanced mapping functions like multi-layer
neural networks or support-vector regression work for tasks like rating prediction
or tag prediction.
The weighted BPR (WBPR) criterion, introduced and discussed in section
4.2, can be applied to scenarios other than music recommendation (chapter
5), for example for providing personalized rankings of news articles that have
been assigned weights/priorities by an editorial team. A further possible use of
weighted BPR are learning scenarios with case weights.
Seeing recommendation as a supervised learning and prediction problem,
which is the main theme of this thesis, has been proven benecial over and over
both in practice and in academic research. While it is a practical abstraction, it
is not the last or only solution to modeling recommendation and personalization.
After all, a recommender is often part of a larger interactive system, and is not
as static as other scenarios where supervised learning techniques are used, like
7.2.
FUTURE DIRECTIONS
99
hand-written digit recognition for ZIP codes [LeCun et al., 1989]. Approaches
that view recommendation as an interactive, dynamic process, and that use the
abstraction of reinforcement learning [Sutton and Barto, 1998] for example
Shani et al. [2006] and Li et al. [2010, 2011] seem to be a promising research
direction.
100
CHAPTER 7.
CONCLUSION
Appendix A
MyMediaLite Reference
See chapter 6 for a general introduction to the software. This appendix is
meant to be a reference manual. Part of its contents are also available on
MyMediaLite's homepage.
MyMediaLite is a software containing dierent recommender system algorithms, plus tools that support developers and end-users in making ecient use
of the software. The following sections give an introduction on how to use and
extend MyMediaLite and its components, both from a developer and from and
end-user's perspective.
A.1
Installation
A.1.1 Prerequisites
For running MyMediaLite, you need at least Mono 2.8.x or another recent .NET
runtime. Mono 2.10.x is highly recommended.
Perl and the package File::Slurp are required for the download and data
processing scripts, but not necessary for running/building MyMediaLite.
For building MyMediaLite from its sources, you either need an integrated
development environment (IDE) like MonoDevelop or VisualStudio, or the make
utility. For building the API documentation, Doxygen 1.6.3 or later is needed.
A.1.2 Packages
MyMediaLite can be installed from source, or from a binary package. The
download page1 oers three packages: A binary package, a source package, and
a documentation package. The source code can also be obtained directly from
MyMediaLite's repositories on GitHub and Gitorious.
A.1.3 Instructions
If you have the binary package, just copy its contents to wherever you want and
run the programs from there.
1 http://ismll.de/mymedialite/download.html see the second appendix for more URLs
101
102
APPENDIX A.
MYMEDIALITE REFERENCE
To build MyMediaLite from source on Unix-like systems, run make all. Set
the PREFIX variable in the Makefile, then run make install. On Windows,
compile the software using Visual Studio or MonoDevelop.
A.2
Command-Line Tools
MyMediaLite is mainly a library, meant to be used by other applications. There
are two command-line tools that oer much of MyMediaLite's functionality.
They allow users to work with MyMediaLite without having to integrate the
library in an application or having to develop their own programs.
A.2.1 Rating Prediction
The general usage of the rating prediction program is as follows:
rating_prediction --training-file=TRAINING_FILE
--test-file=TEST_FILE
--recommender=METHOD [OPTIONS]
METHOD is the recommender to use, which will be trained using the contents
of TRAINING_FILE. The recommender will then predict the data in TEST_FILE,
and the program will display the RMSE (root mean square error, see section 2.4.1) and MAE (mean absolute error) of the predictions. If you call
rating_prediction without arguments, it will provide a list of recommenders
to choose from, plus their arguments and further options:
MyMediaLite rating prediction 2.99
usage: rating_prediction --training-file=FILE --recommender=METHOD [OPTIONS]
recommenders (plus options and their defaults):
- GlobalAverage
supports --online-evaluation
...
- SVDPlusPlus num_factors=10 regularization=0.015 bias_reg=0.33 learn_rate=0.001
bias_learn_rate=0.7 num_iter=30 init_mean=0 init_stddev=0.1
supports --find-iter=N, --online-evaluation
method ARGUMENTS have the form name=value
general OPTIONS:
--recommender=METHOD
--recommender-options=OPTIONS
--help
--version
--random-seed=N
--rating-type=float|byte
--no-id-mapping
files:
--training-file=FILE
--test-file=FILE
set recommender method
(default: BiasedMatrixFactorization)
use OPTIONS as recommender options
display this usage information and exit
display version information and exit
initialize the random number generator with N
store ratings internally as floats (default)
or bytes
do not map user and item IDs to internal IDs,
keep original IDs
read training data from FILE
read test data from FILE
A.2.
COMMAND-LINE TOOLS
103
--file-format=movielens_1m|kddcup_2011|ignore_first_line|default
--data-dir=DIR
load all files from DIR
--user-attributes=FILE
file containing user attribute information
--item-attributes=FILE
file containing item attribute information
--user-relations=FILE
file containing user relation information
--item-relations=FILE
file containing item relation information
--save-model=FILE
save computed model to FILE
--load-model=FILE
load model from FILE
prediction options:
--prediction-file=FILE
--prediction-line=FORMAT
--prediction-header=LINE
write the rating predictions to FILE
format of the prediction line; {0}, {1}, {2} refer
to user ID, item ID, and predicted rating;
default is {0}\\t{1}\\t{2}
print LINE to the first line of the prediction file
evaluation options:
--cross-validation=K
--show-fold-results
--test-ratio=NUM
perform k-fold cross-validation on the training data
show results for individual folds in cross-validation
use a ratio of NUM of the training data for evaluation
(simple split)
--chronological-split=NUM|DATETIME
use the last ratio of NUM of the training data ratings
for evaluation, or use the ratings from DATETIME on for
evaluation (requires time information in the training
data)
--online-evaluation
perform online evaluation (use every tested rating for
incremental training)
--search-hp
search for good hyperparameter values
--compute-fit
display fit on training data
options for finding the right number of iterations (iterative methods)
--find-iter=N
give out statistics every N iterations
--max-iter=N
perform at most N iterations
--measure=RMSE|MAE|NMAE|CBD
evaluation measure to use for the abort conditions
below (default RMSE)
--epsilon=NUM
abort iterations if evaluation measure is more than
best result plus NUM
--cutoff=NUM
abort if evaluation measure is above NUM
One can download the MovieLens 100k ratings dataset (see section 2.5 and
unzip it to go through the following examples. In the MyMediaLite directory,
this is can be performed by entering
make download-movielens
The le formats supported by MyMediaLite are described in section A.4. To
try out a simple baseline method on the data, one just enters
rating_prediction --training-file=u1.base --test-file=u1.test
--recommender=UserAverage
which should give a result like
UserAverage training_time 00:00:00.000098 RMSE 1.063 MAE 0.8502
testing_time 00:00:00.032326
To use a more advanced recommender, enter
104
APPENDIX A.
MYMEDIALITE REFERENCE
rating_prediction --training-file=u1.base --test-file=u1.test
--recommender=BiasedMatrixFactorization
which yields better result than the user average:
BiasedMatrixFactorization num_factors=10 regularization=0.015
learn_rate=0.01 num_iter=30
init_mean=0 init_stdev=0.1
training_time 00:00:03.3575780 RMSE 0.96108 MAE 0.75124
testing_time 00:00:00.0159740
The key-value pairs after the method name represent arguments to the recommender that may be modied to get even better results. For instance, we
could use more latent factors per user and item, which leads to a more complex
(and hopefully more accurate) model:
rating_prediction --training-file=u1.base --test-file=u1.test
--recommender=BiasedMatrixFactorization
--recommender-options="num_factors=20"
...
... RMSE 0.98029 MAE 0.76558
A.2.2 Item Recommendation
The item recommendation program behaves similarly to the rating prediction
program, so we concentrate on the dierences here.
The basic usage is:
item_recommendation --training-file=TRAINING_FILE
--test-file=TEST_FILE
--recommender=METHOD [OPTIONS]
Again, if you call item_recommendation without arguments, it will provide
a list of recommenders to choose from, plus their arguments and further options:
MyMediaLite item recommendation from positive-only feedback 2.99
usage:
item_recommendation --training-file=FILE --recommender=METHOD [OPTIONS]
methods (plus arguments and their defaults):
- BPRMF num_factors=10 bias_reg=0 reg_u=0.0025 reg_i=0.0025 reg_j=0.00025
num_iter=30 learn_rate=0.05 uniform_user_sampling=True
with_replacement=False bold_driver=False
fast_sampling_memory_limit=1024 update_j=True init_mean=0
init_stddev=0.1
supports --find-iter=N, --online-evaluation
...
- MostPopular
supports --online-evaluation
method ARGUMENTS have the form name=value
general OPTIONS:
--recommender=METHOD
--group-recommender=METHOD
use METHOD for recommendations
use METHOD to combine the predictions for
several users
A.2.
105
COMMAND-LINE TOOLS
--recommender-options=OPTIONS
--help
--version
--random-seed=N
use OPTIONS as recommender options
display this usage information and exit
display version information and exit
initialize random number generator with N
files:
--training-file=FILE
read training data from FILE
--test-file=FILE
read test data from FILE
--file-format=ignore_first_line|default
--no-id-mapping
do not map user and item IDs to
internal IDs, keep the original IDs
--data-dir=DIR
load all files from DIR
--user-attributes=FILE
file with user attribute information
--item-attributes=FILE
file with item attribute information
--user-relations=FILE
file with user relation information
--item-relations=FILE
file with item relation information
--user-groups=FILE
file with group-to-user mappings
--save-model=FILE
save computed model to FILE
--load-model=FILE
load model from FILE
data interpretation:
--user-prediction
--rating-threshold=NUM
transpose the user-item matrix and perform user
prediction instead of item prediction
(for rating datasets) interpret rating >= NUM
as positive feedback
choosing the items for evaluation/prediction (mutually exclusive):
--candidate-items=FILE
use items in FILE (one per line) as candidate items
--overlap-items
use only items that are both in the training
and the test set as candidate items
--in-training-items
use only items in the training set as candidate
items
--in-test-items
use only items in the test set as candidate items
--all-items
use all known items as candidate items
choosing the users for evaluation/prediction
--test-users=FILE
predict items for users specified in FILE
(one user per line)
prediction options:
--prediction-file=FILE
--predict-items-number=N
evaluation options:
--cross-validation=K
--show-fold-results
--test-ratio=NUM
--num-test-users=N
--online-evaluation
--filtered-evaluation
--repeat-evaluation
--compute-fit
write ranked predictions to FILE, one user per line
predict N items per user (needs --prediction-file)
perform k-fold cross-validation on the training data
show results for individual folds in cross-validation
evaluate by splitting of a NUM part of the feedback
evaluate on only N randomly picked users (to save time)
perform online evaluation (use every tested user-item
combination for incremental training)
perform evaluation filtered by item attribute
(requires --item-attributes=FILE)
items accessed by a user before may be in the
recommendations (and are not ignored in the evaluation)
display fit on training data
finding the right number of iterations (iterative methods)
--find-iter=N
give out statistics every N iterations
--max-iter=N
perform at most N iterations
--measure=MEASURE
the evaluation measure to use for the abort conditions
below (default is AUC)
--epsilon=NUM
abort iterations if MEASURE is less than best result
106
APPENDIX A.
--cutoff=NUM
MYMEDIALITE REFERENCE
plus NUM
abort if MEASURE is below NUM
Instead of RMSE and MAP, the evaluation measures are now prec@N (precision at N), AUC (area under the ROC curve), MAP (mean average precision),
and NDCG (normalized discounted cumulative gain).
Let us start again with some baseline methods, Random and MostPopular:
item_recommendation --training-file=u1.base --test-file=u1.test
--recommender=Random
random training_time 00:00:00.0001040
AUC 0.4992 prec@5 0.0279 prec@10 0.0290 MAP 0.0012 NDCG 0.3721
num_users 459 num_items 1650 testing_time 00:00:02.7115540
item_recommendation --training-file=u1.base --test-file=u1.test
--recommender=MostPopular
MostPopular training_time 00:00:00.0015710
AUC 0.8543 prec@5 0.322 prec@10 0.3046 MAP 0.0219 NDCG 0.5704
num_users 459 num_items 1650 testing_time 00:00:02.3813790
User-based collaborative ltering leads to output like the following:
item_recommendation --training-file=u1.base --test-file=u1.test
--recommender=UserKNN
UserKNN k=80 training_time 00:00:05.6057200
AUC 0.9168 prec@5 0.5251 prec@10 0.4678 MAP 0.0648 NDCG 0.6879
num_users 459 num_items 1650 testing_time 00:00:08.8362840
Note that item recommendation evaluation usually takes longer than the
rating prediction evaluation, because for each user, scores for every candidate
item (possibly all items) have to be computed. You can restrict the number of predictions to be made using the options --test-users=FILE and --candidate-items=FILE to save time.
The item recommendation program supports the same options for iteratively
trained recommenders like BPRMF and WRMF, for example --find-iter=N.
A.3
Library Structure
MyMediaLite's library source code is structured into several namespaces:
• MyMediaLite: generic recommender denitions like the IRecommender interface (see below)
• MyMediaLite.Correlation: correlations and similarity measures, used by
kNN recommenders
• MyMediaLite.Data: data structures for storing interaction and attribute
data
A.3.
LIBRARY STRUCTURE
107
• MyMediaLite.DataType: basic data types like vectors and matrices
• MyMediaLite.Diversification: methods for diversifying recommendation lists
• MyMediaLite.Ensemble: ensemble methods for combining the output of
several recommenders
• MyMediaLite.Eval: evaluation code
• MyMediaLite.GroupRecommendation: recommenders for making recommendations to groups
• MyMediaLite.HyperParameter: hyperparameter search methods
• MyMediaLite.IO: input/output procedures
• MyMediaLite.ItemRecommendation: item recommendation from positiveonly feedback
• MyMediaLite.RatingPrediction: rating predictors
• MyMediaLite.Taxonomy: data types to represent taxonomic information
of entities, for example which entities exist
• MyMediaLite.Util: miscellaneous utility code
The main library is contained in src/MyMediaLite. Some experimental
parts are in src/MyMediaLiteExperimental. Unit tests are in src/Tests.
In the following, we describe and mention the most important interfaces and
classes of MyMediaLite. A complete description of MyMediaLite's API can be
found on the homepage and in the documentation package.
A.3.1 Conventions
There are several conventions followed in the MyMediaLite library. Users and
items are referred to by int (32 bit integer) values called user_id or shorter
just u and item_id or i. Usually, the user ID is immediately before the item
ID in a method call. For example, IRecommender's Predict() method has the
signature float Predict(int user_id, int item_id).
Interface names start, as it is usual in C#, with an upper-case I. Often
there are standard classes implementing such an interface, which have the same
name except for the leading I. For example, there is the (non-abstract) standard implementation Ratings for the IRatings interface, and the abstract
RatingPredictor class which contains code shared by many non-abstract recommenders inheriting from that class and thus implementing the IRatingPredictor interface.
A.3.2 Interfaces
In this section, we describe some general interfaces. More specialized interfaces
are described in the following sections.
108
APPENDIX A.
MYMEDIALITE REFERENCE
Recommenders
IRecommender is the most general recommender interface. Its method float
Predict(int user_id, int item_id) returns a score for a given user-item
combination. The higher the score, the higher the recommender estimates
that the user will like the given item. bool CanPredict(int user_id, int
item_id) can be used to check whether the recommender is able to provide
a meaningful score for the given user-item combination. The void Train()
method performs the recommender training. void SaveModel(string filename) stores the resulting model to a le, and void LoadModel(string filename) can be used to restore a trained model from a le. Finally, string
ToString() returns a string representation of the recommender, containing the
class name and the names and values of all hyperparameters.
Iterative Models
IIterativeModel is an interface for recommenders which learn by performing several passes over the training examples. The interface has a property
NumIter for the number of passes, and the methods void Iterate() and float
ComputeObjective(). Iterate() performs one learning iteration over the
training examples, and ComputeObjective() returns the current value of the
training objective.
Similarity Providers
Some recommenders have a notion of similarity between users or items. To make
use of such similarities, the IUserSimilarityProvider and IItemSimilarityProvider interfaces provide two methods each, one for getting the similarity of two given entities float GetUserSimilarity(int user_id1, int
user_id2) and float GetItemSimilarity(int item_id1, int item_id2) and one for getting the entities that are most similar to a given entity IList<int> GetMostSimilarUsers(int user_id, uint n = 10) and IList<int>
GetMostSimilarItems(int item_id, uint n = 10).
A.4
Data Structures
A.4.1 Basic Data Types
Vectors and Matrices
One kind of basic data type used in MyMediaLite are vectors and matrices. Vectors are just instances of the .NET generic type IList<T>, where T is usually
float or double, and have no specic interface in MyMediaLite. Matrices are
represented by the interface IMatrix<T> and its dense standard implementation
Matrix<T>. There are also more specic implementations for sparse matrices,
(skew-) symmetric matrices, and combinations thereof. A particular case are
Boolean matrices. They are represented by the interface IBooleanMatrix, a
specialization of IMatrix<bool>, and which again has several specialized implementations. Methods for reading and writing vectors and matrices can be
found in the classes IO.MatrixExtensions and
A.4.
DATA STRUCTURES
109
List Proxies and Combined Lists
When dealing with large datasets, we do not want to unnecessarily replicate
data in memory. Ideally, each dataset is loaded into memory once. All derived
datasets, for example when having k dierent splits in k -fold cross-validation,
are represented by referencing to the original dataset. We want the same for
derived datasets that are combinations of several original datasets. To support
the implementation of such scenarios, there is ListProxy<T>, whose constructor
takes an IList<T> and a list of indexes. The created object is an IList<T>,
where the i-th element is the element in the original list at the position specied
by the i-th index. In a similar manner, the constructor of CombinedList<T>
concatenates two lists.
Pairs
Pair<T, U> is a tuple data type that can be used for example for representing
user-item combinations, or combinations of item IDs and rating values if the
user ID is known.
A.4.2 Entity Mappings
MyMediaLite represents user, item, and attribute IDs internally as integers,
starting from zero. In imported data, IDs are often arbitrary integers with
great gaps in between, or even arbitrary strings.
The interface IEntityMapping serves the purpose of mapping external string
IDs to internal integer IDs. EntityMapping is the standard implementation,
while IdentityMapping merely transforms the string to the integer it represents,
without consuming any memory. This can save memory when handling large
datasets, and be useful for debugging purposes. Giving the --no-id-mapping
option to the command-line tools (see section A.2) enables IdentityMapping.
A.4.3 Rating Data
Rating data is represented by the IRatings interface, which inherits from
IDataSet. It is usually read from text les or a database.
Rating data les have at least three columns: the user ID, the item ID, and
the rating value. Valid user and item IDs are strings, and the rating value is a
single precision (32 bit float) oating point number.
Date/time information or numerical timestamps will be used if necessary,
and ignored otherwise.
Reading: Reading in rating data is implemented in IO.RatingData, IO.StaticRatingData, and IO.MovieLensRatingData. RatingData.Read() returns a data structure that can be updated/modied, while the result of StaticRatingData.Read() is a read-only structure which cannot be updated. If you
also want to read in time information, use the static class IO.TimedRatingData.
MyMediaLite also supports the KDD Cup 2011 le format, the corresponding classes are in the IO.KDDCup2011 namespace.
110
APPENDIX A.
MYMEDIALITE REFERENCE
RatingPrediction.Extensions contains an extension method WritePredictions() that lets you write the predictions to a target, either a le or a
TextWriter, for example:
Writing:
recommender.WritePredictions(ratings, user_mapping,
item_mapping, target);
This will use tabs as column separators.
If you want other separators, provide a line format string to the method:
recommender.WritePredictions(ratings, user_mapping,
item_mapping, target,
"{0}|{1}|{2}");
Examples
Tab-separated columns (.tsv):
5951
5951
5951
5951
5951
5951
5951
50
223
260
293
356
364
457
5
5
5
5
4
3
3
Space-separated columns with string IDs and non-integer ratings:
u5951
u5951
u5951
u5951
u5951
u5951
u5951
i50
i223
i260
i293
i356
i364
i457
5.0
5.0
5.0
4.5
4.0
3.5
3.0
Comma-separated columns (.csv) with timestamps:
5951,50,5,978300760
5951,223,5,978302109
5951,260,5,978301968
5951,293,5,978300275
5951,356,4,978824291
5951,364,3,978302268
5951,457,3,978300719
Rating data with dates:
5951
5951
5951
5951
50 5 2005-12-04
223 5 2005-12-04
260 5 2005-12-04
293 5 2005-11-27
A.4.
111
DATA STRUCTURES
5951 356 4 2005-11-27
5951 364 3 2005-11-27
5951 457 3 2005-11-27
It does not matter whether the date/time information is in quotation marks or
not.
Rating data with dates and times:
5951
5951
5951
5951
5951
5951
5951
50
223
260
293
356
364
457
5
5
5
5
4
3
3
"2009-08-05
"2009-08-02
"2010-05-04
"2009-09-25
"2010-06-30
"2010-06-11
"2010-06-11
00:50:30"
17:19:33"
21:21:03"
05:04:24"
02:07:57"
04:54:41"
14:26:32"
MovieLens 1M/10M rating data les:
5951::50::5
5951::223::5
5951::260::5
5951::293::5
5951::356::4
5951::364::3
5951::457::3
The MovieLens 1M and 10M datasets use a double colon :: as separator.
MovieLens 1M/10M data with timestamps:
5951::50::5::978300760
5951::223::5::978302109
5951::260::5::978301968
5951::293::5::978300275
5951::356::4::978824291
5951::364::3::978302268
5951::457::3::978300719
Command-Line Tools
MyMediaLite's rating prediction tool supports all three
variants by default for its --training-data=FILE and --test-data=FILE arguments. If all ratings are integers between 0 and 255, one can use --rating-type=byte to save memory. To read a le in the MovieLens 1M/10M le format, use
--file-format=ml1m.
For prediction, if you use --prediction-file=FILE, you will get a tabseparated le with predictions for the test data. With the --prediction-line="FORMAT" you can modify the line format, e.g. if you want the items to be in
the rst column: --prediction-line="1,0,2"
Rating prediction:
112
APPENDIX A.
MYMEDIALITE REFERENCE
The item recommendation tool also supports this
rating data format. By default, every rating is interpreted as positive feedback,
the rating value is ignored. You can use the option --rating-threshold=NUM to
dene the minimum rating value that is to be interpreted as positive feedback.
Ratings below this value will be ignored. The MovieLens 1M/10M format is
currently not supported by the item recommendation tool.
Item recommendation:
A.4.4 Positive-Only Feedback
Positive-only feedback is stored in memory by classes implementing the interface
IPosOnlyData. The standard class PosOnlyData<T> takes a type argument
implementing the ISparseBooleanMatrix interface, which determines the type
to be used for the user- and item-wise representation of the feedback data.
Positive-only feedback les have at least two columns: the user ID and the
item ID. Additional columns are ignored, which means that the rating data
examples from the previous subsection will also be read. The column separators
are the same as the ones for the rating data: spaces, tabs, or commas.
The class for reading in this kind of le is IO.ItemData.
To use ratings above a certain threshold as feedback, one can use the IO.ItemDataRatingThreshold class.
Reading:
Writing the le format described is trivial, thus there is no code in
MyMediaLite for that particular task.
Writing:
Example: tab-separated columns (.tsv):
5951
5951
5951
5951
5951
5951
5951
50
223
260
293
356
364
457
A.4.5 Attributes and Relations
Recommenders that use user or item attributes or user or item relations implement at least one of the interfaces IUserAttributeAwareRecommender, IItemAttributeAwareRecommender, IUserRelationAwareRecommender, or IItemRelationAwareRecommender.
Both the item recommendation tool and the rating prediction tool support attribute and relations les via the options --user-attributes=FILE, --item-attributes=FILE, --user-relations=FILE, and --item-relations=FILE.
An attribute is an arbitrary property that a user or item can have.
Relation les, in contrast, describe binary relations over one type of entity,
for example over users or over items. An example for a user relation is the edges
in the social network. An example for an item relation would be the relation "A
is a sequel to B" for movies. Please note that relations are not automatically
A.5.
RECOMMENDERS
113
symmetric. This means that "1,2" does not imply "2,1". If you want to have a
symmetric relation, make sure that both lines are contained in the le.
Binary attribute les have exactly two columns: the entity (user or item) ID
and the attribute ID. Relation les also have exactly two columns. Each column
contains an entity ID. One line in a le means there exists a relation between
the rst and the second entity mentioned in that line.
The classes for reading in attribute and relation les are IO.AttributeData and IO.RelationData, respectively.
Reading:
Writing these le formats is trivial, so there is no specic class for
that in MyMediaLite.
Writing:
Examples
Again, the column separators are the same as the ones for the rating data:
spaces, tabs, or commas.
Attribute le with tab-separated columns (.tsv):
51
51
51
51
51
51
51
5
22
26
29
35
36
45
This means that the entity 51 (may be a user or item) has the attributes 5, 22,
26, . . . If this example is complete, it also means entity 51 does not have the
attributes 0, 1, 2, 3, 4, 6, . . .
Relation le with comma-separated columns (.csv):
51,5
51,22
51,26
51,29
51,35
51,36
51,45
This means that the entity 51 (may be a user or item) is in relation with the
entities 5, 22, 26, . . . of the same type If this example is complete, it also means
entity 51 does not have a relation with entities 0, 1, 2, 3, 4, 6, . . .
A.5
Recommenders
We distinguish three dierent recommendation tasks, and thus have three different kinds of recommenders: Rating predictors, which predict explicit ratings,
114
APPENDIX A.
MYMEDIALITE REFERENCE
item recommenders, which predict scores based on positive-only feedback, and
group recommenders, which combine the predictions for individual users to get
good group decisions. Additionally, there are ensembles, which combine the
output of several recommenders for the same user-item combination in order to
achieve more accurate predictions.
A.5.1 Rating Prediction
Rating predictors implement the interface IRatingPredictor. More specialized
interfaces are ITimeAwareRatingPredictor, for recommenders that can use
time information for training and prediction, IIncrementalRatingPredictor,
for predictors that can learn incrementally as new feedback comes in, and IFoldInRatingPredictor, for recommenders that can make predictions for anonymous users based on their interactions, without adding their data to the model.
Table A.1 lists all rating predictors in MyMediaLite. The column Class
contains the class name, inc species whether the recommender is capable of
incremental updates, ur means it uses user relation data, ua user attributes,
ia item attributes, and t time data. Literature contains references to publications relevant to the recommendation method. The hyperparameters of the
dierent recommenders, together with their default values and a brief explanation, are shown in Table A.2.
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
GlobalAverage
UserAverage
ItemAverage
Random
Constant
UserItemBaseline
TimeAwareBaseline
TimeAwareBaselineWithFrequencies
CoClustering
SlopeOne
BiPolarSlopeOne
UserAttributeKNN
UserKNNCosine
UserKNNPearson
ItemAttributeKNN
ItemKNNCosine
ItemKNNPearson
MatrixFactorization
FactorWiseMatrixFactorization
BiasedMatrixFactorization
SVDPlusPlus
SocialMF
X
ur
X
ua
X
ia
X
X
t
Koren [2008]
Koren [2009]
Koren [2009]
George and Merugu [2005]
Lemire and Maclachlan [2005]
Lemire and Maclachlan [2005]
Koren [2008]
Koren [2008]
Koren [2008]
Koren [2008]
Koren [2008]
Koren [2008]
Rendle and Schmidt-Thieme [2008]
Bell et al. [2007]
Rendle and Schmidt-Thieme [2008], Menon and Elkan [2010], Gemulla
et al. [2011]
Koren [2008]
Jamali and Ester [2010]
Literature
Table A.1: Rating predictors in MyMediaLite 2.99. Some of the methods are described in detail in section 2.3.
inc
Class
A.5.
RECOMMENDERS
115
116
Class
Constant
UserItemBaseline
TimeAwareBaseline
TimeAwareBaselineWithFrequencies
CoClustering
APPENDIX A.
Hyperparameter
constant_rating
reg_u
reg_i
num_iter
num_iter
bin_size
beta
user_bias_learn_rate
item_bias_learn_rate
alpha_learn_rate
item_bias_by_time_bin_learn_rate
user_bias_by_day_learn_rate
user_scaling_learn_rate
user_scaling_by_day_learn_rate
reg_u
reg_i
reg_alpha
reg_item_bias_by_time_bin
reg_user_bias_by_day
reg_user_scaling
reg_user_scaling_by_day
num_iter
bin_size
beta
user_bias_learn_rate
item_bias_learn_rate
alpha_learn_rate
item_bias_by_time_bin_learn_rate
user_bias_by_day_learn_rate
user_scaling_learn_rate
user_scaling_by_day_learn_rate
reg_u
reg_i
reg_alpha
reg_item_bias_by_time_bin
reg_user_bias_by_day
reg_user_scaling
reg_user_scaling_by_day
frequency_log_base
item_bias_at_frequency_learn_rate
reg_item_bias_at_frequency
num_user_clusters
MYMEDIALITE REFERENCE
Default
Description
1
10
5
10
30
70
0.4
0.003
0.002
1E-05
5E-06
rating value
user bias regularization
item bias regularization
number of iterations
number of iterations
bin size in days
user bias step size
item bias step size
learn rate for the α parameters
0.0025
0.008
0.002
0.03
0.03
50
0.1
0.005
0.01
0.005
40
70
0.4
0.00267
0.000488
3.11E-06
1.15E-06
number of iterations
bin size in days
0.000257
0.00564
0.00103
0.0255
0.0255
3.95
0.0929
0.00231
0.0476
0.019
6.76
0.00236
1.1E-08
3
number of user clusters
Continued on next page . . .
A.5.
117
RECOMMENDERS
Class
Hyperparameter
num_item_clusters
num_iter
Userk
Attributereg_u
KNN
reg_i
UserKNNk
Cosine
reg_u
reg_i
UserKNNk
Pearson
shrinkage
reg_u
reg_i
ItemAttribute- k
KNN
reg_u
reg_i
ItemKNNk
Cosine
reg_u
reg_i
ItemKNNk
Pearson
shrinkage
reg_u
reg_i
Matrixnum_factors
Factorization regularization
learn_rate
num_iter
init_mean
FactorWiseMatrixFactorization
BiasedMatrixFactorization
Default
Description
3
30
inf
10
5
inf
10
5
inf
10
10
5
inf
10
5
inf
10
5
inf
10
10
5
10
0.015
0.01
30
0
number of item clusters
number of iterations
number of neighbors
user bias regularization
item bias regularization
number of neighbors
user bias regularization
item bias regularization
number of neighbors
shrinkage factor for similarities
user bias regularization
item bias regularization
number of neighbors
user bias regularization
item bias regularization
number of neighbors
user bias regularization
item bias regularization
number of neighbors
shrinkage factor for similarities
user bias regularization
item bias regularization
number of factors / user and item
regularization constant
step size for learning
number of iterations
mean of the normal distribution
used to initialize the factors
standard deviation of the normal
distribution used to initialize the
factors
number of factors / user and item
regularization constant
convergence sensibility
mean of the normal distribution
used to initialize the factors
standard deviation of the normal
distribution used to initialize the
factors
number of iterations
number of factors / user and item
bias regularization modier
user factor regularization
item factor regularization
step size for learning
step size modier for the biases
number of iterations
learning rate adaptation heuristics
mean of the normal distribution
used to initialize the factors
Continued on next page . . .
init_stdev
0.1
num_factors
shrinkage
sensibility
init_mean
10
25
1E-05
0
init_stdev
0.1
num_iter
num_factors
bias_reg
reg_u
reg_i
learn_rate
bias_learn_rate
num_iter
bold_driver
init_mean
10
10
0.01
0.015
0.015
0.01
1
30
False
0
118
APPENDIX A.
MYMEDIALITE REFERENCE
Class
inc
ua
ia
Literature
Zero
Random
MostPopular
UserAttributeKNN
UserKNN
WeightedUserKNN
ItemAttributeKNN
ItemKNN
WeightedItemKNN
ItemAttributeSVM
BPRLinear
WRMF
BPRMF
SoftMarginRankingMF
WeightedBPRMF
X
X
X
X
X
X
X
X
X
Desrosiers and Karypis [2011]
Desrosiers and Karypis [2011]
Desrosiers and Karypis [2011]
Desrosiers and Karypis [2011]
Desrosiers and Karypis [2011]
Desrosiers and Karypis [2011]
Hsu et al. [2003]
Gantner et al. [2010a]
Hu et al. [2008], Pan et al. [2008]
Gantner et al. [2010a]
Rendle [2010a]
Gantner et al. [2011a]
Table A.3: Item recommenders in MyMediaLite 2.99.
Class
Hyperparameter
init_stdev
Default
Description
0.1
standard deviation of the normal
distribution used to initialize the
factors
loss
RMSE
the loss to optimize for: RMSE,
MAE, or LogisticLoss
max_threads
100
maximum number of parallel
threads
SocialMF
num_factors
10
number of factors / user and item
regularization
0.015
regularization constant
social_regularization
1
strength of the social regularization
learn_rate
0.01
step size for learning
num_iter
30
number of iterations
init_mean
0
mean of the normal distribution
used to initialize the factors
init_stdev
0.1
standard deviation of the normal
distribution used to initialize the
factors
Table A.2: Rating predictors hyperparameters in MyMediaLite 2.99.
A.5.2 Item Recommendation
Item recommenders implement the interface IItemRecommender. IIncrementalItemRecommender is the interface for methods that can learn incrementally as
new feedback comes in.
Table A.3 lists all item recommenders in MyMediaLite, and the hyperparameters are shown in Table A.2.
A.5.
119
RECOMMENDERS
Class
Hyperparameter
Default
Description
UserAttributeKNN
UserKNN
WeightedUserKNN
ItemAttributeKNN
ItemKNN
WeightedItemKNN
ItemAttributeSVM
BPRLinear
k
80
number of neighbors
k
k
80
80
number of neighbors
number of neighbors
k
80
number of neighbors
k
k
80
80
number of neighbors
number of neighbors
c
gamma
reg
num_iter
learn_rate
fast_sampling_memory_limit
init_mean
1
0.002
0.015
10
0.05
1024
init_stdev
0.1
num_factors
regularization
c_pos
10
0.015
1
num_iter
init_mean
15
0
init_stdev
0.1
C hyperparameter for the SVM
γ parameter for the RBF kernel
regularization constant
number of iterations
step size for learning
MB to be used for fast sampling
data
mean of the normal distribution
used to initialize the factors
standard deviation of the normal
distribution used to initialize the
factors
number of factors / user and item
regularization constant
the weight put on positive observations
number of iterations
mean of the normal distribution
used to initialize the factors
standard deviation of the normal
distribution used to initialize the
factors
number of factors / user and item
item bias regularizationes
user factor regularization
positive item factor regularization
negative item factor regularization
number of iterations
step size for learning
sample users uniformly
sample examples with replacement
learning rate adaptation heuristics
MB to be used for fast sampling
data
perform updates on negative item
factors
mean of the normal distribution
used to initialize the factors
standard deviation of the normal
distribution used to initialize the
factors
Continued on next page . . .
WRMF
BPRMF
0
num_factors
10
bias_reg
0
reg_u
0.0025
reg_i
0.0025
reg_j
0.00025
num_iter
30
learn_rate
0.05
uniform_user_samplingTrue
with_replacement
False
bold_driver
False
fast_sampling_1024
memory_limit
update_j
True
init_mean
0
init_stdev
0.1
120
APPENDIX A.
Class
SoftMarginRankingMF
Hyperparameter
num_factors
bias_reg
reg_u
reg_i
reg_j
num_iter
learn_rate
bold_driver
fast_sampling_memory_limit
init_mean
MYMEDIALITE REFERENCE
Default
Description
10
0
0.0025
0.0025
0.00025
30
0.05
False
1024
number of factors / user and item
item bias regularizationes
user factor regularization
positive item factor regularization
negative item factor regularization
number of iterations
step size for learning
learning rate adaptation heuristics
MB to be used for fast sampling
data
0
mean of the normal distribution
used to initialize the factors
init_stdev
0.1
standard deviation of the normal
distribution used to initialize the
factors
Weightednum_factors
10
number of factors / user and item
BPRMF
bias_reg
0
item bias regularizationes
reg_u
0.0025
user factor regularization
reg_i
0.0025
positive item factor regularization
reg_j
0.00025
negative item factor regularization
num_iter
30
number of iterations
learn_rate
0.05
step size for learning
bold_driver
False
learning rate adaptation heuristics
init_mean
0
mean of the normal distribution
used to initialize the factors
init_stdev
0.1
standard deviation of the normal
distribution used to initialize the
factors
Table A.4: Item recommender hyperparameters in MyMediaLite 2.99.
Item Recommendation Files
Files containing item recommendations contain the recommended items for one
user in one line.
An entry line contains the user ID followed by a tab character, followed by
the top N recommended items with their score.
Example:
0
[9:3.5,7:3.4,3:3.1]
means that we recommend the items 9, 7, and 3 to user 0, and that their
respective scores are 3.5, 3.4, and 3.1.
The item recommendation tool supports this le format. Use --prediction-file=FILE to specify the le name and --predict-items-number=N to specify the number of items to recommend to each user.
Command-line tools:
A.5.
121
RECOMMENDERS
Class
Average
Maximum
Minimum
PairwiseWins
WeightedAverage
Table A.5: Group recommenders in MyMediaLite 2.99.
There is currently no class for reading this kind of le in
MyMediaLite. Given the information here it should be easy to implement,
though.
Writing item recommendations to a stream or le can be performed using
the extension methods dened in ItemRecommendation.Extensions
All one needs to do is to import the MyMediaLite.ItemRecommendation
namespace with the using statement. Then writing out predictions is simple:
Programming:
using MyMediaLite.ItemRecommendation;
...
recommender.WritePredictions(training_data, candidate_items,
predict_items_number,
prediction_file);
If you only want predictions for specic users, provide a list of user IDs to
the method:
recommender.WritePredictions(training_data, candidate_items,
predict_items_number,
prediction_file, user_list);
If you mapped the user and item IDs to internal IDs, supply the mapping
data as arguments to the method so that the internal IDs can be turned into
their original counterparts again:
recommender.WritePredictions(training_data, candidate_items,
predict_items_number,
prediction_file, user_list,
user_mapping, item_mapping);
A.5.3 Group Recommendation
Group recommenders implement the interface IGroupRecommender. MyMediaLite's group recommenders are listed in Table A.5. See section 6.2.1 for more
information on group recommendation.
If you want to use group recommendation methods from the command
line, you can use the item recommendation tool's --user-groups=FILE and
--group-recommender=METHOD options.
122
APPENDIX A.
MYMEDIALITE REFERENCE
A.5.4 Ensembles
Ensembles, that is combinations of several recommenders, inherit from the
Ensemble.Ensemble class. Currently, the only inheriting class is WeightedEnsemble, which is a recommender that scores user-item combinations as a
weighted sum of the scores emitted by the individual recommenders.
A.6
Using MyMediaLite Recommenders
This section describes how to use the MyMediaLite recommenders from a programmer's perspective.
A.6.1 General Remarks
Setting up a recommender so that it can make predictions requires three steps:
1. creating the recommender by callings its constructor
2. assigning the training data
3. calling the Train() method
The training data may consist of interaction data, attribute data, and relation data. The property for the interaction data is called Ratings for rating predictors and Feedback for item recommenders. For attribute and relation data, the corresponding interfaces dene the properties UserAttributes,
ItemAttributes, UserRelation, and ItemRelation.
After these steps, the recommender can be called using the Predict()
method.
In the following, we give examples on how to use recommenders from dierent programming languages. All examples used here can be also found in the
examples/ directory of the MyMediaLite source code.
A.6.2 C#
using
using
using
using
using
System;
MyMediaLite.Data;
MyMediaLite.Eval;
MyMediaLite.IO;
MyMediaLite.RatingPrediction;
public class RatingPrediction
{
public static void Main(string[] args)
{
// load the data
var training_data = RatingData.Read(args[0]);
var test_data = RatingData.Read(args[1]);
// set up the recommender
var recommender = new UserItemBaseline();
A.6.
USING MYMEDIALITE RECOMMENDERS
123
recommender.Ratings = training_data;
recommender.Train();
// measure the accuracy on the test dataset
var results = recommender.Evaluate(test_data);
Console.WriteLine("RMSE={0} MAE={1}", results["RMSE"], results["MAE"]);
Console.WriteLine(results);
// make a prediction for a certain user and item
Console.WriteLine(recommender.Predict(1, 1));
}
}
using
using
using
using
using
var bmf = new BiasedMatrixFactorization {Ratings = training_data};
Console.WriteLine(bmf.DoCrossValidation());
System;
MyMediaLite.Data;
MyMediaLite.Eval;
MyMediaLite.IO;
MyMediaLite.ItemRecommendation;
public class ItemPrediction
{
public static void Main(string[] args)
{
// load the data
var training_data = ItemData.Read(args[0]);
var test_data = ItemData.Read(args[1]);
// set up the recommender
var recommender = new MostPopular();
recommender.Feedback = training_data;
recommender.Train();
// measure the accuracy on the test dataset
var results = recommender.Evaluate(test_data, training_data);
foreach (var key in results.Keys)
Console.WriteLine("{0}={1}", key, results[key]);
Console.WriteLine(results);
}
}
// make a score prediction for a certain user and item
Console.WriteLine(recommender.Predict(1, 1));
A.6.3 F#
open System
open MyMediaLite.IO
124
APPENDIX A.
MYMEDIALITE REFERENCE
open MyMediaLite.RatingPrediction
open MyMediaLite.Eval
(* load the data *)
let train_data = RatingData.Read "u1.base"
let test_data = RatingData.Read "u1.test"
(* set up the recommender *)
let recommender = new UserItemBaseline(Ratings=train_data)
recommender.Train()
(* measure the accuracy on the test dataset *)
let result = recommender.Evaluate(test_data)
Console.WriteLine(result)
(* make a prediction for a certain user and item *)
let prediction = recommender.Predict(1, 1)
Console.WriteLine(prediction)
open
open
open
open
System
MyMediaLite.IO
MyMediaLite.ItemRecommendation
MyMediaLite.Eval
(* load the data *)
let train_data = ItemData.Read "u1.base"
let test_data = ItemData.Read "u1.test"
(* set up the recommender *)
let recommender = new UserKNN(K=20u, Feedback=train_data)
recommender.Train()
(* measure the accuracy on the test dataset *)
let result = recommender.Evaluate(test_data, train_data)
Console.WriteLine(result)
(* make a prediction for a certain user and item *)
let prediction = recommender.Predict(1, 1)
Console.WriteLine(prediction)
A.6.4 Python
A rating prediction example in Python:
#!/usr/bin/env ipy
import clr
clr.AddReference("MyMediaLite.dll")
from MyMediaLite import *
# load the data
A.6.
USING MYMEDIALITE RECOMMENDERS
125
train_data = IO.RatingData.Read("u1.base")
test_data = IO.RatingData.Read("u1.test")
# set up the recommender
recommender = RatingPrediction.UserItemBaseline() # don't forget ()
recommender.Ratings = train_data
recommender.Train()
# measure the accuracy on the test dataset
print Eval.Ratings.Evaluate(recommender, test_data)
# make a prediction for a certain user and item
print recommender.Predict(1, 1)
An item recommendation example in Python:
#!/usr/bin/env ipy
import clr
clr.AddReference("MyMediaLite.dll")
from MyMediaLite import *
# load the data
train_data = IO.ItemData.Read("u1.base")
test_data = IO.ItemData.Read("u1.test")
# set up the recommender
recommender = ItemRecommendation.UserKNN() # don't forget ()
recommender.K = 20
recommender.Feedback = train_data
recommender.Train()
# measure the accuracy on the test dataset
print Eval.Items.Evaluate(recommender, test_data, train_data)
# make a prediction for a certain user and item
print recommender.Predict(1, 1)
A.6.5 Ruby
#!/usr/bin/env ir
require 'MyMediaLite'
# load the data
train_data = MyMediaLite::IO::RatingData.Read("u1.base")
test_data = MyMediaLite::IO::RatingData.Read("u1.test")
# set up the recommender
recommender = MyMediaLite::RatingPrediction::UserItemBaseline.new()
recommender.Ratings = train_data
126
APPENDIX A.
MYMEDIALITE REFERENCE
recommender.Train()
# measure the accuracy on the test dataset
eval_results = MyMediaLite::Eval::Ratings::Evaluate(recommender, test_data)
eval_results.each do |entry|
puts "#{entry}"
end
# make a prediction for a certain user and item
puts recommender.Predict(1, 1)
#!/usr/bin/env ir
require 'MyMediaLite'
using_clr_extensions MyMediaLite
# load the data
train_data = MyMediaLite::IO::ItemData.Read("u1.base")
test_data = MyMediaLite::IO::ItemData.Read("u1.test")
# set up the recommender
recommender = MyMediaLite::ItemRecommendation::MostPopular.new()
recommender.Feedback = train_data;
recommender.Train()
# measure the accuracy on the test dataset
eval_results = MyMediaLite::Eval::Items.Evaluate(
recommender, test_data, train_data)
eval_results.each do |entry|
puts "#{entry}"
end
# make a prediction for a certain user and item
puts recommender.Predict(1, 1)
A.7
Implementing MyMediaLite Recommenders
All infrastructure for implementing new recommenders is already in place, which
means the developer can concentrate on programming the details of the algorithm, without having to worry about storing the interaction data and so on.
Furthermore, there is no need to manually integrate the new recommender in
the command-line tools, as these tools automatically nd all available recommenders using reection.
For implementing the basic functionality of a recommender, the necessary
steps are:
• derive the new class from RatingPredictor or ItemRecommender
• dene the model data structures
• dene the hyperparameters as object properties
A.7.
IMPLEMENTING MYMEDIALITE RECOMMENDERS
127
• write the Train() method
• write the Predict() method
To get the full functionality, LoadModel(), SaveModel(), and ToString()
have to be dened.
Plenty of examples of implemented recommenders can be found in the MyMediaLite source code. Simple recommenders to start with are SlopeOne,
MostPopular, and UserItemBaseline.
128
APPENDIX A.
MYMEDIALITE REFERENCE
Appendix B
URLs
129
130
APPENDIX B.
Software
Gendarme
Mono
MonoDevelop
Embedding Mono
IKVM
git
Doxygen
Perl
GATE
OpenCV
R
RapidMiner
scikit-learn
Shogun
Weka
URLS
URL
http://www.mono-project.com/Gendarme
http://www.mono-project.com
http://monodevelop.com
http://www.mono-project.com/Embedding_Mono
http://www.ikvm.net
http://git-scm.com
http://www.doxygen.org
http://www.perl.org
http://gate.ac.uk
http://opencv.willowgarage.com
http://www.r-project.org
http://rapid-i.com
http://scikit-learn.org
http://www.shogun-toolbox.org
http://www.cs.waikato.ac.nz/ml/weka/
Table B.1: Development tools and other software mentioned in this thesis.
Website
AppRecommender
Bibsonomy
Donation Dashboard
Jester
MovieLens
Internet Movie Database (IMDb)
KDD Cup 2010
KDD Cup 2011
Netix Prize
MovieLens datasets
Yahoo! Music Ratings
RecLab Prize on Overstock.com
RecSysWiki
URL
https://github.com/tassia/AppRecommender
http://www.bibsonomy.org
http://dd.berkeley.edu
http://eigentaste.berkeley.edu
http://movielens.umn.edu
http://www.imdb.com/interfaces
https://pslcdatashop.web.cmu.edu/KDDCup
http://kddcup.yahoo.com
http://www.netflixprize.com
http://www.grouplens.org/node/73
http://webscope.sandbox.yahoo.com
http://overstockreclabprize.com
http://recsyswiki.com
Table B.2: Academic/experimental systems with recommender functionality
and other recommender system resources mentioned in this thesis.
131
Website
Amazon
Asos
Delicious
Facebook
Findory
GoodReads
GMail
Google
Google Maps
Google News
last.fm
LibraryThing
LinkedIn
MoviePilot
MoviePilot Germany
Net-A-Porter
Netix
Nokia Maps
Otto
Pandora
Songkick
TiVo
Twitter
Xing
Yahoo!
Yahoo! Answers
Yahoo! News
Zalando
URL
http://amazon.com
http://www.asos.de
http://delicious.com
http://www.facebook.com
http://www.findory.com
http://www.goodreads.com
http://mail.google.com
http://www.google.com
http://maps.google.com
http://news.google.com
http://last.fm
http://www.librarything.com
http://www.linkedin.com
http://moviepilot.com
http://moviepilot.de
http://www.net-a-porter.com
http://netflix.com
http://maps.nokia.com
http://www.otto.de
http://pandora.com
http://songkick.com
http://tivo.com
http://twitter.com
http://xing.com
http://www.yahoo.com
http://answers.yahoo.com
http://news.yahoo.com
http://www.zalando.de
Table B.3: Websites mentioned in this thesis.
132
Software
MyMediaLite
. . . for Java
GitHub repository
Gitorious repository
Evaluation
using MyMediaLite
MyMediaLite
for RapidMiner
Duine
GraphLab
LensKit
Mahout
RecLab
recommenderlab
Waes
jCOLIBRI
MyCBR
PyRSVD
CoRank
SVDFeature
Jellysh
Likelike
OpenSlopeOne
Vogoo
Vowpal Wabbit
Wooix
Recommendable
ActsAsRecommendable
SUGGEST
LibFM
PMF/PTF
BPTF
MultiLens
APPENDIX B.
URLS
URL
http://ismll.de/mymedialite
https://github/zenogantner/MyMediaLiteJava
https://github.com/zenogantner/MyMediaLite
http://gitorious.org/mymedialite
http://ir.ii.uam.es/evaluation/rs/
http://elico.rapid-i.com/recommender-extension.html
http://duineframework.org
http://graphlab.org
http://lenskit.grouplens.org
http://mahout.apache.org
http://code.richrelevance.com
http://cran.r-project.org/web/packages/recommenderlab
http://waffles.sourceforge.net
http://gaia.fdi.ucm.es/research/colibri/jcolibri
http://mycbr-project.net
http://code.google.com/p/pyrsvd/
http://cofirank.org/
http://apex.sjtu.edu.cn/apex_wiki/svdfeature
http://research.cs.wisc.edu/hazy/victor/download/
http://code.google.com/p/likelike/
http://code.google.com/p/openslopeone/
http://sourceforge.net/projects/vogoo/
https://github.com/JohnLangford/vowpal_wabbit/
http://code.gustavonarea.net/wooflix.tar.gz
https://github.com/davidcelis/recommendable
https://github.com/maccman/acts_as_recommendable
http://glaros.dtc.umn.edu/gkhome/suggest/overview
http://cms.uni-konstanz.de/informatik/rendle/software/libfm/
http://www.mit.edu/~rsalakhu/BPMF.html
http://www.cs.cmu.edu/~lxiong/bptf/bptf.html
http://knuth.luther.edu/~bmiller/dynahome.php?page=multilens
Table B.4: Websites of recommender system software mentioned in this thesis.
Bibliography
Jacob Abernethy, Francis R. Bach, Theodoros Evgeniou, and Jean-Philippe
Vert. A new approach to collaborative ltering: Operator estimation with
spectral regularization. Journal of Machine Learning Research, 10:803826,
2009.
Hector Garcia-Molina Aditya Parameswaran, Petros Venetis. Recommendation
systems with complex constraints: A CourseRank perspective. Transactions
on Information Systems (TOIS), 2011.
Gediminas Adomavicius and Alexander Tuzhilin. Context-aware recommender
systems. In Recommender Systems Handbook Kantor et al. (2011), pages
217253.
Deepak Agarwal and Bee-Chung Chen. Regression-based latent factor models.
In John F. Elder IV, Francoise Fogelman-Soulie, Peter A. Flach, and Mohammed Javeed Zaki, editors, Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1928.
ACM, 2009.
Deepak Agarwal and Bee-Chung Chen. fLDA: Matrix factorization through
latent dirichlet allocation. In Davison et al. (2010), pages 91100.
Masoud Alghoniemy and Ahmed H. Tewk. A network ow model for playlist
generation. In Proceedings of the 2001 IEEE International Conference on
Multimedia and Expo, 2001.
Kamal Ali and Wijnand van Stam. TiVo: Making show recommendations using
a distributed collaborative ltering architecture. In Proceedings of the Tenth
ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2004.
Xavier Amatriain, Marc Torrens, Paul Resnick, and Markus Zanker, editors.
Proceedings of the Fourth ACM Conference on Recommender Systems (RecSys
2010), 2010. ACM.
Thorsten Angermann.
Empirische Analyse von Open-SourceEmpfehlungssystemen. Master's thesis, University of Hildesheim, 2010.
Marko Balabanovic and Yoav Shoham. Fab: content-based, collaborative recommendation. Communications of the ACM, 40(3):6672, 1997.
Marko Balabanovic. An adaptive web page recommendation service. In Proceedings of the First International Conference on Autonomous Agents, pages
378385. ACM, 1997.
133
134
BIBLIOGRAPHY
Suhrid Balakrishnan, Rensheng Wang, Carlos Scheidegger, Angus MacLellan,
Yifan Hu, and Aaron Archer. Combining predictors for recommending music:
the False Positives' approach to KDD Cup track 2. In KDD Cup Workshop
2011, 2011.
Maria-Florina Balcan, Nikhil Bansal, Alina Beygelzimer, Don Coppersmith,
John Langford, and Gregory B. Sorkin. Robust reductions from ranking to
classication. Machine Learning, 72(12):139153, 2008.
Linas Baltrunas. Context-Aware Collaborative Filtering Recommender Systems.
PhD thesis, Free University of Bozen-Bolzano, 2011.
Linas Baltrunas, Tadas Makcinskas, and Francesco Ricci. Group recommendations with rank aggregation and collaborative ltering. In Amatriain et al.
(2010), pages 119126.
Linas Baltrunas, Bernd Ludwig, and Francesco Ricci. Matrix factorization techniques for context aware recommendation. In Mobasher et al. (2011), pages
301304.
Riccardo Bambini, Paolo Cremonesi, and Roberto Turrin. A recommender system for an IPTV service provider: a real large-scale production environment.
In Recommender Systems Handbook Kantor et al. (2011), pages 299331.
Robert M. Bell, Yehuda Koren, and Chris Volinsky. Modeling relationships at
multiple scales to improve accuracy of large recommender systems. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pages 95104. ACM, 2007.
Robert M. Bell, Yehuda Koren, and Chris Volinsky. The BellKor 2008 solution
to the Netix Prize. Technical report, Statistics Research Department at
AT&T Research, 2008.
Alejandro Bellogn, Pablo Castells, and Iv
an Cantador. Precision-oriented evaluation of recommender systems: an algorithmic comparison. In Mobasher
et al. (2011), pages 333336.
Lawrence D. Bergman, Alexander Tuzhilin, Robin D. Burke, Alexander Felfernig, and Lars Schmidt-Thieme, editors. Proceedings of the Third ACM Conference on Recommender Systems (RecSys 2009), 2009. ACM.
Steen Bickel. ECML-PKDD Discovery Challenge 2006 overview. In The Discovery Challenge Workshop, page 1, 2006.
Daniel Billsus, Michael J. Pazzani, and James Chen. A learning agent for
wireless news access. In Intelligent User Interfaces, pages 3336, 2000.
Christopher M. Bishop. Pattern recognition and machine learning. Springer,
New York, 2006.
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:9931022, 2003.
BIBLIOGRAPHY
135
Kurt D. Bollacker, Steve Lawrence, and C. Lee Giles. Discovering relevant
scientic literature on the web. Intelligent Systems and their Applications, 15
(2):4247, 2000.
Dirk Bollen, Bart P. Knijnenburg, Martijn C. Willemsen, and Mark Graus.
Understanding choice overload in recommender systems. In Amatriain et al.
(2010), pages 6370.
Craig Boutilier, Richard S. Zemel, and Benjamin Marlin. Active collaborative
ltering. In Proceedings of the Nineteenth Conference on Uncertainty in Articial Intelligence, pages 98106, 2003.
Gary Bradski and Adrian Kaehler. Learning OpenCV: Computer vision with
the OpenCV library. O'Reilly Media, 2008.
Leo Breiman. Bagging predictors. Machine Learning, 24(2):123140, 1996.
Leo Breiman. Random forests. Machine Learning, 45(1):532, 2001.
Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual
web search engine. Computer Networks and ISDN Systems, 30(17):107117,
1998.
Rich Caruana. Multitask learning. Machine Learning, 28(1):4175, 1997.
Oscar Celma. Music Recommendation and Discovery: The Long Tail, Long Fail,
and Long Play in the Digital Music Space. Springer, 2010.
S. Wesley Changchien and Tzu-Chuen Lu. Mining association rules procedure to
support on-line recommendation by customers and products fragmentation.
Expert Systems with Applications, 20(4):325335, 2001.
Po-Lung Chen, Chen-Tse Tsai, Yao-Nan Chen, Ku-Chun Chou, Chun-Liang
Li, Cheng-Hao Tsai, Kuan-Wei Wu, Yu-Cheng Chou, Chung-Yi Li, Wei-Shih
Lin, Shu-Hao Yu, Rong-Bing Chiu, Chieh-Yen Lin, Chien-Chih Wang, Po-Wei
Wang, Wei-Lun Su, Chen-Hung Wu, Tsung-Ting Kuo, Todd G. McKenzie,
Ya-Hsuan Chang, Chun-Sung Ferng, Chia-Mau Ni, Hsuan-Tien Lin, Chih-Jen
Lin, and Shou-De Lin. A linear ensemble of individual and blended models
for music rating prediction. In KDD Cup Workshop 2011, 2011a.
Tianqi Chen, Zhao Zheng, Qiuxia Lu, Xiao Jiang, Yuqiang Chen, Weinan
Zhang Kailong Chen, Yong Yu, Nathan N. Liu, Bin Cao, Luheng He, and
Qiang Yang. Informative ensemble of multi-resolution dynamic factorization
models. In KDD Cup Workshop 2011, 2011b.
Yu-Chen Chen. A personalized local event recommendation system for mobile
users. Master's thesis, Dept. of Information Management, Chaoyang University of Technology, Taiwan, 2005.
Yeong Bin Cho, Yoon Ho Cho, and Soung Hie Kim. Mining changes in customer buying behavior for collaborative recommendations. Expert Systems
with Applications, 28(2):359369, 2005.
136
BIBLIOGRAPHY
Kenneth L. Clarkson, Elad Hazan, and David P. Woodru. Sublinear optimization for machine learning. In 51st Annual IEEE Symposium on Foundations
of Computer Science (FOCS 2010), pages 449457. IEEE Computer Society,
2010.
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Cliord Stein.
Introduction to algorithms, volume 6. MIT Press, 2001.
Paolo Cremonesi and Roberto Turrin. Analysis of cold-start recommendations
in IPTV systems. In Bergman et al. (2009), pages 233236.
Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. Performance of recommender algorithms on top-n recommendation tasks. In Amatriain et al.
(2010), pages 3946.
Hamish Cunningham, Diana Maynard, Kalina Bontcheva, Valentin Tablan, Niraj Aswani, Ian Roberts, Genevieve Gorrell, Adam Funk, Angus Roberts,
Danica Damljanovic, Thomas Heitz, Mark A. Greenwood, Horacio Saggion,
Johann Petrak, Yaoyong Li, and Wim Peters. Text Processing with GATE
(Version 6). 2011.
Abhinandan Das, Mayur Datar, Ashutosh Garg, and Shyam Rajaram. Google
News personalization: scalable online collaborative ltering. In Proceedings
of the 16th International World Wide Web Conference, pages 271280. ACM,
2007.
James Davidson, Benjamin Liebald, Junning Liu, Palash Nandy, Taylor Van
Vleet, Ullas Gargi, Sujoy Gupta, Yu He, Mike Lambert, Blake Livingston,
and Dasarathi Sampath. The YouTube video recommendation system. In
Amatriain et al. (2010), pages 293296.
Brian D. Davison, Torsten Suel, Nick Craswell, and Bing Liu, editors. Proceedings of the Third International Conference on Web Search and Web Data
Mining (WSDM 2010), 2010. ACM.
Mukund Deshpande and George Karypis. Item-based top-n recommendation
algorithms. ACM Transactions on Information Systems, 22/1:143177, 2004.
Christian Desrosiers and George Karypis.
A comprehensive survey of
neighborhood-based recommendation methods. In Recommender Systems
Handbook Kantor et al. (2011), pages 107144.
Belen Daz-Agudo, Pedro A. Gonz
alez-Calero, Juan A. Recio-Garca, and Antonio A. S
anchez-Ruiz-Granados. Building CBR systems with jCOLIBRI.
Science of Computer Programming, 69(1-3):6875, 2007.
Gideon Dror, Noam Koenigstein, and Yehuda Koren. Yahoo! music recommendations: Modeling music ratings with temporal dynamics and item taxonomy.
In Mobasher et al. (2011).
Gideon Dror, Noam Koenigstein, Yehuda Koren, and Markus Weimer. The
Yahoo! Music Dataset and KDD-Cup'11. In KDD Cup Workshop 2011,
2011b.
BIBLIOGRAPHY
137
Gideon Dror, Yehuda Koren, Yoelle Maarek, and Idan Szpektor. I want to
answer, who has a question? yahoo! answers recommender system. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pages 11091117. ACM, 2011c.
Liang Du, Xuan Li, and Yi-Dong Shen. User graph regularized pairwise matrix
factorization for item recommendation. In Jie Tang, Irwin King, Ling Chen,
and Jianyong Wang, editors, Advanced Data Mining and Applications, volume
7121 of Lecture Notes in Computer Science, pages 372385. Springer, 2011.
Andrew S.C. Ehrenberg. Repeat-buying: Facts, theory and applications. Grin
New London, 1988.
Michael D. Ekstrand, Praveen Kannan, James A. Stemper, John T. Butler,
Joseph A. Konstan, and John T. Riedl. Automatically building research
reading lists. In Amatriain et al. (2010), pages 159166.
Michael D. Ekstrand, Michael Ludwig, Joseph A. Konstan, and John T. Riedl.
Rethinking the recommender research ecosystem: reproducibility, openness,
and LensKit. In Mobasher et al. (2011), pages 133140.
Christoph Freudenthaler, Lars Schmidt-Thieme, and Steen Rendle. Bayesian
factorization machines. In Workshop on Sparse Representation and Low-rank
Approximation, Neural Information Processing Systems (NIPS 2011), 2011.
Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of
on-line learning and an application to boosting. Journal of Computer and
System Sciences, 55(1):119139, 1997.
Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by
the authors). The Annals of Statistics, 28(2):337407, 2000.
Jerome H. Friedman. Stochastic gradient boosting. Computational Statistics &
Data Analysis, 38(4):367378, 2002.
Zeno Gantner and Lars Schmidt-Thieme. Automatic content-based categorization of wikipedia articles. In Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources, pages
3237. Association for Computational Linguistics, 2009.
Zeno Gantner, Lucas Drumond, Christoph Freudenthaler, Steen Rendle, and
Lars Schmidt-Thieme. Learning attribute-to-feature mappings for cold-start
recommendations. In Webb et al. (2010).
Zeno Gantner, Christoph Freudenthaler, Steen Rendle, and Lars SchmidtThieme. Optimal ranking for video recommendation. In User Centric Media: First International Conference (UCMedia 2009), Revised Selected Papers, pages 255258. Springer, 2010b.
Zeno Gantner, Steen Rendle, and Lars Schmidt-Thieme. Factorization models
for context-/time-aware movie recommendations. In Challenge on Contextaware Movie Recommendation (CAMRa2010). ACM, 2010c.
138
BIBLIOGRAPHY
Zeno Gantner, Lucas Drumond, Christoph Freudenthaler, and Lars SchmidtThieme. Bayesian personalized ranking for non-uniformly sampled items. In
KDD Cup Workshop 2011, 2011a.
Zeno Gantner, Steen Rendle, Christoph Freudenthaler, and Lars SchmidtThieme. MyMediaLite: A free recommender system library. In Mobasher
et al. (2011).
William A. Gardner. Learning characteristics of stochastic-gradient-descent algorithms: A general study, analysis, and critique. Signal Processing, 6(2):
113133, 1984.
Michael S. Gashler. Waes: A machine learning toolkit. Journal of Machine
Learning Research, 2:23832387, July 2011.
Rainer Gemulla, Peter J. Haas, Erik Nijkamp, and Yannis Sismanis. Largescale matrix factorization with distributed stochastic gradient descent. In
Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 6977. ACM, 2011.
Thomas George and Srujana Merugu. A scalable collaborative ltering framework based on co-clustering. In Jiawei Han, Benjamin W. Wah, Vijay Raghavan, Xindong Wu, and Rajeev Rastogi, editors, Proceedings of the 5th IEEE
International Conference on Data Mining (ICDM 2005). IEEE Computer Society, 2005.
Andreas Geyer-Schulz and Michael Hahsler. Evaluation of recommender algorithms for an internet information broker based on simple association rules
and on the repeat-buying theory. In Proceedings of the ACM WebKDD Workshop, 2002.
Fosca Giannotti, Dimitrios Gunopulos, Naren Ramakrishnan, Franco Turini,
Carlo Zanilo, and Xindong Wu, editors. Proceedings of the 8th IEEE International Conference on Data Mining (ICDM 2008), 2008. IEEE Computer
Society.
David Goldberg, David Nichols, Brian M. Oki, and Douglas Terry. Using collaborative ltering to weave an information tapestry. Communications of the
ACM, 35(12):6170, 1992.
Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Eigentaste:
A constant time collaborative ltering algorithm. Information Retrieval, 4
(2):133151, 2001.
Nathaniel Good, J. Ben Schafer, Joseph A. Konstan, Al Borchers, Badrul Sarwar, Jon Herlocker, and John Riedl. Combining collaborative ltering with
personal agents for better recommendations. In Proceedings of the National
Conference on Articial Intelligence, pages 439446. John Wiley & Sons,
1999.
Asela Gunawardana and Christopher Meek. Tied Boltzmann machines for cold
start recommendations. In Pu et al. (2008), pages 1926.
BIBLIOGRAPHY
139
Asela Gunawardana and Christopher Meek. A unied approach to building
hybrid recommender systems. In Bergman et al. (2009), pages 117124.
Mark Hall, Eibe Frank, Georey Holmes, Bernd Pfahringer, Peter Reutemann,
and Ian H. Witten. The weka data mining software: an update. ACM
SIGKDD Explorations Newsletter, 11(1):1018, 2009.
Richard A. Harshman. Foundations of the PARAFAC procedure: models and
conditions for an 'exploratory' multimodal factor analysis. In UCLA Working
Papers in Phonetics, pages 184, 1970.
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning, Second Edition: Data Mining, Inference, and Prediction.
Springer Series in Statistics. Springer, 2nd ed. corr. 3rd printing edition, January 2009.
Elad Hazan, Tomer Koren, and Nathan Srebro. Beating SGD: Learning SVMs
in sublinear time. In Advances in Neural Information Processing Systems
(NIPS) 24, 2011.
Jonathan L. Herlocker, Joseph A. Konstan, Al Borchers, and John Riedl. An
algorithmic framework for performing collaborative ltering. In Proceedings
of the 22nd Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, pages 230237. ACM, 1999.
Jonathan L. Herlocker, Joseph A. Konstan, Loren G. Terveen, and John T.
Riedl. Evaluating collaborative ltering recommender systems. ACM Transactions on Information Systems, 22:553, January 2004.
Thomas Hofmann. Latent semantic models for collaborative ltering. ACM
Transactions on Information Systems, 22(1):89115, 2004.
Thomas Hofmann and Jan Puzicha. Latent class models for collaborative ltering. In Proceedings of the Sixteenth International Joint Conference on
Articial Intelligence, pages 688693, Stockholm, 1999.
Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. A practical guide to
support vector classication, 2003.
Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative ltering for implicit
feedback datasets. In Giannotti et al. (2008).
Ross Ihaka and Robert Gentleman. R: a language for data analysis and graphics.
Journal of Computational and Graphical Statistics, pages 299314, 1996.
Michael Jahrer and Andreas T
oscher. Collaborative ltering ensemble for ranking. In KDD Cup Workshop 2011, 2011.
Mohsen Jamali and Martin Ester. A matrix factorization technique with trust
propagation for recommendation in social networks. In Amatriain et al.
(2010).
Dietmar Jannach, Markus Zanker, and Matthias Fuchs. Constraint-based recommendation in tourism: A multiperspective case study. Information Technology & Tourism, 11(2):139155, 2009.
140
BIBLIOGRAPHY
Dietmar Jannach, Markus Zanker, Alexander Felfering, and Gerhard Friedrich.
Recommender Systems: An Introduction. Cambridge University Press, 2010.
Robert J
aschke, Leandro Balby Marinho, Andreas Hotho, Lars SchmidtThieme, and Gerd Stumme. Tag recommendations in folksonomies. Knowledge Discovery in Databases: PKDD 2007, pages 506514, 2007.
Paul B. Kantor, Francesco Ricci, Lior Rokach, and Bracha Shapira. Recommender Systems Handbook. Springer, 2011.
Alexandros Karatzoglou, Xavier Amatriain, Linas Baltrunas, and Nuria Oliver.
Multiverse recommendation: n-dimensional tensor factorization for contextaware collaborative ltering. In Amatriain et al. (2010), pages 7986.
George Karypis. Evaluation of item-based top-n recommendation algorithms.
In Henrique Paques, Ling Liu, and David A. Grossman, editors, Proceedings of the 2001 ACM CIKM International Conference on Information and
Knowledge Management, 2001.
Ricardo Kawase, George Papadakis, Eelco Herder, and Wolfgang Nejdl. Beyond
the usual suspects: context-aware revisitation support. In Proceedings of the
22nd ACM Conference on Hypertext and Hypermedia, pages 2736. ACM,
2011.
Bart P. Knijnenburg, Niels J. M. Reijmer, and Martijn C. Willemsen. Each to his
own: how dierent users call for dierent interaction methods in recommender
systems. In Mobasher et al. (2011), pages 141148.
Bart P. Knijnenburg, Martijn C. Willemsen, Zeno Gantner, Hakan Soncu, and
Chris Newell. Explaining the user experience of recommender systems. User
Modeling and User-Adapted Interaction Journal (UMUAI), Special Issue on
User Interfaces for Recommender Systems, 2011b.
Joseph S. Kong, Kyle Teague, and Justin Kessler. Just count the love-hate
squares: a rating network based method for recommender systems. In KDD
Cup Workshop 2011, 2011.
Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative ltering model. In Proceeding of the 14th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pages 426434, New
York, USA, 2008. ACM.
Yehuda Koren.
The BellKor solution to the Netix Grand Prize,
2009. URL http://www.netflixprize.com/assets/GrandPrize2009_BPC_
BellKor.pdf.
Yehuda Koren. Collaborative ltering with temporal dynamics. Communications of the ACM, 53(4):8997, 2010.
Yehuda Koren and Robert Bell. Advances in collaborative ltering. In Recommender Systems Handbook Kantor et al. (2011), pages 145186.
Yehuda Koren and Joseph Sill. OrdRec: an ordinal model for predicting personalized item rating distributions. In Mobasher et al. (2011), pages 117124.
BIBLIOGRAPHY
141
Yehuda Koren, Robert M. Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. IEEE Computer, 42(8):3037, 2009.
Artus Krohn-Grimberghe, Lucas Drumond, Christoph Freudenthaler, and Lars
Schmidt-Thieme. Multi-relational matrix factorization using bayesian personalized ranking for social network data. In Proceedings of the Fifth International Conference on Web Search and Web Data Mining (WSDM 2012).
ACM, 2012.
Siwei Lai, Liang Xiang, Rui Diao, Yang Liu, Huxiang Gu, Liheng Xu, Hang Li,
Dong Wang, Kang Liu, Jun Zhao, and Chunhong Pan. Hybrid recommendation models for binary user preference prediction problem. In KDD Cup
Workshop 2011, 2011.
Ken Lang. Newsweeder: Learning to lter netnews. In Proceedings of the Twelfth
International Conference on Machine Learning, 1995.
Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, and
L.D. Jackel. Backpropagation applied to handwritten zip code recognition.
Neural computation, 1(4):541551, 1989.
Yann LeCun, Leon Bottou, Genevieve B. Orr, and Klaus-Robert M
uller. Ecient BackProp. Neural networks: Tricks of the trade, pages 546546, 1998.
Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix
factorization. In Advances in Neural Information Processing Systems (NIPS)
13. MIT Press, 2001.
Daniel Lemire. Scale and translation invariant collaborative ltering systems.
Information Retrieval, 8(1):129150, 2005.
Daniel Lemire and Anna Maclachlan. Slope one predictors for online ratingbased collaborative ltering. In Proceedings of SIAM Data Mining (SDM
2005), 2005.
Daniel Lemire, Harold Boley, Sean McGrath, and Marcel Ball. Collaborative ltering and inference rules for context-aware learning object recommendation.
Interactive Technology and Smart Education, 2(3):179188, 2005.
Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextualbandit approach to personalized news article recommendation. In Proceedings
of the 19th International World Wide Web Conference, pages 661670. ACM,
2010.
Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased oine evaluation of contextual-bandit-based news article recommendation algorithms.
In Proceedings of the Fourth ACM International Conference on Web Search
and Data Mining (WSDM 2011), pages 297306. ACM, 2011.
Ling Li and Hsuan-Tien Lin. Optimizing 0/1 loss for perceptrons by random
coordinate descent. In International Joint Conference on Neural Networks
2007 (IJCNN 2007), pages 749754. IEEE Computer Society, 2007.
Greg Linden, Brent Smith, and Jeremy York. Amazon.com recommendations:
Item-to-item collaborative ltering. IEEE Internet Computing, 7, 2003.
142
BIBLIOGRAPHY
Marek Lipczak. Tag recommendation for folksonomies oriented towards individual users. In Proceedings of the ECML-PKDD Discovery Challenge Workshop,
2009.
Beth Logan. Content-based playlist generation: Exploratory experiments. In
Proceedings of the International Symposium on Music Information Retrieval,
2002.
Pasquale Lops, Marco de Gemmis, and Giovanni Semeraro. Content-based recommender systems: State of the art and trends. In Recommender Systems
Handbook Kantor et al. (2011), pages 73105.
Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin,
and Joseph M. Hellerstein. GraphLab: A new parallel framework for machine
learning. In Proceedings of the 26th Conference on Uncertainty in Articial
Intelligence, 2010.
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Sch
utze. Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK,
2008.
Leandro Balby Marinho, Christine Preisach, and Lars Schmidt-Thieme. Relational classication for personalized tag recommendation. In Proceedings of
the ECML-PKDD Discovery Challenge Workshop, 2009.
Andrey Andreyevich Markov.
Ðàñïðîñòðàíåíèå çàêîíà áîë'øèõ ÷èñåë
íà æåëè÷èíú, çàæèñúàñ÷èå äðóã îò äðóãà.
Èçæåñòèúà Ôèçèêîìàòåìàòè÷åñêîãî îáñ÷åñòæà ïðè Êàçàíñêîì óíèæåðñèòåòå, pages 135
156, 1906.
Benjamin M. Marlin and Richard S. Zemel. Collaborative prediction and ranking
with non-random missing data. In Bergman et al. (2009), pages 512.
Paul Marrow, Rich Hanbidge, Steen Rendle, Christian Wartena, and Christoph
Freudenthaler. MyMedia: Producing an extensible framework for recommendation. In NEM Summit, 2009.
Paul Marrow, Tim Stevens, Ian Kegel, Joshan Meenowa, and Craig McCahill.
Future IPTV services eld trial report. Technical report, MyMedia project,
2010.
Andrew K. McCallum. Multi-label text classication with a mixture model
trained by EM. In AAAI 99 Workshop on Text Learning, 1999.
Todd G. McKenzie, Chun-Sung Ferng, Yao-Nan Chen, Chun Liang Li, ChengHao Tsai, Kuan-Wei Wu, Ya-Hsuan Chang, Chung-Yi Li, Wei-Shih Lin, ShuHao Yu, Chieh-Yen Lin, Po-Wei Wang, Chia-Mai Ni, Wei-Lun Su, Tsung-Ting
Kuo, Chen-Tse Tsai, Po-Lung Chen, Rong-Bing Chiu, Ku-Chun Chou, YuCheng Chou, Chien-Shih Wang, Cheng-Hung Wu, Hsuan-Tien Lin, Chih-Jen
Lin, and Shou-De Lin. Novel models and ensemble techniques to discriminate
favorite items from unrated ones for personalized music recommendation. In
KDD Cup Workshop 2011, 2011.
BIBLIOGRAPHY
143
Aditya Krishna Menon and Charles Elkan. A log-linear model with latent features for dyadic prediction. In Webb et al. (2010), pages 364373.
Aditya Krishna Menon and Charles Elkan. Link prediction via matrix factorization. Machine Learning and Knowledge Discovery in Databases, pages
437452, 2011.
Bradley N. Miller. Toward a Personal Recommender System. PhD thesis, University of Minnesota, 2003.
Tom M. Mitchell. Machine Learning. McGraw Hill, 1997.
Andriy Mnih. Taxonomy-informed latent factor models for implicit feedback.
In KDD Cup Workshop 2011, 2011.
Bamshad Mobasher, Robin Burke, Dietmar Jannach, and Gediminas Adomavicius, editors. Proceedings of the Fifth ACM Conference on Recommender
Systems (RecSys 2011), 2011. ACM.
Mary M. Moya and Don R. Hush. Network constraints and multi-objective
optimization for one-class classication. Neural Networks, 9(3):463474, 1996.
Lik Mui, Peter Szolovits, and Cheewee Wang. Collaborative sanctioning: applications in restaurant recommendations based on reputation. In Proceedings
of the Fifth International Conference on Autonomous Agents, pages 118119.
ACM, 2001.
Tavi Nathanson, Ephrat Bitton, and Ken Goldberg. Donation dashboard: a
recommender system for donation portfolios. In Bergman et al. (2009), pages
253256.
Thai-Nghe Nguyen, Lucas Drumond, Tomas Horvath, Alexandros Nanopoulos,
and Lars Schmidt-Thieme. Matrix and tensor factorization for predicting
student performance. In Proceedings of the 3rd International Conference on
Computer Supported Education (CSEDU 2011), 2011.
Douglas Oard and Jinmook Kim. Implicit feedback for recommender systems. In
Proceedings of the AAAI Workshop on Recommender Systems, pages 8183,
1998.
Sean Owen, Robin Anil, Ted Dunning, and Ellen Friedman. Mahout in Action.
Manning Publications, 2011.
Elias Pampalk, Tim Pohle, and Gerhard Widmer. Dynamic playlist generation
based on skipping behavior. In Proceedings of the 6th International Conference on Music Information Retrieval (ISMIR), pages 634637, 2005.
Rong Pan, Yunhong Zhou, Bin Cao, Nathan Nan Liu, Rajan M. Lukose, Martin
Scholz, and Qiang Yang. One-class collaborative ltering. In Giannotti et al.
(2008), pages 502511.
Seung-Taek Park and Wei Chu. Pairwise preference regression for cold-start
recommendation. In Bergman et al. (2009), pages 2128.
144
BIBLIOGRAPHY
Arek Paterek. Improving regularized singular value decomposition for collaborative ltering. In KDD Cup Workshop at KDD 2007, San Jose, California,
USA, 2007.
Steen Pauws, Wim Verhaegh, and Mark Vossen. Fast generation of optimal
music playlists using local search. In Proceedings of the 7th International
Conference on Music Information Retrieval (ISMIR), 2006.
Michael Pazzani and Daniel Billsus. Content-based recommendation systems.
In The Adaptive Web: Methods and Strategies of Web Personalization, pages
325341. Springer, 2007.
Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, Vincent Michel,
Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron
Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Courna
peau, Matthieu Brucher, Matthieu Perrot, and Edouard
Duchesnay. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research,
12:28252830, 2011.
Istv
an Pil
aszy and Domonkos Tikk. Recommending new movies: even a few
ratings are more valuable than metadata. In Bergman et al. (2009), pages
93100.
Istv
an Pilaszy, D
avid Zibriczky, and Domonkos Tikk. Fast ALS-based matrix
factorization for explicit and implicit feedback datasets. In Amatriain et al.
(2010), pages 7178.
Martin Piotte and Martin Chabbert. The Pragmatic Theory solution to
the Netix Grand Prize, 2009. URL http://netflixprize.com/assets/
GrandPrize2009_BPC_PragmaticTheory.pdf.
Luiz Pizzato, Tomek Rej, Thomas Chung, Irena Koprinska, and Judy Kay.
RECON: a reciprocal recommender for online dating. In Amatriain et al.
(2010), pages 207214.
Pearl Pu, Derek G. Bridge, Bamshad Mobasher, and Francesco Ricci, editors.
Proceedings of the Second ACM Conference on Recommender Systems (RecSys 2008), 2008. ACM.
Al Mamunur Rashid, George Karypis, and John Riedl. Learning preferences of
new users in recommender systems: an information theoretic approach. ACM
SIGKDD Explorations Newsletter, 10(2):90100, 2008.
Benjamin Recht and Christopher Re. Parallel Stochastic Gradient Algorithms
for Large-Scale Matrix Completion. Optimization Online, 2011.
Steen Rendle. Context-aware ranking with factorization models, volume 330 of
Studies in Computational Intelligence. Springer, 2010a.
Steen Rendle. Factorization machines. In Webb et al. (2010), pages 9951000.
Steen Rendle and Lars Schmidt-Thieme. Online-updating regularized kernel
matrix factorization models for large-scale recommender systems. In Pu et al.
(2008).
BIBLIOGRAPHY
145
Steen Rendle and Lars Schmidt-Thieme. Factor models for tag recommendation in BibSonomy. In Proceedings of the ECML-PKDD Discovery Challenge
Workshop, 2009.
Steen Rendle and Lars Schmidt-Thieme. Pairwise interaction tensor factorization for personalized tag recommendation. In Davison et al. (2010), pages
8190.
Steen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars SchmidtThieme. BPR: Bayesian personalized ranking from implicit feedback. In
UAI 2009, 2009.
Steen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. Factorizing
personalized Markov chains for next-basket recommendation. In Proceedings
of the 19th International World Wide Web Conference, pages 811820. ACM,
2010.
Steen Rendle, Zeno Gantner, Christoph Freudenthaler, and Lars SchmidtThieme. Fast context-aware recommendations with factorization machines.
In Proceedings of the 34th International ACM SIGIR Conference on Research
and Development in Information Retrieval, pages 635644. ACM, 2011.
Jason D. M. Rennie and Nathan Srebro. Fast maximum margin matrix factorization for collaborative prediction. In Proceedings of the 22nd International Conference on Machine Learning, pages 713719, New York, USA,
2005. ACM, ACM.
Paul Resnick and Hal R. Varian. Recommender systems. Communications of
the ACM, 40(3):5658, 1997.
Paul Resnick, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, and John
Riedl. Grouplens: An open architecture for collaborative ltering of netnews. In Proceedings of the 1994 ACM Conference on Computer Supported
Cooperative Work, pages 175186. ACM, 1994.
Enrique H. Ruspini. A new approach to clustering. Information and Control,
15(1):2232, 1969.
Ruslan Salakhutdinov and Andriy Mnih. Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In Proceedings of the 25th International Conference on Machine Learning, pages 880887. ACM, 2008a.
Ruslan Salakhutdinov and Andriy Mnih. Probabilistic matrix factorization. In
Advances in Neural Information Processing Systems (NIPS) 20, pages 1257
1264. MIT Press, 2008b.
Ruslan Salakhutdinov, Andriy Mnih, and Georey Hinton. Restricted Boltzmann machines for collaborative ltering. In Proceedings of the 24th International Conference on Machine Learning, pages 791798. ACM, 2007.
Ramesh R. Sarukkai. Link prediction and path analysis using Markov chains.
Computer Networks, 33(1):377386, 2000.
146
BIBLIOGRAPHY
Badrul M. Sarwar, Joseph A. Konstan, Al Borchers, Jon Herlocker, Brad Miller,
and John Riedl. Using ltering agents to improve prediction quality in the
grouplens research collaborative ltering system. In Computer Supported Cooperative Work, pages 345354, 1998.
Andrew I. Schein, Alexandrin Popescul, Lyle H. Ungar, and David M. Pennock.
Methods and metrics for cold-start recommendations. In Proceedings of the
25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 253260, New York, NY, USA, 2002.
ACM.
Guy Shani and Asela Gunawardana. Evaluating recommendation systems. In
Recommender Systems Handbook Kantor et al. (2011), pages 257297.
Guy Shani, David Heckerman, and Ronen I. Brafman. An MDP-based recommender system. Journal of Machine Learning Research, 6(2):1265, 2006.
Upendra Shardanand. Social information ltering for music recommendation,
1994. Bachelor thesis, Massachussetts Institute of Technology.
Victor S. Sheng and Charles X. Ling. Roulette sampling for cost-sensitive
learning. Proceedings of the 18th European Conference on Machine Learning (ECML 2007), pages 724731, 2007.
Alex J. Smola and Bernhard Sch
olkopf. A tutorial on support vector regression.
Statistics and Computing, 14(3), 2004.
S
oren Sonnenburg, Gunnar R
atsch, Sebastian Henschel, Christian Widmer,
Jonas Behr, Alexander Zien, Fabio de Bona, Alexander Binder, Christian
Gehl, and Vojtech Franc. The SHOGUN machine learning toolbox. The
Journal of Machine Learning Research, 11:17991802, 2010.
Bharath Sriram, David Fuhry, Engin Demir, Hakan Ferhatosmanoglu, and Murat Demirbas. Short text classication in twitter to improve information
ltering. In Proceeding of the 33rd International ACM SIGIR Conference on
Research and Development in Information Retrieval, pages 841842. ACM,
2010.
Armin Stahl and Thomas Roth-Berghofer. Rapid prototyping of CBR applications with the open source tool myCBR. Advances in Case-Based Reasoning,
pages 615629, 2008.
Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction. Cambridge University Press, 1998.
G
abor Tak
acs, Istv
an Pil
aszy, Botty
an Nemeth, and D. Tikk. Matrix factorization and neighbor based algorithms for the Netix prize problem. In Pu et al.
(2008), pages 267274.
G
abor Tak
acs, Istv
an Pil
aszy, Botty
an Nemeth, and D. Tikk. Scalable collaborative ltering approaches for large recommender systems. Journal of
Machine Learning Research, 10:623656, 2009.
BIBLIOGRAPHY
147
Yuichiro Takeuchi and Masanori Sugimoto. Cityvoyager: An outdoor recommendation system based on user location history. Ubiquitous Intelligence and
Computing, pages 625636, 2006.
Ben Taskar, Vasil Chatalbashev, Daphne Koller, and Carlos Guestrin. Learning
structured prediction models: A large margin approach. In Proceedings of the
22nd International Conference on Machine Learning, pages 896903. ACM,
2005.
Kai Ming Ting. Inducing cost-sensitive trees via instance weighting. In Proceedings of The Second European Symposium on Principles of Data Mining
and Knowledge Discovery, pages 139147. Springer, 1998.
Andrew Trotman. Learning to rank. Information Retrieval, 8(3):359381, 2005.
Karen H. L. Tso-Sutter, Leandro Balby Marinho, and Lars Schmidt-Thieme.
Tag-aware recommender systems by fusion of collaborative ltering algorithms. In Proceedings of the 2008 ACM Symposium on Applied Computing,
pages 19951999. ACM, 2008.
Ledyard R. Tucker. Some mathematical notes on three-mode factor analysis.
Psychometrika, pages 279311, 1966.
Andreas Toscher, Michael Jahrer, and Robert Legenstein.
Improved
neighborhood-based algorithms for large-scale recommender systems. In Proceedings of the 2nd KDD Workshop on Large-Scale Recommender Systems
and the Netix Prize Competition, page 4. ACM, 2008.
Andreas T
oscher, Michael Jahrer, and Robert Legenstein. Combining predictions for accurate recommender systems. In Proceedings of the 16th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 693702. ACM, 2010.
Tiany Tang und Gordon McCalla. Utilizing articial learners to help overcome the cold-start problem in a pedagogically-oriented paper recommendation system. In Adaptive Hypermedia and Adaptive Web-Based Systems, pages
395423. Springer, 2004.
Mark van Setten. Supporting People In Finding Information Hybrid Recommender Systems and Goal-Based Structuring. PhD thesis, University of
Twente, 2005.
Darren E. Vengro. RecLab: a system for ecommerce recommender research
with real data, context and feedback. In Proceedings of the 2011 Workshop
on Context-awareness in Retrieval and Recommendation, pages 3138. ACM,
2011.
Ellen M. Voorhees. The TREC-8 question answering track report. Technical
report, National Institute of Standards and Technology, 2000.
Chong Wang and David M. Blei. Collaborative topic modeling for recommending scientic articles. In Proceedings of the 17th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pages 448456. ACM,
2011.
148
BIBLIOGRAPHY
Georey I. Webb, Bing Liu, Chengqi Zhang, Dimitrios Gunopulos, and Xindong
Wu, editors. Proceedings of the 10th IEEE International Conference on Data
Mining (ICDM 2010), 2010. IEEE Computer Society.
Markus Weimer, Alexandros Karatzoglou, and Alex Smola. Improving maximum margin matrix factorization. Machine Learning, 72(3):263276, 2008.
Yao Wu, Qiang Yan, Danny Bickson, Yucheng Low, and Qing Yang. Ecient
multicore collaborative ltering. ArXiv preprint arXiv:1108.2580, 2011.
Jianjun Xie, Scott Leishman, Liang Tian, David Lisuk, Seongjoon Koo, and
Matthias Blume. Feature engineering in user's music preference prediction.
In KDD Cup Workshop 2011, 2011.
Yu Xin and Harald Steck. Multi-value probabilistic matrix factorization for
IP-TV recommendations. In Mobasher et al. (2011), pages 221228.
Liang Xiong, Xi Chen, Tzu-Kuo Huang, Je Schneider, and Jaime G. Carbonell.
Temporal collaborative ltering with Bayesian probabilistic tensor factorization. In Proceedings of SIAM Data Mining, 2010.
Shuang-Hong Yang, Bo Long, Alexander J. Smola, Hongyuan Zha, and Zhaohui Zheng. Collaborative competitive ltering: learning recommender using
context of user choice. In Proceedings of the 34th International ACM SIGIR
Conference on Research and Development in Information Retrieval, pages
295304. ACM, 2011.
Kai Yu and Volker Tresp. Learning to learn and collaborative ltering. In Neural
Information Processing Systems Workshop on Inductive Transfer, volume 10,
2005.
Kai Yu, Shenghuo Zhu, John Laerty, and Yihong Gong. Fast nonparametric
matrix factorization for large-scale collaborative ltering. In Proceedings of
the 32nd International ACM SIGIR Conference on Research and Development
in Information Retrieval, pages 211218. ACM, 2009.
Liang Zhang, Deepak Agarwal, and Bee-Chung Chen. Generalizing matrix factorization through exible regression priors. In Mobasher et al. (2011), pages
1320.
Renjie Zhou, Samamon Khemmarat, and Lixin Gao. The impact of YouTube
recommendation system on video views. In Proceedings of the 10th Annual
Conference on Internet Measurement, pages 404410. ACM, 2010.
Yunhong Zhou, Dennis Wilkinson, Robert Schreiber, and Rong Pan. Largescale parallel collaborative ltering for the Netix Prize. Algorithmic Aspects
in Information and Management, pages 337348, 2008.
Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, and Georg Lausen.
Improving recommendation lists through topic diversication. In Allan Ellis
and Tatsuya Hagino, editors, Proceedings of the 14th International World
Wide Web Conference (WWW 2005). ACM, 2005.
Index
command-line tools, 81
competitive collaborative ltering, 29
computer vision, 79
condence, 40
conjugate gradient, 36
content-based ltering, 88
context, 26
context-aware recommendation, 29, 89
contextual modeling, 40
contextual post-ltering, 40
contextual pre-ltering, 40
cosine similarity, 34
cost-sensitive learning, 63
coverage, 40
Crab, 87
cross-validation, 83
0/1 classication, 62
k -fold cross-validation, 82
.NET platform, 80
abstract class, 107
accuracy, 40
AdaBoost, 77
adaptivity, 41
age, 81
ALS, see alternating least squares
alternating least squares, 36, 86
Apache Mahout, 86
area under the ROC curve, 41, 61, 82
aspect model, 58
association rules, 87
attribute-based kNN, 81
attribute-to-factor mapping, 4559
AUC, see area under the ROC curve
bagging, 77, 87
Bayesian Context-Aware Ranking, 38
Bayesian Personalized Ranking, 3839,
6166, 82
binary classication, 62
BPR, see Bayesian Personalized Ranking
BPR-MF, 39, 81
C, 92
C++, 80, 92
C#, 88, 92, 122
case-based reasoning, 87
categories, 81
chronological splits, 82
classication, 24, 27, 61
clustering, 24, 87
co-clustering, 80
COFI, 87
CoRank, 38, 88
cold-start problem, 4559
collaborative ltering, 61
decision trees, 35
demographic data, 81
dense matrix, 108
diagonal, 84
distributed matrix factorization, 83
diversity, 41
documentation, 84
dot product, 45
Duine, 88, 92
e-commerce, 27
e-LICO, 96
EasyRec, 87
Eigentaste, 87
EM, see expectation-maximization
ensemble, 70, 87, 122
epoch, 35
evaluation, 40, 51, 82, 93
expectation-maximization, 36
explicit context, 29
explicit feedback, 25
F#, 80, 123
Facebook, 28
149
150
Factorization Machines, 38
feedback data, 33
Filterbots, 58
ltered recommendation, 29
fLDA, 57
folksonomy, 29
free software, 91
GATE, 79
generalized linear models, 77
Generalized Matrix Factorization, 58
geographic data, 81
GNU General Public License, 79
gradient boosted decision trees, 77
GraphLab, 86, 92
grid search, 82
group recommendation, 81, 91, 92, 121
INDEX
like, 28
Likelike, 88
linear regression, 35
locality sensitive hashing, 88
location, 81
logistic regression, 61, 77, 89
LogitBoost, 77
loss function, 23
machine learning, 23
MAE, see mean absolute error
Mahout, 86, 92
Markov Chain Monte Carlo, 36, 86
Markov property, 31
match making, 29
Matlab, 89
matrix, 107
matrix factorization, 36, 39, 80, 81, 86
Hadoop, 86, 88
88, 91, 92
hyperparameter tuning, 91
mean absolute error, 41, 82, 103, 118
mean average precision, 82
implicit context, 29
mean reciprocal rank, 82
implicit feedback, 27, 81
memory usage, 93
incremental updates, 82, 91
MinHash, 34
inferred context, 29
Mono, 93
information retrieval, 40, 42
most popular items, 33
interface, 107
MovieLens, 44
item attributes, 34, 81
multi-core, 83, 91, 92
item average, 87
multi-label classication, 27
item ranking, 40
multi-task learning, 62
item recommendation, 26, 81
MultiLens, 90
item recommendation from positive-only
Multiverse recommendation, 40
feedback, 28
music recommendation, 31
item-based kNN, 34, 86
MyCBR, 87
MyMediaLite, 7996, 98, 101127
Jaccard index, 34
MySQL, 89
Java, 80, 87, 92
jCOLIBRI, 87
Jellysh, 83, 88
Jester, 27
KDD Cup 2011, 44, 67
keywords, 81
kNN, 33, 87, 90
latent feature log-linear model, 58, 89
LDA, 57
learning reduction, 61
learning to rank, 61
LensKit, 86, 92
LibFM, 89
Naive Bayes, 35
natural language processing, 79
neighborhood-based models, 33
Nelder-Mead method, 82
Netix Prize, 44, 82
neural networks, 77
new-item problem, 98
new-user problem, 98
non-negative matrix factorization, 86
normalized discounted cumulative gain,
82
novelty, 40
NSVD1, 56
INDEX
151
recall, 42
recall at n, 42
RecLab, 87
recommenderlab, 87
reduction, 61
reection, 126
regression, 24, 26
Regression-based Latent Factor ModPageRank, 40
els, 58
pairwise classication, 61
regret, 62
pairwise interaction tensor factorization, reinforcement learning, 99
40
repeat buying, 31
Pairwise Preference Regression, 58
repeated events, 31
pairwise ranking, 61
ridge regression, 50
Pandora, 25
risk, 41
PARAFAC, 48
RMSE, see root mean square error
parallel processing, 83
robustness, 41
parallel SGD, 83
root mean square error, 41, 82, 103,
PCA, 87
118
Pearson correlation, 33
Ruby, 80, 89, 125
per-user AUC, 42
scalability, 41, 91, 93
personalized recommendations, 26
scikit-learn, 79
PHP, 89
sequential recommendation, 31
PITF, 40
serendipity, 40
playlist recommendation, 31
serialization, 84
positive-only feedback, 27, 81, 91
SGD, see stochastic gradient descent
post-ltering, 40
Shogun, 79
pre-ltering, 40
similar items, 26
precision, 42
similarity measure, 87
precision at n, 42, 82
Slope-One, 80, 86, 87, 89
prediction accuracy, 40
social network, 29, 81
prediction tasks, 23
SocialMF, 82
privacy, 41
probabilistic latent semantic indexing, soft clustering, 66
SQL, 82
86
stochastic gradient descent, 35, 63, 86
probabilistic matrix factorization, 89
structured prediction, 27
probabilistic tensor factorization, 89
property, 108
sub-epoch, 84
publicly available software, 89, 91
SUGGEST, 92
supervised machine learning, 19, 23
PyRSVD, 88
support vector machine, 89
Python, 80, 8789, 124
support vector machines, 35
R, 79, 87
SVD++, 37, 86, 89
R recommenderlab, 87
SVDFeature, 38, 88
Rails, 89
tag recommendation, 29, 40
random coordinate descent, 77
Taste, 86
Random Forests, 77
Taste.NET, 88
ranking, 61
RapidMiner, 79, 96
taxonomies, 81
rating prediction, 25, 80, 91, 102
tensor factorization, 86, 89
one-class classication, 28
one-class feedback, 27
online shopping, 87
open source, 91
OpenCV, 79
OpenSlopeOne, 89
ordinal regression, 26
152
text mining, 79
thumbs up, 28
Tied Boltzmann Machines, 58
time-aware Bayesian probabilistic tensor factorization, 38
time-aware rating prediction, 91
time-aware recommendation, 29
timeSVD, 37
timeSVD++, 38
trust, 40
Tucker decomposition, 48
TV series, 81
Twitter, 28
two-way aspect model, 58
Unied Boltzmann Machine, 58
unit testing, 107
unsupervised learning, 24
Untied Boltzmann Machines, 58
user attributes, 34, 81
user clustering, 66
user preference, 40
user recommendation, 29
user-based collaborative ltering, 33
user-based kNN, 33, 87
utility, 41
vector, 107
virtual machine, 93
Vogoo, 89
Vowpal Wabbit, 88
Waes, 87
WBPR-MF, 81
weighted regularized matrix factorization, 39
Weka, 79
Wooix, 89
WR-MF, see weighted regularized matrix factorization
Yahoo
Music Ratings, 44
INDEX
Download