Recommender Systems Daniel Rodriguez University of Alcala Some slides and examples based on Chapter 9, Mining of Massive Datasets, Rajaraman et al., 2011 (DRAFT) Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 1 / 101 Outline 1 Introduction 2 Recommender Systems 3 Algorithms 4 Evaluation of Recommender Systems 5 Dimensionality Reduction 6 Tools 7 Problems and Challenges 8 Conclusions 9 References Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 2 / 101 Introduction 1 Introduction 2 Recommender Systems 3 Algorithms 4 Evaluation of Recommender Systems 5 Dimensionality Reduction 6 Tools 7 Problems and Challenges 8 Conclusions 9 References Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 3 / 101 Recommender systems Introduction What are recommender systems? Denition (by Ricci et al. [10]) Recommender Systems (RSs) are software tools and techniques providing suggestions for items to be of use to a user. The suggestions relate to various decision-making processes, such as what items to buy, what music to listen to, or what online news to read. Item is the general term used to denote what the system recommends to users. A RS normally focuses on a specic type of item (e.g., CDs, or news) and accordingly its design, its graphical user interface, and the core recommendation technique used to generate the recommendations are all customized to provide useful and eective suggestions for that specic type of item. Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 4 / 101 Recommender systems Introduction What are recommender systems? http://en.wikipedia.org/wiki/Recommender_system Recommender systems (RS) are a subclass of information ltering system that seek to predict the 'rating' or 'preference' that user would give to an item (such as music, books, or movies) or social element (e.g. people or groups) they had not yet considered, using a model built from the characteristics of an item (content-based approaches) or the user's social environment (collaborative ltering approaches). The goal of these systems is to serve the right items to a user in a given context or create bundle packs of similar products to optimize long term business objectives Foundations based on data mining, information retrieval, statistics, etc. Established area of research with an ACM Conference: http://recsys.acm.org/ Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 5 / 101 Recommender systems: Applications Introduction What are recommender systems? Product Recommendations: Perhaps the most important use of recommendation systems is at on-line retailers, etc., e.g, Amazon or similar on-line vendors strive to present users suggestions of products that they might like to buy. These suggestions are not random, but based on decisions made by similar customers. News Articles: News services usually classify interesting news for some people to oer them to similar users. The similarity might be based on the similarity of important words in the documents, or on the articles that are read by people with similar reading tastes. (the same principles apply to recommending blogs, videos, etc.) Movie/Music Recommendations, e.g. Netix oers customers recommendations of movies they might be interested in. These recommendations are based on ratings provided by users. Application stores Social Networks etc. Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 6 / 101 Recommender systems: Examples Introduction Figure: Windsor Aug 5-16, 2013 (Erasmus IP) What are recommender systems? Amazon Recommendations Recommender Systems 7 / 101 Recommender systems: Examples Introduction Figure: Windsor Aug 5-16, 2013 (Erasmus IP) What are recommender systems? Netix recommendations Recommender Systems 8 / 101 Recommender systems: Examples Introduction Figure: Windsor Aug 5-16, 2013 (Erasmus IP) What are recommender systems? Spotify recommendation Recommender Systems 9 / 101 Recommender systems: Examples Introduction Windsor Aug 5-16, 2013 (Erasmus IP) What are recommender systems? Recommender Systems 10 / 101 Long Tail Phenomenon Introduction Long Tail Phenomenon Physical institutions can only provide what is popular (shelf space is scarce), while on-line institutions can make everything available [2]. The greater the choice, the greater need for better lters. Long Tail Phenomenon Recommender systems take advantage of large amounts of data available on the Internet to create recommendations that in bricks and mortar stores would be impossible to do. This phenomenon is called long tail phenomenon and explains the impossibility of the physical stores to create personalized recommendations. Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 11 / 101 Long Tail Phenomenon Introduction Windsor Aug 5-16, 2013 (Erasmus IP) Long Tail Phenomenon Recommender Systems 12 / 101 Recommender systems: History Introduction Recommender systems: History The rst recommender system, Tapestry, was designed to recommend documents from newsgroups. The authors also introduced the term collaborative ltering as they used social collaboration to help users with large volume of documents. P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, J. Riedl, GroupLens: An Open Architecture for Collaborative Filtering of Netnews, Proc. of 1 Computer Supported Cooperative Work (CSCW), pp. 175-186, 1994 1 http://ccs.mit.edu/papers/CCSWP165.html Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 13 / 101 The Netix Prize Introduction Netix Prize The importance of predicting ratings accurately is so high that on Oct 2006, Netix, decided to create a contest to try to improve its recommendation algorithm, it was the Netix Prize: http://www.netflixprize.com/ Netix oered $1m to the team capable of proposing an algorithm 10% better than their current algorithm (CineMatch). To so so, Netix published a dataset with: approximately 17,000 movies 500,000 users 100 million ratings (training set of 99 million ratings) The Root-mean-square error (RMSE) was used to measure the performance of algorithms. CineMatch had an RMSE of 0.9525. Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 14 / 101 The Netix Prize: Results Introduction Netix Prize The contest nished almost 3 years later and the winner improved the algorithm in 10.6%. The winner, a team of researchers called Bellkor's Pragmatic Chaos. They must also share the algorithm with Netix and the source code. In this contest participated more than 40,000 teams from 186 dierent countries. The winning algorithm was a composition of dierent algorithms that had been developed independently. Also Moshfeghi [7] made another approach adding semantics to the algorithms to improve the accuracy. Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 15 / 101 Characteristics Introduction Characteristics A recommender system must be reliable providing good recommendations and showing information about the recommendations (explanations, details, etc.). Another important point of these systems is how they should display the information about the recommended products: The item recommended must be easy to identify by the user. Also the item must be easy to evaluate/correct (I don't like it, I already have it, etc.). The ratings must be easy to understand and meaningful. Explanations must provide a quick way for the user to evaluate the recommendation. Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 16 / 101 Characteristics: Degree of Personalisation Introduction Characteristics The degree of personalisation of this recommendations can be dierent for each site. Galland [4] classied the recommendations into 4 groups: Generic: everyone receives same recommendations. Demographic: everyone in the same category receives same recommendations. Contextual: recommendation depends only on current activity. Persistent: recommendation depends on long-term interests. Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 17 / 101 Other Characteristics Introduction Characteristics A recommender system usually involves [1]: Large scale Machine Learning and Statistics O-line Models (capture global and stable characteristics) On-line Models (incorporates dynamic components) Explore/Exploit (active and adaptive experimentation) Multi-Objective Optimization Click-rates (CTR), Engagement, advertising revenue, diversity, etc. Inferring user interest Constructing User Proles - Natural Language Processing to understand content. Topics, aboutness , entities, follow-up of something, breaking news, etc. Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 18 / 101 Recommender Systems 1 Introduction 2 Recommender Systems 3 Algorithms 4 Evaluation of Recommender Systems 5 Dimensionality Reduction 6 Tools 7 Problems and Challenges 8 Conclusions 9 References Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 19 / 101 Utility Matrix Recommender Systems Utility Matrix Recommender Systems are often seen as a function: C ∈ Customers I ∈ Items R ∈ Ratings , e.g., [1-5], [+,-], [a+,E-], Utility function: u : C × I → R U1 U2 I1 I2 4 ? 1 5 5 7 6 etc. In ... 6 ... Un 3 3 Blanks correspond to not rated items. Objective: would U1 I like 2 ? Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 20 / 101 Sparse Matrices Recommender Systems Utility Matrix Typically most of the values are empty, these matrices are called sparse matrices. Users rate just a few percentage of all available items An unknown rating implies that we have no explicit information about the user's preference for that particular item. The goal of a RS is to predict the blanks (ll the gaps) in the matrix. Generally, it is not necessary to ll up all of the blanks because we usually are not interested in low rated items The majority of RS recommend a few of the highest rated items Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 21 / 101 Sparse Matrices - Representation Recommender Systems Utility Matrix In order to t in memory, zero values are not explicitly represented. For instance, in ARFF format (Weka) 0, X, 0, Y, "class A" 0, 0, W, 0, "class B" is represented by their attribute number and value as: {1 X, 3 Y, 4 "class A"} {2 W, 4 "class B"} Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 22 / 101 Work-ow in recommendations Recommender Systems Utility Matrix There are a lot of dierent ways to get recommendations but the most used are based on the previous knowledge of similar users or contents [5]. Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 23 / 101 Naïve Recommender Systems Recommender Systems RS Classication Editorial recommendations Simple aggregates Top 10, Most Popular, etc. From now on, we will refer to systems tailored to individual users (Amazon, Netix, etc.) Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 24 / 101 Recommender Systems: Classication Recommender Systems RS Classication Typically, also in Rajamaran et al. [8], recommender systems are classied according to the technique used to create the recommendation (ll the blanks in the utility matrix): Content-based systems examine properties of the items recommended and oer similar items Collaborative ltering (CF) systems recommend items based on similarity measures between users and/or items. The items recommended to a user are those preferred by similar users Hybrid mixing both previous approaches Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 25 / 101 Recommender Systems: Classication Recommender Systems Figure: RS Classication Classication Recommender Systems [8] Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 26 / 101 Recommender Systems: Content-based Recommender Systems Content based Recommender Systems Content-based systems examine properties of the items to recommend items that are similar in content to items the user has already liked in the past, or matched to attributes of the user. For instance, movie recommendations with the same actors, director, genres, etc., if a Netix user has watched many action movies, then recommend movies classied in the database as having the action genre. Textual content (news, blogs, etc) recommend other sites, blogs, news with similar content (we will cover how to measure similar content) Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 27 / 101 Item Proles Recommender Systems Content based Recommender Systems For each item, we need to create an item prole A prole is a set of features Context specic (e.g. with lms: actors, director, genre, title, etc.) Documents: sets of important words. Important words are usually selected using the is TF .IDF (Term Frequency times Inverse Doc Frequency) metric Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 28 / 101 Term Frequency-Inverse Document Frequency Recommender Systems Content based Recommender Systems The Term Frequency-Inverse Document Frequency (TF-IDF) is a statistic 2 which reects how important a word is to a document . The prole of a document is the set of words with highest to the term t in a document d. tf − idf , which assigns a weight tf .idfd ,j = tft ,d · idft (1) where tft ,d no. of times term t occurs in document d (there are better approaches) idft = log dfNt , dft is no. of documents and N is total no. of documents. 2 that that contain the term ti http://en.wikipedia.org/wiki/Tf-idf Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 29 / 101 Tags Recommender Systems Content based Recommender Systems Tags are also used to create item proles. Then, tags are used to provide similar content. For example, it is dicult to automatically extract features form images. In this case, users are asked to tag them. A real example is del.icio.us in which users tag Web pages: http://delicious.com/ The drawback of this approach is that it is dicult to get users to tags items. Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 30 / 101 User Proles Recommender Systems Content based Recommender Systems We also need to create vectors with the same components that describe the user's preferences. It is classied as implicit or explicit. Implicit refers to observe user's behaviour with the system, for example by watching certain lms, listening to music, reading a kind of news or downloading applications/documents Explicit refers when the user provides information to the system Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 31 / 101 Collaborative Filtering Recommender Systems Collaborative Filtering In Collaborative Filtering (CF), a user is recommended items based on the past ratings of all users collectively. CF can be of two types: User-based collaborative ltering 1 2 Given a user U , nd the set of other users D whose ratings are similar to the ratings of the selected user U . Use the ratings from those like-minded users found in Step 1 to calculate a prediction for the selected user U . Item-based collaborative ltering 1 2 Build an item-item matrix determining relationships between pairs of items Using this matrix and data on the current user, infer the user's taste Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 32 / 101 Hybrid Methods Recommender Systems Collaborative Filtering Implement two separate recommenders and combine predictions Add content-based methods to collaborative ltering Item proles for new item problem Demographics to deal with new user problem Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 33 / 101 Ratings Recommender Systems Collaborative Filtering Rating A numeric (usually) value given by a user to specic item/user [10]. The way we populate the ratings matrix is also very important. However acquiring data from which to build a ratings matrix is often a very dicult step. There are two general approaches to discover the rating value of the users: Ask directly to the user to insert ratings. This approach is typically used by the movie sites (Netix) but is limited by the willing of the users to rate items. Spy the behaviour of the user. If a user watches a movie, reads an article or buys an item we can say that the user likes such particular item. In this case, the ratings matrix will be lled with the value 0 if the user does not buy/watch/read the item and 1 if the user buy/watch/read it. Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 34 / 101 Ratings Classication Recommender Systems Collaborative Filtering Implicit ratings Based on interaction & time: Purchases, clicks, browsing (page view time), etc. Used to generate an implicit numeric rating. Explicit ratings Numeric ratings: numeric scale between 2 (+/-) and 15 values (the more levels, the more variance; should be normalized) Partial order: comparison between two items. Semantic information: tags, labels. Hybrid Mixing both previous approaches. Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 35 / 101 Measuring Similarity Recommender Systems Measuring similarity One of the more complicated task is to determine is the similarity between the users/items (the duality of similarity [8]). Following the example by Rajaraman and Ullman [8], we can observe that users A and C both rated items TW 1 and SW 1 totally dierent. Intuitively we could say that there is a large distance regarding their similarity. We next explain dierent methods that can be used to check the similarity between users. A B C D HP1 HP2 HP3 4 5 5 TW 1 SW 1 5 1 2 4 SW 3 4 3 5 3 Table: Windsor Aug 5-16, 2013 (Erasmus IP) SW 2 Utility Matrix [8] Recommender Systems 36 / 101 Jaccard Distance Recommender Systems Measuring similarity The Jaccard index (aka Jaccard similarity coecient) measures the 3 similarity of two sets : J (A, B ) = |A∩B | |A∪B | (2) The Jaccard distance measures the dissimilarty and is dened as the complement of Eq(2): Jδ (A, B ) = 1 − J (A, B ) = |A∪B |−|A∩B | |A∪B | (3) The Jaccard distance only takes into account the number of rated items but not the actual rating which is discarded, therefore, it loses accuracy in case that detailed recommendations were needed. 3 http://en.wikipedia.org/wiki/Jaccard_index Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 37 / 101 Jaccard Distance Recommender Systems Measuring similarity Following with the previous example [8]: A B C D A and size 5, HP1 HP2 HP3 4 5 5 TW 1 SW 1 5 1 2 4 SW 2 SW 3 4 5 3 3 B have an intersection of size 1, i.e., | A ∪ B |= 1 and a union of | A ∩ B |= 5 (items rated). Thus, their Jaccard similarity is 1/5, and their Jaccard distance is 4/5; i.e., they are very far apart. A and C have a Jaccard similarity of 2/4, so their Jaccard distance is the same, 1/2. However, although A appears to be closer to seems intuitively wrong. watched, while A and B A and C C than to B, that conclusion disagree on the two movies they both seem both to have liked the one movie they watched in common [8]. Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 38 / 101 Cosine Distance Recommender Systems Measuring similarity Cosine Distance: Items are represented as vectors over the user space Similarity is the cosine of the angle between two vectors Range is between 1 (perfect) and -1 (opposite) Given two vectors of attributes, A and B, the cosine similarity, cos(θ), is 4 represented using a dot product and magnitude as : n × Bi A·B i = 1 Ai p cos(θ) = = pPn Pn 2 2 k A kk B k i =1 (Ai ) × i =1 (Bi ) P (4) In information retrieval, the cosine similarity of two documents will range tf − idf from 0 to 1, since the term frequencies ( weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90 degrees. 4 http://en.wikipedia.org/wiki/Cosine_similarity Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 39 / 101 Cosine Distance: Example Recommender Systems Measuring similarity Following with the previous example [8], the Cosine distance between users A and B is: √ ·5 √ = 0.380 42 + 52 + 12 · 42 + 52 + 52 4 The Cosine distance between √ A and C (5) is: ·2+1·4 √ = 0.322 + 12 · 22 + 42 + 52 5 42 + 52 (6) A larger (positive) cosine implies a smaller angle and therefore a smaller distance, this measure tells us that A is slightly closer to B than to C [8]. Empty values as set to 0 (questionable election as it could seem that users do not like the item instead of no rating form the user) Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 40 / 101 Pearson correlation Recommender Systems Measuring similarity Another common measure of similarity is the Pearson correlation coef [U+FB01]cient between the ratings of the two users, a and u: − ra )(ru,i − ru ) P P 2 2 i ∈I ((ra,i − ra ) i ∈I ( i ∈I ((ra,i − ra ) ) Pearsona,u = pP i ∈I (ra,i P 2 (7) where: I is the set of items rated by both users ru,i is the rating of given to item i by user u ra , ru are the mean ratings given by users a and u Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems respectively 41 / 101 Rounding Data Recommender Systems Measuring similarity Rounding data is more like a pre-processing step of the information before applying any distance measure. Following the example, we could consider ratings of 3, 4, and 5 as a 1 and consider ratings 1 and 2 as unrated (blanks). A B C D HP1 HP2 HP3 1 1 1 1 C SW 1 1 1 1 1 The Jaccard distance between Now, TW 1 appears further away SW 2 SW 3 1 1 A and B is 3/4, while between A and C from A than B does, which is intuitively is 1. correct. Applying cosine distance allows us to reach the same conclusion. Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 42 / 101 Normalizing Ratings Recommender Systems Measuring similarity As with the rounding data method, this is also a preprocessing step applied before any other measure distance. Normalised ratings are calculated subtracting the average value of the ratings of the user to each single rating. Then low ratings will be negative and high ratings positive. Following the example, the normalised rating matrix will be as follows: A B C D HP1 HP2 HP3 2/3 1/3 1/3 TW 1 SW 1 5/3 -7/3 -5/3 1/3 SW 3 -2/3 0 Windsor Aug 5-16, 2013 (Erasmus IP) SW 2 4/3 0 Recommender Systems 43 / 101 Normalizing Ratings: Example Recommender Systems Measuring similarity Applying the Jaccard distance to this normalised matrix, it is noted that ratings from the user D are 0. The ratings given by the user D are (probably) not interesting ( as this user has always rated items with the same value). The Cosine distance in this new case for users A and B: (2/3) · (1/3) q q = 0.092 ( 23 )2 + ( 35 )2 + (− 73 )2 ( 13 )2 + ( 13 )2 + (− 23 )2 Now, the Cosine distance for the users A and C (8) with normalized values: (5/3) · (−5/3) + (−7/3) · (1/3) q q = −0.559 ( 23 )2 + ( 53 )2 + (− 73 )2 (− 53 )2 + ( 31 )2 + ( 43 )2 A and C because are much further apart than A and C A and B. (9) This result makes sense rated 2 lms with very dierent values and A and B rated only one lm with similar values. Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 44 / 101 Algorithms 1 Introduction 2 Recommender Systems 3 Algorithms 4 Evaluation of Recommender Systems 5 Dimensionality Reduction 6 Tools 7 Problems and Challenges 8 Conclusions 9 References Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 45 / 101 Knowledge Discovery in Databases (KDD) Algorithms KDD RSs, natural language processing (NLP) are based on a subeld of computer science that tries to discover patterns in large data sets, called Data mining or Knowledge Discovery in Databases (KDD). In fact, data mining is typically one step within the process of what is known as Knowledge Discovery in Databases (KDD). The main goal of the data mining (the analysis step of the Knowledge Discovery in Databases (KDD) process [3]) is to extract information from a data set and transform it into an understandable structure for further use. This transformation involves several steps that includes: database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and on-line updating. The step which requires most of the time is the preparation of data. Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 46 / 101 KDD Process Algorithms KDD KDD is composed of the following steps [3]: 1 Selection 2 Pre-processing 3 Transformation 4 Data Mining 5 Interpretation/Evaluation Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 47 / 101 KDD Process Figure: Windsor Aug 5-16, 2013 (Erasmus IP) Algorithms KDD Data mining steps (Fayyad et al, 96) Recommender Systems 48 / 101 Text Mining Algorithms KDD Text mining, aka text analytics or Information Retrieval tries to obtain usable information for a computer using no structured (no measurable) data. The Oxford Dictionary denes text mining as process or practice of examining large collections of written resources in order to generate new information, typically using specialized computer software. Typically no measurable data refers to e-mails, newspapers, research papers, etc. The process of text mining usually is divided in 4 dierent steps 1 Information retrieval (IR) systems. 2 Natural language processing (NLP). 3 Information extraction (IE). 4 Data mining (DM). Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 49 / 101 Web Mining Algorithms KDD Web mining uses data mining tools to extract information from: Web pages (Web content mining) Links (Web structure mining) User's navigation data (Web usage mining) based on Web server logs Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 50 / 101 Nearest Neighbour Algorithm Algorithms kNN k The Nearest Neighbour ( -NN) algorithm is one of the simplest machine 5 learning algorithms but works very well in practice . The idea is to predict a classication based on the k-nearest neighbours of the item we want to classify. No model is in fact learned, but the active user is used to search for the k most similar cases 5 http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 51 / 101 Nearest Neighbour Algorithm Algorithms A simple approximation of k NN kNN is shown in Figure where we need to classify the green circle: 1 If we select k = 3, the nearest neighbours are 2 red triangles and a blue square, represented inside the solid black circle, the class of the circle will be red triangle. 2 In the case of k=5 (discontinuous black circle), the nearest neighbours are 2 red triangles and 3 blue squares, then the class of the circle will be blue square. The selection of k is very important (it can change the class) Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 52 / 101 kNN: Complexity 1 Expensive step is nding Algorithms k kNN most similar customers Too expensive to do at runtime, need to pre-compute 2 Can use clustering, partitioning as alternatives, but quality degrades Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 53 / 101 Clustering Algorithms Clustering With little data (ratings) or to reduce the matrix, clustering can be used. A hierarchical clustering (until a desired level can be used). A B C D Windsor Aug 5-16, 2013 (Erasmus IP) HP TW SW 4 5 1 2 4.5 4.67 3 3 Recommender Systems 54 / 101 Clustering Windsor Aug 5-16, 2013 (Erasmus IP) Algorithms Clustering Recommender Systems 55 / 101 Association Rules Algorithms Association Rules Association Rules are a data mining technique to nd relations between variables, initially used for shopping behaviour analysis. Associations are represented as rules, Antecedent → Consequent If a customer buys A, that customer also buys B Association among A and B means that the presence of A in a record implies the presence of B in the same record {Nappies , Chips } → Beer Quality of the rules need a minimum support and condence Support: the proportion of times that the rule applies Condence: the proportion of times that the rule is correct Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 56 / 101 Slope One Algorithms Slope One This algorithm was introduced in 2005 and is the simplest form of item-based algorithm based on ratings. For a better understanding we are going to explain this algorithm with a little example. First we need a dataset, for example, the next table shows the ratings from 3 people to 3 dierent items. Customer Item A Item B Item C John 5 3 2 Mark 3 4 - Lucy - 2 5 Table: Windsor Aug 5-16, 2013 (Erasmus IP) Example Slope One algorithm Recommender Systems 57 / 101 Slope One Algorithms Slope One This algorithm is based on the average dierences between users who rated the same items that we want to predict. We want to know the rating that Lucy will give to the item A. Then we need to calculate the average dierences of the other users with the item A and other item as reference (item B): John, item A rating 5, item B rating 3, dierence 2 Mark, item A rating 3, item B rating 4, dierence -1 Average dierence between item A and item B, (2+(-1))/2=0.5 Rating from Lucy to item B, 2 Estimated rating from Lucy to item A, 2+0.5=2.5 Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 58 / 101 Slope One Algorithms Slope One Also we want to know the rating of Mark for the item C. We need to calculate the average dierences of the other users with the item other item as reference (item B ): C and John, item C rating 2, item B rating 3, dierence -1 Lucy, item C rating 5, item B rating 2, dierence 3 Average dierence between item C and item B, ((-1)+3)/2=1 Rating from Mark to item B, 4 Estimated rating from Mark to item C, 4+1=5 Therefore, the nal table of ratings is completed as follows: Customer Item A Item B Item C John 5 3 2 Mark Lucy Windsor Aug 5-16, 2013 (Erasmus IP) 3 2.5 4 5 2 5 Recommender Systems 59 / 101 Decision Trees Algorithms Decision Trees A decision tree is a collection of nodes, arranged as a binary tree. A decision tree is constructed in a top-down approach. The leaves of the tree correspond to classes (decisions such as likes or dislikes), Each interior node is a condition that correspond to features, and branches to their associated values. To classify a new instance, one simply examines the features tested at the nodes of the tree and follows the branches corresponding to their observed values in the instance. Upon reaching a leaf, the process terminates, and the class at the leaf is assigned to the instance. The most popular decision tree algorithm is C4.5 which uses the gain ratio criterion to select the attribute to be at every node of the tree. Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 60 / 101 Naïve Bayes Algorithms Probabilistic Networks The naïve Bayes algorithm uses the Bayes theorem to predict the class for each case, assuming that the predictive attributes are independent given a category. A Bayesian classier assigns a set of attributes such that P (C |A1 , A2 , . . . , An ) A1 , A2 , . . . , An to a class C is maximum, that is the probability of the class description value given the attribute instances, is maximal. Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 61 / 101 Evaluation of Recommender Systems 1 Introduction 2 Recommender Systems 3 Algorithms 4 Evaluation of Recommender Systems 5 Dimensionality Reduction 6 Tools 7 Problems and Challenges 8 Conclusions 9 References Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 62 / 101 Evaluation Measures: Confusion Matrix Evaluation of Recommender Systems Table: Positive Pred Negative Evaluation Measures based on the Confusion Matrix Confusion Matrix for Two Classes Actual Negative Positive True Positive (TP ) False Positive (FP ) Type I error (False alarm) Positive Predictive Value (PPV )= Condence = Precision = False Negative (FN ) Type II error True Negative (TN ) Recall = Sensitivity = TPr = TPTP +FN Specicity = TNr = FPTN +TN Negative Predictive Value (NPV )= FNTN +TN Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems = TPTP +FP 63 / 101 Evaluation Measures using the confusion Matrix Evaluation of Recommender Systems Evaluation Measures based on the Confusion Matrix True positive rate (TP /TP + FN ) is the proportion of positive cases correctly classied as belonging to the positive class. False negative rate (FN /TP + FN ) is the proportion of positive cases misclassied as belonging to the negative class. False positive rate (FP /FP + TN ) is the proportion of negative cases misclassied as belonging to the positive class. True negative rate (TN /FP + TN ) is the proportion of negative cases correctly classied as belonging to the negative class. There is a trade-o between false positive rates and false negative rates as the objective is to minimize both metrics (or conversely, maximize the true negative and positive rates). Both metrics can be combined to form single metrics. For example, the predictive accuracy = Windsor Aug 5-16, 2013 (Erasmus IP) accuracy is dened as: TP + TN TP + TN + FP + FN Recommender Systems (10) 64 / 101 Evaluation Measures using the confusion Matrix Evaluation of Recommender Systems Evaluation Measures based on the Confusion Matrix Another metrics from the Information Retrieval eld that is widely used when measuring the performance of classiers is the f-measure, which is just an harmonic median of the following two proportions: f1= · precision · recall precision + recall 2 (11) where precision = TP /TP + FP ) Precision ( is the proportion of positive predictions that are correct (no. of good item recommended / no. of all recommendations) recall is the true positive rate dened as TP /TP + FN , no. of good items recommended / no. of good items Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 65 / 101 Receiver Operating Characteristic (ROC) Curve Evaluation of Recommender Systems Evaluation Measures based on the Confusion Matrix The ROCcurve provides a graphical visualisation to analyse the relationship between the true positive rate Windsor Aug 5-16, 2013 (Erasmus IP) and the true negative rate. Recommender Systems 66 / 101 Receiver Operating Characteristic (ROC) Curve Evaluation of Recommender Systems Evaluation Measures based on the Confusion Matrix The optimal point in the graph is the top-left corner, i.e., all positive instances are classied correctly and no negative instances are misclassied as positive. The diagonal line on the graph represents the scenario of randomly guessing the class, and so it represents the minimum baseline for all classiers. The Area Under the ROC Curve (AUC) also provides a quality measure with a single value and is used to compare dierent algorithms. Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 67 / 101 Rule Evaluation Measures Evaluation of Recommender Systems The Support Rule Evaluation of a rule refers to the ratio between the number of instances satisfying both the antecedent and the consequent part of a rule and the total number of instances. n(Antecedent · Consequent ) TP = N N n(Antecedent · Consequent ) corresponds to the TP Sup (Ri ) = where the (12) and N is the total number of instances. Condence (Conf ), also known as Precision or Positive Predictive Value (PPV ) of a rule is the percentage of positive instances of a rule, i.e. relative frequency of the number of instances satisfying the both the condition ( Antecedent ) and the Consequent and the number of instances satisfying the only the condition. Conf (Ri ) = Windsor Aug 5-16, 2013 (Erasmus IP) TP n(Antecedent · Consequent ) = n(Antecedent ) TP + FP Recommender Systems (13) 68 / 101 Rule Evaluation Measures - Association Rules Evaluation of Recommender Systems Coverage Cov ) of a rule ( Rule Evaluation is the percentage of instances covered by a rule of the induced set of rules Cov (Ri ) = p (Cond ) = where Ri is a single rule, by condition Cond Windsor Aug 5-16, 2013 (Erasmus IP) and N n(Cond ) n(Cond ) N (14) is the number of instances covered is the total number of instances. Recommender Systems 69 / 101 Numeric Evaluation Evaluation of Recommender Systems Numeric Evaluation With numeric evaluation (e.g. dierence between predicted rating and actual rating), the following measures are typically used. Mean Absolute Error (u ,i ) |p(u ,i ) P MAE = − r u ,i | N (15) Root mean squared error sP RMSE = RMSE (u ,i ) (p(u ,i ) N − r u ,i ) 2 (16) penalises more larger errors. Other are also possible such as Mean-squared error, Relative-squared error Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 70 / 101 Dimensionality Reduction 1 Introduction 2 Recommender Systems 3 Algorithms 4 Evaluation of Recommender Systems 5 Dimensionality Reduction 6 Tools 7 Problems and Challenges 8 Conclusions 9 References Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 71 / 101 Dimensionality Reduction Dimensionality Reduction An dierent approach to estimating the blank entries in the utility matrix is to break the utility matrix into the product of smaller matrices. This approach is called SVD (Singular Value Decomposition) or a more concrete approach based on SVD called UV-decomposition. Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 72 / 101 Singular Value Decomposition Dimensionality Reduction Singular Value Decomposition is a way of breaking up (factoring) a matrix into three simpler matrices. M = Um,m × Sm,n × VnT,n where S a diagonal matrix with non-negative sorted values Finding new recommendations: 1 New user B comes in with some ratings in the original feature space [0,0,2,0,4,1] B k BUS 1 2 Map 3 Any distance measure can be used in the reduced space to provide to a dimensional vector: = [-0.25, -0.15] recommendations Example taken from: http://www.ischool.berkeley.edu/node/22627 Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 73 / 101 Singular Value Decomposition Dimensionality Reduction Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 74 / 101 Singular Value Decomposition Dimensionality Reduction Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 75 / 101 Singular Value Decomposition Dimensionality Reduction Example using R N=matrix(c(5,2,1,1,4, 0,1,1,0,3, 3,4,1,0,2, 4,0,0,0,3, 3,0,2,5,4, 2,5,1,5,1),byrow=T, ncol=5) svd<-svd(N) S <- diag(svd$d) svd$u %*% S %*% t(svd$v) svd <- svd(N,nu=3,nv=3) S <- diag(svd$d[1:3]) svd$u %*% S %*% t(svd$v) Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 76 / 101 Singular Value Decomposition Dimensionality Reduction Selecting k =3: svd <- svd(N,nu=3,nv=3) S <- diag(svd$d[1:3]) svd$u %*% S %*% t(svd$v) > svd$u %*% S %*% t(svd$v) [,1] [,2] [,3] [,4] [,5] [1,] 4.747427 2.077269730 1.0835420 0.9373748 4.237575 [2,] 1.602453 0.513149695 0.3907818 0.4122864 1.507976 [3,] 3.170104 3.964360658 0.5605769 0.1145642 1.913918 [4,] 3.490307 0.136595437 0.6203060 -0.2117173 3.392280 [5,] 3.064186 -0.008805885 1.7257425 5.0637172 3.988441 [6,] 1.821143 5.042020528 1.3558031 4.8996100 1.111007 Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 77 / 101 Singular Value Decomposition Dimensionality Reduction New user with ratings: 0,0,2,0,4,1 b <-matrix(c(0,0,2,0,4,1),byrow=T, ncol=6) S1 <- diag(1/svd$d[1:3]) b %*% svd$u %*% S1 > b %*% svd$u %*% S1 [,1] [,2] [,3] [1,] -0.2550929 -0.1523131 -0.3437509 Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 78 / 101 UV Decomposition Dimensionality Reduction Mm,n is the utility M = Um,k × Vk ,n m1,1 m2,1 . . . matrix m1,2 m2,2 ··· ··· . . . .. m1,n m2,n . . . . mm,1 mm,2 · · · mm ,n Windsor Aug 5-16, 2013 (Erasmus IP) u1,1 u2,1 = . . . ··· ··· .. . um,1 · · · Recommender Systems u1,k v v ··· u2,n 1.,1 1.,2 . v1,n vk ,1 vk ,2 · · · vk ,n . . . um,k × . . . . .. . . . 79 / 101 UV Decomposition: Example Dimensionality Reduction Mm,n is the utility M = Um,k × Vk ,n 5 2 3 2 2 1 4 4 4 2 4 3 1 5 4 3 4 5 4 matrix u1,1 u2,1 1 4 = u3,1 u4,1 5 u5,1 3 Windsor Aug 5-16, 2013 (Erasmus IP) u1,2 u2,2 v1,1 v1,2 v1,3 v1,4 v1,5 u3,2 × v2,1 v2,2 v2,3 v2,4 v2,5 u4,2 u5,2 Recommender Systems 80 / 101 Tools 1 Introduction 2 Recommender Systems 3 Algorithms 4 Evaluation of Recommender Systems 5 Dimensionality Reduction 6 Tools 7 Problems and Challenges 8 Conclusions 9 References Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 81 / 101 Apache Mahout Tools Implementations Apache Mahout machine learning library is written in Java that is designed to be scalable, i.e. run over very large data sets. It achieves this by ensuring that most of its algorithms are parallelizable (map-reduce paradigm on Hadoop). The Mahout project was started by people involved in the Apache Lucene project with an active interest in machine learning and a desire for robust, well-documented, scalable implementations of common machine-learning algorithms for clustering and categorization The community was initially driven by the paper Machine Learning on Multicore Map-Reduce for but has since evolved to cover other machine-learning approaches. http://mahout.apache.org/ Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 82 / 101 Apache Mahout Tools Implementations Mahout is designed mainly for the following use cases: 1 Recommendation mining takes user's behaviour and from that tries to nd items users might like. 2 Clustering takes e.g. text documents and groups them into groups of topically related documents. 3 Classication learns from existing categorized documents what documents of a specic category look like and is able to assign unlabelled documents to the (hopefully) correct category. 4 Frequent item-set mining takes a set of item groups (terms in a query session, shopping cart content) and identies, which individual items usually appear together. Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 83 / 101 Apache Mahout Tools Implementations Mahout is built on top of Hadoop: http://hadoop.apache.org/ which implements the MapReduce model. Hadoop is most used implementation of MapReduce model and Mahout is one of the projects of the Apache Software Foundation. Hadoop divides data into small pieces to process it and in case of failure, repeats only the pieces that failed. In order to achieve this objective, Hadoop uses his own le system called HDFS (Hadoop Distributed File System). Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 84 / 101 Apache Mahout Workow Tools Windsor Aug 5-16, 2013 (Erasmus IP) Implementations Recommender Systems 85 / 101 R Tools Implementations R with RecommnederLab and other packages for text mining. Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 86 / 101 R and extensions Tools Implementations RecommnederLab http://lyle.smu.edu/IDA/recommenderlab/ http://cran.r-project.org/ web/packages/recommenderlab/ Arules (for association rules) http://lyle.smu.edu/IDA/arules/ http://cran.r-project.org/web/packages/arules/ Text mining http://cran.r-project.org/web/packages/tm/ Examples and further pointers: http://www.rdatamining.com/ Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 87 / 101 Rapidminer Tools Implementations Rapidminer and its plug-in for recommendation. Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 88 / 101 KNIME Tools Implementations KNIME Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 89 / 101 Weka Tools Implementations Weka Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 90 / 101 Problems and Challenges 1 Introduction 2 Recommender Systems 3 Algorithms 4 Evaluation of Recommender Systems 5 Dimensionality Reduction 6 Tools 7 Problems and Challenges 8 Conclusions 9 References Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 91 / 101 Cold start Problem Problems and Challenges When a new user or new item is added to the system, it knows nothing about them and therefore, it is dicult to draw recommendations. Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 92 / 101 Explore/Explode Problems and Challenges Recommend items at random to a small number of randomly chosen users. To do so, we need to removed one of the highly ranked item to include the random items. There is trade-o between recommend new items or highly ranked items: Exploiting a model to improve quality Exploring new items that can be good items and help to Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 93 / 101 Conclusions 1 Introduction 2 Recommender Systems 3 Algorithms 4 Evaluation of Recommender Systems 5 Dimensionality Reduction 6 Tools 7 Problems and Challenges 8 Conclusions 9 References Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 94 / 101 Conclusions Conclusions Nowadays, recommendation systems are increasingly gaining notoriety due to their hight number of applications. The users cannot manage all the information available on Internet because of this is necessary some kind of lters or recommendations. Also the companies want to oer to the user specic information to increase the purchases. The contest organized by Netix resulted in a big jump in the research of the recommender systems, more than 40,000 teams were trying to create a good algorithm. The work of Krishnan [6] was also very interesting trying to compare the recommendation results of machine against humans. As stated in the article, the machine won in most of the comparatives because machines can handle more data than humans can. Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 95 / 101 Current and Future Work Conclusions The following improvements can be applied to the recommendation systems [9]: Improve the security regarding false ratings or users. E.g., Second Netix challenge was in part cancelled because it was possible to identify users based on their votes on IMBD. Take the advantage of the social networks: The users likes the same as his friends. Explore the social graph of the users. Potential friends could be suggested using recommendation techniques. Improve the evaluation method when there are no ratings. Proactive recommender systems. Privacy preserving recommender systems. Recommending a sequence of items (e.g. a play list). Recommender systems for mobile users. Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 96 / 101 References 1 Introduction 2 Recommender Systems 3 Algorithms 4 Evaluation of Recommender Systems 5 Dimensionality Reduction 6 Tools 7 Problems and Challenges 8 Conclusions 9 References Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 97 / 101 References I References Deepak Agarwal. Oine components: Collaborative ltering in cold-start situations. Technical report, YAHOO, 2011. Chris Anderson. The Long Tail: Why the Future of Business Is Selling Less of More. Hyperion, 2006. Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. The kdd process for extracting useful knowledge from volumes of data. Communications of the ACM, 39(11):2734, November 1996. A. Galland. Recommender systems. Technical report, INRIA-Saclay, 2012. Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 98 / 101 References II References Joseph A. Konstan and John Riedl. Recommender systems: from algorithms to user experience. User Modeling and User-Adapted Interaction, 22(1-2):101123, April 2012. Vinod Krishnan, Pradeep Kumar Narayanashetty, Mukesh Nathan, Richard T. Davies, and Joseph A. Konstan. Who predicts better?: results from an online study comparing humans and an online recommender system. In Proceedings of the 2008 ACM conference on Recommender systems, RecSys '08, pages 211218, New York, NY, USA, 2008. ACM. Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 99 / 101 References III References Yashar Moshfeghi, Deepak Agarwal, Benjamin Piwowarski, and Joemon M. Jose. Movie recommender: Semantically enriched unied relevance model for rating prediction in collaborative ltering. In Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval, ECIR '09, pages 5465, Berlin, Heidelberg, 2009. Springer-Verlag. Anand Rajaraman, Jure Leskovec, and Jerey David Ullman. Mining of Massive Datasets. Cambridge University Press, 2011. Francesco Ricci. Information search and retrieval, 2012. Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 100 / 101 References IV References Francesco Ricci, Lior Rokach, and Bracha Shapira. Recommender Systems Handbook, chapter Introduction to Recommender Systems Handbook, pages 135. Springer, 2011. Windsor Aug 5-16, 2013 (Erasmus IP) Recommender Systems 101 / 101