Movie Advisor Tel Aviv University Faculty of Engineering M.Sc. Project Doron Harlev Supervisor: Dr. Dana Ron February 12, 2016 Movie Advisor Project 2/51 February 12, 2016 Table of Contents Introduction ................................................................................................................................ 3 Database Characteristics ............................................................................................................ 4 Data Sets .................................................................................................................................... 7 Evaluation Criteria ..................................................................................................................... 8 Mean Average Error (MAE) .................................................................................................. 8 Coverage ................................................................................................................................ 8 Base Algorithms......................................................................................................................... 9 All Average ............................................................................................................................ 9 Movie Average....................................................................................................................... 9 User Average ......................................................................................................................... 9 User Movie Average .............................................................................................................. 9 Summary .............................................................................................................................. 10 Prediction Methods .................................................................................................................. 11 Pearson R Algorithm............................................................................................................ 11 Base Algorithm ................................................................................................................ 11 Additions To The Base Pearson R Algorithm ................................................................. 13 Mean Square Difference (MSD) Algorithm ........................................................................ 17 Genre Based Algorithms ...................................................................................................... 19 Genre Statistics ................................................................................................................ 20 Base Algorithm ................................................................................................................ 20 Genre Algorithm .............................................................................................................. 21 Hybrid Genre Algorithm .................................................................................................. 23 Algorithm Comparison ........................................................................................................ 25 MAE Vs. Coverage For All Algorithms .......................................................................... 25 Database Alteration .......................................................................................................... 26 Conclusions .............................................................................................................................. 31 Reference ................................................................................................................................. 32 Appendix I: Implementation .................................................................................................... 33 Introduction .......................................................................................................................... 33 Matlab Code ......................................................................................................................... 34 General ............................................................................................................................. 34 Database Statistics ........................................................................................................... 36 Base Algorithms............................................................................................................... 37 Pearson r Algorithm ......................................................................................................... 38 MSD ................................................................................................................................. 41 Genre ................................................................................................................................ 43 Movie Advisor Project 3/51 February 12, 2016 Introduction This paper will analyze existing and proposed algorithms designed to provide predictions of a users’ rating of a particular movie. The predictions will rely on the users’ own rating of other movies as well as the scores provided by neighboring users. The ability to predict a users’ rating may prove useful in many contexts. One such context may be enhancing a users’ experience in an online store by recommending some items while helping avoid others. The problem set described in this paper is studied in the field of Recommender Systems or Collaborative Filtering. This paper will start by examining the statistical properties of the chosen database. Once the base and test data sets have been characterized, evaluation criteria such as Mean Average Error (MAE) and coverage will be explained. These criteria will then be used to measure the performance of trivial prediction methods applied to the base and test data sets. The performance of trivial methods will provide a reference for more sophisticated algorithms. The first of these algorithms makes use of the Pearson R coefficient. The Pearson R coefficient provides a correlation measure between two vectors and can be used to provide a distance metric between two users. Use of the Pearson R correlation coefficient is quite common in the field of collaborative filtering, and results obtained with this method will be used to gauge the performance of other algorithms. The Pearson R algorithm will be further enhanced in accordance with improvements proposed by other papers as well this one. Another baseline algorithm to be examined is the Mean Square Difference (MSD) algorithm. Although generally considered inferior to the Pearson R algorithm, elements in the MSD algorithm will prove valuable in newly proposed algorithms. The database used in this paper also provides genre information for all movies. The statistics of genre information in the database will be analyzed. Novel prediction algorithms relying on genre information will then be proposed. As a reference, a new trivial prediction algorithm will be presented and its performance compared to previous trivial methods. Several genre-based prediction algorithms will then be proposed and analyzed with respect to the test and base data sets. Finally, all algorithms will be compared. This comparison will include a different instantiation of the initial data set, as well as altered versions of the entire database. Movie Advisor Project 4/51 February 12, 2016 Database Characteristics The database used for the project is the GroupLens database available on the internet at http://www.cs.umn.edu/Research/GroupLens/. The database contains 100,000 ratings on a scale of 1-5. The ratings are made for 1682 movies by 943 users. If the database were to be viewed as a matrix with users designating rows and movies designating columns, the matrix would be extremely sparse with values in merely 6% of its entries. Despite its apparent sparseness, the database is sufficient in allowing the analysis of the prediction algorithms discussed in this paper. The mean score provided for all entries in the database is 3.53, the median is 41 and the standard deviation is 1.13. Figure 1 depicts a histogram of all user ratings: 3.5 x 10 4 3 2.5 2 1.5 1 0.5 0 1 2 3 Score 4 5 Figure 1: Historgram of all movie scores 1 The median is actually 4-, that is to say, a score of 4 is included in the upper half of the median. Movie Advisor Project 5/51 February 12, 2016 From the histogram and average score we learn that users tend to rate movies they liked more than movies they disliked. The average number of scores given to each movie is around 60 and the median number of scores is 57. Figure 2 depicts a histogram of the number of scores given to each movie: 700 600 500 400 300 200 100 0 0 100 200 300 400 Number of Scores for Movie 500 600 Figure 2: Histogram of number of scores given to each movie The histogram shows that while some movies have a significant number of ratings of more than 500, many others have 10 ratings or less. The average number of scores given by each user is 106 and the median is 65. The minimum number of scores given by each user is 20. Figure 3 depicts a histogram of the number of scores given by users: 300 250 200 150 100 50 0 0 100 200 300 400 500 Number of Scores Per User 600 700 Figure 3: Histogram of number of scores given by each user 800 Movie Advisor Project 6/51 February 12, 2016 This histogram shows a distribution similar to the ratings of movies. While some users rated as many as 700 hundred movies, many others rated the minimum allowed. Figure 4 depicts a graphical representation of the database: Figure 4: Graphical representation of rated movies Each rated item in the database is designated by a blue dot. The parabolic decline of rated movies as a function of user ID implies that when the database was established, each new user was provided with a more elaborate list of movies. It is also evident that movies with higher ID’s average less ratings than those with lower ID’s. Movie Advisor Project 7/51 February 12, 2016 Data Sets To assess the prediction methods, the data set is divided into two parts: base and test. The test data set contains 5 entries for 10% of the users yielding a total of 470 entries. The base data set contains all remaining entries (99,530). The test data set is used as a reference for the accuracy of the predictions. Since the division into base and test data sets is made randomly, the performance of the tested algorithms will vary depending on the specific division that was made. To stabilize the performance throughout the paper, all results are calculated for a specific base/test division. This issue will be further addressed in the final analysis. Movie Advisor Project 8/51 February 12, 2016 Evaluation Criteria Two types of evaluation criteria will be used in this paper. Mean Average Error (MAE) Once predictions for the test data set are obtained, Ei is defined as the difference between 1 n prediction, Si, and the actual score given by the user, Ri. The MAE is MAE Si Ri , n i 1 where n is the length of the predicted data set. Other error coefficients such as standard deviation and ROC may be used. However, related papers show that MAE is consistent with other measures and proves to be the most widely used. Coverage As will later be shown, predictions cannot be made for all movies in the test data set. Coverage is a measure of the percentage of movies in the test data set that can be predicted. In general, the smaller the MAE the smaller the coverage becomes. The importance of coverage depends on the application using the recommender system. Some applications may require predictions for most items in the database, while others may choose to compromise coverage for improved accuracy. Movie Advisor Project 9/51 February 12, 2016 Base Algorithms Four base algorithms are analyzed in this section. These algorithms are “basic” in that they rely on trivial aspects of the user’s entries to provide predictions. All Average In this method the average rating of all entries in the base data set is calculated. This calculated value is then used as the predicted score for all values in the test data set. This method yields an MAE of 1.046 and coverage of 100%. Movie Average In this method the average score given to each movie in the base data set is calculated. This average score is then used to predict values for all occurrences of the movie in the test data set. This method yields a vast improvement over the previous one. The MAE drops to 0.887. Since not all movies in the base data set contain scores, some values in the test dataset cannot be predicted. The coverage in this method is 99.8%. User Average In this method the average score given by each user in the base dataset is calculated. This average score is then used to predict values for the user in the test data set. This method yields an improvement over the previous one. The MAE drops to 0.849. The coverage in this method is 100% since all users made at least 20 predictions. Figure 5 depicts a histogram of the error using the user average base algorithm. 60 50 40 30 20 10 0 -3 -2 -1 0 Error 1 2 3 Figure 5: Histogram of error using User Movie Average prediction User Movie Average This method is the most sophisticated of the basic methods. The method combines user average and movie average to produce a prediction. In order to produce a prediction for user a and movie i we perform the following calculation: 1 Sa ,i ra ru ,i ru n u Movie Advisor Project 10/51 February 12, 2016 where ra is user a’s average rating, u is an index running through all users who rated and the movie, and n is the number of users who rated the movie. This method yields yet another improvement over the previous one. The MAE drops to 0.83 while the coverage is 99.8%, identical to method 2. Summary It is clear that any desirable prediction method must improve upon the results obtained in this section. Table 1 summarizes the results: Method All Average Movie Average User Average User Movie Average MAE 1.046 0.887 0.849 0.830 Coverage [%] 100 99.8 100 99.8 Table 1: Summary of base algorithms results Movie Advisor Project 11/51 February 12, 2016 Prediction Methods The following prediction methods attempt to provide further improvement over the base algorithms. The improvement is possible through the use of additional information. One form of additional information is to intelligently weigh other users’ ratings. The Pearson R and MSD methods use this approach to improve prediction. Another form of additional information is to make use of genre information provided with the database. Methods using this approach will be presented in the ensuing discussion. Pearson R Algorithm The Pearson R algorithm relies on the Pearson R coefficient to produce a correlation metric between users. This correlation is then used to weigh the score of each relevant user. The Pearson R algorithm is used widely in the study of recommender systems, and is used as a reference in this paper. Base Algorithm The Pearson R correlation between users a and u is defined as: r m Pa ,u i 1 a ,i ra ru ,i ru a u where m is the number of movies that both users rated, ra ,i is the score user a gave movie i, and ra is the average score user a gave all movies. Since the base data set is sparse, when calculating the Pearson R coefficient, most users have only a few overlapping scored movies. In calculating the coefficient, only movies ranked by both users are taken into account. This affects the sum in the numerator, and the variance in the denominator. Figure 6 depicts a histogram of the average number of mutually rated movies (MRM) for all users in the test data set: 40 35 30 25 20 15 10 5 0 0 10 20 30 40 50 Average Number of Mutually Rated Movies 60 70 Figure 6: Histogram of average number of MRM for all users in the test data set Movie Advisor Project 12/51 February 12, 2016 The mean MRM is 17.6 and the median is 12. We note that many users have an average MRM below 10, while some have a relatively high average MRM of 60. Users who provide more entries will tend to have a higher average MRM. Figure 7 depicts a histogram of the Pearson R coefficient between a particular user and all other users: 150 100 50 0 -1 -0.5 0 0.5 Pearson R Coefficient for user 104 1 Figure 7: Historgram of Pearson R coefficient It is evident that most users are slightly positively correlated. While there is such a thing as two people with similar tastes, it is highly unlikely to find two users with opposite tastes. Opposite tastes would require users to consistently dislike what the other user likes and vice versa. This observation is further discussed at Shardanand [1]. Once the Pearson R correlation between a user and all other users is obtained, the predicted movie score is calculated as: r n S a ,i ra u 1 u ,i ru Pa ,u n P u 1 a ,u This approach is very similar to that of the user movie average discussed earlier. The difference is that the Pearson R coefficient is used to weigh the score given by each movie, whereas in the user movie average algorithm all users are weighted equally. Applying the Pearson R base algorithm to the test data set yields a MAE of 0.79 with coverage of 99.8%. The coverage is slightly limited, because not all movies in the test data set have a score in the base data set. Since no limitations have been made on the predictions, the coverage is identical to the movie average and user movie average base algorithms. Movie Advisor Project 13/51 February 12, 2016 Figure 8 depicts a histogram of the prediction error using the Pearson R base algorithm. 120 100 80 60 40 20 0 -3 -2 -1 0 Error 1 2 3 Figure 8: Histogram of prediction error for Pearson R base algorithm The MAE obtained is clearly an improvement over all base methods. Note that in order to normalize the weights, all user scores are divided by a sum of the absolute value of the correlations. Since the Pearson R correlation may be either positive or negative, adding all correlations may produce low values in the denominator. These low values may cause the prediction to be either too low or too high, at times exceeding a value of 5. Using the absolute value has a drawback in that it produces an unbalanced normalization factor. Results obtained without the absolute value were extensively tested and shown to produce substantially poorer results. In further discussion, only the absolute value approach will be applied. Additions To The Base Pearson R Algorithm Several modifications can be made to the base Pearson R algorithm to improve its performance. These modifications have to do with setting thresholds in order to reduce the effects of ‘noisy’ data. Pearson R Thresholding One modification, initially suggested by Shardanand [1], is to avoid making use of users who are not highly correlated. To implement this approach we modify Pa ,u as follows: Pa ,u L if Pa ,u L 1 L P L | Pa ,u a ,u if Pa ,u L 1 L otherwise 0 where L is the Pearson R threshold. After calculating Pa| ,u we simply replace it with Pa ,u and produce the expected score. Movie Advisor Project 14/51 February 12, 2016 Figure 9 shows the effect of increasing the Pearson R threshold on the MAE and coverage: Users TH=3, Herlock TH=1 0.95 Pearson R Algorithm User Average MAE 0.9 0.85 0.8 0.75 0 0.05 0.1 0.15 0.2 0.25 0.3 Pearson R Threshold 0.35 0.4 0.45 0.5 0 0.05 0.1 0.15 0.2 0.25 0.3 Pearson R Threshold 0.35 0.4 0.45 0.5 Coverage 1 0.9 0.8 0.7 Figure 9: MAE and Coverage as a function of Pearson R threshold The results were obtained with two additional thresholds, users and Herlock. The users threshold is the number of correlated users required to make a prediction. When the threshold is not met, the algorithm will not make a prediction for the item in the test data set. For consistency, all methods in this paper will be examined with a threshold of 3 users. A Herlock threshold of 1 is identical to no threshold at all. This threshold will be explained later. The Pearson R threshold initially reduces the MAE (MAE=0.7867 at P TH=0.1) slightly without compromising coverage significantly. Later on, the Pearson R threshold has the effect of increasing the MAE and decreasing coverage – both effects undesirable. The decreasing coverage is caused by an increased amount of movies that cannot be predicted due to an insufficient number of correlated users. The user average plot, present in the upper chart, is the MAE of the user average algorithm calculated on the predicted test data set. The test data set experiences changes as the coverage changes. These changes cause slight variations in the results user average method. The plot of user average is adds information about the type of movies being omitted with decreasing coverage. This measure will prove more significant in later discussion. Significance Weighting Herlocker et. al. [2] suggested that it may be advisable to make use of the number of mutually rated movies (MRM). Recall Figure 6, which shows a histogram of the average MRM for all users in the test data set. Herlocker suggested that weighing should be applied to decrease the correlation coefficient for neighbors with a small MRM. The weighting scheme suggested alters the Pearson R coefficient as follows: H if MRM H Pa ,u | Pa ,u MRM Pa ,u if MRM H Movie Advisor Project 15/51 where H is a given threshold (designated Herlock threshold in the paper). February 12, 2016 Figure 10 depicts the effect of increasing the Herlocker threshold with a Pearson R threshold of 0.1. Users TH=3, Pearson TH=0.1 0.95 Pearson R Algorithm User Average MAE 0.9 0.85 0.8 0.75 0 20 40 60 80 Herlocker Threshold 100 120 140 0 20 40 60 80 Herlocker Threshold 100 120 140 Coverage 1 0.8 0.6 0.4 Figure 10: MAE and coverage as a function of Herlocker threshold, Pearson threshold=0.1 As we can see, an increase in the threshold reduces coverage, while also slightly improving the MAE. In an attempt to improve Herlocker’s method, a slightgly altered weighing scheme was applied: 2 H if MRM H P Pa| ,u a ,u MRM P if MRM H a ,u This slight alteration has the effect of truncating users with low MRM’s faster, while still making use of ‘borderline’ users. Movie Advisor Project 16/51 February 12, 2016 Figure 11 depicts the effect of increasing the Herlocker threshold with a Pearson R threshold of 0.1. Both squared and none squared approaches are shown. Users TH=3, Pearson TH=0.10 0.9 H/MRM 2 (H/MRM) User Average 0.88 0.86 0.84 MAE 0.82 0.8 0.78 0.76 0.74 0.72 0.7 0.5 0.6 0.7 Coverage 0.8 0.9 1 Figure 11: MAE Vs. Coverage with two types of Herlocker thresholding As we can see, the performance of both square and none-square approaches is relatively similar. In future discussion, only the none-square approach will be used. Another alteration to Herlocker’s threshold was attempted. Instead of a rigid threshold, a threshold based on each users’ average MRM was calculated for each user. The threshold is a constant multiplied by the average MRM for each user. The logic behind this approach is that some users tend to rate more movies than others, and would therefore tend to have higher MRM’s. Using a constant threshold for all users’ MRM does not take into account this variance. Movie Advisor Project 17/51 Figure 12 depicts the results obtained with this method. February 12, 2016 Users TH=3, Pearson TH=0.100 MAE 0.9 Pearson R Algorithm User Average 0.85 0.8 0.75 0 2 4 6 8 Mult. Threshold 10 12 0 2 4 6 8 Mult. Threshold 10 12 Coverage 0.9 0.8 0.7 0.6 0.5 Figure 12: MAE and coverage as a function of multiplication threshold As we can see, this method performs worse than Herlocker’s initial proposal and will be discarded. Mean Square Difference (MSD) Algorithm The MSD algorithm is very similar to the Pearson R algorithm, except that it relies on mean square distance as a distance metric, instead of Pearson R correlation. The distance between two users is calculated as: r m Da ,u i 1 a ,i ra ru ,i ru 2 m where m denotes the number of MRM between users a and u. Once the distance vector Da is obtained we calculate Pa as: L Da ,u if Da ,u L Pa ,u L 0 otherwise where L is a threshold for the distance measure. The larger the threshold, the more distant the users we take into account to produce a prediction. Once Pa is obtained, the score is calculated identically to the Pearson R algorithm. Note that all distances are positive, so the absolute value on the distance measure is not necessary. Movie Advisor Project 18/51 February 12, 2016 Figure 13 depicts the effect of decreasing the L threshold on the MAE and coverage: Users TH=3 0.85 0.84 MSD Method User Average 0.83 MAE 0.82 0.81 0.8 0.79 0.78 0.77 0.5 0.6 0.7 0.8 0.9 1 Coverage Figure 13: MAE Vs. Coverage for MSD algorithm We note that this algorithm does not perform as well as Pearson R at all coverages. We also note a phenomenon of improved user average MAE with decreased coverage. This effect was not present with the Pearson R algorithm, where user average performance was relatively constant with decreasing coverage. This decrease implies that the algorithm tends to make predictions for items that are closer to the users’ average. Figure 14 may provide help in understanding this effect: 1.1 Discarded User Average 1.05 MAE 1 0.95 0.9 0.85 0.8 0.75 0.5 0.6 0.7 0.8 0.9 1 Coverage Figure 14: MAE Vs. Coverage of discarded items for user average algorithm The figure depicts the MAE of the user average algorithm for items that were omitted with the MSD algorithm for a given coverage. To understand this curve we need to examine it in Movie Advisor Project 19/51 February 12, 2016 three parts. At high high coverage (>0.95) very few items are present in the calculation yielding very noisy results. At mid to high coverage (0.8-0.95) we note that the MSD algorithm actually omits items with user average MAE higher than items that are not omitted. At lower coverages the MSD omits items with better user average MAE, so the MAE begins to decrease. It is possible that movies that are difficult to predict with MSD are esoteric movies that few people rated. Since few people rated these movies, the MSD algorithm eventually omits them. Since the movies are esoteric, a users score for these movies will tend to vary greatly from their mean score. Genre Based Algorithms The GroupLens database provides additional information about each movie in the database. The following is a description of the information provided2: u.item -- Information about the items (movies); this is a tab separated list of movie id | movie title | release date | video release date | IMDb URL | unknown | Action | Adventure | Animation | Children's | Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western | The last 19 fields are the genres, a 1 indicates the movie is of that genre, a 0 indicates it is not; movies can be in several genres at once. Since the genres will be referenced as numbers in the ensuing discussion the following table provides a summary of movie names and numbers. # 1 2 3 4 5 Name Unknown Action Adventure Animation Children’s # 6 7 8 9 10 Name Comedy Crime Documentary Drama Fantasy # 11 12 13 14 15 Name Film-Noir Horror Musical Mystery Romance # 16 17 18 19 Name Sci-Fi Thriller War Western Table 2: Summary of genre numbers The genre algorithm presented in this paper attempts to make use of this additional information to improve the prediction accuracy. Only genre information will be used – dates are omitted. 2 The description is copied verbatim from the README file downloaded with the database. Movie Advisor Project 20/51 February 12, 2016 Genre Statistics Average Number of User Ratings Figure 15 depicts average scores and the number of user ratings for each genre: 40 30 20 10 0 2 4 6 8 10 12 Genre Number 14 16 18 20 2 4 6 8 10 12 Genre Number 14 16 18 20 Average Score 0.4 0.2 0 -0.2 -0.4 Figure 15: Average scores and number of user rating for all genres In order to produce the average number of user ratings for each genre all entries in the database are taken into account. For each entry, counters for all genres present in the movie are increased. Once all entries have been searched, the counters are divided by the number of users. The average score is calculated in a similar manner. Initially, a new matrix of user ratings is produced. This new matrix subtracts the users’ average score from all their ratings. Once the new matrix is obtained, all entries are searched again. For each item a counter for each genre present is increased, and the value from the calculated matrix is added to an accumulator. Note that since user averages were subtracted from the matrix, the accumulators may contain negative values. Once all entries have been searched, the accumulator for each genre is divided by counter for that genre. Not surprisingly, the most popular genres are action, comedy and drama. The negative average score received by both comedy and action implies that although they are popular they are generally disliked. Drama is the only genre that is both widely rated and liked. Base Algorithm Once we obtain genre information, we take aim at another base algorithm. We start by calculating how much a user likes or dislikes a particular genre. To do so we calculate a vector R ra ra for all movies the user entered. For each movie we add the score in vector R to all genres associated with the movie. After going through all movies, we divide the sum obtained for each genre with the number of movies that the genre appeared in. This calculation is similar to the one performed to obtain average genre scores for the entire database, except that it is performed for each user. We thus obtain a vector with an average score for each genre for every user. For future reference this this vector is designated Ga , and the matrix for all users G. A positive score indicates that a user likes a particular genre. Movie Advisor Project 21/51 February 12, 2016 To produce a prediction for a particular item, the base algorithm averages all of the users’ scores of genres present in the movie and adds the result to the users’ average. The following equation shows this calculation: n S a ,i ra G a ,u u 1 n where n is the number of genres present in the movie. This approach yields the following results for the base and test data sets: MAE 0.836 Coverage [%] 98.72 Table 3: Result for the genre base algorithm Recalling Table 1, it should be noted that this approach provides an improvement over the user average base algorithm. This implies that genre information is useful in making predictions. Genre Algorithm In this section a novel prediction algorithm based on genre information is proposed. This approach makes use of other users’ scores for the movie as well as genre information. At first the matrix G is calculated for all users. This matrix is then used to calculate the mean square distance (MSD) between users, as shown previously. Once distances for a particular user are obtained, they are used to weigh movie scores made for the movie by other users. Figure 16 depicts the performance of the genre algorithm. In order to reduce the MAE while lowering our coverage, the distance threshold, L, is decreased. Users TH=3 0.85 Genre Method User Average MAE 0.8 0.75 0.7 0.5 0.6 0.7 Coverage 0.8 0.9 1 Figure 16: MAE Vs. Coverage for genre algorithm using MSD The performance of this algorithm is better than that of Pearson R at low coverages, and comparable at high coverages. Nonetheless, the calculations necessary to produce predictions are much simpler. In order to produce each prediction with the Pearson R algorithm, it is necessary to perform calculations on the entire data base containing Movie Advisor Project 22/51 February 12, 2016 6 16829431.610 items. If the algorithm used is efficient, the calculation complexity may be reduced to the number of entries, 100,000. To produce a single result with the genre algorithm, only 9431918,000 need to be taken into account. In practice, predicting with the genre algorithm proved to be much faster. As with the MSD algorithm, a phenomenon of decreased user average MAE with decreased coverage is encountered. The three area in the MSD curve are present in the genre algorithm as well. The explanation for this phenomenon is identical to that of MSD. An attempt to provide further explanation by analyzing average genre scores of omitted items proved inconclusive. Figure 17 depicts the user average MAE for omitted items as a function of coverage. 1.15 Discarded User Average MAE 1.1 1.05 1 0.95 0.5 0.6 0.7 0.8 0.9 Coverage Figure 17: MAE Vs. Coverage of discarded items for user average algorithm 1 Movie Advisor Project 23/51 February 12, 2016 Another implementation of the genre algorithm may rely on Pearson R as the distance measure. The result obtained with this approach is depicted in Figure 18: Users TH=3, Pears TH=0.1 0.86 Genre Pears Method User Average 0.85 MAE 0.84 0.83 0.82 0.81 0.8 0.5 0.6 0.7 0.8 0.9 1 Coverage Figure 18: MAE Vs. Coverage for genre algorithm using Pearson correlation As wee can see, the performance of this approach is generally worse than the MSD approach and does not improve with reduced coverage. The approach will be discarded from ensuing discussion. Hybrid Genre Algorithm While the genre algorithm makes use of user information to calculate the distance between the user and other users, it does not make use of the genres the users themselves like. In this section a hybrid algorithm that takes into account the weighted score by other users as well as the weighted score of the genres in the movie itself is proposed. The following equation shows how the prediction is calculated: r n S a ,i ra u 1 u ,i ru Pa ,u n P u 1 n 1 G u 1 a ,u n a ,u where designates the weight ratio between other users’ rating of the movie, and the users’ own average rating of the genre. When =0 the algorithm resorts to the base genre algorithm, and when =1 the proposed genre algorithm is used solely. Movie Advisor Project 24/51 February 12, 2016 This approach produces further improvement. Figure 19 shows the MAE as a function of for two thresholds. Users TH=3 0.83 L=0.9, coverage=98.1 L=0.5, coverage=71.3 0.82 0.81 MAE 0.8 0.79 0.78 0.77 0.76 0.75 0.74 0 0.2 0.4 0.6 0.8 1 Ratio Figure 19: MAE Vs. for hybrid genre algorithm An examination of both curves shows that they have an absolute minimum at approximately =0.65. At this mostly other users’ opinions are taken into account, but a strong weight is still given to the genres present in the movie. To test the robustness of the value of the optimal , the same plot was obtained for a different instance of the base and test data sets. Figure 20 depicts the result obtained: Users TH=3 0.84 0.83 L=0.9, coverage=95.5 L=0.5, coverage=78.1 MAE 0.82 0.81 0.8 0.79 0.78 0.77 0 0.2 0.4 0.6 0.8 Ratio Figure 20: MAE Vs. for hybrid genre algorithm, different data-set 1 Movie Advisor Project 25/51 As can be seen, the optimal is still obtained around 0.65. February 12, 2016 Figure 21 compares the genre algorithm to the hybrid genre algorithm with =0.65. Users TH=3, rat=0.65 0.82 Hybrid Method Genre Method 0.8 MAE 0.78 0.76 0.74 0.72 0.7 0.68 0.5 0.6 0.7 Coverage 0.8 0.9 1 Figure 21: MAE Vs. Coverage of genre and hybrid genre ( =0.65) algorithms The figure shows that the hybrid genre algorithm performs better than the genre algorithms at all coverages. The additional complexity of the hybrid algorithm is negligible. Algorithm Comparison MAE Vs. Coverage For All Algorithms Figure 22 shows a comparison of the main algorithms discussed in this paper. 0.82 0.8 MAE 0.78 0.76 0.74 Hybrid Genre MSD Pearson R 0.72 0.7 0.5 0.6 0.7 0.8 0.9 Coverage Figure 22: MAE Vs. Coverage for hybrid genre, MSD, and Pearson R algorithms 1 Movie Advisor Project 26/51 February 12, 2016 It is clear that the hybrid genre algorithm performs better at low coverages, while Pearson R performs better at high coverages. It should be noted that the hybrid genre algorithm is the only algorithm able to significantly improve its performance with decreased coverage. Database Alteration In this section the robustness of the results will be tested against several alterations to the data set. Instantiation As noted in earlier discussion, results may differ if a different instance of division into base and test data sets is used. Table 4 shows results obtained with base algorithms for new division of base/test data sets: Method All Average Movie Average User Average User Movie Average Genre Base MAE 0.948 0.832 0.85 0.785 0.85 Coverage [%] 100 99.8 100 99.8 98.7 Table 4: Base algorithm results for new instance As Table 4 shows, the user average method produces results similar to the genre base method. This suggests, that it is more difficult for genre based algorithms to produce accurate results for this division of test and base data sets. Figure 23 depicts a comparison of the main algorithms for the new instance of test and base data sets: 0.84 Hybrid Genre MSD Pearson R 0.82 MAE 0.8 0.78 0.76 0.74 0.72 0.55 0.6 0.65 0.7 0.75 0.8 Coverage 0.85 0.9 0.95 1 Figure 23: MAE Vs. Coverage for Hybrid Genre, MSD, and Pearson R algorithms In this instance, the Pearson R algorithm seems superior at most coverages. The hybrid genre algorithms still manages to improve accuracy with reduced coverage, yielding the lowest MAE at coverages below 0.65. The general behavior of the three algorithms is still maintained. Movie Advisor Project 27/51 February 12, 2016 User Truncation In this section 4/5 of the users are randomly removed from the database. The test data set is comprised of 5 scores for 20% of the remaining users, yielding 190 entries. Since the data sets have been altered, the results for base algorithms are recalculated. As Table 5 shows, it is generally simpler to predict for the new data sets: Method All Average Movie Average User Average User Movie Average Genre Base MAE 0.989 0.818 0.807 0.750 0.807 Coverage [%] 100 99.5 100 99.5 99.5 Table 5: Base algorithm results for data base with 1/5 of users As Table 5 indicates, genre information is once more at a disadvantage since it fails to improve the accuracy of the user average method. Figure 24 shows a comparison of the main algorithms discussed in this paper with respect to the new data sets. 0.78 MAE 0.76 0.74 0.72 Hybrid Genre MSD Pearson R 0.7 0.5 0.6 0.7 0.8 0.9 1 Coverage Figure 24: MAE Vs. Coverage for tested algorithms with 1/5 of users in data sets The results of this section are somewhat of an anomaly with respect to all other results. The MSD algorithms seems to perform best at almost all coverages, while Pearson R and hybrid genre switch rolls with regards to performance at low and high coverages. Movie Truncation In this section 2/3 of the movies are removed from the database leaving 336 movies. The test data set is comprised of 5 scores for 10% of the remaining users, producing 470 entries. Since the data sets have been altered, the results for base algorithms are recalculated. As Table 6 shows, it is generally simpler to predict for the new data sets: Method All Average Movie Average MAE 0.931 0.786 Coverage [%] 100 100 Movie Advisor Project 28/51 0.855 0.771 0.884 User Average User Movie Average Genre Base February 12, 2016 100 100 98.9 Table 6: Base algorithm results for data base with 1/3 of movies Once more, the genre algorithm is at a disadvantage doing more harm than good with respect to the user average base algorithm. Figure 25 shows a comparison of the main algorithms discussed in this paper with respect to the new data sets. 0.84 Hybrid Genre MSD Pearson R 0.82 0.8 MAE 0.78 0.76 0.74 0.72 0.7 0.5 0.6 0.7 0.8 0.9 1 Coverage Figure 25: MAE Vs. Coverage for tested algorithms with 1/3 of movies in data sets Again, the results are generally in line with previous data sets. The Pearson R algorithm seems to perform very well consistently. One of the reasons behind this improved performance may be the truncation of movies with generally less predictions. Recall Figure 4 which indicated that movies with higher ID’s tended to have more predictions since they were introduced to the data base later. Movie Advisor Project 29/51 February 12, 2016 Figure 26 shows a histogram of the number of scores given to each movie: 50 45 40 35 30 25 20 15 10 5 0 0 100 200 300 400 Number of scores for movie 500 600 Figure 26: Histogram of number of scores given to each movie Comparing this figure to Figure 2, we learn that most movies in the new data set generally tend to have more predictions. The median of the number of scores for each movie has risen from 57 to 129. This reduced sparseness of the database may improve the performance of the Pearson R algorithm. Database reduction The purpose of this section is to test to performance of the algorithms with respect to a more sparse database. At first, 2/3 of the entries for each user are randomly removed. From the remaining entries, one score from 20% of users is used as the test data set, yielding 189 entries. Since the data sets have been altered, the results for base algorithms are recalculated. Table 7 summarizes the results obtained with the base algorithms: Method All Average Movie Average User Average User Movie Average Genre Base MAE 0.998 0.868 0.807 0.809 0.778 Coverage [%] 100 98.9 100 98.9 98.4 Table 7: Base algorithm results for data base with 1/3 of entries Movie Advisor Project 30/51 February 12, 2016 Figure 27 shows a comparison of the main algorithms discussed in this paper with respect to the new data sets. Hybrid Genre MSD Pearson R 0.85 MAE 0.8 0.75 0.7 0.65 0.5 0.6 0.7 0.8 0.9 1 Coverage Figure 27: MAE Vs. Coverage for tested algorithms with 1/3 of entries Results obtained with the diluted database show the hybrid genre algorithms to be superior at most coverages. This superiority may arise from the fact that the hybrid genre algorithm uses as little as 19 features to represent each user. These 19 features may quickly stabilize, even with a very sparse data set. Movie Advisor Project 31/51 February 12, 2016 Conclusions Several prediction algorithms were analyzed using the GroupdLens database. Existing methods such as Pearson R and MSD were presented and compared to proposed algorithms based on genre information. The Pearson R algorithm performs quite well under variations of the database. It is always able to yield improvement over trivial methods. The best of the proposed algorithms, hybrid genre, was shown to generally excel at low coverages, and increased sparseness. Besides performing better at specific problem sets, the hybrid genre algorithm is substantially less complex. Coupled with its ability to perform well in sparse environments, the hybrid genre algorithm may prove to be very practical in real world applications. One practical approach may be to intentionally dilute the database in order to reduce computational complexity. Further research may be conducted to improve the performance of the hybrid genre algorithm. One direction may be to collapse several genres into one, or to omit unpopular genres altogether. This direction was shown to improve results while working on this paper, but was not presented. The logic behind this approach is that unpopular genres are given less of an opportunity to affect results for the popular genres, and therefore reducing ‘noise’. Another direction may be the complete opposite of this one. Several genres may be designated as new genres when appearing together. For example, people may have a specific affection for action-comedies. Movie Advisor Project 32/51 February 12, 2016 Reference 1. Upendra Shardanand, “Social Information Filtering for Music Recommendation”, MIT EECS M. Eng. Thesis, also TR-94-04, Learning and Common Sense Group, MIT Media Laboratory, 1994. 2. Herlocker, J.L., Konstan, J.A., Borchers, A., Riedl, J., 1999. An algorithmic framework for performing collaborative filtering. Proceedings of the 1999 Conference on Research and Development in Information Retrieval. Movie Advisor Project 33/51 February 12, 2016 Appendix I: Implementation Introduction All results in this paper were calculated and plotted using Matlab 5.3. To improve efficiency, databases were defined as sparse matrixes and for loops were avoided as much as possible by using look up tables and matrix multiplication. Pearson R algorithms were rather slow to produce results. An attempt to compile m files into mex files showed no significant improvement and was discarded. The following lists the matlab code used to generate results in this paper. Matlab Code General List2Mat.m This m-file is used to convert a list of entries into a sparse matrix. function mat=list2mat(list) % USAGE: mat=list2mat(list) % This function converts a list (UID, MID, R) into a sparse matrix U x M = R mat=0; mat=sparse(mat); l=length(list); h=waitbar(0,'Processing List2Mat'); for i=1:l, if mod(i,2000)==0 waitbar(i/l,h); end mat(list(i,1),list(i,2))=list(i,3); end close(h) Mkdata.m This m-file is used to generate base and test data sets. function [base_mat, test_list]=mkdata(mat_base,varargin) % % % % % % % % % % % This function resizes an input sparse matrix and divides the data into a base matrix (UxM<-r) and a test list (uid mid r). Base and test data are divided according to parameters r and tnum. USAGE:[base_mat, test_list]=mkdata(mat_base,r,tnum,mnum,unum) base_mattest_listmat_basertnum- base matrix test list input sparse matrix containg all data ratio of users tested (default= 0.1) number of movies tested per user (default = 5) Movie Advisor Project % mnum% unum- 35/51 February 12, 2016 number of movies in output data (default= base_mat) number of users in output data (default = base_mat) r=0.1; % percent of users tested [unum, mnum]=size(mat_base); % default matrix size, untruncated tnum=5; if nargin>=2 r=varargin{1}; end if nargin>=3 tnum=varargin{2}; end if nargin>=4 mnum=varargin{3}; end if nargin==5 unum=varargin{4}; end if (nargin<1 | nargin>5) error('Improper number of input arguements'); end base_mat=mat_base(1:unum,1:mnum); % truncate matrix if necessary n_u_tst=round(r*unum); % number of users tested p=randperm(unum); % randomize user index for i=1:n_u_tst, idx=find(base_mat(p(i),:)~=0); % indices of rated movies of user ns=length(idx); % number of scored items rs=randperm(ns); % randomize scored movie index for j=1:tnum, % select 5 random movies for each user test_list((i-1)*tnum+j,:)=[p(i) idx(rs(j)) mat_base(p(i),idx(rs(j)))]; base_mat(p(i),idx(rs(j)))=0; end end [tmp,index]=sort(test_list,1); % sort _test_list_ according to user number Movie Advisor Project 36/51 February 12, 2016 test_list=test_list(index(:,1),:); Database Statistics This m-file is used to generate general statistics about the GroupLens database. Stats.m % This code is used to generate statistics for the entire MovieLens database [uid, mid, r]=textread('u.data','%u %u %u %*u'); list=[uid mid r]; clear uid mid r rnum=length(list); mat_base= list2mat(list); % load data_load <- used to load file and avoid processing of above... % calcualte mean and median fprintf('\nMean Value is %1.2f, median is %1.2f\n',mean(list(:,3)),median(list(:,3))); % plot score historgram hist(list(:,3),[1 2 3 4 5]); xlabel('Score') % number of movie ratings, histogram and mean mr=full(sum(mat_base~=0)); fprintf('The average number of ratings per movie is %2.1f\n',mean(mr)); hist(mr,50) xlabel('Number of Scores for Movie') % number of movie ratings mu=full(sum(mat_base~=0,2)); fprintf('The average number of ratings per user is %2.1f\n',mean(mu)); hist(mu,50) xlabel('Number of Scores Per User') Movie Advisor Project 37/51 February 12, 2016 % view score matrix spy(mat_base) xlabel('Movie ID') ylabel('User ID') Base Algorithms Basepr.m This m-file is used to calculate results for the base algorithms. function [ave_pred, u_pred, m_pred, um_pred] =basepr(mat_base,list_test); % This function takes a rating matrix (UxM,R) and a list of observations (uid, mid, r) % and returns a list of predictions using several methods % USAGE: [ave_pred, u_pred, m_pred, um_pred]=basepr(mat_base,list_test) % ave_pred- average score in rating matrix % u_predaverage user score predicted for each movie % m_predaverage movie score predicted for each movie (0 is not available) % um_pred- average user and movie score predicted for each movie (0 is not available) [unum,mnum]=size(mat_base); user_mean=sum(mat_base,2)./sum(mat_base~=0,2); % vector of mean scores of each user rated=sum(mat_base~=0)~=0; % is the movie rated at all? mov_ave= rated.*sum(mat_base)./(sum(mat_base~=0)+~rated); % mean score of each movie, 0 if not rated mrated=rated(list_test(:,2)); % vecotor designating rated movies for list_test mrated_i=find(mrated~=0); % indices of movies predicted with mean score of movie ave_pred=full(sum(mov_ave)./sum(mov_ave~=0)); % average rating of all movies u_pred=full(user_mean(list_test(:,1))); % predict movie score based on user's average scores m_pred=full(mov_ave(list_test(:,2))); % predict movie score based on movie's average scores Movie Advisor Project 38/51 February 12, 2016 mat_ave=(mat_base-repmat(user_mean,1,mnum)).*(mat_base~=0); % rating matrix minus mean of each user, 0 for unrated movies mat_ave=rated.*sum(mat_ave)./(sum(mat_ave~=0)+~rated); um_pred=full(user_mean(list_test(:,1))+mat_ave(list_test(:,2))'); r=list_test(:,3); % output mean and STD for each method... fprintf('\nAll Average\tMAE=%1.3f\tstd=%1.3f\n',mean(abs(r-ave_pred)),std(r-ave_pred)) fprintf('User average\tMAE=%1.3f\tstd=%1.3f\n',mean(abs(u_pred-r)),std(u_pred-r)) fprintf('Movie average\tMAE=%1.3f\tstd=%1.3f\tPredicted %2.1f\n',... full(sum(mrated.*abs(m_pred-r'))./sum(mrated)),full(std(m_pred(mrated_i)-r(mrated_i)')),... (length(r)-length(find(m_pred==0)))/length(r)*100); fprintf('UM average\tMAE=%1.3f\tstd=%1.3f\tPredicted %2.1f\n',... full(sum(mrated'.*abs(um_pred-r))./sum(mrated)),full(std(um_pred(mrated_i)-r(mrated_i))),... (length(r)-length(find(m_pred==0)))/length(r)*100); Pearson r Algorithm Pearsnn.m This m-file is used to produce predictions using the Pearson R algorithms. The m-file also outputs several statistics to the standard output. function list_pred=pearsnn(mat_base,list_test, varargin); % This function takes a rating matrix (UxM,R) and a list of observations (uid, mid, r) % and returns a list of predictions using the pearson r coefficient. % % USAGE: list_pred=pearsnn(mat_base,list_test,pears_th, users_th, herlck_th) % list_pred- a list of [uid mid r pred u_ave] % % mat_base- base dataset in the form of a sparse matrix % list_test- test list in the form of [uid mid r], not presumed to be sorted % Optional Parameters: % pears_th- pearson threshhold (default=0.1) % users_th- minimal number of correlated users to make a prediction (default=3) % herlck_th- number of matching rated movies between users (default=25) herlck_th=25; Movie Advisor Project 39/51 February 12, 2016 pears_th=0.1; % default pearson r threshhold users_th=3; % default minimum number of 'affecting users' if nargin>=3 pears_th=varargin{1}; end if nargin>=4 users_th=varargin{2}; end if nargin==5 herlck_th=round(varargin{3}); end if (nargin<2 | nargin>5) error('Improper number of input arguements'); end h=waitbar(0,'Processing Pears...'); [tmp,index]=sort(list_test,1); % sort _list_test_ according to user number list_test=list_test(index(:,1),:); [unum,mnum]=size(mat_base); num_test=size(list_test,1); prev_user=0; count=1; user_mean=sum(mat_base,2)./sum(mat_base~=0,2); % vector of mean score of each user mat_ave=(mat_base-repmat(user_mean,1,mnum)).*(mat_base~=0); % rating matrix minus mean of each user, 0 for unrated movies for i=1:num_test, % calculate expected score for each test item user=list_test(i,1); mid=list_test(i,2); if prev_user~=user; % the distance is recaulculated whenever a new user is encounterd prev_user=user; ref=repmat(mat_ave(user,:),unum,1); % repeated matrix of tested user minus mean, 0 for unrated movies rated=(mat_ave~=0).*(ref~=0); % mutually rated movies... Movie Advisor Project 40/51 February 12, 2016 srated=sum(rated,2); % number of mutually rated movies cv=sqrt(sum((mat_ave.^2).*rated,2).*sum((ref.^2).*rated,2))+(sum(rated,2)==0); % cov between the _user_ and others, 1 if no match (to avoid division by 0) pr=(sum(rated,2)~=0).*sum(ref.*mat_ave.*rated,2)./cv; % pearson r coef. between _user_ and others, if there is no overlap pr=0 % should be [-1,1], higher absolute values signify stronger correlation % herlck_th=3*full(round(sum(srated)./sum(srated~=0))); <- unused 'adaptive herlck' pr=srated/herlck_th.*pr.*(srated<=herlck_th)+pr.*(srated>herlck_th); %pr=(srated/herlck_th).^2.*pr.*(srated<=herlck_th)+pr.*(srated>herlck_th); % add herlock, used to lessen the influence of users with small srated... w=(abs(pr)>=pears_th).*sign(pr).*(abs(pr)-pears_th)/(1-pears_th); % weight used instead of pearson r coef, used to incorporate threshholding end mscr=full(mat_ave(:,mid)); % vector of movie scores minus user average sumw=abs(w)'*(mscr~=0); % sum of w for correlated users who scored movie rel_users=full(sum((w~=0).*(mscr~=0))); % if sumw~=0 % predict only if other matching users rated the movie... if rel_users>users_th score= user_mean(user)+full(w'*mscr./sumw); % expected score score=(score<=1)+score*((score<5)&(score>1))+5*(score>=5); % truncate exceptional values list_pred(count,:)=[user mid list_test(i,3) score user_mean(user)]; count=count+1; end if mod(i,25)==0 waitbar(i/num_test,h); end end close(h) if exist('list_pred') fprintf('\nPearson TH=%.3f, Herlock TH=%i, Users TH=%i\n',pears_th, herlck_th, users_th); fprintf('Coverage=%.4f, MAE Average=%.4f, MAE Pearson=%.4f\n', ... (count-1)/num_test, ... mean(abs(list_pred(:,3)-list_pred(:,5))), ... mean(abs(list_pred(:,3)-list_pred(:,4)))); else fprintf('No predictions made'); end Movie Advisor Project 41/51 February 12, 2016 MSD Msd.m This m-file is used to produce predictions using the MSD algorithm. The m-file also outputs several statistics to the standard output. function [list_pred, list_npred]=msd(mat_base,list_test, varargin); % % % % % % % % % % % % % This function takes a rating matrix (UxM,R) a list of observations (uid, mid, r) and returns a list of predictions using the mean square difference. USAGE: [list_pred, list_npred]=msd(mat_base,list_test, g, L, users_th) list_pred- a list of predicted values [uid mid r pred u_ave] list_npred- a list of values that were not predicted [uid mid r u_ave] mat_base- base dataset in the form of a sparse matrix list_test- test list in the form of [uid mid r], not presumed to be sorted Optional Parameter: Lmsd threshold (default = 0.2) users_thminimal number of related users to make a prediction (default=3) users_th=3; L=.8; if nargin>=3 L=varargin{1}; end if nargin>=4 users_th=varargin{2}; end if (nargin<2 | nargin>4) error('Improper number of input arguements'); end h=waitbar(0,'Processing msd...'); [tmp,index]=sort(list_test,1); % sort _list_test_ according to user number list_test=list_test(index(:,1),:); Movie Advisor Project 42/51 February 12, 2016 [unum,mnum]=size(mat_base); num_test=size(list_test,1); prev_user=0; count=1; countn=1; user_mean=sum(mat_base,2)./sum(mat_base~=0,2); % vector of mean score of each user mat_ave=(mat_base-repmat(user_mean,1,mnum)).*(mat_base~=0); % rating matrix minus mean of each user, 0 for unrated movies for i=1:num_test, % calculate expected score for each test item user=list_test(i,1); mid=list_test(i,2); if prev_user~=user; % the distance is recaulculated whenever a new user is encounterd prev_user=user; ref=repmat(mat_ave(user,:),unum,1); % repeated matrix of tested user minus mean, 0 for unrated movies rated=(mat_ave~=0).*(ref~=0); % mutually rated movies... srated=sum(rated,2); % number of mutually rated movies pr=sqrt(sum((mat_ave-ref).^2.*rated,2)./(srated+(srated==0))); w=(srated~=0).*(pr<L).*(L-pr)/L; % weight used instead of pearson r coef, used to incorporate threshholding end mscr=full(mat_ave(:,mid)); % vector of movie scores minus user average sumw=w'*(mscr~=0); % sum of w for correlated users who scored movie rel_users=full(sum((w~=0).*(mscr~=0))); % if sumw~=0 % predict only if other matching users rated the movie... if rel_users>users_th score= user_mean(user)+full(w'*mscr./sumw); % expected score score=(score<=1)+score*((score<5)&(score>1))+5*(score>=5); % truncate exceptional values list_pred(count,:)=[user mid list_test(i,3) score user_mean(user)]; count=count+1; else list_npred(countn,:)=[user mid list_test(i,3) user_mean(user)]; countn=countn+1; end Movie Advisor Project 43/51 February 12, 2016 if mod(i,25)==0 waitbar(i/num_test,h); end end close(h) if exist('list_pred') fprintf('\nL TH=%.3f, Users TH=%i\n',L, users_th); fprintf('Coverage=%.4f, MAE Average=%.4f, MAE msd=%.4f\n', ... (count-1)/num_test, ... mean(abs(list_pred(:,3)-list_pred(:,5))), ... mean(abs(list_pred(:,3)-list_pred(:,4)))); else fprintf('No predictions made'); end if exist('list_npred') fprintf('ommited values, MAE Average=%1.4f\n',mean(abs(list_npred(:,3)-list_npred(:,4)))) end Genre Basegen.m This m-file is used to generate results for the baseline genre algorithm. function g_pred=basegen(mat_base,list_test, g); % base genre prediction algorithm [tmp,index]=sort(list_test,1); % sort _list_test_ according to user number list_test=list_test(index(:,1),:); [unum,mnum]=size(mat_base); num_test=size(list_test,1); user_mean=sum(mat_base,2)./sum(mat_base~=0,2); % vector of mean score of each user mat_ave=(mat_base-repmat(user_mean,1,mnum)).*(mat_base~=0); % rating matrix minus mean of each user, 0 for unrated movies ugen=mat_ave*g; ugenum=(mat_ave~=0)*g; ugenum1=ugenum+(ugenum==0); ugen=ugen./ugenum1; % % % % create a matrix of average user scores for each genre count the number of rated movies in genre for normalization avoid divide by 0; normalize Movie Advisor Project 44/51 February 12, 2016 % modify user average by adding an average of the users average rating for % each genre in the movie gen_ave=ugen(list_test(:,1),:).*g(list_test(:,2),:); % multiply users' average genre scores with genres in the movie sgen=sum(gen_ave~=0,2)~=0; % avoid divide by 0 for movies that can't be predected idx=find(sgen); g_pred=full(user_mean(list_test(:,1))+sum(gen_ave,2)./(sum(gen_ave~=0,2)+~sgen)); % predict user's rating based solely on his average ratings and the movie's genres r=list_test(:,3); fprintf('Genre average\tMAE=%1.3f\tCoverage=%2.2f\n',... mean(abs(g_pred(idx)-r(idx))),length(idx)./num_test*100); Peargen.m This m-file is used to produce predictions using the pearson genre algorithm. The m-file also outputs several statistics to the standard output.. function list_pred=peargen(mat_base,list_test, g, varargin); % % % % % % % % % % % % % This function takes a rating matrix (UxM,R) a list of observations (uid, mid, r) and a list of genres and returns a list of predictions using the pearson r coefficient. USAGE: list_pred=peargen(mat_base,list_test, g, pears_th, users_th, herlck_th) list_pred- a list of [uid mid r pred u_ave] mat_base- base dataset in the form of a sparse matrix list_test- test list in the form of [uid mid r], not presumed to be sorted glist of genres Optional Parameters: pears_th- pearson threshhold (default=0.1) users_th- minimal number of correlated users to make a prediction (default=3) herlck_th- number of matching rated movies between users (default=30) herlck_th=30; pears_th=0.1; % default pearson r threshhold users_th=3; % default minimum number of 'affecting users' Movie Advisor Project 45/51 February 12, 2016 if nargin>=4 pears_th=varargin{1}; end if nargin>=5 users_th=varargin{2}; end if nargin==6 herlck_th=round(varargin{3}); end if (nargin<3 | nargin>6) error('Improper number of input arguements'); end h=waitbar(0,'Processing Peargen...'); [tmp,index]=sort(list_test,1); % sort _list_test_ according to user number list_test=list_test(index(:,1),:); [unum,mnum]=size(mat_base); num_test=size(list_test,1); prev_user=0; count=1; user_mean=sum(mat_base,2)./sum(mat_base~=0,2); % vector of mean score of each user mat_ave=(mat_base-repmat(user_mean,1,mnum)).*(mat_base~=0); % rating matrix minus mean of each user, 0 for unrated movies ugen=mat_ave*g; ugenum=(mat_ave~=0)*g; ugenum1=ugenum+(ugenum==0); ugen=ugen./ugenum1; % % % % create a matrix of average user scores for each genre count the number of rated movies in genre for normalization avoid divide by 0; normalize for i=1:num_test, % calculate expected score for each test item user=list_test(i,1); mid=list_test(i,2); if prev_user~=user; % the distance is recaulculated whenever a new user is encounterd prev_user=user; ref=repmat(ugen(user,:),unum,1); % repeated matrix of tested users genre prefs Movie Advisor Project 46/51 February 12, 2016 rated=(ugen~=0).*(ref~=0); % mutually rated genres... srated=sum(rated,2); % number of mutually rated genres cv=sqrt(sum((ugen.^2).*rated,2).*sum((ref.^2).*rated,2))+(sum(rated,2)==0); % cov between the _user_ and others, 1 if no match (to avoid division by 0) pr=(sum(rated,2)~=0).*sum(ref.*ugen.*rated,2)./cv; % pearson r coef. between _user_ and others, if there is no overlap pr=0 % should be [-1,1], higher absolute values signify stronger correlation pr=(srated/herlck_th).*pr.*(srated<=herlck_th)+pr.*(srated>herlck_th); % add herlock, used to lessen the influence of users with small srated... w=(abs(pr)>=pears_th).*sign(pr).*(abs(pr)-pears_th)/(1-pears_th); % weight used instead of pearson r coef, used to incorporate threshholding end mscr=full(mat_ave(:,mid)); % vector of movie scores minus user average sumw=abs(w)'*(mscr~=0); % sum of w for correlated users who scored movie rel_users=full(sum((w~=0).*(mscr~=0))); % if sumw~=0 % predict only if other matching users rated the movie... if rel_users>users_th score= user_mean(user)+full(w'*mscr./sumw); % expected score score=(score<=1)+score*((score<5)&(score>1))+5*(score>=5); % truncate exceptional values list_pred(count,:)=[user mid list_test(i,3) score user_mean(user)]; count=count+1; end if mod(i,25)==0 waitbar(i/num_test,h); end end close(h) fprintf('\nPearson TH=%.3f, Herlock TH=%i, Users TH=%i\n',pears_th, herlck_th, users_th); fprintf('Coverage=%.4f, MAE Average=%.4f, MAE Pearson=%.4f\n', ... (count-1)/num_test, ... mean(abs(list_pred(:,3)-list_pred(:,5))), ... mean(abs(list_pred(:,3)-list_pred(:,4)))); Genmsd.m This m-file is used to produce predictions using the genre MSD algorithm. The m-file also outputs several statistics to the standard output.. function [list_pred, list_npred]=genmsd(mat_base,list_test, g, varargin); % This function takes a rating matrix (UxM,R) a list of observations (uid, mid, r) Movie Advisor Project % % % % % % % % % % % % 47/51 February 12, 2016 and a list of genres and returns a list of predictions using mean square distance. USAGE: [list_pred, list_npred]=genmsd(mat_base,list_test, g, L, users_th) list_pred- a list of predicted values [uid mid r pred u_ave] list_npred- a list of values that were not predicted [uid mid r u_ave] mat_base- base dataset in the form of a sparse matrix list_test- test list in the form of [uid mid r], not presumed to be sorted glist of genres Optional Parameter: users_th- minimal number of correlated users to make a prediction (default=10) Lrms threshold users_th=10; L=.4; if nargin>=4 L=varargin{1}; end if nargin>=5 users_th=varargin{2}; end if (nargin<3 | nargin>5) error('Improper number of input arguements'); end h=waitbar(0,'Processing Peargen...'); [tmp,index]=sort(list_test,1); % sort _list_test_ according to user number list_test=list_test(index(:,1),:); [unum,mnum]=size(mat_base); num_test=size(list_test,1); prev_user=0; count=1; countn=1; user_mean=sum(mat_base,2)./sum(mat_base~=0,2); % vector of mean score of each user mat_ave=(mat_base-repmat(user_mean,1,mnum)).*(mat_base~=0); Movie Advisor Project 48/51 February 12, 2016 % rating matrix minus mean of each user, 0 for unrated movies ugen=mat_ave*g; ugenum=(mat_ave~=0)*g; ugenum1=ugenum+(ugenum==0); ugen=ugen./ugenum1; % % % % create a matrix of average user scores for each genre count the number of rated movies in genre for normalization avoid divide by 0; normalize for i=1:num_test, % calculate expected score for each test item user=list_test(i,1); mid=list_test(i,2); if prev_user~=user; % the distance is recaulculated whenever a new user is encounterd prev_user=user; ref=repmat(ugen(user,:),unum,1); % repeated matrix of tested users genre prefs rated=(ugen~=0).*(ref~=0); % mutually rated genres... srated=sum(rated,2); % number of mutually rated genres pr=sqrt(sum((ugen-ref).^2.*rated,2)./srated); w=(pr<L).*(L-pr)/L; % weight used instead of pearson r coef, used to incorporate threshholding end mscr=full(mat_ave(:,mid)); % vector of movie scores minus user average sumw=w'*(mscr~=0); % sum of w for correlated users who scored movie rel_users=full(sum((w~=0).*(mscr~=0))); % if sumw~=0 % predict only if other matching users rated the movie... if rel_users>users_th score= user_mean(user)+full(w'*mscr./sumw); % expected score score=(score<=1)+score*((score<5)&(score>1))+5*(score>=5); % truncate exceptional values list_pred(count,:)=[user mid list_test(i,3) score user_mean(user)]; count=count+1; else list_npred(countn,:)=[user mid list_test(i,3) user_mean(user)]; countn=countn+1; end if mod(i,25)==0 waitbar(i/num_test,h); end end close(h) if exist('list_pred') Movie Advisor Project 49/51 February 12, 2016 fprintf('\nL TH=%.3f, Users TH=%i\n',L, users_th); fprintf('Coverage=%.4f, MAE Average=%.4f, MAE Genre=%.4f\n', ... (count-1)/num_test, ... mean(abs(list_pred(:,3)-list_pred(:,5))), ... mean(abs(list_pred(:,3)-list_pred(:,4)))); fprintf('ommited values, MAE Average=%1.4f\n',mean(abs(list_npred(:,3)-list_npred(:,4)))) else fprintf('No predictions made'); end Hgenmsd.m This m-file is used to produce predictions using the hybrid genre algorithm. The m-file also outputs several statistics to the standard output.. function [list_pred, list_npred]=hgenmsd(mat_base,list_test, g, varargin); % % % % % % % % % % % % % % % % % This function takes a rating matrix (UxM,R) a list of observations (uid, mid, r) and a list of genres and returns a list of predictions using the pearson r coefficient. The algorithm used is a hybrid of Genre algorithm, using with weighting a users' own prefrences. USAGE: [list_pred, list_npred]=hgenmsd(mat_base,list_test, g, L, rat, users_th) list_pred- a list of predicted values [uid mid r pred u_ave] list_npred- a list of values that were not predicted [uid mid r u_ave] mat_base- base dataset in the form of a sparse matrix list_test- test list in the form of [uid mid r], not presumed to be sorted glist of genres Optional Parameter: Lmsd threshold (default=.7) ratratio between weight given to other users and user (default=0.65) users_th- minimal number of correlated users to make a prediction (default=3) % obtain parameters users_th=3; L=.7; rat=.65; if nargin>=4 Movie Advisor Project 50/51 February 12, 2016 L=varargin{1}; end if nargin>=5 rat=varargin{2}; end if nargin>=6 users_th=varargin{3}; end if (nargin<3 | nargin>6) error('Improper number of input arguements'); end h=waitbar(0,'Processing hgenmsd...'); [tmp,index]=sort(list_test,1); % sort _list_test_ according to user number list_test=list_test(index(:,1),:); [unum,mnum]=size(mat_base); num_test=size(list_test,1); prev_user=0; count=1; countn=1; user_mean=sum(mat_base,2)./sum(mat_base~=0,2); % vector of mean score of each user mat_ave=(mat_base-repmat(user_mean,1,mnum)).*(mat_base~=0); % rating matrix minus mean of each user, 0 for unrated movies ugen=mat_ave*g; ugenum=(mat_ave~=0)*g; ugenum1=ugenum+(ugenum==0); ugen=ugen./ugenum1; % % % % create a matrix of average user scores for each genre count the number of rated movies in genre for normalization avoid divide by 0; normalize for i=1:num_test, % calculate expected score for each test item user=list_test(i,1); mid=list_test(i,2); if prev_user~=user; % the distance is recaulculated whenever a new user is encounterd prev_user=user; ref=repmat(ugen(user,:),unum,1); % repeated matrix of tested users genre prefs Movie Advisor Project 51/51 February 12, 2016 rated=(ugen~=0).*(ref~=0); % mutually rated genres... srated=sum(rated,2); % number of mutually rated genres pr=sqrt(sum((ugen-ref).^2.*rated,2)./(srated+(srated==0))); w=(srated~=0).*(pr<L).*(L-pr)/L; % weight used instead of pearson r coef, used to incorporate threshholding end mscr=full(mat_ave(:,mid)); % vector of movie scores minus user average sumw=w'*(mscr~=0); % sum of w for correlated users who scored movie rel_users=full(sum((w~=0).*(mscr~=0))); % if sumw~=0 % predict only if other matching users rated the movie... if rel_users>users_th idx=find(g(mid,:)~=0); u_gen_mid=sum(ugen(user,idx))./length(idx); score= user_mean(user)+rat*full(w'*mscr./sumw)+(1-rat)*u_gen_mid; % expected score score=(score<=1)+score*((score<5)&(score>1))+5*(score>=5); % truncate exceptional values list_pred(count,:)=[user mid list_test(i,3) score user_mean(user)]; count=count+1; else list_npred(countn,:)=[user mid list_test(i,3) user_mean(user)]; countn=countn+1; end if mod(i,25)==0 waitbar(i/num_test,h); end end close(h) if exist('list_pred') fprintf('\nL TH=%.3f, Users TH=%i, ratio=%.2f\n',L, users_th,rat); fprintf('Coverage=%.4f, MAE Average=%.4f, MAE Hybrid Genre=%.4f\n', ... (count-1)/num_test, ... mean(abs(list_pred(:,3)-list_pred(:,5))), ... mean(abs(list_pred(:,3)-list_pred(:,4)))); fprintf('ommited values, MAE Average=%1.4f\n',mean(abs(list_npred(:,3)-list_npred(:,4)))) else fprintf('No predictions made'); end