Eric Eisenberg and Varun Rao Collaborative Filtering of the Jester Data Set Collaborative filtering is the term for any of a number of algorithms that uses subjective ratings from a number of users for a number of items in order to predict ratings for an individual item by an individual user. It is used extensively by such web sites as amazon.com and bn.com (Barnes and Noble) to make personalized recommendations of products to buy for users. In our project, we analyze a number of collaborative filtering algorithms’ performance in predicting user’s ratings for the jokes in the Jester data set. The simplest algorithm we use is Predict Zero, which merely predicts zero. The next simplest algorithms we implement, Predict User Average and Predict Item Average, use the mean value of the user’s ratings as a prediction, and use the mean value of the item’s ratings as a prediction, respectively. However, this does not take advantage of the correlation in likeness between different users. More advanced algorithms we implement are Pearson Correlation-based collaborative filtering and Singular Value Decomposition (SVD). Pearson Correlations make use of the overall similarity between users to determine what users’ ratings of a given item are most relevant in predicting the queried user’s rating of that item. Finally, we implement a collaborative algorithm of our own design, which will be described later in this paper. Fundamentally, our algorithm makes use of the similarity between items in order to better correlate users giving the highest weight to users who rate items similar to the queried item similarly to the user in question. We implemented these algorithms with a combination of Java and Matlab. Matlab is used only to implement part of the singular value decomposition algorithm. The Jester data set is a matrix of six-thousand users by one-hundred jokes. Each entry in the matrix is a joke rated by a particular user. These ratings were generated by having a user click on a continuum. For each joke that was rated by a particular user, the corresponding entry in the matrix was filled in with a rating between -10 and 10. The values were filled in by users clicking along a continuum, which explains why the ratings have quite a few significant digits. While this large datasets exist, we only tested the algorithms on subsets of the Jester dataset, most frequently with 1000 of the 6000 users but all of the 100 jokes, but all of our testing will be discussed later in this paper. However, our algorithms can scale (with more computing power and space) to work with larger datasets. In our analysis of SVD, we found that a rank 90 matrix worked well compared to other ranks, and this is what we chose for our final implementation. In addition to the basic project requirements we make several enhancements. The first addition we suggest will be in the realm of improving our evaluation methodology so as to improve our understanding of the relative advantages of different collaborative filtering algorithms. The first enhancement to our testing suite that we suggest involves altering the density of our joke database. As has been noted, different algorithms scale with sparsity in different ways. Then we increase the sparsity of the matrix till it is only about 5% full to test the algorithms on a more realistically sparse dataset. This will better help us understand which algorithms work better on smaller datasets. The second enhancement is to improve our evaluation methodology. Currently the project only requires us to use a normalized mean absolute error evaluation to test how well a predicted rating of the joke matches up with its actual rating. However this may not be the best evaluation methodology. If we are putting the collaborative filtering algorithm to use in a conventional context (such as an Amazon.com-like recommendation system), then we are not interested only in how correctly our estimation of a rating matches the actual rating. What might be more interesting to us is, say the top ten rated items. If we can correctly select the top rated items, then our work is done. If we rate the top items too high or too low (when compared to the actual rating), it does not actually affect our results if all we care about is which are the top items to recommend. However, just looking at the top ten items does not give us a lot of data on which to analyze the performance. So, extending this idea, we come to realize that often item ratings are not necessarily intended by users to be cardinal. That is to say, when a user evaluates ten jokes, he or she may not be actually rating jokes on a –10 to 10 scale, but rather rating jokes in comparison to each other. That is to say, that having given the first joke an approximate goodness rating, the rest of the jokes may just be rated against that joke in an ordinal manner. That is to say if a joke is rated as 5 and another as 6 that merely means that the second joke is better than the first. The argument can thus be made that we should check to see how well our ratings match the actual ratings in terms of relative ranking rather than actual numeric correspondence. Thus we propose an evaluation metric based on how different a user’s ranking list of jokes is after our algorithms have run as compared to the actual ranking of the users. Thus we will understand how correctly our algorithms estimate the ordinal or relative ranking rating of jokes. Our method calculates the average absolute normalized difference from each predicted values’ ranking amongst all of the predicted and non-eliminated ratings of that values’ user to the ranking of the correct value for that prediction amongst all of that user’s correct values. This difference is normalized by dividing by the maximum possible distance of two ratings from a user, which is the number of ratings of that user -1. (since this is the difference between the first and last ranked entry) Finally our final enhancement concerns a new collaborative filtering algorithm1. The idea behind the algorithm is simple. Often users have mostly homogenous preferences over most types of items, but have particular (and different) preferences about specific types of items. Lawyers may like the same kinds of jokes as most other people but singularly dislike jokes that make fun of lawyers. Now if we are trying to evaluate a lawyer’s rating of a lawyer joke, and use the unweighted similarity of other users to the user in question to derive a rating for the joke, we will be significantly wrong2. Our algorithm corrects this by weighing users by how similar they are when it comes to rating the type of item concerned rather than using mere general similarity. Our Algorithm: Input: A query for a prediction for user x’s rating for joke y, one-half of the data within the Jester data set. Output: a predicted value for user x’s rating of joke y First, a similarity metric is used to determine the similarity of joke y to each other joke. The similarity metric we use for this step is Pearson correlation. 2 Assuming of course that the general population likes lawyer jokes Second, a similarity metric is used to determine the “joke y-based” likeness of user x to all the other users. This metric weighs users more heavily on their similarity to user x’s rating on jokes like joke y than their similarity to user x’s rating on jokes less like joke y. Third, this “joke y-based” likeness of users is used to give a weighted average of other users’ ratings of joke y. This weighted average is returned as the output. Results As can be seen from the graphs, below , even our best method (our algorithm IBUB) does only about 20%-25% better than a trivial predict zero algorithms (on the normalized mean absolute error metric). This is largely a function of the difficulty of our task; we are trying to predict complex results of users preferences using simplistic means. However, when we examine these results under the lens of our ordinal metric we find more substantial improvements. Some of this maybe be more reflective of our particular ordinal metric than any actual improvement. Nevertheless using a different kind of metric seems give us significantly more information about the relative advantages of different algorithms. We find, as expected, that the SVD algorithm does about as well as the predict user average algorithm. Our particular SVD algorithm normalized off user average values only, and thus its results are close to that of the predict user average method. SVD is more useful when applied to collaborative filtering with many more items, since one of its advantages is the ease of computing a rating for any particular item. Similarly while our IBUB algorithm does better than any other method we implement, it builds off of, and is similar to, the Pearson correlation method and thus behaves similarly. The time tradeoffs with IBUB are significant. To get its tiny improvement, it requires much more time. It will probably be a much better algorithm than Pearson correlations when used in combination with nearest neighbour approaches. Alternatively, using IBUB only when you know that the information gain will be significant and using other methods the rest of the time would probably give as good results in less time. Comparison of Algorithms 23.00% 22.00% Predict 0 Predict Item Av NMAE . 21.00% 20.00% Predict User Av SVD (rank 90) 19.00% Pearson Corr IBUB 18.00% 17.00% 16.00% 2000 1000 500 Number of Users Number of Users vs. Normalized Mean Absolute Error As can be seen from this chart, decreasing the number of users has a similar effect upon each of the algorithms, increasing the Normalized Mean Absolute Error by a bit less than 1 percent for each of these algorithms. Comparison of Algorithms Predict 0 Predict Item Av Predict User Av SVD (rank 90) Pearson Corr IBUB Ordinal Err . 28.00% 27.00% 26.00% 25.00% 24.00% 23.00% 22.00% 2000 1000 500 Number of Users Comparison of Algorithms 23.00% 22.00% Predict 0 Predict Item Av Predict User Av SVD (rank 90) Pearson Corr IBUB 20.00% 19.00% 18.00% 17.00% 16.00% 7% 35% 64% Density (1000 users) Comparison of Algorithms 34.00% 32.00% Ordinal Err . NMAE . 21.00% Predict 0 Predict Item Av Predict User Av SVD (rank 90) Pearson Corr IBUB 30.00% 28.00% 26.00% 24.00% 22.00% 7% 35% Density (1000 users) 64% Individual report for Eric Eisenberg: The ideas in this report were generated through discussion between myself and Varun Rao. We came up with several ideas for new collaborative filtering methods and debated whether they’d be both effective at filtering and reasonably efficient from a timeanalysis perspective. Since the group is only two people division of labor does not appear to be greatly necessary. Rather, since two people are generally needed for debugging we did our coding together. Implementation went fairly smoothly. Individual Report for Varun Rao The work we did on this project was essentially collaborative in nature. Since we are a team of two, we thought it unwise to assign any task to a single person, but rather to work together on all of it. Thus we did the research together, discussing ideas for various algorithms and methodologies. In particular we came up with our item weighted algorithm in close consultation with each other. We worked together to write and debug code as well. We structured the code as follows: We have a generation module that strips data from the dataset and passes it to any one of several algorithm modules. The modules will impute rankings only for the stripped out data, and pass it, along with the correct full dataset to an evaluation model. The evaluation model will then evaluate the imputation based on the metrics outlined before. Works Consulted: J. S. Breese, D. Heckerman and C. Kadie, Empirical Analysis of Predictive Algorithms for Collaborative Filtering. In proceedings of the 14th conference on Uncertainty in Artificial Intelligence (UAI), 1998. J. Li. Herlocker, J. A. Konstan, A. Borchers and J. Riedl, An Algorithmic Framework for Performing Collaborative Filtering. In proceedings of the 22nd ACM SIGIR conference, 1999. B. Sarwar, G. Karypis, J. A. Konstan and J. Riedl, Application of Dimensionality Reduction in Recommender Systems-A Case Study. In the ACM WebKDD WEb mining for E-Commerce Workshop, 2000. Sarwar, B. M., Karypis, G., Konstan, J. A., and Riedl, J. "Item-based Collaborative Filtering Recommender Algorithms.” Accepted for publication at the WWW10 Conference. May, 2001. Sarwar, B. M., Karypis, G., Konstan, J. A., and Riedl, J. "Analysis of Recommendation Algorithms for E-Commerce.” Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING(ISBN 0-521-431908-5) 1988-1992 Cambridge University Press Eric W. Weisstein. "Singular Value Decomposition." From MathWorld--A Wolfram Web Resource. http://mathworld.wolfram.com/SingularValueDecomposition.html K. Goldberg, R. Hennessy, Jester: The On-Line Joke Recommender, http://shadow.ieor.berkeley.edu/humor/ Retrieved February 2005