From: AAAI Technical Report WS-98-08. Compilation copyright © 1998, AAAI ( All rights reserved. Recommender Systems: A GroupLensPerspective Joseph A. Konstan*t, John Riedl *t, AI Borchers,* and Jonathan L. Herlocker* *GroupLens ResearchProject *NetPerceptions,Inc. Dept. of ComputerScience and Engineering 11200 West78th Street University of Minnesota Suite 300 Minneapolis, MN55455 Minneapolis, MN55344 ABSTRACT In this paper, wereview the history and research findings of the GroupLensResearch project I and present the four broad research directions that we feel are most critical for recommendersystems. identifying sets of articles by keyworddoes not scale to a situation in which there are thousands of articles that contain any imaginable set of keywords. Taken together, these two weaknesses represented an opportunity for a new type of filtering, that would focus on finding which available articles matchhumannotions of quality and taste. Such a system wouldbe able to produce a list of articles that each user wouldlike, independentof their content. Wedecided to apply our ideas in the domain of Usenet news. Usenet screamsfor better information filtering, with hundreds of thousands of articles posted daily. Manyof the articles in each Usenetnewsgroupare on the sametopic, so syntactic techniques that identify topic are muchless valuable in Usenet. Further, different people value very different sets of articles, with somepeople participating in long discussion threads that other people couldn’t imagine even reading. INTRODUCTION: A History of the GroupLens Project The GroupLens Research project began at the Computer Supported Cooperative Work(CSCW)Conference in 1992. Oneof the keynote speakers at the conference lectured on a his vision of an emerging information economy,in which most of the effort in the economywould revolve around production, distribution, and consumptionof information, rather than physical goodsand services. Paul Resnick, then a student at MIT,and nowa professor at the University of Michigan, and one of us (Riedl) were movedby the talk consider the technical challenges that would have to be overcometo enable the information economy.Werealized that as the amount of information increased enormously, while people’s ability to process information remained stable, one of the critical challenges wouldbe technology that would automate matching people with the information they wouldfind most valuable. Therewere two mainthrusts of research activity in this area that we knewof: (1) Artificial Intelligence (AI) research develop tools that would serve as a "knowledgerobot", or knowbot, continually seeking out information, reading and understandingit, and returning with the informationthat the knowbotdetermined wouldbe most valuable to its user. (2) Information Filtering (IF) research to develop even more efficient tools for selecting documents that contain keywordsof interest to a user. These techniques were, and continue to be fruitful, but we felt they each have one serious weakness. In the case of the knowbot,the weakness is that we are still a significant distance from technology that can understandarticles in the waya humandoes. In the case of Information Filtering, the weakness is that Wedevelopeda systemthat falls into the class that is now called automatic collaborative filtering. It collects ratings from people on articles, combinesthe ratings statistically, and produces recommendations for other people of how muchthey are likely to like each article. Weinvited people to participate in using GroupLensfrom all over the Internet, and studied the effect of the systemon users. Users resisted our early attempts to establish multidimensional rating schemes, including characteristics such as quality of the writing, and suitability of the topic for the newsgroup. Rating on multiple dimensions was too much work. Wechanged to single-dimension ratings, with the dimension being "What score would you have liked GroupLens to predict for youfor this article?" Wefound that users did change behavior in response to the recommendations,reading a muchhigher percentage of the articles that GroupLenspredicted they would like than of either randomly selected articles, or articles GroupLens predicted they would not like. However,there were many articles for whichGroupLenswas unable to provide ratings, because even with a two to three hundredusers, there were simply too manyarticles in the six newsgroups we were studying. A greater density of ratings by article wouldhave improvedthe usability of the system for most users. The low ratings density was compoundedby the first rater problem, which is the problem that a pure collaborative filtering system cannot possibly makerecommendationsto 1 GroupLensT Mis a trademark of Net Perceptions, Inc, which develops and markets the GroupLens Recommendation Engine. Net Perceptions allows the University of Minnesota to use the name "GroupLens Research" for continuity. The ideas and opinions expressed in this paper are those of the authors and do not represent opinions of Net Perceptions, Inc. 60 the first person that reads each article. Oneeffect of these two problems is that some beginning users of the system saw little value from GroupLensinitially, and hence never developed the habit of contributing ratings, though they continued to use GroupLens-enablednews readers. Becausemost users did not like most articles, and because GroupLenswas effective at identifying articles users would like, users requested the ability to scan a newsgroupfor the articles that were predicted to be of high interest to them. This led to our exploring a different style of interface to a collaborative filtering system, the TopNinterface. Rather than predicting a score for each article, a TopNinterface greedily seeks articles that are likely to have high scores for an individual user, and recommendsthose articles to that user. Eventually, such an interface might be able to present each of us with a list of the 20-30 most interesting articles for us from all of Useneteach morning. Our key lesson learned was that a very high volume, low quality system like Usenet would require a very large numberof users for collaborative filtering to be successful. For our research purposes, we needed a lower volume, higher density testbed. Our colleagues from Digital Equipment Corporation were closing downtheir research system on movie recommendations,and offered us the data to jump-start a similar system using GroupLens. We launched our system in the summerof 1997, and have been running it since at entirely web-based,and has several thousandregular users. Users rate movies, and MovieLens recommends other movies to them. Over the past six years of research, we have learned that people are hungry for effective tools for information filtering, and that collaborative filtering is an exciting complementto existing filtering systems. Users value both the taste-based recommendations, and the sense of communitythey get by participating in a group filtering process. However,there are manyopen research problems still in collaborative filtering. Belowwe discuss our early results on some of these problems, and outline the remaining problems we feel to be most important to the evolutionof the field of collaborativefiltering. CURRENTRESEARCHRESULTS Our recent research has focused on improving the quality and efficiency of collaborative filtering systems. Wehave taken a broad approach, seeking solutions that improvethe efficiency, accuracy, and coverage of the system. Specifically, we’ve examinedpartitioning users and items, incorporatingfiltering agents into the collaborative filtering framework, and using existing data sets to start up a collaborative filtering recommendation system. Partitioning UsersandItems Both performance and accuracy concerns led us to explore the use of partitioning. If user tastes are more consistent within a partition of items, and therefore, user agreementis more consistent, then partitioning the items in the system 61 may yield more accurate recommendations. Even if the increased accuracy is offset by the smaller numberof items available to establish user correlations, partitioning maybe valuable because it can help scale the performanceof the system; each partition can be run in parallel on a separate server. To explore the potential of item partitioning, we considered three partitioning strategies for MovieLens: random partitions, partitions by movie genre, and partitions generated algorithmically by clustering based on ratings. Clustering-based partitions produced a slight loss in prediction accuracy as partitions grew smaller, but showed promise for a reasonable trade-off between performance and accuracy. Moviegenre partitions yielded less accurate recommendations than cluster-based ones, though some genres were muchmore accurate, and others muchless so). Randompartitions were slightly worse still. The value of item partitions clearly depends on the domain of the recommendationsystem and the density of ratings within and across potential partitions (our earlier Usenet work found that mixing widely different newsgroups together reduced accuracy). One advantage of the clustering result is that it maybe morebroadly applicable in domainswhere items lack obviousattributes for partitioning. Wealso looked at the value of user partitioning, starting with the extreme case of pre-computed symmetric neighborhoods based on our clustering algorithm; these were small partitions of about 200 users. If symmetric neighborhoodsyield good results, time per recommendation can be reduced dramatically, since substantial perneighborhood computation can be performed incrementally and amortized across the neighbors. Wefound that the accuracy of recommendationswas almost as good as using the full data set, but that the coverage(i.e., the numberof movies for which we could computea recommendation)fell by 14%. To restore coverage we introduced a two level hierarchyof users. Users from each other neighborhoodwere collapsed into a single composite user. Each neighborhood then had all users represented, similar users were represented at full resolution and the more distant users were represented at the much lower resolution of one composite user per neighborhood. This restored full coverage and the quality of predictions was only slightly degraded by about 1%from the unpartitioned case. Weare continuing to explore these hierarchical approaches. Filterbots Oneproblemwith pure collaborative filtering is that users cannot receive recommendationsfor an item until enough other users have rated it. Content-based information filtering approaches, by contrast, avoid this problem by establishing profiles that can be used to evaluate items (e.g., keyword preferences). To combine the best of both approaches, we developedfilterbots--rating agents that use content information to generate ratings systematically. These ratings are entered into the recommendationsystem by treating the filterbots as additional users. This approach has the benefit of allowing us to use simple-minded or controversial filterbots; if a user agrees with a particular filterbot, that filterbot becomes part of the user’s neighborhoodand gains influence in recommendations.If a user does not agree with the filterbot, it does not become part of that user’s neighborhoodand is therefore ignored. To test this concept, we created several simple-minded filterbots for Usenet news. Wefound that a spell checking filterbot not only increased coverage dramatically (as much as 514%), but also increased accuracy as muchas 74%.In tee.humor, a notoriously high noise group, all three of our simple filterbots (spell checker, percentageof included text, and message length) improved coverage and quality. In other newsgroups, somefilterbots helped while others did not. Fortunately, the cost of filterbots is quite low, particularly since simple ones appear to have significant value. Weplan to continue exploring filterbots, looking both at simple content filtering algorithms and at learning agents. Jump-Starting a Recommendation Engine Collaborative filtering systemsface a start up problem:until a critical mass of ratings has been collected, there is not enough data to compute recommendations. Accordingly, early users receive little value for their contribution. In our MovieLenssystem we had the good fortune of starting our system seeded with a database over 2.8 million ratings from the earlier EachMovierecommendersystem. For privacy reasons the database we received from EachMoviehad only anonymoususers; although we could not associate these users with our own, they could still serve as recommenders for our users. Wecall this "dead data." Wetook advantage of this rare opportunity to evaluate the experience of new users in systems with and without the dead data. We retrospectively evaluated the recommendationaccuracy, coverage, and user satisfaction for early users of EachMovie and MovieLens. For our accuracy and coverage experiments, we held the recommendation algorithm constant, and found that the jump-started case (MovieLens)had better coverage (nearly 100%, as compared with 89%) and higher accuracy (increases as high as 19%,dependingon the metric used). To assess user satisfaction, we retrospectively compared user retention and participation in our current MovieLens system with that of the early EachMovie system. By looking at the session, rating, and overall length of active use of corresponding early EachMovie and MovieLens users (all of which could be reconstructed from logs), were able to measure indicators of user satisfaction. We found that early MovieLensusers were more active than early EachMovieusers in all categories, with dramatic increases in the numberof ratings and numberof sessions. Accordingly, it appears that the start-up problemis a real one--user retention and participation improves when users receive value--and using historical or "dead" data maybe a useful technique for improvingstart-up. 62 WHAT’SNEXT: A RESEARCH AGENDA Basedon our prior work, we’ve identified four broad problemareas that are particularly critical to the success of recommender systems. We discuss these in general, highlighting work knowto be underway,but also presenting openquestionsthat are ripe for research. SupportingUsersand Decision-Making Early work in recommender systems focused on the technology of making recommendations. Papers cited measures of system accuracy such as meanabsolute error, measures of throughput and latency, and occasionally a general metric indicating the people used the system, or perhaps that they said they liked it. Nowthat the technological feasibility of recommendersystems is well established, we must face the challenge of designing and evaluating systems to support users and their decisionmakingprocesses. While there are manyfactors that affect the decisionmakingvalue of a recommendationsystem for users, three critical issues have arisen in each system we’ve studied: accuracy, confidence, and user interface. Accuracy is the measure of how closely the recommendations generated by the system predict the actual preferences of the user. Measurement of accuracy is itself a challenging issue that we discuss below. However,for any sensible definition of accuracy, recommender systems still are far from perfect. Both Maes’ work on Ringo and our ownwork suggest that today’s pure collaborative filtering systems typically achieve at best an accuracy of plus-orminus one on a seven-point scale. Further research is neededto determinethe theoretical limit of accuracy, based on user rate/re-rate differences and empirically determined variances. Then, significant work is needed on a wide range of approaches to improve accuracy for user tasks. These approaches include those discussed below and special filtering modelstuned for precision and for recall. Confidence is a measure of how certain the recommendation system is of its recommendation. While statistical confidence measures often are expressed as confidence intervals or expected distributions, current recommendationsystems generally provide no more than a simple "high, medium,or low" confidence score. Part of the difficulty with expressing confidence as an interval, distribution, or variance is the complexityof the statistics underlying collaborative filtering. Unlike dense analytic techniques such as multiple regression, collaborative filtering lacks a well-understood measure of error or variance. Sources of error include: the user’s ownvariance in rating, the imperfect matchingof neighbors, the degree to which past agreementreally does predict future agreement, the numberof items rated by the user, and the numberof items in commonwith each neighbor, rounding effects, and manyothers. At the same time, measures of confidence are critical to users trying to determine whether to rely upon a recommendation. Without confidence measures, it is extremely difficult to provide recommendations in situations where users are risk averse. Perhaps even worse is the poor reputation that a recommendationengine will receive if it delivers low-confidence recommendations. Accordingly, a key research priority is the developmentof computable and usable confidence measures. When algorithms permit, analytic solutions are desirable, but we are also investigating empirical confidence measures that can be derived from and applied to an existing system. User interface issues in collaborative filtering span a range of questions including: ¯ Whenand howto use multi-dimensional ratings ¯ Whenand howto use implicit ratings collaborative filtering (e.g., sparsity and the early rater problem). Someof the interesting open research questions include: ¯ Howto integrate content analysis techniques from information retrieval, information filtering, and agents research into recommendationsystems. Our filterbot work is a first step in this direction, as is MIT’s collaborating agent work and Stanford’s Fab system. More research is needed to discover which techniques work for whichapplications. ¯ Howto take advantage of user demographicsand rulebased knowledgein recommendationsystems. ¯ Howto integrate the power of data mining with the real-time capabilities of recommender systems. Particularly interesting questions include identifying temporal trends in preferences (e.g., people whoprefer Aat time t are morelikely to prefer B at time t+l). ¯ Howto take advantage of machine learning techniques in recommendersystem. ¯ Howa recommendationshould be displayed Multi-dimensional ratings (and therefore predictions) seem natural in certain applications. Restaurants are often evaluated separately on food, service, and value. Tasks are often rated for importance and urgency. Today’s recommendation engines can accept these dimensions separately, but further research is needed on crossdimension correlation and recommendation. Implicit ratings are observational measures of a user’s preference for an item. For example, in Usenet news we found that time spent reading an article is a goodmeasure of preference for that article. Similarly, listening to music, viewing art, and purchasing consumergoods are all good indicators of preference. Today’s systems lack automated means for evaluating and calibrating implicit ratings; further research wouldbe valuable. There are manyways to display recommendations, ranging from simply listing recommended items (without order), markingrecommended items in a list of items, to displaying a predicted rating for items. Manyresearch questions need to be answered through real user studies. In one study we’ve conducted, we saw that the correctness of user decision makingis directly affected by the type of display used. In other work, we are examining the question of whether the expected value of a predicted rating distribution is moreor less valuable than the probability of the rating exceeding a "worthwhile"cutoff. For example, is a user moreinterested that a movieis likely to be three-anda-half stars, or that it has a 40%chanceof being four or five stars? Manysimilar questions remain unanswered. Beyond CollaborativeFiltering The second major research issue for recommendersystems is integrating technologiesother than collaborative filtering into recommendation systems. Content analysis, demographics, data mining, machine learning, and other approaches to learning from data each have advantages that can help offset some of the fundamental limitations of 63 In all of these cases, a key question will be whether one technology can be incorporated into the framework of another, or whether a new architecture is needed to merge the two types of knowledge. Scale, Sparsity, andAlgorithmlcs There fundamental problem of producing accurate recommendations efficiently from a sparse set of ratings is inherent to collaborative filtering. Indeed, if the ratings set were dense, there would be little value in producing recommendations.Accordingly, there is still great need for continued research into fundamentalissues in performance, scalability, and applicability. One particularly interesting research area that we, along with others, are actively investigating is techniques for reducing the computational complexity of recommendation. As discussed above, the complexity of recommendations growsgenerally with the size of the database, which is the product of the number of users and the number of items. Neighborhoods,partitioning, and factor analysis all attempt to reduce this size by limiting the elements considered along the user dimension, the item dimension, or both. Neighborhoodtechniques use only a subset of users in the computation of recommendations. Many variants have been proposed and implemented; research is needed to assess the varying merits of symmetric vs. asymmetric neighborhoods, on-demandvs. longer-term neighborhoods, threshold vs. size limit, etc. Partitioning is discussed above; factor analysis is a different approachthat tries to decomposeusers or items into a combination of shared "taste" vectors. Each of these techniques has promise for reducing storage and computation. A second critical issues is to continue to address the challenge of ratings sparsity, particularly for applications wherefew users ever should rate an item. Clustering, factor analysis, and hierarchical techniques that combine individual items with clusters or factors can provide one set of solutions. Integration with other technologies will provide others. Finally, there is still significant workneededon the central algorithmics of collaborative filtering to improve performance and accuracy. The use of Pearson correlations for neighbor selection and weighting is commonin recommendation systems, yet many alternatives may be more suitable, depending on the distribution of ratings. Similarly, the results from RINGO,along with someof our own, suggest that the common Pearson algorithm overvalues neighbors with low correlations. Further work, particularly including empirical work, is needed to evaluate candidate algorithms. Metrics and Benchmarks Sadly, evaluation is one of the weakest parts of current recommendersystem research. Systems are not compared against each other directly, and published results use a variety of metrics, often incomparableones. Both research and collaboration is needed to establish an accepted set of metrics and benchmarks for evaluating recommendation systems. Three areas are particularly important and promising. Accuracy metrics. There are nearly a dozen different metrics that have been used to measure recommendation system accuracy. Statistical measuresof error include the mean absolute error between predicted and actual rating, the root meansquared error (to more heavily weigh large errors), and the correlation between predicted and actual ratings. Other metrics attempt to assess the prevalence of large errors. Reversal measures tally the frequency with which "embarrassingly bad" recommendationsare made. A different set of metrics attempts to assess the effectiveness of the recommendationengine in filtering items. Precision and recall statistics, borrowed from information retrieval, together with receiver operating characteristic measurements from signal processing discount errors that do not affect usage, and moreheavily weigh errors near the decision point. For example, the difference between1.0 and 2.5 on a five point scale maybe unimportant, since both are rejected, while the difference 64 between 3.0 and 4.5 is extremely relevant. Finally, ordering metrics assess the effectiveness of top-N algorithms in identifying true "top" items. The different metrics in each category should be evaluated with the most useful ones identified as expected for publication and comparison. Coverage combined with accuracy. Accuracy metrics alone are not useful for many types of comparison. By setting the confidence threshold high, most systems can increase accuracy at the expense of coverage. Meaningful combined metrics are needed to allow meaningful evaluation of coverage/accuracy trade-offs. We are working on a combineddecision-support metric, but others are neededas well. Full-system benchmarks. The greatest need right now is for corpora and benchmarks that can be used for comparison. In the future, these should be integrated with economicmodelsto evaluate, in monetaryterms, the value added by a recommendersystem. ACKNOWLEDGMENTS Wewouldlike to acknowledgethe financial support of the National Science Foundation and Net Perceptions, Inc. We also wouldlike to thank the dozens of individuals, mostly students, who have contributed their effort to the GroupLensResearch project. Finally, we would like to thankour users, all FOR ADDITIONALINFORMATION Several good sources of bibliographic information already exist in the area of collaborative information filtering and recommender systems. Rather than duplicate that work here, werefer the user to: 1. The March 1997 issue of Communicationsof the ACM, edited by Hal Varian and Paul Resnick. In addition to containing articles on several relevant systems, the section introduction and articles contain extensive bibliographic information. 2. The Collaborative Filtering Resources web page, at; this page grew out of a March 1996 workshop on collaborative filtering held at Berkeley. It includes pointers to other reference pages.