Recommender Systems: A GroupLens Perspective

From: AAAI Technical Report WS-98-08. Compilation copyright © 1998, AAAI (www.aaai.org). All rights reserved.
Recommender
Systems: A GroupLensPerspective
Joseph A. Konstan*t, John Riedl *t, AI Borchers,* and Jonathan L. Herlocker*
*GroupLens
ResearchProject
*NetPerceptions,Inc.
Dept. of ComputerScience and Engineering
11200 West78th Street
University of Minnesota
Suite 300
Minneapolis, MN55455
Minneapolis, MN55344
http://www.cs.umn.edu/Research/GroupLens/
http://www.netperceptions.com/
ABSTRACT
In this paper, wereview the history and research findings of
the GroupLensResearch project I and present the four broad
research directions that we feel are most critical for
recommendersystems.
identifying sets of articles by keyworddoes not scale to a
situation in which there are thousands of articles that
contain any imaginable set of keywords. Taken together,
these two weaknesses represented an opportunity for a new
type of filtering,
that would focus on finding which
available articles matchhumannotions of quality and taste.
Such a system wouldbe able to produce a list of articles
that each user wouldlike, independentof their content.
Wedecided to apply our ideas in the domain of Usenet
news. Usenet screamsfor better information filtering, with
hundreds of thousands of articles posted daily. Manyof the
articles in each Usenetnewsgroupare on the sametopic, so
syntactic techniques that identify topic are muchless
valuable in Usenet. Further, different people value very
different sets of articles, with somepeople participating in
long discussion threads that other people couldn’t imagine
even reading.
INTRODUCTION:
A History of the GroupLens
Project
The GroupLens Research project began at the Computer
Supported Cooperative Work(CSCW)Conference in 1992.
Oneof the keynote speakers at the conference lectured on a
his vision of an emerging information economy,in which
most of the effort in the economywould revolve around
production, distribution, and consumptionof information,
rather than physical goodsand services. Paul Resnick, then
a student at MIT,and nowa professor at the University of
Michigan, and one of us (Riedl) were movedby the talk
consider the technical challenges that would have to be
overcometo enable the information economy.Werealized
that as the amount of information increased enormously,
while people’s ability to process information remained
stable, one of the critical challenges wouldbe technology
that would automate matching people with the information
they wouldfind most valuable.
Therewere two mainthrusts of research activity in this area
that we knewof: (1) Artificial Intelligence (AI) research
develop tools that would serve as a "knowledgerobot", or
knowbot, continually seeking out information, reading and
understandingit, and returning with the informationthat the
knowbotdetermined wouldbe most valuable to its user. (2)
Information Filtering (IF) research to develop even more
efficient tools for selecting documents that contain
keywordsof interest to a user. These techniques were, and
continue to be fruitful, but we felt they each have one
serious weakness. In the case of the knowbot,the weakness
is that we are still a significant distance from technology
that can understandarticles in the waya humandoes. In the
case of Information Filtering,
the weakness is that
Wedevelopeda systemthat falls into the class that is now
called automatic collaborative filtering. It collects ratings
from people on articles, combinesthe ratings statistically,
and produces recommendations for other people of how
muchthey are likely to like each article.
Weinvited people to participate in using GroupLensfrom
all over the Internet, and studied the effect of the systemon
users. Users resisted our early attempts to establish multidimensional rating schemes, including characteristics such
as quality of the writing, and suitability of the topic for the
newsgroup. Rating on multiple dimensions was too much
work. Wechanged to single-dimension ratings, with the
dimension being "What score would you have liked
GroupLens
to predict for youfor this article?"
Wefound that users did change behavior in response to the
recommendations,reading a muchhigher percentage of the
articles that GroupLenspredicted they would like than of
either randomly selected articles, or articles GroupLens
predicted they would not like. However,there were many
articles for whichGroupLenswas unable to provide ratings,
because even with a two to three hundredusers, there were
simply too manyarticles in the six newsgroups we were
studying. A greater density of ratings by article wouldhave
improvedthe usability of the system for most users. The
low ratings density was compoundedby the first rater
problem, which is the problem that a pure collaborative
filtering system cannot possibly makerecommendationsto
1 GroupLensT
Mis a trademark of Net Perceptions, Inc,
which develops
and markets the GroupLens
Recommendation Engine. Net Perceptions allows the
University of Minnesota to use the name "GroupLens
Research" for continuity.
The ideas and opinions
expressed in this paper are those of the authors and do not
represent opinions of Net Perceptions, Inc.
60
the first person that reads each article. Oneeffect of these
two problems is that some beginning users of the system
saw little value from GroupLensinitially, and hence never
developed the habit of contributing ratings, though they
continued to use GroupLens-enablednews readers.
Becausemost users did not like most articles, and because
GroupLenswas effective at identifying articles users would
like, users requested the ability to scan a newsgroupfor the
articles that were predicted to be of high interest to them.
This led to our exploring a different style of interface to a
collaborative filtering system, the TopNinterface. Rather
than predicting a score for each article, a TopNinterface
greedily seeks articles that are likely to have high scores for
an individual user, and recommendsthose articles to that
user. Eventually, such an interface might be able to present
each of us with a list of the 20-30 most interesting articles
for us from all of Useneteach morning.
Our key lesson learned was that a very high volume, low
quality system like Usenet would require a very large
numberof users for collaborative filtering to be successful.
For our research purposes, we needed a lower volume,
higher density testbed. Our colleagues from Digital
Equipment Corporation were closing downtheir research
system on movie recommendations,and offered us the data
to jump-start a similar system using GroupLens. We
launched our system in the summerof 1997, and have been
running it since at www.movielens.umn.edu.MovieLensis
entirely web-based,and has several thousandregular users.
Users rate movies, and MovieLens recommends other
movies to them.
Over the past six years of research, we have learned that
people are hungry for effective tools for information
filtering, and that collaborative filtering is an exciting
complementto existing filtering systems. Users value both
the taste-based
recommendations, and the sense of
communitythey get by participating in a group filtering
process. However,there are manyopen research problems
still in collaborative filtering. Belowwe discuss our early
results on some of these problems, and outline the
remaining problems we feel to be most important to the
evolutionof the field of collaborativefiltering.
CURRENTRESEARCHRESULTS
Our recent research has focused on improving the quality
and efficiency of collaborative filtering systems. Wehave
taken a broad approach, seeking solutions that improvethe
efficiency,
accuracy, and coverage of the system.
Specifically, we’ve examinedpartitioning users and items,
incorporatingfiltering agents into the collaborative filtering
framework, and using existing data sets to start up a
collaborative filtering recommendation
system.
Partitioning UsersandItems
Both performance and accuracy concerns led us to explore
the use of partitioning. If user tastes are more consistent
within a partition of items, and therefore, user agreementis
more consistent, then partitioning the items in the system
61
may yield more accurate recommendations. Even if the
increased accuracy is offset by the smaller numberof items
available to establish user correlations, partitioning maybe
valuable because it can help scale the performanceof the
system; each partition can be run in parallel on a separate
server.
To explore the potential of item partitioning, we considered
three partitioning
strategies for MovieLens: random
partitions, partitions by movie genre, and partitions
generated algorithmically by clustering based on ratings.
Clustering-based partitions produced a slight loss in
prediction accuracy as partitions grew smaller, but showed
promise for a reasonable trade-off between performance
and accuracy. Moviegenre partitions yielded less accurate
recommendations than cluster-based ones, though some
genres were muchmore accurate, and others muchless so).
Randompartitions were slightly worse still. The value of
item partitions clearly depends on the domain of the
recommendationsystem and the density of ratings within
and across potential partitions (our earlier Usenet work
found that mixing widely different newsgroups together
reduced accuracy). One advantage of the clustering result
is that it maybe morebroadly applicable in domainswhere
items lack obviousattributes for partitioning.
Wealso looked at the value of user partitioning, starting
with the extreme case of pre-computed symmetric
neighborhoods based on our clustering algorithm; these
were small partitions of about 200 users. If symmetric
neighborhoodsyield good results, time per recommendation
can be reduced dramatically, since substantial perneighborhood computation can be performed incrementally
and amortized across the neighbors. Wefound that the
accuracy of recommendationswas almost as good as using
the full data set, but that the coverage(i.e., the numberof
movies for which we could computea recommendation)fell
by 14%. To restore coverage we introduced a two level
hierarchyof users.
Users from each other neighborhoodwere collapsed into a
single composite user. Each neighborhood then had all
users represented, similar users were represented at full
resolution and the more distant users were represented at
the much lower resolution of one composite user per
neighborhood. This restored full coverage and the quality
of predictions was only slightly degraded by about 1%from
the unpartitioned case. Weare continuing to explore these
hierarchical approaches.
Filterbots
Oneproblemwith pure collaborative filtering is that users
cannot receive recommendationsfor an item until enough
other users have rated it. Content-based information
filtering approaches, by contrast, avoid this problem by
establishing profiles that can be used to evaluate items (e.g.,
keyword preferences).
To combine the best of both
approaches, we developedfilterbots--rating agents that use
content information to generate ratings systematically.
These ratings are entered into the recommendationsystem
by treating the filterbots as additional users. This approach
has the benefit of allowing us to use simple-minded or
controversial filterbots; if a user agrees with a particular
filterbot,
that filterbot becomes part of the user’s
neighborhoodand gains influence in recommendations.If a
user does not agree with the filterbot, it does not become
part of that user’s neighborhoodand is therefore ignored.
To test this concept, we created several simple-minded
filterbots for Usenet news. Wefound that a spell checking
filterbot not only increased coverage dramatically (as much
as 514%), but also increased accuracy as muchas 74%.In
tee.humor, a notoriously high noise group, all three of our
simple filterbots (spell checker, percentageof included text,
and message length) improved coverage and quality. In
other newsgroups, somefilterbots helped while others did
not. Fortunately, the cost of filterbots is quite low,
particularly since simple ones appear to have significant
value. Weplan to continue exploring filterbots, looking
both at simple content filtering algorithms and at learning
agents.
Jump-Starting a Recommendation
Engine
Collaborative filtering systemsface a start up problem:until
a critical mass of ratings has been collected, there is not
enough data to compute recommendations. Accordingly,
early users receive little value for their contribution. In our
MovieLenssystem we had the good fortune of starting our
system seeded with a database over 2.8 million ratings from
the earlier EachMovierecommendersystem. For privacy
reasons the database we received from EachMoviehad only
anonymoususers; although we could not associate these
users with our own, they could still serve as recommenders
for our users. Wecall this "dead data."
Wetook advantage of this rare opportunity to evaluate the
experience of new users in systems with and without the
dead data.
We retrospectively
evaluated the
recommendationaccuracy, coverage, and user satisfaction
for early users of EachMovie and MovieLens. For our
accuracy and coverage experiments,
we held the
recommendation algorithm constant, and found that the
jump-started case (MovieLens)had better coverage (nearly
100%, as compared with 89%) and higher accuracy
(increases as high as 19%,dependingon the metric used).
To assess user satisfaction, we retrospectively compared
user retention and participation in our current MovieLens
system with that of the early EachMovie system. By
looking at the session, rating, and overall length of active
use of corresponding early EachMovie and MovieLens
users (all of which could be reconstructed from logs),
were able to measure indicators of user satisfaction. We
found that early MovieLensusers were more active than
early EachMovieusers in all categories, with dramatic
increases in the numberof ratings and numberof sessions.
Accordingly, it appears that the start-up problemis a real
one--user retention and participation improves when users
receive value--and using historical or "dead" data maybe a
useful technique for improvingstart-up.
62
WHAT’SNEXT: A RESEARCH
AGENDA
Basedon our prior work, we’ve identified four broad
problemareas that are particularly critical to the success of
recommender systems. We discuss these in general,
highlighting work knowto be underway,but also presenting
openquestionsthat are ripe for research.
SupportingUsersand Decision-Making
Early work in recommender systems focused on the
technology of making recommendations. Papers cited
measures of system accuracy such as meanabsolute error,
measures of throughput and latency, and occasionally a
general metric indicating the people used the system, or
perhaps that they said they liked it. Nowthat the
technological feasibility of recommendersystems is well
established, we must face the challenge of designing and
evaluating systems to support users and their decisionmakingprocesses.
While there are manyfactors that affect the decisionmakingvalue of a recommendationsystem for users, three
critical issues have arisen in each system we’ve studied:
accuracy, confidence, and user interface.
Accuracy is the measure of how closely
the
recommendations
generated by the system predict the actual
preferences of the user. Measurement
of accuracy is itself a
challenging issue that we discuss below. However,for any
sensible definition of accuracy, recommender
systems still
are far from perfect. Both Maes’ work on Ringo and our
ownwork suggest that today’s pure collaborative filtering
systems typically achieve at best an accuracy of plus-orminus one on a seven-point scale. Further research is
neededto determinethe theoretical limit of accuracy, based
on user rate/re-rate differences and empirically determined
variances. Then, significant work is needed on a wide
range of approaches to improve accuracy for user tasks.
These approaches include those discussed below and
special filtering modelstuned for precision and for recall.
Confidence
is a measure of how certain
the
recommendation system is of its recommendation. While
statistical
confidence measures often are expressed as
confidence intervals or expected distributions, current
recommendationsystems generally provide no more than a
simple "high, medium,or low" confidence score. Part of
the difficulty with expressing confidence as an interval,
distribution, or variance is the complexityof the statistics
underlying collaborative filtering. Unlike dense analytic
techniques such as multiple regression, collaborative
filtering lacks a well-understood measure of error or
variance. Sources of error include: the user’s ownvariance
in rating, the imperfect matchingof neighbors, the degree to
which past agreementreally does predict future agreement,
the numberof items rated by the user, and the numberof
items in commonwith each neighbor, rounding effects, and
manyothers.
At the same time, measures of confidence are critical to
users trying to determine whether to rely upon a
recommendation. Without confidence measures, it is
extremely difficult
to provide recommendations in
situations where users are risk averse. Perhaps even worse
is the poor reputation that a recommendationengine will
receive if it delivers low-confidence recommendations.
Accordingly, a key research priority is the developmentof
computable and usable confidence measures. When
algorithms permit, analytic solutions are desirable, but we
are also investigating empirical confidence measures that
can be derived from and applied to an existing system.
User interface issues in collaborative filtering span a range
of questions including:
¯
Whenand howto use multi-dimensional ratings
¯
Whenand howto use implicit ratings
collaborative filtering (e.g., sparsity and the early rater
problem). Someof the interesting open research questions
include:
¯
Howto integrate content analysis techniques from
information retrieval, information filtering, and agents
research into recommendationsystems. Our filterbot
work is a first step in this direction, as is MIT’s
collaborating agent work and Stanford’s Fab system.
More research is needed to discover which techniques
work for whichapplications.
¯
Howto take advantage of user demographicsand rulebased knowledgein recommendationsystems.
¯
Howto integrate the power of data mining with the
real-time capabilities
of recommender systems.
Particularly interesting questions include identifying
temporal trends in preferences (e.g., people whoprefer
Aat time t are morelikely to prefer B at time t+l).
¯
Howto take advantage of machine learning techniques
in recommendersystem.
¯
Howa recommendationshould be displayed
Multi-dimensional ratings (and therefore predictions) seem
natural in certain applications. Restaurants are often
evaluated separately on food, service, and value. Tasks are
often rated for importance and urgency. Today’s
recommendation engines can accept these dimensions
separately, but further research is needed on crossdimension correlation and recommendation.
Implicit ratings are observational measures of a user’s
preference for an item. For example, in Usenet news we
found that time spent reading an article is a goodmeasure
of preference for that article. Similarly, listening to music,
viewing art, and purchasing consumergoods are all good
indicators of preference. Today’s systems lack automated
means for evaluating and calibrating implicit ratings;
further research wouldbe valuable.
There are manyways to display recommendations, ranging
from simply listing recommended
items (without order),
markingrecommended
items in a list of items, to displaying
a predicted rating for items. Manyresearch questions need
to be answered through real user studies. In one study
we’ve conducted, we saw that the correctness of user
decision makingis directly affected by the type of display
used. In other work, we are examining the question of
whether the expected value of a predicted rating
distribution is moreor less valuable than the probability of
the rating exceeding a "worthwhile"cutoff. For example, is
a user moreinterested that a movieis likely to be three-anda-half stars, or that it has a 40%chanceof being four or five
stars? Manysimilar questions remain unanswered.
Beyond
CollaborativeFiltering
The second major research issue for recommendersystems
is integrating technologiesother than collaborative filtering
into recommendation systems.
Content analysis,
demographics, data mining, machine learning, and other
approaches to learning from data each have advantages that
can help offset some of the fundamental limitations of
63
In all of these cases, a key question will be whether one
technology can be incorporated into the framework of
another, or whether a new architecture is needed to merge
the two types of knowledge.
Scale, Sparsity, andAlgorithmlcs
There fundamental problem of producing accurate
recommendations
efficiently from a sparse set of ratings is
inherent to collaborative filtering. Indeed, if the ratings set
were dense, there would be little value in producing
recommendations.Accordingly, there is still great need for
continued research into fundamentalissues in performance,
scalability, and applicability.
One particularly interesting research area that we, along
with others, are actively investigating is techniques for
reducing the computational complexity of recommendation.
As discussed above, the complexity of recommendations
growsgenerally with the size of the database, which is the
product of the number of users and the number of items.
Neighborhoods,partitioning, and factor analysis all attempt
to reduce this size by limiting the elements considered
along the user dimension, the item dimension, or both.
Neighborhoodtechniques use only a subset of users in the
computation of recommendations. Many variants have
been proposed and implemented; research is needed to
assess the varying merits of symmetric vs. asymmetric
neighborhoods, on-demandvs. longer-term neighborhoods,
threshold vs. size limit, etc. Partitioning is discussed
above; factor analysis is a different approachthat tries to
decomposeusers or items into a combination of shared
"taste" vectors. Each of these techniques has promise for
reducing storage and computation.
A second critical issues is to continue to address the
challenge of ratings sparsity, particularly for applications
wherefew users ever should rate an item. Clustering, factor
analysis, and hierarchical
techniques that combine
individual items with clusters or factors can provide one set
of solutions. Integration with other technologies will
provide others.
Finally, there is still significant workneededon the central
algorithmics of collaborative filtering
to improve
performance and accuracy. The use of Pearson correlations
for neighbor selection and weighting is commonin
recommendation systems, yet many alternatives may be
more suitable, depending on the distribution of ratings.
Similarly, the results from RINGO,along with someof our
own, suggest that the common Pearson algorithm
overvalues neighbors with low correlations. Further work,
particularly including empirical work, is needed to evaluate
candidate algorithms.
Metrics and Benchmarks
Sadly, evaluation is one of the weakest parts of current
recommendersystem research. Systems are not compared
against each other directly, and published results use a
variety of metrics, often incomparableones. Both research
and collaboration is needed to establish an accepted set of
metrics and benchmarks for evaluating recommendation
systems. Three areas are particularly
important and
promising.
Accuracy metrics. There are nearly a dozen different
metrics that have been used to measure recommendation
system accuracy. Statistical measuresof error include the
mean absolute error between predicted and actual rating,
the root meansquared error (to more heavily weigh large
errors), and the correlation between predicted and actual
ratings. Other metrics attempt to assess the prevalence of
large errors. Reversal measures tally the frequency with
which "embarrassingly bad" recommendationsare made.
A different set of metrics attempts to assess the
effectiveness of the recommendationengine in filtering
items. Precision and recall statistics,
borrowed from
information retrieval, together with receiver operating
characteristic
measurements from signal processing
discount errors that do not affect usage, and moreheavily
weigh errors near the decision point. For example, the
difference between1.0 and 2.5 on a five point scale maybe
unimportant, since both are rejected, while the difference
64
between 3.0 and 4.5 is extremely relevant. Finally,
ordering metrics assess the effectiveness of top-N
algorithms in identifying true "top" items. The different
metrics in each category should be evaluated with the most
useful ones identified as expected for publication and
comparison.
Coverage combined with accuracy. Accuracy metrics
alone are not useful for many types of comparison. By
setting the confidence threshold high, most systems can
increase accuracy at the expense of coverage. Meaningful
combined metrics are needed to allow meaningful
evaluation of coverage/accuracy trade-offs.
We are
working on a combineddecision-support metric, but others
are neededas well.
Full-system benchmarks. The greatest need right now is
for corpora and benchmarks that can be used for
comparison. In the future, these should be integrated with
economicmodelsto evaluate, in monetaryterms, the value
added by a recommendersystem.
ACKNOWLEDGMENTS
Wewouldlike to acknowledgethe financial support of the
National Science Foundation and Net Perceptions, Inc. We
also wouldlike to thank the dozens of individuals, mostly
students, who have contributed their effort to the
GroupLensResearch project. Finally, we would like to
thankour users, all
FOR ADDITIONALINFORMATION
Several good sources of bibliographic information already
exist in the area of collaborative information filtering and
recommender systems. Rather than duplicate that work
here, werefer the user to:
1.
The March 1997 issue of Communicationsof the ACM,
edited by Hal Varian and Paul Resnick. In addition to
containing articles on several relevant systems, the
section introduction and articles contain extensive
bibliographic information.
2.
The Collaborative Filtering Resources web page, at
http://www.sims.berkeley.edu/resources/collab; this
page grew out of a March 1996 workshop on
collaborative filtering held at Berkeley. It includes
pointers to other reference pages.