CN710: Group 11

CN710: Group 11
Data Mining: User-Driven Similarity Matching
Ogi Ogas, Greg Amis, Sai Chaitanya
Core Unifying Ideas Behind Recommendation and Search Systems
The two core ideas which unify all of the algorithms under discussion are:
The user of the algorithm drives the operation of the algorithm.
The algorithm performs some kind of similarity matching.
Collaborative Filtering
Collaborative filtering systems are methods or algorithms which make recommendations (of
products, content, items, or individuals) to a user (usually referred to as an active user)
based upon the past activity of other users.
All collaborative filters use community-based similarity matches.
1. Amazon uses a variety of collaborative filters:
To recommend books based upon the user’s previous book purchases.
To recommend products based upon what the user is currently browsing.
To recommend products based upon an "item-to-item" algorithm, which is basically a
community-based similarity match made across shopping baskets, rather than user
To help users make user-driven trust-based similarity matches based upon product
reviews that other users have made, and the ratings other users have made of those
2. Netflix uses several of the same types of collaborative filters as Amazon:
To suggest movies the user may like based upon the user’s previous movie rentals.
To help users make user-driven trust-based similarity matches based upon movie
reviews that other users have made, and the ratings other users have made of those
movie reviews.
3. The first core reading describes a number of collaborative filtering algorithms.
Information Retrieval (Searches)
Information retrieval methods rely on the user to provide search information (e.g., search
terms, a sample URL, a sample picture, a personal profile) which the method uses to
generate similarity matches. Google and Citeseer are two examples of information retrieval
Information retrieval methods may use either community-based similarity
matches or content-based similarity matches.
These systems use one or both of two types of user data:
Data Mining: User Driven Similarity Matches 2
Explicit data are user preferences that were solicited directly from the user. For
example, on the web site for Microsoft’s Developers Network (MSDN), each article
has a form at the bottom of the page asking users to vote on how useful the page is.
Implicit data are user preferences that were inferred by the user’s actions. For
example, when you buy something from Amazon, their recommendation engine
assumes that the product was useful to you, so it will recommend other products
that “similar users” also bought.
There is some overlap between information retrieval methods and collaborative filtering
methods. However, the greatest distinction is that information retrieval methods usually
require the user to explicitly specify some search information, while collaborative filtering
methods are generally automatic, and simply use implicit user behavior. Nevertheless,
sometimes the “information retrieval” label is applied to automatic methods, and sometimes
“collaborative filtering” is applied to explicit methods.
User-Driven Algorithms
All collaborative filters and search systems require the participation of an active user. The
activity (explicit or implicit) or the user drives the operation of these algorithms: most
collaborative filters user the implicit activity of the user to generate matches, while most
searches require the user to explicitly specify some form of search criteria.
(An example of a non-user driven recommendation algorithm would be one that simply
reported the most popular items, without regard to the user. For example, the NY Times
Bestseller list is a non-user driven algorithm.)
Similarity Matches
All collaborative filters and searches perform some kind of similarity match between the
user-driven input and the database. The similarity match is the engine of the algorithm, and
is where both theory and engineering combine to produce an effective (or ineffective)
There are two broad classes of similarity matches:
 Community-based Matches
These algorithms rely on the activity of other users to generate a match. Some methods
employed by these algorithms include:
Preference: The user’s previous activity (such as movie rentals or book purchases) is used
to match the past activity of other users. Yahoo's LAUNCH and MusicMatch's Radio services
both rely on preference matches.
Trust: Matches are filtered according to the respect or level of trust afforded other users.
For example, movie reviews by Roger Ebert might be given higher weight than reviews by
high school students. (Or, the opposite.) Amazon and Netflixes “Was this review helpful?” is
an example of this.
 Content-based Matches
These algorithms rely on user-independent information to perform their matches. Contentbased matches are much more difficult to execute because of the difficulty in finding a
comprehensive feature set. However, in some cases it may not be possible to perform
collaborative filtering because of very sparse consumption (Matrimonial website eHarmony)
Data Mining: User Driven Similarity Matches 3
Probabilistic matches: uses information about the frequency of items or the co-occurence of
items. Searches on E-Donkey and Ka-Zaa do this when finding files to download by looking
for identical filenames, though other header information might differ.
Feature-based matches: basic classifiers: uses similarity of features in search criteria to
searched items. Windows XP Search. Pandora Music Recommendation System markets itself as 'the only relationship site on the web that
creates compatible matches based on 29 dimensions scientifically proven to predict happier,
healthier relationships'. These 29 dimensions are grouped into what are called Core Traits
and Vital Attributes. Self Concept, Obstreperousness, Conflict Resolution, and
Communicative Style are some example dimensions. The problem of finding unbiased
values for these features is quite apparent.
Discussion Questions
1. How much “theory” can really be applied to these problems, and how much is
intimately problem-specific, system-specific, environment-specific engineering hacks?
2. In a similar vein, since technology is changing so very rapidly, what are the core
ideas and skills that one must know (if there are any) in order to stay relevant in data
3. What are some other ways to evaluate “trust” in community-based systems?
4. 'Trust' in the recommendation literature usually implies our judgment of other
users' tastes, not their intentions. It is also possible that ratings are engineered to
acheive an artificial similarity/correlation between products to induce buyers. Consider
the possibility that the movie B recommendation by Netflix is the result of a few users
'liking' it and some popular movie A, thus deliberately linking the two. Other users who
have documented their approval of the popular movie A might now have movie B
recommended to them. Is it possible to weed out such mis-users?
5. One well known problem with Google's PageRank algorithm is that it is susceptible
to "link farming." Webmasters can create large numbers of fake web pages that all link
to their site, giving their site an artificially high PageRank and moving their site higher
on the search result list. Is the algorithm proposed by Dean & Henzinger's paper
vulnerable to similar abuses? If so, what could be changed to prevent such abuses?
6. The Breese et al paper on collaborative filtering structures the problem in a very
popular way. A user is profiled as a vector of “votes” indicating (via implicit or explicit
mechanisms) what products the user prefers, and recommendations are drawn by
comparing this vector to the vectors of other users. Given that these vectors are very
sparse (e.g., the median number of voters for an Eachmovie user is 26 of 1,623
movies), is this the best way to formulate the problem?
7. How reliable do you think implicit preference data are? Just because someone
buys a product doesn’t necessarily mean that they are interested in that product or that
they want to buy similar items. Similarly, just because a user clicks on a link doesn’t
mean that the page for that link was actually useful in the end. Should we use implicit
data at all?
8. Some systems use explicit preference data, e.g., a user indicates their preference
by explicitly voting—click “this was helpful” or “this wasn’t what I was looking for.” Will a
dataset based on explicit voting necessarily provide a useful basis for making
recommendations? Why or why not?