CN710: Group 11 Data Mining: User-Driven Similarity Matching Ogi Ogas, Greg Amis, Sai Chaitanya Core Unifying Ideas Behind Recommendation and Search Systems The two core ideas which unify all of the algorithms under discussion are: The user of the algorithm drives the operation of the algorithm. The algorithm performs some kind of similarity matching. Collaborative Filtering Collaborative filtering systems are methods or algorithms which make recommendations (of products, content, items, or individuals) to a user (usually referred to as an active user) based upon the past activity of other users. All collaborative filters use community-based similarity matches. EXAMPLES: 1. Amazon uses a variety of collaborative filters: To recommend books based upon the user’s previous book purchases. To recommend products based upon what the user is currently browsing. To recommend products based upon an "item-to-item" algorithm, which is basically a community-based similarity match made across shopping baskets, rather than user profiles. To help users make user-driven trust-based similarity matches based upon product reviews that other users have made, and the ratings other users have made of those reviews. 2. Netflix uses several of the same types of collaborative filters as Amazon: To suggest movies the user may like based upon the user’s previous movie rentals. To help users make user-driven trust-based similarity matches based upon movie reviews that other users have made, and the ratings other users have made of those movie reviews. 3. The first core reading describes a number of collaborative filtering algorithms. Information Retrieval (Searches) Information retrieval methods rely on the user to provide search information (e.g., search terms, a sample URL, a sample picture, a personal profile) which the method uses to generate similarity matches. Google and Citeseer are two examples of information retrieval systems. Information retrieval methods may use either community-based similarity matches or content-based similarity matches. These systems use one or both of two types of user data: Data Mining: User Driven Similarity Matches 2 Explicit data are user preferences that were solicited directly from the user. For example, on the web site for Microsoft’s Developers Network (MSDN), each article has a form at the bottom of the page asking users to vote on how useful the page is. Implicit data are user preferences that were inferred by the user’s actions. For example, when you buy something from Amazon, their recommendation engine assumes that the product was useful to you, so it will recommend other products that “similar users” also bought. There is some overlap between information retrieval methods and collaborative filtering methods. However, the greatest distinction is that information retrieval methods usually require the user to explicitly specify some search information, while collaborative filtering methods are generally automatic, and simply use implicit user behavior. Nevertheless, sometimes the “information retrieval” label is applied to automatic methods, and sometimes “collaborative filtering” is applied to explicit methods. User-Driven Algorithms All collaborative filters and search systems require the participation of an active user. The activity (explicit or implicit) or the user drives the operation of these algorithms: most collaborative filters user the implicit activity of the user to generate matches, while most searches require the user to explicitly specify some form of search criteria. (An example of a non-user driven recommendation algorithm would be one that simply reported the most popular items, without regard to the user. For example, the NY Times Bestseller list is a non-user driven algorithm.) Similarity Matches All collaborative filters and searches perform some kind of similarity match between the user-driven input and the database. The similarity match is the engine of the algorithm, and is where both theory and engineering combine to produce an effective (or ineffective) algorithm. There are two broad classes of similarity matches: Community-based Matches These algorithms rely on the activity of other users to generate a match. Some methods employed by these algorithms include: Preference: The user’s previous activity (such as movie rentals or book purchases) is used to match the past activity of other users. Yahoo's LAUNCH and MusicMatch's Radio services both rely on preference matches. Trust: Matches are filtered according to the respect or level of trust afforded other users. For example, movie reviews by Roger Ebert might be given higher weight than reviews by high school students. (Or, the opposite.) Amazon and Netflixes “Was this review helpful?” is an example of this. Content-based Matches These algorithms rely on user-independent information to perform their matches. Contentbased matches are much more difficult to execute because of the difficulty in finding a comprehensive feature set. However, in some cases it may not be possible to perform collaborative filtering because of very sparse consumption (Matrimonial website eHarmony) Data Mining: User Driven Similarity Matches 3 Probabilistic matches: uses information about the frequency of items or the co-occurence of items. Searches on E-Donkey and Ka-Zaa do this when finding files to download by looking for identical filenames, though other header information might differ. Feature-based matches: basic classifiers: uses similarity of features in search criteria to searched items. Windows XP Search. Pandora Music Recommendation System www.pandora.com.eHarmony markets itself as 'the only relationship site on the web that creates compatible matches based on 29 dimensions scientifically proven to predict happier, healthier relationships'. These 29 dimensions are grouped into what are called Core Traits and Vital Attributes. Self Concept, Obstreperousness, Conflict Resolution, and Communicative Style are some example dimensions. The problem of finding unbiased values for these features is quite apparent. Context: Discussion Questions 1. How much “theory” can really be applied to these problems, and how much is intimately problem-specific, system-specific, environment-specific engineering hacks? 2. In a similar vein, since technology is changing so very rapidly, what are the core ideas and skills that one must know (if there are any) in order to stay relevant in data mining? 3. What are some other ways to evaluate “trust” in community-based systems? 4. 'Trust' in the recommendation literature usually implies our judgment of other users' tastes, not their intentions. It is also possible that ratings are engineered to acheive an artificial similarity/correlation between products to induce buyers. Consider the possibility that the movie B recommendation by Netflix is the result of a few users 'liking' it and some popular movie A, thus deliberately linking the two. Other users who have documented their approval of the popular movie A might now have movie B recommended to them. Is it possible to weed out such mis-users? 5. One well known problem with Google's PageRank algorithm is that it is susceptible to "link farming." Webmasters can create large numbers of fake web pages that all link to their site, giving their site an artificially high PageRank and moving their site higher on the search result list. Is the algorithm proposed by Dean & Henzinger's paper vulnerable to similar abuses? If so, what could be changed to prevent such abuses? 6. The Breese et al paper on collaborative filtering structures the problem in a very popular way. A user is profiled as a vector of “votes” indicating (via implicit or explicit mechanisms) what products the user prefers, and recommendations are drawn by comparing this vector to the vectors of other users. Given that these vectors are very sparse (e.g., the median number of voters for an Eachmovie user is 26 of 1,623 movies), is this the best way to formulate the problem? 7. How reliable do you think implicit preference data are? Just because someone buys a product doesn’t necessarily mean that they are interested in that product or that they want to buy similar items. Similarly, just because a user clicks on a link doesn’t mean that the page for that link was actually useful in the end. Should we use implicit data at all? 8. Some systems use explicit preference data, e.g., a user indicates their preference by explicitly voting—click “this was helpful” or “this wasn’t what I was looking for.” Will a dataset based on explicit voting necessarily provide a useful basis for making recommendations? Why or why not?