CS 8803 Advanced Internet Application Development Project Proposal SELF-LEARNING PERSONALIZED SEARCH Submitted by : Amogh Budhkar & Swapneel Ambre Date : 02/12/2007 1. Motivation and Objectives Personalized search has long been promised as an important next step for increasing relevancy. We feel that it is one of the directions in which search may head in the future. The search should be intelligent enough not only to understand the entered query but to realize what the user intends to ask too. A five year old and a fifty year old querying for the same term don’t expect the same results. So the ranking feature that is currently being used is inadequate. We propose a search engine which will log user’s searches and clicks and learn over time about the topics of interest for a particular user, thus narrowing the search to these topics for that particular user. This makes search truly personalized. We aim at giving user what he expects. 2. Related work Several different efforts are being made to make search engines personalized. A good example is A9, which keeps track of searches and is able to retrieve the information to allow people to repeat those same searches at a later time. [4] Google has come up with its own subject-based form of personalized search, where the user sets up a profile and search results can be filtered to the user's area of interest. Google's personalized search reorders search results based on your history of past searches, giving more weight to topics that interest you. Google maintains search tracking for Google Account holders, which records Web pages, images, videos, music ad more. Previously called Personalized Search, it also includes Bookmarks, search trends and item recommendations. [8] Another interesting community based search by a leading search technology Eurekster [5] provides a unique swicki features which include personalized social search of "buzzclouds" of recent searches and ranked results reflecting the expertise and interests of the host and community. 3. Proposed work Basic idea of our work is that many of the times users have some particular field in their mind when they search for a particular term. But a search engine provides all the results which are sufficiently high ranked, without worrying about the particular field of interest. So the problem here is knowing what is relevant for a particular user. We can learn what user’s area of interest is, from what he searches for and then what links he clicks. But just because a user searches for something and clicks it once, doesn’t mean he is really interested in that area. E.g. - If someone is sick and searches for information on a particular medicine doesn’t mean he is always going to be interested in that medicine or even medicine field for that matter. To overcome this problem, the search engine should store a long history of user searches and clicks in the form of logs. The logs can better represent the user’s interests. A single term like “Jordan” is used in multiple contexts to mean differently. Search engine needs to distinguish what the user means when he enters this particular term as a query or a part of a query. This can be pretty accurately predicted if logs can be used to predict user’s areas of interest. This also means the search engine must have some kind of understanding of the page content and thus must be able to categorize the page to satisfy user queries. Our vision of a personalized search engine then is something that logs user searches (queries and clickstreams) and learns about user’s areas of interest over time. Then when the user enters a query, the search is narrowed down to only those fields which user is interested in. The user is presented with the highest ranked results from the most relevant field of interest to the user. Because the search engine is constantly learning it can adjust to a change of interest over the time. Another feature that could be implemented is, whenever the user logs in, the search engine will display the list of his most frequently visited pages. This differs from most search engines with personalized search feature, in a way that, the bookmarks needn’t be added manually. Explanation: 1. Users will be provided with a user interface that is like normal search engine interface through which they can submit a query. So the user doesn’t have to bother about any settings, thus improving the usability. Only thing the user needs to do is to log in to his account, if he has not, using his own computer. 2. If we don’t have sufficient history for the user, there could be two approaches i) The search engine searches entire web, the results are clustered and presented to the user according to fields (categories). The query and corresponding fields are logged. ii) The search engine will keep track of what other users have clicked for the same query and accordingly cluster the results and show them to the other users. 3. It the user clicks on a particular link from a particular cluster, this click is also logged. Over time these query and click combination log will help building a user profile where we know what the user’s primary fields of interest are. 4. For long time users, we already know their primary fields of interest. So if a query is submitted, the search is performed only on the narrowed down subset (based on the words in the query and the user’s primary fields of interest) of the entire search space. This gives specialized results relevant to user’s primary interests. These results are then displayed. The architectural design is shown below:- Indexer Robot keyword keyword UI Retrieval Engine Logs Analysis, Processing & Learning URL DB (Web pages, RDB) Relational DB Index keyword keyword keyword URL URL URL URL URL URL User DB With logs Figure 1: - Architectural Design The final aim is to provide the user with what they want and not what a ranking algorithm considers is the highest ranked page. There would be times when a user might be looking results from a completely new field which is not among his areas of interest according to his search history. In such cases the user may have to refine his query to get better results in our search engine. Experienced users will soon learn how to train the search engine according to their needs. Due to privacy issues, user may want to be anonymous. So we provide anonymous search and still give the user the benefits of clustered search results. 4. Plan of action Software: Open source clustering algorithms, Learning algorithms, Java Hardware: Stand-alone Computer Operating System: Windows XP Schedule with milestones: Week Date Scheduled work Week 1 Feb 18 – Feb 24 Week 2 Feb 25 – Mar 03 Week 3 Mar 04 – Mar 10 Week 4 Mar 11 – Mar 17 Week 5 Mar 18 – Mar 24 Week 6 Mar 25 – Mar 31 Implementation of User Week 7 April 01 – April 07 Interface Week 8 April 08 – April 14 Testing and Enhancements Week 9 April 15 – April 22 Study of clustering algorithms Implementation of clustering algorithm Study of Machine Learning algorithms Implementation of inductive learning personalized search using log analysis Extra time for handling delays 5. Evaluation and Testing Method An essence of testing our implementation is a large amount of history logs of the user. Evaluation and testing would be done considering user experiences. Test Cases and Evaluation:1. An experienced user logs in The search engine should display all the list of his most frequently visited pages. 2. Experienced user with a good search history enters a search query (normal operation) - The search results should correspond to his primary fields of interest according to his search history. 3. Experienced user with a good search history enters a search query (with a completely new field of interest in mind. - The search engine will still try to show the results most relevant to the user’s principal areas of interest according to the search history. 4. Experienced user fine tunes the query after getting non expected results. - The search engine should adjust the results according to the field suggested in the fine tuned query. 5. Completely new user without any search history. i) In one of the approaches, the search results should be clustered and presented to the user in an easy to use format. The user can select a particular link according to his choice of field. Log the user query and click data. ii) In the 2nd approach, the search results should be displayed to the users based on the track of what other users have clicked for the same query. 6. Anonymous search: user not logged in. The search engine should perform normal search, cluster the results and present in an easy to use format. No need of logging any data. 6. Bibliography 1. Google Relaunches Personal Search - This Time, It Really Is Personal http://blog.searchenginewatch.com/blog/050628-073541 2. The Search: How Google and Its Rivals Rewrote the Rules of Business and transformed Our Culture - by John Battelle http://www.webpronews.com/topnews/2004/08/03/is-personalized-search-the-future issues in personalized search discussed here. 3. Google History FAQ page http://www.google.com/support/bin/topic.py?topic=1593 4. Is Personalized Search the Future? http://www.webpronews.com/topnews/2004/08/03/is-personalized-search-the-future 5. Eurekster http://www.eurekster.com/about 6. Eurekster Launches Personalized Social Search http://searchenginewatch.com/showPage.html?page=3301481 7. Search Engines get personal http://www.bruceclay.com/newsletter/1004/personalizedsearch.html 8. List of Google Products http://en.wikipedia.org/wiki/List_of_Google_products