Advanced Internet Application Development Project Proposal SELF-LEARNING PERSONALIZED SEARCH CS 8803

advertisement
CS 8803 Advanced Internet Application Development Project Proposal
SELF-LEARNING PERSONALIZED SEARCH
Submitted by : Amogh Budhkar & Swapneel Ambre
Date : 02/12/2007
1. Motivation and Objectives
Personalized search has long been promised as an important next step for increasing
relevancy. We feel that it is one of the directions in which search may head in the future.
The search should be intelligent enough not only to understand the entered query but to
realize what the user intends to ask too. A five year old and a fifty year old querying for
the same term don’t expect the same results. So the ranking feature that is currently being
used is inadequate.
We propose a search engine which will log user’s searches and clicks and learn over time
about the topics of interest for a particular user, thus narrowing the search to these topics
for that particular user. This makes search truly personalized. We aim at giving user what
he expects.
2. Related work
Several different efforts are being made to make search engines personalized. A good
example is A9, which keeps track of searches and is able to retrieve the information to
allow people to repeat those same searches at a later time. [4]
Google has come up with its own subject-based form of personalized search, where the
user sets up a profile and search results can be filtered to the user's area of interest.
Google's personalized search reorders search results based on your history of past
searches, giving more weight to topics that interest you. Google maintains search tracking
for Google Account holders, which records Web pages, images, videos, music ad more.
Previously called Personalized Search, it also includes Bookmarks, search trends and
item recommendations. [8]
Another interesting community based search by a leading search technology Eurekster
[5] provides a unique swicki features which include personalized social search of
"buzzclouds" of recent searches and ranked results reflecting the expertise and interests
of the host and community.
3. Proposed work
Basic idea of our work is that many of the times users have some particular field in their
mind when they search for a particular term. But a search engine provides all the results
which are sufficiently high ranked, without worrying about the particular field of interest.
So the problem here is knowing what is relevant for a particular user. We can learn what
user’s area of interest is, from what he searches for and then what links he clicks.
But just because a user searches for something and clicks it once, doesn’t mean he is
really interested in that area. E.g. - If someone is sick and searches for information on a
particular medicine doesn’t mean he is always going to be interested in that medicine or
even medicine field for that matter. To overcome this problem, the search engine should
store a long history of user searches and clicks in the form of logs. The logs can better
represent the user’s interests.
A single term like “Jordan” is used in multiple contexts to mean differently. Search
engine needs to distinguish what the user means when he enters this particular term as a
query or a part of a query. This can be pretty accurately predicted if logs can be used to
predict user’s areas of interest. This also means the search engine must have some kind of
understanding of the page content and thus must be able to categorize the page to satisfy
user queries.
Our vision of a personalized search engine then is something that logs user searches
(queries and clickstreams) and learns about user’s areas of interest over time. Then when
the user enters a query, the search is narrowed down to only those fields which user is
interested in. The user is presented with the highest ranked results from the most relevant
field of interest to the user. Because the search engine is constantly learning it can adjust
to a change of interest over the time.
Another feature that could be implemented is, whenever the user logs in, the search
engine will display the list of his most frequently visited pages. This differs from most
search engines with personalized search feature, in a way that, the bookmarks needn’t be
added manually.
Explanation:
1. Users will be provided with a user interface that is like normal search engine
interface through which they can submit a query. So the user doesn’t have to
bother about any settings, thus improving the usability. Only thing the user needs
to do is to log in to his account, if he has not, using his own computer.
2. If we don’t have sufficient history for the user, there could be two approaches
i)
The search engine searches entire web, the results are clustered and
presented to the user according to fields (categories). The query and
corresponding fields are logged.
ii)
The search engine will keep track of what other users have clicked for
the same query and accordingly cluster the results and show them to
the other users.
3. It the user clicks on a particular link from a particular cluster, this click is also
logged. Over time these query and click combination log will help building a user
profile where we know what the user’s primary fields of interest are.
4. For long time users, we already know their primary fields of interest. So if a query
is submitted, the search is performed only on the narrowed down subset (based on
the words in the query and the user’s primary fields of interest) of the entire
search space. This gives specialized results relevant to user’s primary interests.
These results are then displayed.
The architectural design is shown below:-
Indexer
Robot
keyword
keyword
UI
Retrieval
Engine
Logs
Analysis,
Processing
& Learning
URL DB
(Web pages, RDB)
Relational
DB
Index
keyword
keyword
keyword
URL
URL
URL
URL
URL
URL
User
DB
With
logs
Figure 1: - Architectural Design
The final aim is to provide the user with what they want and not what a ranking algorithm
considers is the highest ranked page. There would be times when a user might be looking
results from a completely new field which is not among his areas of interest according to
his search history. In such cases the user may have to refine his query to get better results
in our search engine. Experienced users will soon learn how to train the search engine
according to their needs.
Due to privacy issues, user may want to be anonymous. So we provide anonymous search
and still give the user the benefits of clustered search results.
4. Plan of action
Software: Open source clustering algorithms, Learning algorithms, Java
Hardware: Stand-alone Computer
Operating System: Windows XP
Schedule with milestones:
Week
Date
Scheduled work
Week 1
Feb 18 – Feb 24
Week 2
Feb 25 – Mar 03
Week 3
Mar 04 – Mar 10
Week 4
Mar 11 – Mar 17
Week 5
Mar 18 – Mar 24
Week 6
Mar 25 – Mar 31
Implementation of User
Week 7
April 01 – April 07
Interface
Week 8
April 08 – April 14
Testing and Enhancements
Week 9
April 15 – April 22
Study of clustering
algorithms
Implementation of
clustering algorithm
Study of Machine Learning
algorithms
Implementation of inductive
learning personalized search
using log analysis
Extra time for handling
delays
5. Evaluation and Testing Method
An essence of testing our implementation is a large amount of history logs of the user.
Evaluation and testing would be done considering user experiences.
Test Cases and Evaluation:1. An experienced user logs in
The search engine should display all the list of his most frequently visited pages.
2. Experienced user with a good search history enters a search query (normal operation)
- The search results should correspond to his primary fields of interest according to his
search history.
3. Experienced user with a good search history enters a search query (with a completely
new field of interest in mind.
- The search engine will still try to show the results most relevant to the user’s principal
areas of interest according to the search history.
4. Experienced user fine tunes the query after getting non expected results.
- The search engine should adjust the results according to the field suggested in the fine
tuned query.
5. Completely new user without any search history.
i) In one of the approaches, the search results should be clustered and presented to the
user in an easy to use format. The user can select a particular link according to his choice
of field. Log the user query and click data.
ii) In the 2nd approach, the search results should be displayed to the users based on the
track of what other users have clicked for the same query.
6. Anonymous search: user not logged in.
The search engine should perform normal search, cluster the results and present in an
easy to use format. No need of logging any data.
6. Bibliography
1. Google Relaunches Personal Search - This Time, It Really Is Personal
http://blog.searchenginewatch.com/blog/050628-073541
2. The Search: How Google and Its Rivals Rewrote the Rules of Business and
transformed Our Culture - by John Battelle
http://www.webpronews.com/topnews/2004/08/03/is-personalized-search-the-future
issues in personalized search discussed here.
3. Google History FAQ page
http://www.google.com/support/bin/topic.py?topic=1593
4. Is Personalized Search the Future?
http://www.webpronews.com/topnews/2004/08/03/is-personalized-search-the-future
5. Eurekster
http://www.eurekster.com/about
6. Eurekster Launches Personalized Social Search
http://searchenginewatch.com/showPage.html?page=3301481
7. Search Engines get personal
http://www.bruceclay.com/newsletter/1004/personalizedsearch.html
8. List of Google Products
http://en.wikipedia.org/wiki/List_of_Google_products
Download