08 09 Ferret: 360 Search

advertisement
Spring
Ferret: 360o Search
Prafulla Mahindrakar
Aniket Patil
Ketan Umare
Advisor: Dr. Ling Liu
CS8803:Advanced Internet Application Development, Group Project.
09
2
FERRET: 360O SEARCH
Table of Contents
1. MOTIVATION AND OBJECTIVES
4
2. RELATED WORK
5
2.1 GOOGLING
2.2 SOCIALLY RELEVANT SEARCH
2.2 CATEGORIZATION OF SEARCH RESULTS
5
6
6
3. ARCHITECTURE
7
3.1 SYSTEM ARCHITECTURE DIAGRAM
3.2 PATTERN ORIENTED ARCHITECTURE
3.2.1 DESIGN PATTERNS
3.2.2 DESIGN PATTERNS USED IN FERRET
3.3 HIGH PERFORMANCE
3.3.1 THREADING
3.3.2 CACHING
3.4 DATABASE SCHEMA
3.4.1 ER DIAGRAM
3.4.2 USER TABLE
3.4.3 PAGE KEYWORD TABLE
3.4.4 USER SESSION TABLE
7
8
8
8
9
9
9
9
9
10
10
10
4. COMPONENTS
11
4.1 AUTHENTICATION
4.2 STANFORD TAGGER
4.2 WEB SEARCH
4.3 MEDIA SEARCH
4.4 PRODUCT SEARCH
4.4.1 CLUSTERING OF RESULTS
4.5 SOCIAL SEARCH
4.5.1 SESSIONS
4.5.2 LISTEN TO USER CLICKS
4.5.3 HEARTBEAT MESSAGES
4.6 CATEGORIZATION ENGINE
4.6.1 WHY DOCUMENT CLUSTERING
4.6.2 APPROACHES
4.6.3 BUILDING BLOCKS
4.6.4 LINGO ALGORITHM
4.7 PRESENTATION ENGINE
4.7.1 SEARCH RESULTS TAB CREATOR
4.8 VIEW
11
11
11
11
11
12
12
13
13
13
14
14
14
15
16
18
18
18
5. EVALUATION FRAMEWORK
19
2 FERRET: 360O SEARCH
FERRET: 360O SEARCH 3
5.1 SOCIAL NETWORK SIMULATION
5.1.1 ER DIAGRAM
5.2 JMETER
5.2.1 TEST CASES
5.3 COMPARISON TO OTHER SEARCH ENGINES
19
20
20
20
22
6. TESTING AND RESULTS
22
6.1 PROTOTYPE SYSTEM
6.1.1 SOFTWARE
6.1.2 HARDWARE
6.1.3 OPERATING SYSTEM
6.2 RESULTS
6.2.1 WITHOUT MEMCACHED
6.2.2 WITH MEMCACHED
6.2.3 LOAD RESPONSE WITH MEMCACHED
22
22
23
23
23
23
23
25
7. FUTURE WORK
25
8. CONCLUSIONS
25
9. BIBLIOGRAPHY
26
9. APPENDIX
27
FERRET: 360O SEARCH 3
4
FERRET: 360O SEARCH
1. Motivation and Objectives
fer·ret (v) (\ˈfer-ət\)
to find and bring to light by searching
Imagine trying to find a pair of the latest Ray-Ban glasses in the Lenox Square Mall.
It is not an easy task! Now think about doing the same across the World Wide Web.
Feeling tizzy? The World Wide Web with its astronomical amount of information
presents an enormous challenge for resource discovery. Precise navigation is
impossible with the increasingly large collection of hyperlinks that users must
traverse. Commercial search engines like Google and Yahoo have solved the
problem at a fundamental level by making available a hypertext-based index for
pages across the web. Web Users can query the index for documents about a specific
topic to find the desired document.
While search engines have become quite popular and are helping to redefine how
people access information scattered across the wide-area network, they are not well
suited to the case when users do not know what exactly they are looking for. In such
a situation, using one of the popular search engines can be a messy, frustrating
experience. What do you do when you don’t know where to start? Give Ferret a try!
For any topic in the universe, Ferret provides a neatly organized view of the web.
Our category guides bring meaningful and relevant information that makes
browsing for a topic fast. Rather than the messy back-and-forth clicking of search
results, we do the processing so that you can learn, explore and discover the things
that matter to you. Ferret offers you a new way to discover the Web – it’s the place
you should be when you want to browse and discover everything the Web has to
offer. Come to Ferret when you want to learn about a topic or explore what’s
happening now on the Web. We’ll show you content that you may have never
discovered otherwise and we’ll give you an at-an-glance look at everything related
to the query. Think of Ferret as your guide for exploring the Web.
For instance, consider the search term ‘Transformers’. A Google search result
returns a list arranged serially that speaks about the movie ‘Transformers’, and
electrical transformers on the first page. However, a user who is interested in
knowing about the class ‘Transformer’ in Java or about the comics on Transformers
needs to browse several pages before such results are discovered. Our system
graphically arranges and classifies results into categories such as text, multimedia,
entertainment, discussions, blogs and more. A user simply needs a single click to
have a 360 degree view of content associated with the query term.
4 FERRET: 360O SEARCH
FERRET: 360O SEARCH 5
Ferret. Your guide to the world!
2. Related Work
Search has been a constantly evolving and a continuously researched topic. There
have been great success stories and even greater debacles in this industry. Web
search has become such an important part of our life that it has contributed to our
vocabulary in some cases. Following are some of the most different systems
currently available online, from which we derive and drive our inspiration.
Figure 1: Taxonomy of Existing Search Technologies
2.1 Googling
In their seminal work [1], the authors described a new way of ranking web
documents, based on the idea of citation. The Search engine instantly became a hit
and overtook all of its competitors. The webpage [2] is the most highly visited page
online and everyone knows “The Google Story”.
Google uses a simple keyword based search, but the most important point is the
ranking of content. Thus Google successfully demonstrates the idea that just the
content is not important, but the way we present it is highly important. Google has
continued to innovate and come up with great innovative new features, but still it
has a long way to go.
FERRET: 360O SEARCH 5
6
FERRET: 360O SEARCH
2.2 Socially relevant search
Social search or a social search engine is a type of web search method that determines
the relevance of search results by considering the interactions or contributions of
users.[3]
Based on this simple idea is Delver[4], which uses the social network of a user to
come up with better recommendations. It enables you to find, experience and
benefit from the wealth of information created and referenced by your social world.
Socially relevant search can really benefit a user, as what matters to him is usually
what matters to his peers. Paper [5] talks about the benefits of integrating the web
search and social search and quantifies it with great results. It also delineates the
challenges in doing so.
2.2 Categorization of search results
Search results categorization is another important way to present the search results.
Take an example of the word Transformers. For the same word we could have
different implications – an electrical device, a movie, the cartoon series, a toy, there
could be a review about the movie, or some news about the invention of some new
efficient transformer, etc. So how do you show these results? Which is more
important?
These questions are almost impossible to answer. Papers[6-9] show a variety of
ways in which we can classify the web search results and quantify them with
interesting results. But Kosmix[10], is one of the most promising sites that has
leveraged from this idea. It uses the search provided by Google, and creates a
wrapper for its own classification system. It has been voted as one of the best new
startups[11] and that just makes a statement about the importance of classification
of results.
6 FERRET: 360O SEARCH
FERRET: 360O SEARCH 7
3. Architecture
The following sections give an outline of the System architecture and a small
description of the important components.
3.1 System Architecture Diagram
Figure 2 System Architecture
Ferret can operate in two modes, Logged in or Private mode. Each of these modes
are described in detail in the later sections. In the logged in mode alongwith the
typical web results, ferret also provides socially relevant search results, using the
FERRET: 360O SEARCH 7
8
FERRET: 360O SEARCH
users profile form one of the major social network databases, for example facebook
OR twitter. The typical web results are categorized into 3 broad categories, namely,
Web Search, Media Search and Product Search. Each Category is further categorized
using our clustering algorithm.
3.2 Pattern Oriented Architecture
The aim while developing ferret was to keep it flexible enough so that we can add
new features with relative ease. Also performance was a major concern, so each of
the components built was built for a large-scale system. This could be easily
achieved using Pattern oriented architecture. The following section describe the
various patterns used in ferret.
3.2.1 Design Patterns
A Design Pattern can be defined as a particular recurring design problem that arises in
specific design contexts, and presents a well-proven generic scheme for its solution.
Describing its constituent components, their responsibilities and relationships, and the
ways in which they collaborate specifies the solution scheme[12, 13].
3.2.2 Design Patterns used in Ferret
Ferret uses these design patterns.
3.2.2.1 Front Controller
A Front controller pattern enables centralized request processing. This enables
changes to the levels below to be transparent. Even communication, threading can
be abstracted easily from the presentation layer.
3.2.2.2 Abstract Factories
Factories is a creational pattern that abstracts creation of objects from the place
where it is used. This provides ease of adding modules.
3.2.2.3 Strategy
A strategy pattern allows ferret to change clustering algorithms easily and thus
allowing new algorithms to be plugged in with relative ease. This especially was
vital during testing out various algorithms.
3.2.2.4 Adapter
Adapter pattern is used to abstract the search/fetch/cluster logic from the
presentation generator. This generator can also be modified easily irrespective of
changes to the prior system.
3.2.2.5 Singleton
Many things needs single connections and to avoid the overhead we used thread
controllers in singletons so that we could reduce the thread creation overhead. Also
tagger library is loaded just once so that we avoid the cost associated with re
reading it.
8 FERRET: 360O SEARCH
FERRET: 360O SEARCH 9
3.2.2.6 Spring (OS) Doors
Just as the Spring system developed at Sun labs we have Controller which abstracts
the access of data from the presentation layer. This allows us to deploy individual
systems remotely, which could be employed in the future for large scale distributed
computing.
3.3 High Performance
For a search engine Performance is critical. Ferret achieves performance using large
scale threading, distributed caching and easily allowing separation of modules onto
separate physical hosts.
3.3.1 Threading
Ferret uses pre-spawned thread pools to offset the overhead of thread spawning. It
also uses threads to perform searches across various domains parallely.
3.3.2 Caching
Ferret uses memcached[14, 15] to cache recent results. To maintain freshness of
the results, each cached entry is associated with Expiry value. Currently the expiry
time is arbitrarily fixed, but future efforts would aim at arriving at this number
using a learning algorithm. For example, it is known that google doesnot refresh its
index for atleast n hours. In that case we could cache till the results are updated.
3.4 Database Schema
The database used by ferret is minimal, and this is essential to enhance the
performance. The following section describes the schema in detail.
3.4.1 ER Diagram
usr_user
PK
uid
puk_pagekeyword
PK
username
password
name
profession
imageUrl
pageid
page
keyword
title
uss_usersession
PK
uid
pageid
historycount
timestamp
sessionid
Figure 3 Database Model for Social Search
Table usr_user:
Column Name
Uid
Description
Auto-generated primary key for usr_user table
FERRET: 360O SEARCH 9
10 FERRET: 360O SEARCH
Username
Password
Name
Profession
ImageUrl
Login name for the user
User’s password
User’s name
User’s profession
Pathname for the user image
Table puk_pagekeyword:
Column Name
pageid
page
keyword
title
Description
Auto-generated
primary
key
for
Puk_pagekeyword
table
URL of the page
Processed query term for which page was
retrieved
Title for page
Table uss_usersession:
Column Name
uid
pageid
historycount
timestamp
sessionid
Description
Auto-generated
primary
key
for
uss_usersession
table
Refers to puk_pagekeyword.pageid
Frequency of usage of search results
Time at which user selected a page for reading
Server-generated session id for user
3.4.2 User Table
The user table is needed to maintain login information of the user in case the Google
Authentication system isn’t used. Also it stores the uid’s which again could be
directly from facebook, bt would be needed in case of multiple networks.
3.4.3 Page Keyword Table
This table is used to maintain a list of popular keyword and page combinations
accessed by the users. Based on freshness criteria, this table should be cleaned
every x number of days.
3.4.4 User Session table
This table is used to track the user and his favorite links. This table is essential to
implement the Good page Bad page algorithm.
10 FERRET: 360O SEARCH
FERRET: 360O SEARCH 11
4. Components
This section explains the various modules that constitute the ferret search engine.
4.1 Authentication
Ferret uses its own database to authenticate the user. It is easy to instead use the
Google OpenId system for authentication. The system treats the user as a guest and
does not track your activities. This enables private browsing.
4.2 Stanford Tagger
The Stanford Tagger used by Ferret is a Part-Of-Speech Tagger. It is a piece of
software that reads text in some language and assigns parts of speech to each word
(and other tokens), such as noun, verb, adjective, etc.
The tagger is used to identify relevant keywords in a query and store them in the
database. The tagger is used in the following components:

Dictionary Search: The tagger identifies nouns (personal, common, both
singular and plural) as keywords to be sent to WordNet for query.

Social Search: The tagger identifies nouns (personal, common, both
singular and plural), verbs and adverbs from the user’s query.
4.2 Web Search
This engine is multithreaded and accepts the raw query and dispatches it to the
various worker threads, which aim at collecting the search results from variety of
search engines like Google[2], A9[16], IMDB[17] etc. The worker threads use WSDL
to communicate to the various search engines. The external interface is extensible
since collecting results from a new search engine simply requires the
implementation of a WSDL interface. This enables our system to be augmented by
additional search results through Yahoo, Windows Live or any other search engine.
4.3 Media Search
This engine is multithreaded and accepts the raw query and dispatches it to the
various worker threads, which aim at collecting the search results from variety of
search engines like Google[2], A9[16], IMDB[17] etc. The worker threads use WSDL
to communicate to the various search engines. The external interface is extensible
since collecting results from a new search engine simply requires the
implementation of a WSDL interface. This enables our system to be augmented by
additional search results through Yahoo, Windows Live or any other search engine.
4.4 Product Search
Ferret product search uses Amazon E-Commerce API to retrieve product
information. The API exposes Amazon's product data and e-commerce functionality.
This allows Ferret to leverage the data that Amazon uses to power its own business.
FERRET: 360O SEARCH 11
12 FERRET: 360O SEARCH
Ferret is able to retrieve product results over a huge range of categories. For every
product, Ferret retrieves the product name, product cost as on Amazon and a
product image.
All searches are performed for US locale. In the future, it may be possible to detect
the geographical region from where the query originates and adjust the locale
accordingly.
4.4.1 Clustering of Results
The search results are clustered dynamically on the basis of categories that are
retrieved for the query term. All products belonging to a single category are
arranged together using seed list based clustering. Any Amazon product can be
classified into one of the following categories:
Apparel, Automotive, Baby, Beauty, Blended, Books, Classical, Digital Music, DVD,
Electronics, Foreign Books, Gourmet Food, Health Personal Care, Hobbies, Home Garden,
Jewelry, Kitchen, Magazines, Merchants, Miscellaneous, Music, Musical Instruments, Music
Tracks, Office Products, Outdoor Living, PC Hardware, Pet Supplies, Photo, Restaurants,
Software, Software Video Games, Sporting Goods, Tools, Toys, VHS, Video, Video Games,
Wireless, Wireless Accessories
Figure 4 List of product categories in Amazon
We used the above categories as a seed list and use the retrieved product
information to detect the category and cluster appropriately.
Due to the extensible nature of the product search component, we can easily obtain
results from other e-commerce providers such as Buy.com and Ebay. We also plan
to integrate functionality to sort results by cost and social relevance.
4.5 Social Search
Ferret adds a new spin to search: social networking. One of the most innovative
features of Ferret is the ability to retrieve search results that are relevant to the
user’s social network.
The feature allows the user to leverage searches performed by the user’s friends.
Social search recommends the best pages found by people in the user’s network that
are relevant to the user’s query. Ferret’s social search tries to match the user’s query
term with a larger set of searchers in the user’s social network that are looking for
the same things. The results are clustered by the friend’s name and are listed
serially. Each result contains the page name and the page url which is clickable for
the purpose of viewing.
The feature is an opt-in: no one can see what the user is searching for unless the
user logs in. This ensures user privacy.
12 FERRET: 360O SEARCH
FERRET: 360O SEARCH 13
Social Search is implemented using the following primitives:



Sessions
Listen to user clicks
Heartbeat Messages
4.5.1 Sessions
In Ferret, a session stores the state of communication between a server and the user
enabling the server to identify that user across multiple page requests or visits to
that site.
A session is created when a user logs in with his username and password. The
session for a user stores the following attributes:




User ID: The primary key generated by the database for the user
Query: The query term currently being searched by the user
Page URL: The page currently being viewed by the user
Timestamp: The time at which the user clicked on the page url
4.5.2 Listen to user clicks
A request is sent to the server each time the user clicks on a page for browsing. This
is used to store associate the query term processed by Stanford Tagger (keywords)
with the user id previously stored in the session for future processing. The
algorithm for the user clicks is as follows:
1. If session is invalid
2.
Return
3. Else if no timestamp exists in session
4.
Insert URL into session
5.
Insert Keyword into session
6.
Insert Timestamp into session
7.
Return
Figure 5 User-Click Algorithm
4.5.3 Heartbeat Messages
A heartbeat message is an event-driven message which is sent to the server when
there is a search results page is reloaded. This message is used to detect if the user
likes the page he has just viewed. We use heuristics to differentiate such a page from
one the user does not like.
The heuristic Ferret uses is as follows: If the user spends more time on a certain
page, we can assume he does so because he likes the page. If the user returns back
from a page “quickly”, he does not like the page. Currently, we have set a timeout of
30 seconds to differentiate a good page from a bad page. If the user spends 30
seconds or greater on a specific page, the system records the page as a good page
FERRET: 360O SEARCH 13
14 FERRET: 360O SEARCH
and stores it in the database. If the user returns from the page in less than 30
seconds the page is not associated with the user.
The algorithm for the heartbeat process can be summarized as follows:
1. If session is invalid
2.
Return
3. Else if no timestamp exists in session
4.
Return
5. Else if page is liked by user
6.
Associate page-keyword combination with userid
7.
Return
8. Else if page is disliked by user
9.
Remove page-keyword association with user
10.
Return
Figure 6 Heartbeat Algorithm
In the future, social search can be improved by deducing the “meaning” of the query
being searched using natural language processing query techniques and using the
meaning to retrieve search results. For instance, if the user is searching for “what
drug treats a headache” Ferret can process the semantic relationships between
words and may deduce that someone searching for “what medicine relieves
migraines” is a match. In addition, it may be possible to rank a set of results
retrieved for a specific user’s friend by freshness or relevance to the query.
4.6 Categorization Engine
The results collected through the various websites are then categorized using Lingo
clustering, and then grouped into different categories.
4.6.1 Why Document Clustering
With an enormous growth of the Internet it has become very difficult for the users
to find relevant documents. In response to the user’s query, currently available
search engines return a ranked list of documents along with their partial content
(snippets). If the query is general, it is extremely difficult to identify the specific
document which the user is interested in. The users are forced to sift through a long
list of off-topic documents. Moreover, internal relationships among the documents in
the search result are rarely presented and are left for the user.
One approach is to automatically group search results into thematic groups
(clusters) which would help the user to see various perspective of the same query
grouped into categories.
4.6.2 Approaches
Clustering of web search results was first introduced in the Scatter-Gather system.
Several algorithms followed; Suffix Tree Clustering, (STC), implemented in the
Grouper system pioneered in using recurring phrases as the basis for deriving
14 FERRET: 360O SEARCH
FERRET: 360O SEARCH 15
conclusions about similarity of documents. MSEEC and SHOC also made explicit use
of words proximity in the input documents. Apart from phrases, graph-partitioning
methods have been used in clustering search results
All the above approaches follow a scheme where cluster content discovery is
performed first, and then, based on the content, the labels are determined. But very
often intricate measures of similarity among documents do not correspond well
with plain human understanding of what a cluster’s “glue” element has been. To
avoid such problems Lingo algorithm reverses this process and attempt to ensure
that it can create a human-perceivable cluster label and only then assign documents
to it. This the approach we have followed in our implementation of clustering web
results.
4.6.3 Building Blocks
The following section describes the building blocks for the implementation of the
clustering algorithm used in ferret.
4.6.3.1 Vector Space model
Vector Space Model (VSM)[18] is a technique of information retrieval that
transforms the problem of comparing textual data into a problem of comparing
algebraic vectors in a multidimensional space. Once the transformation is done,
linear algebra operations are used to calculate similarities among the original
documents. Every unique term (word) from the collection of analyzed documents
forms a separate dimension in the VSM and each document is represented by a
vector spanning all these dimensions.
For example, if vector v represents document j in a k-dimensional space ,then
component t of vector v, where t 1 . . . k, represents the degree of the relationship
between document j and a term corresponding to dimension t. This relationship is
best expressed as a t X d matrix A, usually named a term-document matrix , where t
is the number of unique terms and d is the number of documents. Element aij of
matrix A is therefore a numerical representation of relationship between term i and
document j. There are many methods for calculating aij , commonly referred to as
term weighting methods.
4.6.3.2 Calculating Relevance
We use the tf-idf method for calculating the term weights. The tf–idf weight (term
frequency–inverse document frequency) is a weight often used in information
retrieval and text mining. This weight is a statistical measure used to evaluate how
important a word is to a document in a collection or corpus. The importance
increases proportionally to the number of times a word appears in the document
but is offset by the frequency of the word in the corpus.
4.6.3.3 Suffix Arrays
Let A = a1a2a3 . . . an be a sequence of objects. Let us denote by Ai a suffix of A
starting at position i ! 1 . . . n, such as Ai =aiai+1ai+2 . . . an. An empty suffix is also
FERRET: 360O SEARCH 15
16 FERRET: 360O SEARCH
defined for every A as An+1 = #. A suffix array[19] is an ordered array of all suffixes
of A. Suffix arrays are used as an efficient data structure for verifying whether a
sequence
of objects B is a substring of A.The complexity of this operation is O(P + logN), a
suffix array can be builtin O(NlogN).
4.6.3.5 Singular value Decomposition
An algebraic method of matrix decomposition called Singular Value
Decomposition[20] is used for discovering the orthogonal basis of the original termdocument matrix. This basis consists of orthogonal vectors that, at least
hypothetically, correspond to topics present in the original term-document matrix.
SVD breaks a t X d matrix A into three matrices U, ∑ and V , such that A = U∑ VT . U is
a t X t orthogonal matrix whose column vectors are called the left singular vectors of
A, V is a d X d orthogonal matrix whose column vectors are called the right singular
vectors of A, and ∑ is a t X d diagonal matrix having the singular values of A ordered
decreasingly along its diagonal. The rank rA of matrix A is equal to the number of its
non-zero singular values. The first rA columns of U form an orthogonal basis for the
column space of A—an essential fact used by Lingo.
4.6.4 Lingo Algorithm
At the very high level lingo[21] first finds frequent phrases from the input
documents, hoping they are the most informative source of human-readable topic
descriptions. Next, by performing reduction of the original term-document matrix
using SVD, it tries to discover any existing latent structure of diverse topics in the
search result. Finally, it match group descriptions with the extracted topics and
assign relevant documents to them.
4.6.4.1 Preprocessing
The aim of the preprocessing phase is to prune from the input all characters and
terms that can possibly affect the quality of group descriptions. Two steps are
performed: text filtering removes HTML tags, entities and non-letter characters
except for sentence boundaries. Next, appropriate stemming and stop words
removal end the preprocessing phase.
4.6.4.Phrase Extraction
We define frequent phrases as recurring ordered sequences of terms appearing in
the input documents. Intuitively, when writing about something, we usually repeat
the subject-related keywords to keep a reader’s attention. Obviously, in a good
writing style it is common to use synonymy and pronouns and thus avoid annoying
repetition. To be a candidate for a cluster label, a frequent phrase or a single term
must:
 appear in the input documents at least certain number of times (term
frequency threshold),
 not cross sentence boundaries,
16 FERRET: 360O SEARCH
FERRET: 360O SEARCH 17
 be a complete phrase ,
 not begin nor end with a stop word.
We use suffix arrays to find such complete phrases.
4.6.4.2 Cluster Label Induction
Once frequent phrases (and single frequent terms) that exceed term frequency
thresholds are known, they are used for cluster label induction. There are three
steps to this: term-document matrix building, abstract concept discovery,
phrase matching and label pruning.
The term-document matrix is constructed out of single terms that exceed a
predefined term frequency threshold. Weight of each term is calculated using the
standard term frequency, inverse document frequency (tfidf) formula, terms
appearing in document titles are additionally scaled by a constant factor.
In abstract concept discovery, Singular Value Decomposition method is applied to
the term-document matrix to find its orthogonal basis.Vectors of this basis (SVD’s U
matrix) represent the abstract concepts appearing in the input documents.
Phrase matching and label pruning step, where group descriptions are discovered,
relies on an important observation that both abstract concepts and frequent phrases
are expressed in the same vector space—the column space of the original termdocument matrix A.The classic cosine distance is used to calculate how “close” a
phrase or a single term is to an abstract concept. Let us denote by P a matrix of size
t X (p+t) where t is the number of frequent terms and p is the number of frequent
phrases. Having the P matrix and the i-th column vector of the SVD’s U matrix, a
vector mi of cosines of the angles between the i-th abstract concept vector and the
phrase vectors can be calculated.
mi = UI T P.
The phrase that corresponds to the maximum component of the mi vector should be
selected as the human-readable description of i-th abstract concept.
4.6.4.2 Cluster Content Discovery
In the cluster content discovery phase, the classic Vector Space Model is used to
assign the input documents to the cluster labels induced in the previous phase. In a
way, we re-query the input document set with all induced cluster labels. The
assignment process resembles document retrieval based on the VSM model. Let us
define matrix Q, in which each cluster label is represented as a column vector. Let C
= QTA, where A is the original term-document matrix for input documents. This way,
element cij of the C matrix indicates the strength of membership of the j-th
document to the i-th cluster. A document is added to a cluster if cij exceeds the some
threshold yet another control parameter of the algorithm. Documents not assigned
to any cluster end up in an artificial cluster called Others.
FERRET: 360O SEARCH 17
18 FERRET: 360O SEARCH
4.7 Presentation Engine
This module is responsible for displaying and painting the results for the user
browser. It uses the Adapter pattern to abstract the search part from the display
part.
4.7.1 Search Results Tab creator
This interface creates a tab and each type of tab can be separated into a different
class. The most important functions are written in the base class and whenever a tab
is needed to be different a simple class can be easily written.
4.8 View
The clustered results and the socially relevant search results are then showed to the
end user in tabbed format, which allows the user to easily find his appropriate
content. The view uses Mootools[22], which is an opensource Javascriptig
framework, which enables it to be browser agnostic.
The following chart shows the performance comparison of mootools with other
java-scripting frameworks. The performance alongwith the ease of use makes it one
of the preferred choices.
18 FERRET: 360O SEARCH
FERRET: 360O SEARCH 19
Figure 7 Performance comparison of various Java-Scripting Frameworks (source:Blog)
5. Evaluation Framework
Ferret tries to use many old and some new ideas to combine them into a new
exciting product. Hence evaluation of such a system is critical. The Evaluation falls
under Three broad categories
Social Network based relevance
Performance
Comparison to other contemporary search engines
5.1 Social Network Simulation
Ferret needs the social network to provide information about a user and his friends
so that it can perform and maintain social relevance search results. Though it has a
facebook engine ready, Facebook authentication system requires a static IP or a URL
to work with. Due to this limitation it became essential to simulate the social
network. The following section describes a simple social network simulation
FERRET: 360O SEARCH 19
20 FERRET: 360O SEARCH
5.1.1 ER Diagram
Ferret presently simulates a social network to implement Social Search. The
database model used is as follows:
ufl_userfriends
uhb_userhobbies
usk_usersearchpage
PK
PK
PK
uid
fid
uid
hobbies
uid
pageid
Figure 8 Database Model for Social Network
Table ufl_userfriends:
Column Name
Uid
fid
Description
Refers to usr_user.uid
Refers to usr_user.uid
Table uhb_userhobbies:
Column Name
Uid
hobbies
Description
Refers to usr_user.uid
User hobby name
Table Usk_usersearchpage:
Column Name
Uid
pageid
Description
Refers to usr_user.uid
Refers to puk_pagekeyword.pageid
Ferret does not use uhb_userhobbies table currently in simulation. It is possible to
consider the user’s friends’ hobbies when recommending social search results to the
user.
5.2 JMeter
Performance of a search engine is critical and JMeter is an open source tool that can
simulate multiple clients sending post request[23, 24]. It can also load test the
application. Ferret was tested using JMeter and various performance stats were
collected. This section provides details on the test cases.
5.2.1 Test Cases
20 FERRET: 360O SEARCH
FERRET: 360O SEARCH 21
First screenshot shows the Jmeter Test plan setup screen. The Testplan is called
Ferret Testplan. Screenshot 2 shows the type of parameter, namely the search
query, to be passed and type of HTTP request to be sent, for example POST or GET.
Screenshot 3 shows the expected amount of load (number of users), number of time
each query is executed and the gap between consecutive queries.
FERRET: 360O SEARCH 21
22 FERRET: 360O SEARCH
5.3 Comparison to Other search engines
Search engines performance has an important component, which deals with the
quality of results for a particular query. Such an evaluation is very subjective. To
compare the result of ferret to contemporary search engine method of surveying
was used.
6. Testing And Results
6.1 Prototype system
We have build a prototype systems for the demo using the hardware and software
listed in the following sections.
6.1.1 Software
 Java 1.6
 Eclipse IDE
 J2EE 1.4
 Apache Tomcat 5.5
 MySQL 5.0
 Clustering Algorithms (Developed by us)
 Mootools
 Multibox
 MySQL JDBC Connector
 JUnit 4.4
 Open Source Web / REST API’s for Google, IMDB, Facebook etc.
22 FERRET: 360O SEARCH
FERRET: 360O SEARCH 23
6.1.2 Hardware
We need simple commodity hardware, as it will not be a live system, but a proof of
concept. Currently a Desktop PC with a browser and internet connectivity would
suffice. We would primarily develop on our laptops.
6.1.3 Operating system
The primary development and test platforms would be
 Windows 98/XP/Vista
 MacOSX 10.5.5 (Leopard)
Though most of the technologies we are using are completely portable and we
should be able to run on most systems that support JAVA.
6.2 Results
We conducted results using memcached and Tomcat. For every search engine
response times are very important. Since we use Google as our search provider our
times can never be better than Google. Each Tab is separated on different threads
and page is created parallely.
6.2.1 Without Memcached
Without memcached, the same query takes approximately constant response times.
This is because the entire result set is constructed for the same query al over again
for every request.
6.2.2 With Memcached
FERRET: 360O SEARCH 23
24 FERRET: 360O SEARCH
Memcached improves the performance but a small amount of time is spent as the
social results are never cached. But since they are stored locally on Ferrets own
database, the bottleneck is because of the remote servers and the clustering system.
When the user is not logged in the complete page is constructed completely using
the cached results. The Thread pools are not interrupted and thus the performance
is very high.
24 FERRET: 360O SEARCH
FERRET: 360O SEARCH 25
6.2.3 Load Response with Memcached
The above graph shows the response time of Ferret system with multiple
concurrent users search for the same query. It is evident that we need a server or a
host of servers to handle multiple concurrent users.
7. Future Work
There are a lot of changes that we dream of, and we have a long way to go. This
serves as a good demo tool, but not a final product. Following are some things we
have planned for Ferret.
o
Using up our summer vacation to build on it
o
Notion of Social Rank
o
Adding blogs, forums, reservations, email search to search results
o
Using Digg interface to re-rank sites
o
Learning better categories
o And the list goes on...
8. Conclusions
This was a very good learning experience. One of the most important things we
learnt was how to develop an idea and get a working prototype. From our
perspective, there are two navigation paradigms on the Web – Search and Browse.
Search lets you find specific bits of information quickly or navigate to sites you
already know. Browse gives you a more immersive way to explore a topic so that
FERRET: 360O SEARCH 25
26 FERRET: 360O SEARCH
you can learn more about something or discover something new. Ferret is about
reinventing Browse just as Google reinvented Search.
9. Bibliography
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
Brin, S. and L. Page, The anatomy of a large-scale hypertextual Web search
engine. Computer Networks and ISDN Systems, 1998. 30(1-7): p. 107-117.
larry page, S.B. Google. Available from: http://www.google.com.
Wikipedia-The free Encyclopedia. Available from: http://www.wikipedia.com.
Liad Agmon, A.y., Sagie Davidovitch(co-founders), Delver.
Mislove, A., K. Gummadi, and P. Druschel. Exploiting Social Networks for
Internet Search. 2006.
Chen, H. and S. Dumais. Bringing order to the Web: automatically categorizing
search results. 2000: ACM Press New York, NY, USA.
Thet, T., J. Na, and C. Khoo, Automatic Classification of Web Search Results:
Product Review vs. Non-review Documents. LECTURE NOTES IN COMPUTER
SCIENCE, 2007. 4822: p. 65.
Vogel, D., et al., Classifying search engine queries using the web as background
knowledge. SIGKDD Explor. Newsl., 2005. 7(2): p. 117-122.
Yeung, A., N. Gibbins, and N. Shadbolt, A k-Nearest-Neighbour Method for
Classifying Web Search Results with Data in Folksonomies. 2008.
Venky
Harinarayan,
A.R.C.-f.
Kosmix.
Available
from:
http://www.kosmix.com.
Read Write Web- Top 10 Alternative Search Engines of 2008. Available from:
http://www.readwriteweb.com/archives/top_10_alternative_search_engi.ph
ps.
Buschmann, F., Pattern-oriented software architecture: a system of patterns.
2002: Wiley.
Buschmann, F., K. Henney, and D. Schmidt, Pattern-oriented software
architecture. 1996: Wiley New York.
Fitzpatrick, B., Distributed caching with memcached. Linux Journal, 2004.
2004(124).
Interactive, D., Memcached. 2006.
Bezos, J. A9-Amazons Seach Engine. Available from: http://www.a9.com.
Needham, C. Internet Movie Database. Available from: http://www.imdb.com.
Wong, S., W. Ziarko, and P. Wong. Generalized vector spaces model in
information retrieval. 1985: ACM New York, NY, USA.
Manber, U. and G. Myers. Suffix arrays: A new method for on-line string
searches. 1990: Society for Industrial and Applied Mathematics Philadelphia,
PA, USA.
Golub, G. and C. Reinsch, Singular value decomposition and least squares
solutions. Numerische Mathematik, 1970. 14(5): p. 403-420.
Osinski, S., J. Stefanowski, and D. Weiss. Lingo: Search results clustering
algorithm based on singular value decomposition. 2004: Springer.
26 FERRET: 360O SEARCH
FERRET: 360O SEARCH 27
22.
23.
24.
Proietti, V., MooToolsÐthe compact javascript framework.
Foundation, A., Apache JMeter.
Hansen, K., Load Testing your Applications with Apache JMeter. Java Boutique
Internet, http://javaboutique. internet. com/tutorials/JMeter/, as viewed
November, 2004.
9. Appendix
Figure 9: Home Page
FERRET: 360O SEARCH 27
28 FERRET: 360O SEARCH
Searching for ‘metallica’ in Ferret…
Figure 10: Search Query
Search (Web) results for ‘metallica’ in Ferret.
Figure 11: Search Results Page > Web Search Tab
28 FERRET: 360O SEARCH
FERRET: 360O SEARCH 29
Media results for ‘metallica’ in Ferret. Results are clustered by STC and Lingo
Algorithm
Figure 12: Search Results Page: Media Results Tab
Playing ‘Nothing Else Matters’ video.
Figure 13: Search Results Page > Media Results Tab > Media Player
FERRET: 360O SEARCH 29
30 FERRET: 360O SEARCH
Product results for ‘metallica’ in Ferret. Results are clustered by seed-list based
clustering.
Figure 14: Search Results Page > Product Search
Social results for ‘metallica’ in Ferret. No results shown since user is not logged in.
Figure 15: Search Results Page > Social Search (Not Logged in)
User logs into Ferret to see social search results.
30 FERRET: 360O SEARCH
FERRET: 360O SEARCH 31
Figure 16: Ajax Login Option
User’s friend has already searched for ‘metallica’ and his favorite ‘metallica’ pages are
displayed.
Figure 17: Search Results Page > Social Results Tab (Logged in)
FERRET: 360O SEARCH 31
32 FERRET: 360O SEARCH
Search (Web) results for ‘ipl’ in Ferret. User ‘ketan’ is logged in and clicks on a URL.
Figure 18: Search Results Page > Web Search > On Clicking a Query
User ‘ketan’ logs out and ‘praful’ logs in and searches for ipl again.
Figure 19: When a Friend Logs in!
32 FERRET: 360O SEARCH
FERRET: 360O SEARCH 33
Social results for ‘ipl’ display the URL ketan had liked when he searched for ipl.
Figure 20: The Socially Relevant Query turns up on Friends Page
Search (Web) results for ‘yoyo’ in Ferret. Dictionary Search is able to get a definition
for ‘yoyo’
Figure 21: Wikipedia, Wordnet Dictionary search, Images from yahoo and google
FERRET: 360O SEARCH 33
Download