Ferret - 360 Search o Story of development of an Advanced Internet Application

advertisement

Ferret - 360

o

Search

Story of development of an Advanced Internet Application

Outline

Motivation

Is it gonna be the same old boring presentation Surprise!!

Existing systems...

What’s di ff erent?

How is it done...

Evaluation

What we dream about...

Motivation

What would you say is the biggest invention of the modern era?

Web or the Search Engine???

I feel like pulling my hair at times...

Clicking back and forth...

Modifying the search query....

Do we always know what we are looking for?

We thought how cool would it be to build it...

The

Showdown

Google

Ferret

How is it done...

Architecture

Extensible Design

High Performance

Clustering

Layout

Rich Client AJAX

Good Page Bad Page

Session Management

Architecture

Extensibility : Pattern Oriented architecture

Front Controller: Centralized request processing

Abstract Factories: Ease of adding modules

Strategy: Change Algorithm with ease

Adapter: Page Painter can be modified with one class

Singleton: Performance boost

Spring ( OS ) Doors: abstracted interfaces to hide local and remote computing

High Performance

Thread Pools

Threads are pre spawned, and work is allocated

Caching can be easily added ( memcache )

Search Controller can query the fast cache before constructing the page

Clustering Results

Background

Vector Space model documents

Conversion of textual data to numeric data in multidimensional space

Each term is a separate dimension

Example vector for a document

V d

= ( R i

) td i

+ ( R j

) td j

+…+ ( R k

) td k

R i

relevance of term to td i

to doc d

Term Document Matrix t x d for multiple documents

Calculating relevance

Term frequency T f

Inverse Document Frequency ID f

Relevance TFIDF = T f

x ID f

ID f diminishes the weight of terms that occur very frequently

Su

x arrays and

Singular value decomposition

Compare sequence of entities

In our cases phrases or words

Reducing term frequency by removing noisy components

SVD Singular value decomposition

Dimensionality reduction technique

Discovering orthogonal vectors corresponding to topics present

Lingo Algorithm

Primitive way?

The Lingo way?

Extract frequent phrases – potential labels

SVD of TD matrix – diverse topics

Matching phrases with extracted topics Select label

Assigning documents to labels

Preprocessing and Phrase Extraction

Preprocessing

Removal of html tags,stop words,non letter characters

Frequent phrase extraction – Su ffi x arrays

Candidate phrases properties

Term frequency threshold

Does not cross sentence boundaries

Complete phrase

Cluster Label Induction

Building term document matrix

Abstract concept discovery using SVD

U matrix – abstract concepts

First k vectors – determines number of clusters

Frobenius coe ffi cient to estimate value of k

Cosine of the distance between abstract concept and phrases

Deciding best label for abstract concept

Cluster content discovery

Strength of membership for a document to a cluster label

Sorted for display based on the distance

Layout

Web

Wikipedia, Images, Definition, News.....

Media

Shopping

Social

Good Page-Bad Page

Used for socially relevant search results

What does the user “ Di $ ”?

Sessions

HeartBeat Messages

Listen to user clicks

Sessions

Optional, needed if user wishes socially relevant results

Used to implement private browsing

Sessions used to identify user and maintain state

User ID

Query

Page URL

Timestamp

Listen to user clicks

A request sent to Ferret each time user selects a page to browse

Used to associate keyword and page with user id for future processing

Algorithm

1.

If session is invalid

2.

return

3.

Else if no timestamp exists in session

3.1.

Insert URL, Timestamp, Keyword into session

3.2.

return

HeartBeat Messages

1.

2.

3.

4.

A request sent to Ferret every time for a search result reload

Used to detect if a user likes the page just seen

Algorithm

If session invalid

1.

return

Else if session has no timestamp

2.1.

return

Else if page is liked by user

3.1.

Page Keyword combination is associated with user’s id as a favorite

3.2.

return

Else if page is disliked by user

4.1.

Remove page keyword association with user

4.2.

return

Plan for evaluation

Specially created load generators

Commercially available profiling software

JProbe by Quest Software

JVMStat by Sun

What we dream of...

Using up our summer vacation to build on it

Notion of Social Rank

Adding blogs, forums, reservations, email search to search results

Using Digg interface to re rank sites

Learning better categories

And the list goes on...

User Comments

“A definite startup idea” - Prof. ( Dr.

) Richard Vuduc

“Definitely launch it” - Santi Ontanon, Post Doc ( GVU )

“Are you kidding me.....” - Ravi Sastry, PhD Studen t

..... What do you guys think ? : )

Questions / Su $ estions / Critiques

Download