o
Story of development of an Advanced Internet Application
Motivation
Is it gonna be the same old boring presentation Surprise!!
Existing systems...
What’s di ff erent?
How is it done...
Evaluation
What we dream about...
What would you say is the biggest invention of the modern era?
Web or the Search Engine???
I feel like pulling my hair at times...
Clicking back and forth...
Modifying the search query....
Do we always know what we are looking for?
We thought how cool would it be to build it...
Ferret
Architecture
Extensible Design
High Performance
Clustering
Layout
Rich Client AJAX
Good Page Bad Page
Session Management
Extensibility : Pattern Oriented architecture
Front Controller: Centralized request processing
Abstract Factories: Ease of adding modules
Strategy: Change Algorithm with ease
Adapter: Page Painter can be modified with one class
Singleton: Performance boost
Spring ( OS ) Doors: abstracted interfaces to hide local and remote computing
Thread Pools
Threads are pre spawned, and work is allocated
Caching can be easily added ( memcache )
Search Controller can query the fast cache before constructing the page
Background
Vector Space model documents
Conversion of textual data to numeric data in multidimensional space
Each term is a separate dimension
Example vector for a document
V d
= ( R i
) td i
+ ( R j
) td j
+…+ ( R k
) td k
R i
relevance of term to td i
to doc d
Term Document Matrix t x d for multiple documents
Term frequency T f
Inverse Document Frequency ID f
Relevance TFIDF = T f
x ID f
ID f diminishes the weight of terms that occur very frequently
ffi
Compare sequence of entities
In our cases phrases or words
Reducing term frequency by removing noisy components
SVD Singular value decomposition
Dimensionality reduction technique
Discovering orthogonal vectors corresponding to topics present
Primitive way?
The Lingo way?
Extract frequent phrases – potential labels
SVD of TD matrix – diverse topics
Matching phrases with extracted topics Select label
Assigning documents to labels
Preprocessing
Removal of html tags,stop words,non letter characters
Frequent phrase extraction – Su ffi x arrays
Candidate phrases properties
Term frequency threshold
Does not cross sentence boundaries
Complete phrase
Building term document matrix
Abstract concept discovery using SVD
U matrix – abstract concepts
First k vectors – determines number of clusters
Frobenius coe ffi cient to estimate value of k
Cosine of the distance between abstract concept and phrases
Deciding best label for abstract concept
Strength of membership for a document to a cluster label
Sorted for display based on the distance
Web
Wikipedia, Images, Definition, News.....
Media
Shopping
Social
Used for socially relevant search results
What does the user “ Di $ ”?
Sessions
HeartBeat Messages
Listen to user clicks
Optional, needed if user wishes socially relevant results
Used to implement private browsing
Sessions used to identify user and maintain state
User ID
Query
Page URL
Timestamp
A request sent to Ferret each time user selects a page to browse
Used to associate keyword and page with user id for future processing
Algorithm
1.
If session is invalid
2.
return
3.
Else if no timestamp exists in session
3.1.
Insert URL, Timestamp, Keyword into session
3.2.
return
1.
2.
3.
4.
A request sent to Ferret every time for a search result reload
Used to detect if a user likes the page just seen
Algorithm
If session invalid
1.
return
Else if session has no timestamp
2.1.
return
Else if page is liked by user
3.1.
Page Keyword combination is associated with user’s id as a favorite
3.2.
return
Else if page is disliked by user
4.1.
Remove page keyword association with user
4.2.
return
Specially created load generators
Commercially available profiling software
JProbe by Quest Software
JVMStat by Sun
Using up our summer vacation to build on it
Notion of Social Rank
Adding blogs, forums, reservations, email search to search results
Using Digg interface to re rank sites
Learning better categories
And the list goes on...
“A definite startup idea” - Prof. ( Dr.
) Richard Vuduc
“Definitely launch it” - Santi Ontanon, Post Doc ( GVU )
“Are you kidding me.....” - Ravi Sastry, PhD Studen t
..... What do you guys think ? : )
Questions / Su $ estions / Critiques