CSE 450 – Web Mining Seminar
Professor Brian D. Davison
Fall 2005
A Presentation on
Searching the Workplace Web
R. Fagin, R. Kumar, K. McCurley, J. Novak, D. Sivakumar, J.
Tomlin & D. Williamson
WWW2003, Budapest, Hungary by
Osama Ahmed Khan
11/03/2005 (It’s my birthday!
)
Intranet Search vs. Internet Search
A case study of IBM’s intranet
Intranet: Corporate network similar and dissimilar to the Internet at the same time
Democratic: Reflects collective voice of many authors
Interesting Content: Attracting user traffic
(Axiom 1)
Targets various ‘Best Answers’ (Axiom 2)
Spam-influenced: Various authorities contributing (Axiom 3)
Search-engine-friendly (Axiom 4)
Autocratic: Reflects the view of the entity that it serves
Informative Content (Axiom 1)
Targets a single ‘Right Answer’ (Axiom 2)
Spam-free: Small number of authorities for building content (Axiom 3)
Search engine: Bad idea (Axiom 4)
1. Identify a variety of ranking functions based on heuristic and experimental analysis of intranet structure
2. Rank Aggregation Architecture
Unbiased: May apply to other organizations
1. Crawler: Stores and produces structured data
2. Duplicate Elimination: Favorite representative from a group of similar pages
3. Inverted Indexing: 3 indices (Content, Title,
Anchortext)
4. Global Ranking: 7 static lists (PageRank,
Indegree, Discovery date, URL words, URL length, URL depth, Discriminator)
5. Query Runtime System
6. Result Markup and Presentation
Input: Multiple ranked lists from various heuristics
Output: Final ranked list minimizing the total
‘inversions’ with respect to the individual ranked lists
Plug-and-Play: Allows addition and removal of individual heuristics
Rank Aggregation Architecture
(Contd.)
(Contd.)
Intranet and Internet possess different structures
Separating ranking functions helps select a combination of best heuristics