CSE 450 – Web Mining Seminar Searching the Workplace Web

advertisement

CSE 450 – Web Mining Seminar

Professor Brian D. Davison

Fall 2005

A Presentation on

Searching the Workplace Web

R. Fagin, R. Kumar, K. McCurley, J. Novak, D. Sivakumar, J.

Tomlin & D. Williamson

WWW2003, Budapest, Hungary by

Osama Ahmed Khan

11/03/2005 (It’s my birthday! 

)

Problem

Intranet Search vs. Internet Search

Solution

A case study of IBM’s intranet

Definition

Intranet: Corporate network similar and dissimilar to the Internet at the same time

Internet

Democratic: Reflects collective voice of many authors

Interesting Content: Attracting user traffic

(Axiom 1)

Targets various ‘Best Answers’ (Axiom 2)

Spam-influenced: Various authorities contributing (Axiom 3)

Search-engine-friendly (Axiom 4)

Intranet

Autocratic: Reflects the view of the entity that it serves

Informative Content (Axiom 1)

Targets a single ‘Right Answer’ (Axiom 2)

Spam-free: Small number of authorities for building content (Axiom 3)

Search engine: Bad idea (Axiom 4)

Two-phase Approach

1. Identify a variety of ranking functions based on heuristic and experimental analysis of intranet structure

2. Rank Aggregation Architecture

IBM’s Dataset

Unbiased: May apply to other organizations

System Architecture

1. Crawler: Stores and produces structured data

2. Duplicate Elimination: Favorite representative from a group of similar pages

3. Inverted Indexing: 3 indices (Content, Title,

Anchortext)

4. Global Ranking: 7 static lists (PageRank,

Indegree, Discovery date, URL words, URL length, URL depth, Discriminator)

5. Query Runtime System

6. Result Markup and Presentation

Rank Aggregation Architecture

Input: Multiple ranked lists from various heuristics

Output: Final ranked list minimizing the total

‘inversions’ with respect to the individual ranked lists

Plug-and-Play: Allows addition and removal of individual heuristics

Rank Aggregation Architecture

(Contd.)

Experimental Results

Experimental Results

(Contd.)

Conclusion

Intranet and Internet possess different structures

Separating ranking functions helps select a combination of best heuristics

Thank You

Download