CONTENT AND LINK ANALYSIS FOR SEARCHING THE WEB University of Louisiana

advertisement
CONTENT AND LINK ANALYSIS
FOR SEARCHING THE WEB
Kemal Efe, Vijay Raghavan, and Arun Lakhotia
University of Louisiana
Presented by Lan Nie
09/01/2005, Lehigh University
Introduction

Search engine

Crawl, index and retrieve information about web pages.
 find all of the relevant pages
 rank them by relevance to the user query
 present a rank-ordered result


Recall and Precision
Early Search Engine




Solely keyword matching
Lots of low quality pages, rankings rarely agreed with user’s interests
Synonymy and polysemy
Modern Search Engine

Linkage structure provides valuable information.
 Link analysis combined with content analysis
 Substantially improve the search quality
Link Analysis

Authority Flow Model

Link: a channel for authority flow
 A page q with authority rank rq(i) at iteration i distribute all its current
authority equally among its outgoing links.
rp (i  1)  q:q  p rq (i ) / F ( q)
However..

An authorative page on a subject is likely to be co-cited with other
authoritative pages on the same subject
Rank of a page should augmented by its co-citation degree.

Random Walk Model


Surfer walks on the web graph and make random decisions about where
to go next
PageRank is a combination of authority flow model and random walk model
Continued..
Authority and Hub



Co-citation matrix ATA
 Entry (p,q) : the number of joint co-citations received by p and q;
 Entry (p,p) : the indegree of page p.
Bibliographic coupling matrix AAT
Authority / Hub
a[i  1]  ( AT A)  a[i ]
h(i  1)  ( AAT )  h[i ]

diagonal term:
authority is influenced by number of citation
 non-diagonal term: authority is influenced by the degree co-citation
 Influence (co-citation) >> Influence (citation)
 A more general model : different weights for diagonal terms and
no-diagonal terms in the above computation

HITS algorithm combined the authority and hub idea together
Content Analysis
Which pages are important in the Web Graph? (Link Analysis)
Which pages are relevant to the query?
(Content Analysis)
Tasks of Content Analysis

how a page is relevant to the user query
 Similarity between documents in vector space
 Cosine Similarity
 Okapi measure, Three Level Scoring,Cover Density Ranking
 Where on the page to search for the query terms
 Fields: Title, Anchor text, Abstract
 Properties: Font, Highlighting, Capitalization, distance between
subquery terms

Deal with synonymy and polysemy


LSI,GVSM
Application in classification, document search and relevance
ranking.
Combining Content and Link Analysis (PageRank)
Page Rank: A Random Surfer (Brin and Page[1998])
With 1-d, jumps to a random page; with d, follows a random outlink.
Rank Is independent of query/ topic.
PR( p )  d
 PR( q) / F ( q)+(1  d )
q  p
1
N
Topic Sensitive Page Rank: Multiple Focused Surfer(Haveliwala[2002])
A set of predefined topics (top level categories of ODP), with Ct as the set
of URLs in the ODP category t.
Each page is assigned a rank vector , one rank for each topic.
Each surfer is focused on a specific topic t
With 1-d, jumps to a page in Ct; with d, follows a random outlink
1

if
k  Ct
d *
TSPRt ( p )  (1  d )TSPRt ( q) / F ( q)  
Ct

if
k  Ct
0
For a given query, a page’s query-sensitive score is inner product of the
page’s rank vector and the query’s topic distribution vector.
Combining Content and Link Analysis (HITS)
HITS



Sampling
 Use query to collect a root set of pages from a textual search engine
 Expand the root set into a base set by adding pages linked to and from the
root set
Calculation of Authorities and Hubs
h  Aa  AAT h
a  AT h  AT Aa
Problems

Tightly Knit Community (TKC) effect:
HITS has converged to the regions of the web graph which is highly
connected . How about TKC is irrelevant to the topic?
 Page propagate the same authority weight to each outgoing page
 Result is dominated by one community, a page would be deemed
unimportant if it is popular of a smaller community. Example:Jaguar
Continued..
Improvement of HITS

Chakrabarti et al.[1999]
 Outlinks in different part of page may point to different topics
 Page splitting: outlinks in the small page tend to be on the same topic
 Li et al.[2002]
 A good hub is likely to be cited, hub weights of pages are increased
depending on their authority weights
 Cohn and Chang[2001]
 PHITS: A probabilistic model to rank a page within its own community
rather than within the entire base set
 Dean et. Al.[1999]
 What’s Related? Given the seed page, find its parents, children of its
parents, its children, parents of its children
Given a see page, find pages link to it, and what else they link to
 Output pages that are most frequently co-cited with the seed URL
Bharat and Henzinger[1998], Chakrabarti et al.[1998]
 Weighted HITS


Weighted HITS
CLEVER project (Chakrabarti et al.[1998])
 A relevance weight is computed for each link

W (p, q): The number of query matches in the surrounding texts of the link p->q
a ( p )  q:q (
1+w( q, p ))* h ( q)
p
h ( p )  q: p (
1+w( q, p ))* a ( q)
q
Query Expansion (Bharat and Henzinger[1998])
 A relevance weight is assigned to each page

Broader Query Q: concatenation the first 1000 words from each doc in
the root set
 W(p): cosine similarity between page p and broader query Q
a ( p )  w( p ) * q:q   p h ( q )
h ( p )  w( p ) * q: p   q a ( q )
Download