Block-level Link Analysis Presented by Lan Nie 11/08/2005, Lehigh University

advertisement
Block-level Link Analysis
Presented by Lan Nie
11/08/2005, Lehigh University
Introduction

Web page often contains multiple semantics

Different parts of the page have different importance
and topic
 Links contained in different semantic blocks point to
pages of different topics



Importance of page may be mis-calculated by
PageRank and topic drift may happen in HITS
Split page into semantic blocks
Apply link analysis on block-level
Vision-Based Page Segmentation
Construct a semantic tree for a page based
on layout structure
 Extract
blocks from the html DOM tree
 Constructed blocks into a semantic tree
based on seperators
 Node: block with a value (DOC) to indicate
how coherent of the content in the block.
Block Level Web Graph
P: set of all the pages B: set of all the blocks
X: page-to-block matrix (layout structure)
X ij
b j  pi
 f Pi (b j ) if

0
otherwise

f is block importance function:
big size and centered position vs small size and margin position
size (b)
f p ( b)  
, b p f p ( b ) 1
dist (b, p )
Z: block-to-page matrix (link structure)
1 / si
Zij  
 0
si
block
i  Page
otherwise
j
Is the number of pages that block i links to
WP: Page-to-Page Graph
WP  XZ
W p ( ,  )
 prob(  |  )
 b prob(  | b) prob(b |  )
 b Z (b,  ) * X ( , b)
 b (1 / sb ) * f (b)
A weighted adjacency matrix:
Links in blocks with high importance value get more
weights than those in blocks with low importance value
WB: Block-to-Block Graph (didn’t use in this paper)
WB  ZX
WB ( a, b)
 prob(b | a )
  P prob( | a ) prob(b |  )
 Z ( a ,  ) * f  ( b)
Extension:
the probability of jump from a block a to block b within a page is DOC value
of the smallest block containing both block a and block b
WB  (1  t ) ZX  tD1U
Block Level Page Rank(BLPR)

Apply PageRank on weighted adjacency matrix WP
( dU  (1  d ) M )T p  p

Edge is weighted by block’s importance value.


Pages pointed by advertisement hyperlinks might not be
assigned a large score since such links are always in less
important blocks.
Block level PageRank can reflect the semantic structure
of the web
Block level HITS(BLHITS)

Apply HITS on block-to-page matrix Z
A  Z T H , H  ZA


A page will have only authority score A and a block will
have only hub score H
Different parts of the page are treated differently, thus the
links in these hubs are treated differently.
Main difference between BLHITS and HITS





Links from blocks to pages vs Links from pages to pages
Root set is made up of top ranked blocks rather than top
ranked pages.
When expanding the root set, only consider out-links
contained in top ranked blocks of a page instead of all
links.
Combine content analysis in block-level instead of pagelevel.
Weight links: importance value of the block /maximum
block importance value
Experiments




DataSet: TREC2003
Relevance weighting: BM2500
PR and BLPR
HITS and BLHITS
 Size
of rootset:200
 In-link parameter d:50
 Adopting Bharat and Henzinger’s idea


Eliminate mutually reinforcing relationship between hosts
Combine connectivity and content analysis
Results on PR & BLPR
1. First 15 pages in .GOV dataset
2. Results on TREC2003
Combine relevance score (using BM2500) and importance score
(using ranking algorithm)
 .rank relevance (d )  (1   ) * rankimportamce(d )
Results on HITS & BLHITS
summary
Download