Block-level Link Analysis Presented by Lan Nie 11/08/2005, Lehigh University Introduction Web page often contains multiple semantics Different parts of the page have different importance and topic Links contained in different semantic blocks point to pages of different topics Importance of page may be mis-calculated by PageRank and topic drift may happen in HITS Split page into semantic blocks Apply link analysis on block-level Vision-Based Page Segmentation Construct a semantic tree for a page based on layout structure Extract blocks from the html DOM tree Constructed blocks into a semantic tree based on seperators Node: block with a value (DOC) to indicate how coherent of the content in the block. Block Level Web Graph P: set of all the pages B: set of all the blocks X: page-to-block matrix (layout structure) X ij b j pi f Pi (b j ) if 0 otherwise f is block importance function: big size and centered position vs small size and margin position size (b) f p ( b) , b p f p ( b ) 1 dist (b, p ) Z: block-to-page matrix (link structure) 1 / si Zij 0 si block i Page otherwise j Is the number of pages that block i links to WP: Page-to-Page Graph WP XZ W p ( , ) prob( | ) b prob( | b) prob(b | ) b Z (b, ) * X ( , b) b (1 / sb ) * f (b) A weighted adjacency matrix: Links in blocks with high importance value get more weights than those in blocks with low importance value WB: Block-to-Block Graph (didn’t use in this paper) WB ZX WB ( a, b) prob(b | a ) P prob( | a ) prob(b | ) Z ( a , ) * f ( b) Extension: the probability of jump from a block a to block b within a page is DOC value of the smallest block containing both block a and block b WB (1 t ) ZX tD1U Block Level Page Rank(BLPR) Apply PageRank on weighted adjacency matrix WP ( dU (1 d ) M )T p p Edge is weighted by block’s importance value. Pages pointed by advertisement hyperlinks might not be assigned a large score since such links are always in less important blocks. Block level PageRank can reflect the semantic structure of the web Block level HITS(BLHITS) Apply HITS on block-to-page matrix Z A Z T H , H ZA A page will have only authority score A and a block will have only hub score H Different parts of the page are treated differently, thus the links in these hubs are treated differently. Main difference between BLHITS and HITS Links from blocks to pages vs Links from pages to pages Root set is made up of top ranked blocks rather than top ranked pages. When expanding the root set, only consider out-links contained in top ranked blocks of a page instead of all links. Combine content analysis in block-level instead of pagelevel. Weight links: importance value of the block /maximum block importance value Experiments DataSet: TREC2003 Relevance weighting: BM2500 PR and BLPR HITS and BLHITS Size of rootset:200 In-link parameter d:50 Adopting Bharat and Henzinger’s idea Eliminate mutually reinforcing relationship between hosts Combine connectivity and content analysis Results on PR & BLPR 1. First 15 pages in .GOV dataset 2. Results on TREC2003 Combine relevance score (using BM2500) and importance score (using ranking algorithm) .rank relevance (d ) (1 ) * rankimportamce(d ) Results on HITS & BLHITS summary