Part I: Web Structure Mining Chapter 2: Hyperlink Based Ranking • • • • • Social Network Analysis PageRank Authorities and Hubs Link Based Similarity Search Enhanced Techniques for Page Ranking Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, 2007. Slides for Chapter 1: Information Retrieval an Web Search 1 Social Networks • Directed graph with weights assigned to its edges • Nodes represent documents and the edges – citations from one document to other documents. • Prestige can be associated with the number of input edges to a node (in-degree). • Prestige has a recursive nature. depends on the authority (or again, the prestige) of citations Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, 2007. Slides for Chapter 1: Information Retrieval an Web Search 2 Social Networks • adjacency matrix A – if document cites document A(u, v) 1 – otherwise A(u, v) 0 • prestige score p(u) v A(v, u) p(v) Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, 2007. Slides for Chapter 1: Information Retrieval an Web Search 3 Social Networks • Computing prestige T P A P • Eigen decomposition P A P T – Eigenvector P – Eigenvalue Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, 2007. Slides for Chapter 1: Information Retrieval an Web Search 4 Social Networks a 0.548 c 0.726 b 0.413 0.548 0 0 1 0.548 1.325 0.414 1 0 0 0.414 0.726 1 1 0 0.726 0 1 1 A 0 0 1 1 0 0 0 0 1 T A 1 0 0 1 1 0 P AT P 1.325 P 0.548 0.414 0.726 Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, 2007. Slides for Chapter 1: Information Retrieval an Web Search T 5 Social Networks Power Iteration • P P0 • Loop: QP P A TQ P • While 1 P P P Q Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, 2007. Slides for Chapter 1: Information Retrieval an Web Search 6 PageRank • “Random web surfer” keeps clicking on hyperlinks at random with uniform probability • Implements random walk on the web graph • Page u links to N u web pages • Probability of visiting page v will be 1 Nu • Amount of prestige that page v receives from page u is 1 Nu of the prestige of u Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, 2007. Slides for Chapter 1: Information Retrieval an Web Search 7 PageRank Propagation of page rank R(u ) 10 20 20 20 60 20 30 26 4 2 13 13 Nv w A(v, w) 6 12 A(v, u ) R(v) R(u ) Nv v 6 6 Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, 2007. Slides for Chapter 1: Information Retrieval an Web Search 8 PageRank Calculation of page rank 0 0.5 0.5 A 0 0 1 1 0 0 a 0.4 0.4 0.2 0.2 c 0.4 0.2 b 0.2 PT 0.666 0.333 0.666 Norm X 2 x12 x22 ... xn2 P A P T 0.4 0 0 1 0.4 0 . 2 0 .5 0 0 0 . 2 0 . 4 0 .5 1 0 0 . 4 Norm X 1 x1 x2 ... xn PT 2 1 2 Integers Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, 2007. Slides for Chapter 1: Information Retrieval an Web Search 9