Adaptive On-Line Page Importance Computation Serge, Mihai, Gregory Presented By Liang Tian 7/13/2010 Adaptive On-Line Page Importance Computation 1 Overview : • • • • • • • • • What is OPIC? Why Should we care ? Advantages vs off-line algorithms How does it work? Scenario of OPIC Challenge Mathematical mode Algorithm Prons and Cons 7/13/2010 Adaptive On-Line Page Importance Computation 2 What is OPIC? • OPIC stands for On-line Page Important Computation. Why should we care? • OPIC provide a more effective way of computing page importance than other old algorithms. 7/13/2010 Adaptive On-Line Page Importance Computation 3 Advantages vs off-line algorithms • Work online with a large amount of dynamic graph • Use much less resources.eg.It does not require storing the link matrix • Can focus crawling to the most interest pages • fully integrated in the crawling process 7/13/2010 Adaptive On-Line Page Importance Computation 4 How does it work? • It is on-line in that it continuously refines its estimate of page importance while the web graph is visited. 7/13/2010 Adaptive On-Line Page Importance Computation 5 Scenario of OPIC • Initially, ditribute some cash to each page • Each page when it is crawled distributes its current cash equally to all pages it points to. • Record credit history of each page(when crawled, a page’s current cash sent to its children, but the cash amount it ever has record in the credit history ) • The page importance of one page= (credit history + current cash)/(total history amount+ total current cash) 7/13/2010 Adaptive On-Line Page Importance Computation 6 Challenge How to find the values of current cash and history? Intuitively, the cash flow goes through from parent nodes to child nodes, in a inductive way. 7/13/2010 Adaptive On-Line Page Importance Computation 7 Mathematical mode • Let G be any directed graph with n vertices. Fix an arbitrary ordering between the vertices. G can be represented as a matrix L[ i, j], such that L[i,j]>=0, L[i,j]>0 iff exist a edge from i to j • The basic idea is to define the importance of a page in an inductive way and then compute it using a fixpoint. • If the graph contains n nodes, the importance is represented as a vector ͞x in a n dimensional space 7/13/2010 Adaptive On-Line Page Importance Computation 8 Mathematical mode (cont.) Importance is defined inductively by the equation Given a linear transformation A, a non-zero vector x is defined to be an eigenvector of the transformation if it satisfies the eigenvalue equation Ax=λx 7/13/2010 Adaptive On-Line Page Importance Computation 9 Find a fixpoint • By definition, such a fixpoint is an eigenvector of L with a real positive eigenvalue. L͞x = λ͞x • Problems • Multiple solutions • Iteration may not converge • Solution •Google defines L[i,j]=1/d[i] iff there is an edge from i to j. L’[i,j]=L[i,j]+ϵ ,where ϵ is a small real. •a new graph G’ which is G plus a small edge for any pair i,j •the convergence of iteration is guaranteed because this small edge makes G’ stongely connected and aperiodic 7/13/2010 Adaptive On-Line Page Importance Computation 10 Algorithm for static graphs At each step, an estimate of any page k’s importance is (H[k]+C[k])/(G+1) 7/13/2010 Adaptive On-Line Page Importance Computation 11 Crawling strategies There are two main strategies here: • Random : We choose the next page to crawl randomly with equal probability. • Greedy : We read next the page with highest cash. This is a greedy way to decrease the value of the error factor. Impact on convergence speed. 7/13/2010 Adaptive On-Line Page Importance Computation 12 The Adaptive OPIC algorithm(for changing graphs) • Base on time window • Fixed window • Variable Window • Interpolation • two main dimensions • The page selection strategy that is used (e.g., Greedy or Random ) • The window policy that is considered (e.g., Fixed Window or Interpolation). 7/13/2010 Adaptive On-Line Page Importance Computation 13 7/13/2010 Adaptive On-Line Page Importance Computation 14 Pros • it may start even when a (large) part of the matrix is still unknown • it is integrated in the crawling process • it works on-line even while the graph is being updated • It requires less storage resources than standard algorithms • It requires less CPU, memory and disk access than standard algorithms 7/13/2010 Adaptive On-Line Page Importance Computation 15 Cons • it is strictly tailored to the computational cost model of crawling the Web • converges slower than others after reading the same pages 7/13/2010 Adaptive On-Line Page Importance Computation 16 Reference • • • • • • K. Bharat and A. Broder. Estimating the relative size andoverlap of public web search engines. 7th InternationalWorld Wide Web Conference (WWW7), 1998 Andrei Z. Broder and al. Graph structure in the web.WWW9/Computer Networks, 2000. S. Chakrabarti, M. van den Berg, and B. Dom. Focusedcrawling: a new approach to topic-specific web resource discovery. 8th World Wide Web Conference, 1999. J. Dean and M.R. Henzinger. Finding related pages in theworld wide web. 8th International World Wide WebConference, 1999. Lawrence Page, Sergey Brin, Rajeev Motwani, and TerryWinograd. The pagerank citation ranking: Bringing order to the web, 1998. S. Abiteboul, G. Cobena, J. Masanes, and G. Sedrati. A firstexperience in archiving the french web. ECDL, 2002. 7/13/2010 Adaptive On-Line Page Importance Computation 17 Q&A 7/13/2010 Adaptive On-Line Page Importance Computation 18