Adaptive On-Line Page Importance Computation

advertisement
Adaptive On-Line Page Importance
Computation
Serge, Mihai, Gregory
Presented By
Liang Tian
7/13/2010
Adaptive On-Line Page Importance Computation
1
Overview :
•
•
•
•
•
•
•
•
•
What is OPIC?
Why Should we care ?
Advantages vs off-line algorithms
How does it work?
Scenario of OPIC
Challenge
Mathematical mode
Algorithm
Prons and Cons
7/13/2010
Adaptive On-Line Page Importance Computation
2
What is OPIC?
• OPIC stands for On-line Page Important
Computation.
Why should we care?
• OPIC provide a more effective way of
computing page importance than other
old algorithms.
7/13/2010
Adaptive On-Line Page Importance Computation
3
Advantages vs off-line algorithms
• Work online with a large amount of dynamic graph
• Use much less resources.eg.It does not require
storing the link matrix
• Can focus crawling to the most interest pages
• fully integrated in the crawling process
7/13/2010
Adaptive On-Line Page Importance Computation
4
How does it work?
• It is on-line in that it continuously refines its
estimate of page importance while the web graph is
visited.
7/13/2010
Adaptive On-Line Page Importance Computation
5
Scenario of OPIC
• Initially, ditribute some cash to each page
• Each page when it is crawled distributes its current
cash equally to all pages it points to.
• Record credit history of each page(when crawled, a
page’s current cash sent to its children, but the cash
amount it ever has record in the credit history )
• The page importance of one page= (credit history
+ current cash)/(total history amount+ total current
cash)
7/13/2010
Adaptive On-Line Page Importance Computation
6
Challenge
How to find the values of current cash and history?
Intuitively, the cash flow goes through from parent
nodes to child nodes, in a inductive way.
7/13/2010
Adaptive On-Line Page Importance Computation
7
Mathematical mode
• Let G be any directed graph with n vertices. Fix an
arbitrary ordering between the vertices. G can be
represented as a matrix L[ i, j], such that
L[i,j]>=0, L[i,j]>0 iff exist a edge from i to j
• The basic idea is to define the importance of a page
in an inductive way and then compute it using a
fixpoint.
• If the graph contains n nodes, the importance is
represented as a vector ͞x in a n dimensional space
7/13/2010
Adaptive On-Line Page Importance Computation
8
Mathematical mode (cont.)
Importance is defined
inductively by the
equation
Given a linear transformation A, a non-zero vector x
is defined to be an eigenvector of the transformation
if it satisfies the eigenvalue equation Ax=λx
7/13/2010
Adaptive On-Line Page Importance Computation
9
Find a fixpoint
• By definition, such a fixpoint is an eigenvector of L
with a real positive eigenvalue. L͞x = λ͞x
• Problems
• Multiple solutions
• Iteration may not converge
• Solution
•Google defines L[i,j]=1/d[i] iff there is an edge from i to j.
L’[i,j]=L[i,j]+ϵ ,where ϵ is a small real.
•a new graph G’ which is G plus a small edge for any pair i,j
•the convergence of iteration is guaranteed because this small edge
makes G’ stongely connected and aperiodic
7/13/2010
Adaptive On-Line Page Importance Computation
10
Algorithm for static graphs
At each step, an estimate of any page k’s importance is (H[k]+C[k])/(G+1)
7/13/2010
Adaptive On-Line Page Importance Computation
11
Crawling strategies
There are two main strategies here:
• Random : We choose the next page to crawl
randomly with equal probability.
• Greedy : We read next the page with highest cash.
This is a greedy way to decrease the value of the
error factor.
Impact on convergence speed.
7/13/2010
Adaptive On-Line Page Importance Computation
12
The Adaptive OPIC algorithm(for
changing graphs)
• Base on time window
• Fixed window
• Variable Window
• Interpolation
• two main dimensions
• The page selection strategy that is used (e.g., Greedy or
Random )
• The window policy that is considered (e.g., Fixed
Window or Interpolation).
7/13/2010
Adaptive On-Line Page Importance Computation
13
7/13/2010
Adaptive On-Line Page Importance Computation
14
Pros
• it may start even when a (large) part of the matrix is
still unknown
• it is integrated in the crawling process
• it works on-line even while the graph is being
updated
• It requires less storage resources than standard
algorithms
• It requires less CPU, memory and disk access than
standard algorithms
7/13/2010
Adaptive On-Line Page Importance Computation
15
Cons
• it is strictly tailored to the computational cost model
of crawling the Web
• converges slower than others after reading the same
pages
7/13/2010
Adaptive On-Line Page Importance Computation
16
Reference
•
•
•
•
•
•
K. Bharat and A. Broder. Estimating the relative size andoverlap of public web
search engines. 7th InternationalWorld Wide Web Conference (WWW7), 1998
Andrei Z. Broder and al. Graph structure in the web.WWW9/Computer
Networks, 2000.
S. Chakrabarti, M. van den Berg, and B. Dom. Focusedcrawling: a new
approach to topic-specific web resource discovery. 8th World Wide Web
Conference, 1999.
J. Dean and M.R. Henzinger. Finding related pages in theworld wide web. 8th
International World Wide WebConference, 1999.
Lawrence Page, Sergey Brin, Rajeev Motwani, and TerryWinograd. The
pagerank citation ranking: Bringing order to the web, 1998.
S. Abiteboul, G. Cobena, J. Masanes, and G. Sedrati. A firstexperience in
archiving the french web. ECDL, 2002.
7/13/2010
Adaptive On-Line Page Importance Computation
17
Q&A
7/13/2010
Adaptive On-Line Page Importance Computation
18
Download