Accelerating Ranking-system using WebGraph

advertisement
Accelerating Ranking-System
Using WebGraph
Project Report
by
Padmaja Adipudi
Outline of My Talk
• Needle Search Engine/Ranking-System
• Ranking-System Issue/Resolution
– Accelerating Ranking-System using WebGraph
– Ranking Algorithms Overview
– Google’s PageRank, ClusterRank, SourceRank & Truncated
PageRank
• Experimental Results
– Efficiency Measure
– Quality Measure
• Conclusion
– Which algorithm is better in terms of Efficiency & Quality
Search Engine
• Web is a terrific place to get the
information on any topic.
• Search Engine is a useful application for
the information retrieval on the WWW.
• Search Engine has five basic
components, a Crawler, a Parser, a
Ranking-System, a Repository and a
Front-End.
Ranking-System
• Determines the importance of a Web
page.
• Google's PageRank algorithm is the
famous Ranking-System and is based
on URL link structure.
• In Google’s PageRank, the importance
of a Web page is based on the
importance of it’s parent Web pages.
Needle Search Engine
• A Search Engine developed by former
students at UCCS.
• ClusterRank algorithm is implemented
as the Ranking-System.
• The former student Yi-Zhang developed
a Cluster ranking system which takes
an average of 3 hours to rank 300,000
URLs.
Ranking-System Issue
• The major issue with the current ranking
system is, it takes long update times, 3
hours for 300K URLs.
• As the number of pages increases it is
going to be a severe problem.
Project Goal
• Accelerate the existing Ranking-System
of the Needle Search Engine at UCCS
using a package called “WebGraph”.
• Upgrade the Needle Search Engine
system up to 1 Million Web pages from
the 50K Web pages (crawled).
Steps to reach Goal
• Use WebGraph package to represent the
graph efficiently using compression
techniques.
• Compute the Page-Rank using algorithms
namely ClusterRank, SourceRank and
Truncated PageRank.
• Compare the results based on time and
quality measure for ClusterRank with the
results of SourceRank, Truncated PageRank
and choose the best for the Needle Search
Engine.
Work Flow
ClusterRank
SourceRank
Truncated PageRank
Page Rank Results
Compressed Graph
Why Truncated & Source
Algorithms
• These are the latest papers available in
the Page Ranking area.
• Authors used WebGraph package for
their experiments while developing the
algorithm.
Node Graph
• Node graph is used in ranking system.
• Node graph consists of nodes and
directed links from node to node.
• URLs are represented by nodes and the
hyperlinks are represented as directed
links between nodes.
• Compression techniques to represent
the Node graph in efficient manner.
Google’s PageRank
•
•
•
Page Lawrence, Brin Sergey, Rajeev Motwani, Terry Winograd
from Stanford University, 1999.
Importance of a page is based on the incoming link count and also
how important are those incoming links.
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
– PR(Tn): Each page has a notion of its own self-importance.
That’s “PR(T1)” for the first page in the web all the way up to
PR(Tn) for the last page.
– C(Tn): Each page spreads its vote out evenly amongst all of its
outgoing links. The count, or number, of outgoing links for page
1 is C(T1), C(Tn) for page n, and so on for all pages.
– PR(Tn)/C(Tn): if a page (page A) has a back link from page N,
the share of the vote page A gets is PR(Tn)/C(Tn).
– d: All these fractions of votes are added together but, to stop
the other pages having too much influence, this total vote is
"damped down" by multiplying it by 0.85 (the factor d).
ClusterRank
• Yi Zhang, a student at UCCS is the author,
2006.
• Algorithm is based on Google’s PageRank.
• Designed to speed up PageRank calculation
and also to provide a feature of grouping
similar Web pages together in to clusters.
• The original PageRank algorithm is applied
on Clusters.
• The rank is then distributed to members of
the by weighted average.
ClusterRank (Cont’d)
• Group all pages into clusters.
• Perform first level clustering for dynamically
generated page.
• URLs are grouped based on the “?” , “#”
• Example: All URLs below will be grouped in to one
Cluster
– http://www.uccs.edu/057/cs_sub.shtml
– http://www.uccs.edu/057/cs_sub.shtml#news
– http://www.uccs.edu/057/cs_sub.shtml#dates
– http://www.uccs.edu/057/cs_sub.shtml#spotlight
ClusterRank (Cont’d)
• Perform second level clustering on virtual
directory and graph density.
• URLs are grouped based on the last “/”
symbol of the URL.
• Density is calculated for the proposed
clusters.
• Approve the cluster based on the pre-set
threshold value.
ClusterRank (Cont’d)
• Calculate the rank for each cluster using the
original PageRank algorithm.
• Distribute the rank number to its members by
weighted average by using:
– PR = CR * Pi/Ci.
– The notations here are:
– PR: The rank of a member page
– CR: The cluster rank from previous stage
– Pi: The incoming links of this page
– Ci: Total incoming links of this cluster.
SourceRank
• James Caverlee, Ling Liu, and S.Webb from
Georgia Institute of Technology, 2007.
• The Web graph is represented as Sources.
• The Source is a logical collection of Web
pages.
• Assigns a score to each page based on the
overall quality of the source that the page
belongs to, through a random walk over Web
sources.
SourceRank (Cont’d)
• Group all pages into Sources based on
“Domain”.
• URLs are grouped based on the first “/” symbol of
the URL
• Example: All URLs below will be grouped in to
one Source
– http://office.microsoft.com/en-us/default.aspx
– http://office.microsoft.com/en-us/assistance/default.aspx
– http://office.microsoft.com/enus/assistance/CH790018071033.aspx
SourceRank (Cont’d)
• Calculate the rank for each Source with the original PageRank
algorithm
• Distribute the rank number to its members by weighted average
by using:
– PR = SR * Si
– The notations here are:
– PR: The rank of a member page
– SR: The source rank from previous stage
– Si: Total incoming unique links of this source
Truncated PageRank
• L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and
R. Baeza-Yates from Italy, 2006.
• In PageRank, the Web page can gain high PageRank score with supporters (in-links) that are
topologically “Close” to the target node.
• Spammers can afford to influence only a few levels.
• Truncated PageRank is similar to PageRank, except
that the supporters that are too “close” to a target
node do not contribute towards its ranking.
Truncated PageRank (Cont’d)
• PR(p) =  t · Mt =  damping(t) · Mt
The notations here are:
C: Normalization constant
 : The damping factor
WebGraph Package
• Paolo Boldi and Sebastiano Vigna from Italy, 2004.
• Represents the Node graph in efficient manner using
Differential compression technique.
• Allows applications to encode compactly a new
version of data with respect to a previous or
reference version of same data.
• WebGraph can compress the WebBase graph (118
Mnodes, 1 Glinks) in as little as 3:08 bits per link, and
its transposed version in as little as 2:89 bits per link.
• WebBase is a repository of Web pages crawled by
Ubi crawler from Stanford University.
WebGraph Package (Cont’d)
• Node graph initial representation:
• Node graph with Reference compression:
WebGraph Package (Cont’d)
• Node graph with Differential compression:
• Differential compression allows to code a link in less
than a bit (Not possible with plain Reference
compression)
WebGraph Package (Cont’d)
Link Structure
From DB
Graph in BV
Format
Graph in Ascii
format
PageRank
Module
Graph in BV
format
BVGraph Details
• BVGraph: Boldi Vigna Graph
• BVGraph is generated using a graph that is
represented in ASCII format.
• The first line contains the number of nodes
‘n’, then ‘n’ lines follow the i-th line containing
the successors of the node ‘i’ in the
increasing order (nodes are numbered from 0
to n-1). The successors are separated by a
single space.
BVGraph Details (Cont’d)
• For example, consider a graph of three
vertices, a, b, and c, consisting of the
following edges:
• (a, b) (a, c) (b, c) (b, a)
• (a:0, b:1, c:2)
• This graph could be expressed as below
A
B
3
12
C
02
1
BVGraph – Current
Implementation
• The URLLinkStructure table in the Database had
linking information.
• ASCII graph is generated by using data in
URLLinkStructure table and then the BV Graph is
generated
• ASCII graph is represented as basename.graphtxt
• BVGraph is generated using the command:
– java it.unimi.dsi.webgraph.BVGraph -g ASCIIGraph
basename bvbasename
BVGraph – Current Implementation
(Cont’d)
• The grapgh could be generated for incoming
links as well as outgoing links.
• BVnode-in, BVnode-out, BVSource-in graphs
are generated.
• BVGraph can be loaded using two loading
methods load and loadOffline.
• The load method is used for small graphs
• The loadOffline method is used for large
graphs
ClusterRank Using BVGraph
Steps
300K
Without
With BVGraph
BVGraph (Per (Per iteration
iteration in Sec) in Sec)
9452
7737
ClusterRank Using BVGraph (Cont’d)
• Time gain using WebGraph for 300K URLS
Without/With BVGraph
Total time gain using WebGraph for 300K URLs
7737
With WebGraph
1
Without WebGraph
9452
0
2000
4000
6000
Time in seconds
8000
10000
Time Measure for Algorithms
(in Seconds)
Algorithm
URLs: 633061
Node InLinks: 2905183
Average InLinks per Node:
4.6
Clusters: 48271
Cluster InLinks: 983579
Average InLinks per
Cluster: 16.35
Sources: 425
Source InLinks: 75217
Average InLinks per
Source: 176.98
URLs: 289503
Node InLinks: 21781790
Average InLinks per Node:
78.06
Clusters: 164136
Cluster InLinks: 18210270
Average InLinks per Cluster:
109.35
Sources: 14892
Source InLinks: 9988138
Average InLinks per Source:
670.8
URLs: 4 M
Node InLinks: 28346447
Average InLinks per Node:
5.82
Clusters: 256919
Cluster InLinks: 9120926
Average InLinks per Cluster:
32.54
Sources: 482
Source InLinks: 509693
Average InLinks per Source:
1057.45
422
6780
2520
3
660
21
2
12
17
Cluster Rank
Source Rank
Truncated
PageRank
Time Measure for Algorithms
(Cont’d)
Time M easure between algorithms per iteration
Time in secondsds
8000
7000
6000
6780
Cluster Rank
5000
4000
Source Rank
3000
2000
2520
1000
0
660
12
422
3
2
1
2
21
17
3
Node InLinks
(1: 2905183, 2: 21781790, 3: 28346447)
Truncated PageRank
Time Measure for Algorithms
(Cont’d)
Cluster Rank Time M easure based on Cluster InLinks
Time in secondsds
8000
7000
6000
6780
5000
4000
Cluster Rank
3000
2000
2520
1000
0
422
1
2
Cluster InLinks
(1: 983579, 2: 9120926, 3: 18210270)
3
Time Measure for Algorithms
(Cont’d)
Source Rank Time M easure based on Source InLinks
Time in secondsds
700
660
600
500
400
Source Rank
300
200
100
21
3
0
1
2
Source InLinks
(1: 75217, 2: 509693, 3: 9988138)
3
Time Measure for Algorithms
(Cont’d)
Truncated PageRank Time M easure based on Node InLinks
Time in secondsds
20
17
15
12
10
Truncated PageRank
5
2
0
1
2
3
Node InLinks
(1: 2905183, 2: 21781790, 3: 28346447)
Node In-Link Distribution across
Nodes (4M URLs)
Distribution of Nodes and InLinks for 4M
4500000
4000000
3500000
2500000
Nodes
2000000
1500000
1000000
500000
# of InLinks
65578
12648
9058
6298
3527
2444
1920
1579
1325
1068
900
751
642
523
425
349
278
208
139
70
0
1
# of Nodes
3000000
Node In-Link Distribution
across Nodes (4M URLs)
Cluster In-Link Distribution
across Clusters (4M URLs)
Distribution of Clusters and InLinks for 4M
140000
120000
80000
Nodes
60000
40000
20000
# of InLinks
54223
9997
6033
3158
2376
1815
1448
1109
936
744
608
489
412
340
283
237
196
157
118
79
40
0
1
# of Clustersrs
100000
Source In-Link Distribution
across Sources (4M URLs)
Distribution of Sources and InLinks for 4M
100
90
80
60
50
Nodes
40
30
20
10
# of InLinks
13950
3255
1340
759
347
263
195
152
114
95
80
68
53
40
29
22
15
8
0
1
# of Sources
70
Quality Measure for Algorithms
•
•
•
Survey performed on quality of ranking algorithms, using 25
search keywords, by a group of people
Obtained keywords from Google’s Keyword tool at:
https://adwords.google.com/select/KeywordToolExternal
Listed below are the keywords identified.
pictures
university
faculty
stadium
undergraduate
map
admissions
scholarships
loan
mba
alumni
computer
graduate
business research
students technology
accommodation campus
vacations
dean
aid
parking
department
gpa
Quality Measure for Algorithms
(Cont’d)
•
Survey performed to identify the following from
KeyWord Search
– First page accuracy
– Second page accuracy
– Result order on the first page
– Result order on the second page
– Overall, are the important pages showing up
early?
– Overall, the percentage in result hits are
relevant?
Quality Measure For Algorithms
(Cont’d)
Algorithm
Quality measure based on
the scale 1 to 5 (1
being the best)
ClusterRank
2.06
SourceRank
1.65
Truncated PageRank
2.94
Conclusion
• The ClusteRank computation can be accelerated
using WebGraph.
• The SourceRank algorithm takes less time for
Page-Rank calculation compared to ClusterRank
and is close to Truncated PageRank for the
existing 4M URLs.
• The SourceRank has better quality points out of
the three algorithms.
• By considering the Efficiency and Quality,
SourceRank is better out of the three for the
existing data based on experiments performed.
Success Criteria
• Identified the efficiency of Page-Rank
computation algorithm using time-measure
generated by experiments
• Identified the quality of the algorithm using
manual survey results
• Implemented the efficient algorithm for the
Needle Search Engine in UCCS
• Upgraded the existing Needle Search Engine
to 1 Million pages (crawled, actual URLs are
4 Million) from the current 50K URLs
(crawled, actual URLs are 300K).
References
• [1] Paolo Boldi, Sebastiano Vigna. The
WebGraph Framework 1: Compression
Techniques.
http://www2004.org/proceedings/docs/1p595.
pdf
• [2] Yen-Yu Chen, Qingqing Gan, Torsten
Suel. I/O-Efficient Techniques for Computing
PageRank.
http://cis.poly.edu/suel/papers/pagerank.pdf
• [3] Taher H. Haveliwala. Efficient
Computation of PageRank.
References (Cont’d)
• [4] Yi Zhang. Design and Implementation of a
Search Engine with the Cluster Rank
Algorithm.
• [5] John A. Tomlin. A New Paradigm for
Ranking Pages on the World Wide Web.
• [6] Lawrence Page, Sergey Brin, Rajeeve
Motwani, Terry Winograd. The PageRank
Citation Ranking: Bringing Order to the Web
http://www.cs.huji.ac.il/~csip/1999-66.pdf
References (Cont’d)
• [7] Ricardo BaezaYates, Paolo Boldi, Carlos
Castillo. Generalizing PageRank: Damping
Functions for LinkBased Ranking Algorithms.
http://www.dcc.uchile.cl/~ccastill/papers/baez
a06_general_pagerank_damping_functions_li
nk_ranking.pdf
• [8] Gonzalo Navarro. Compressing Web
Graphs like Texts.
• [9] The Spiders Apprentice.
http://www.monash.com/spidap1.html
References (Cont’d)
• [10] James Caverlee, Ling Liu, S.Webb. SpamResilient Web Ranking via influence Throttling.
http://wwwstatic.cc.gatech.edu/~caverlee/pubs/caverlee07ipdps
.pdf
• [11] G. Jeh, J. Widom, “SimRank: A Measure of
Structural-Context Similarity”.
http://www-csstudents.stanford.edu/~glenj/simrank.pdf
• [12] L. Becchetti, C. Castillo, D. Donato, S. Leonardi,
and R. Baeza-Yates, “Using rank propagation and
probabilistic counting for link-based spam detection,
Technical report”, 2006.
Download