Accelerating Ranking-System Using WebGraph Project Report by Padmaja Adipudi Outline of My Talk • Needle Search Engine/Ranking-System • Ranking-System Issue/Resolution – Accelerating Ranking-System using WebGraph – Ranking Algorithms Overview – Google’s PageRank, ClusterRank, SourceRank & Truncated PageRank • Experimental Results – Efficiency Measure – Quality Measure • Conclusion – Which algorithm is better in terms of Efficiency & Quality Search Engine • Web is a terrific place to get the information on any topic. • Search Engine is a useful application for the information retrieval on the WWW. • Search Engine has five basic components, a Crawler, a Parser, a Ranking-System, a Repository and a Front-End. Ranking-System • Determines the importance of a Web page. • Google's PageRank algorithm is the famous Ranking-System and is based on URL link structure. • In Google’s PageRank, the importance of a Web page is based on the importance of it’s parent Web pages. Needle Search Engine • A Search Engine developed by former students at UCCS. • ClusterRank algorithm is implemented as the Ranking-System. • The former student Yi-Zhang developed a Cluster ranking system which takes an average of 3 hours to rank 300,000 URLs. Ranking-System Issue • The major issue with the current ranking system is, it takes long update times, 3 hours for 300K URLs. • As the number of pages increases it is going to be a severe problem. Project Goal • Accelerate the existing Ranking-System of the Needle Search Engine at UCCS using a package called “WebGraph”. • Upgrade the Needle Search Engine system up to 1 Million Web pages from the 50K Web pages (crawled). Steps to reach Goal • Use WebGraph package to represent the graph efficiently using compression techniques. • Compute the Page-Rank using algorithms namely ClusterRank, SourceRank and Truncated PageRank. • Compare the results based on time and quality measure for ClusterRank with the results of SourceRank, Truncated PageRank and choose the best for the Needle Search Engine. Work Flow ClusterRank SourceRank Truncated PageRank Page Rank Results Compressed Graph Why Truncated & Source Algorithms • These are the latest papers available in the Page Ranking area. • Authors used WebGraph package for their experiments while developing the algorithm. Node Graph • Node graph is used in ranking system. • Node graph consists of nodes and directed links from node to node. • URLs are represented by nodes and the hyperlinks are represented as directed links between nodes. • Compression techniques to represent the Node graph in efficient manner. Google’s PageRank • • • Page Lawrence, Brin Sergey, Rajeev Motwani, Terry Winograd from Stanford University, 1999. Importance of a page is based on the incoming link count and also how important are those incoming links. PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) – PR(Tn): Each page has a notion of its own self-importance. That’s “PR(T1)” for the first page in the web all the way up to PR(Tn) for the last page. – C(Tn): Each page spreads its vote out evenly amongst all of its outgoing links. The count, or number, of outgoing links for page 1 is C(T1), C(Tn) for page n, and so on for all pages. – PR(Tn)/C(Tn): if a page (page A) has a back link from page N, the share of the vote page A gets is PR(Tn)/C(Tn). – d: All these fractions of votes are added together but, to stop the other pages having too much influence, this total vote is "damped down" by multiplying it by 0.85 (the factor d). ClusterRank • Yi Zhang, a student at UCCS is the author, 2006. • Algorithm is based on Google’s PageRank. • Designed to speed up PageRank calculation and also to provide a feature of grouping similar Web pages together in to clusters. • The original PageRank algorithm is applied on Clusters. • The rank is then distributed to members of the by weighted average. ClusterRank (Cont’d) • Group all pages into clusters. • Perform first level clustering for dynamically generated page. • URLs are grouped based on the “?” , “#” • Example: All URLs below will be grouped in to one Cluster – http://www.uccs.edu/057/cs_sub.shtml – http://www.uccs.edu/057/cs_sub.shtml#news – http://www.uccs.edu/057/cs_sub.shtml#dates – http://www.uccs.edu/057/cs_sub.shtml#spotlight ClusterRank (Cont’d) • Perform second level clustering on virtual directory and graph density. • URLs are grouped based on the last “/” symbol of the URL. • Density is calculated for the proposed clusters. • Approve the cluster based on the pre-set threshold value. ClusterRank (Cont’d) • Calculate the rank for each cluster using the original PageRank algorithm. • Distribute the rank number to its members by weighted average by using: – PR = CR * Pi/Ci. – The notations here are: – PR: The rank of a member page – CR: The cluster rank from previous stage – Pi: The incoming links of this page – Ci: Total incoming links of this cluster. SourceRank • James Caverlee, Ling Liu, and S.Webb from Georgia Institute of Technology, 2007. • The Web graph is represented as Sources. • The Source is a logical collection of Web pages. • Assigns a score to each page based on the overall quality of the source that the page belongs to, through a random walk over Web sources. SourceRank (Cont’d) • Group all pages into Sources based on “Domain”. • URLs are grouped based on the first “/” symbol of the URL • Example: All URLs below will be grouped in to one Source – http://office.microsoft.com/en-us/default.aspx – http://office.microsoft.com/en-us/assistance/default.aspx – http://office.microsoft.com/enus/assistance/CH790018071033.aspx SourceRank (Cont’d) • Calculate the rank for each Source with the original PageRank algorithm • Distribute the rank number to its members by weighted average by using: – PR = SR * Si – The notations here are: – PR: The rank of a member page – SR: The source rank from previous stage – Si: Total incoming unique links of this source Truncated PageRank • L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates from Italy, 2006. • In PageRank, the Web page can gain high PageRank score with supporters (in-links) that are topologically “Close” to the target node. • Spammers can afford to influence only a few levels. • Truncated PageRank is similar to PageRank, except that the supporters that are too “close” to a target node do not contribute towards its ranking. Truncated PageRank (Cont’d) • PR(p) = t · Mt = damping(t) · Mt The notations here are: C: Normalization constant : The damping factor WebGraph Package • Paolo Boldi and Sebastiano Vigna from Italy, 2004. • Represents the Node graph in efficient manner using Differential compression technique. • Allows applications to encode compactly a new version of data with respect to a previous or reference version of same data. • WebGraph can compress the WebBase graph (118 Mnodes, 1 Glinks) in as little as 3:08 bits per link, and its transposed version in as little as 2:89 bits per link. • WebBase is a repository of Web pages crawled by Ubi crawler from Stanford University. WebGraph Package (Cont’d) • Node graph initial representation: • Node graph with Reference compression: WebGraph Package (Cont’d) • Node graph with Differential compression: • Differential compression allows to code a link in less than a bit (Not possible with plain Reference compression) WebGraph Package (Cont’d) Link Structure From DB Graph in BV Format Graph in Ascii format PageRank Module Graph in BV format BVGraph Details • BVGraph: Boldi Vigna Graph • BVGraph is generated using a graph that is represented in ASCII format. • The first line contains the number of nodes ‘n’, then ‘n’ lines follow the i-th line containing the successors of the node ‘i’ in the increasing order (nodes are numbered from 0 to n-1). The successors are separated by a single space. BVGraph Details (Cont’d) • For example, consider a graph of three vertices, a, b, and c, consisting of the following edges: • (a, b) (a, c) (b, c) (b, a) • (a:0, b:1, c:2) • This graph could be expressed as below A B 3 12 C 02 1 BVGraph – Current Implementation • The URLLinkStructure table in the Database had linking information. • ASCII graph is generated by using data in URLLinkStructure table and then the BV Graph is generated • ASCII graph is represented as basename.graphtxt • BVGraph is generated using the command: – java it.unimi.dsi.webgraph.BVGraph -g ASCIIGraph basename bvbasename BVGraph – Current Implementation (Cont’d) • The grapgh could be generated for incoming links as well as outgoing links. • BVnode-in, BVnode-out, BVSource-in graphs are generated. • BVGraph can be loaded using two loading methods load and loadOffline. • The load method is used for small graphs • The loadOffline method is used for large graphs ClusterRank Using BVGraph Steps 300K Without With BVGraph BVGraph (Per (Per iteration iteration in Sec) in Sec) 9452 7737 ClusterRank Using BVGraph (Cont’d) • Time gain using WebGraph for 300K URLS Without/With BVGraph Total time gain using WebGraph for 300K URLs 7737 With WebGraph 1 Without WebGraph 9452 0 2000 4000 6000 Time in seconds 8000 10000 Time Measure for Algorithms (in Seconds) Algorithm URLs: 633061 Node InLinks: 2905183 Average InLinks per Node: 4.6 Clusters: 48271 Cluster InLinks: 983579 Average InLinks per Cluster: 16.35 Sources: 425 Source InLinks: 75217 Average InLinks per Source: 176.98 URLs: 289503 Node InLinks: 21781790 Average InLinks per Node: 78.06 Clusters: 164136 Cluster InLinks: 18210270 Average InLinks per Cluster: 109.35 Sources: 14892 Source InLinks: 9988138 Average InLinks per Source: 670.8 URLs: 4 M Node InLinks: 28346447 Average InLinks per Node: 5.82 Clusters: 256919 Cluster InLinks: 9120926 Average InLinks per Cluster: 32.54 Sources: 482 Source InLinks: 509693 Average InLinks per Source: 1057.45 422 6780 2520 3 660 21 2 12 17 Cluster Rank Source Rank Truncated PageRank Time Measure for Algorithms (Cont’d) Time M easure between algorithms per iteration Time in secondsds 8000 7000 6000 6780 Cluster Rank 5000 4000 Source Rank 3000 2000 2520 1000 0 660 12 422 3 2 1 2 21 17 3 Node InLinks (1: 2905183, 2: 21781790, 3: 28346447) Truncated PageRank Time Measure for Algorithms (Cont’d) Cluster Rank Time M easure based on Cluster InLinks Time in secondsds 8000 7000 6000 6780 5000 4000 Cluster Rank 3000 2000 2520 1000 0 422 1 2 Cluster InLinks (1: 983579, 2: 9120926, 3: 18210270) 3 Time Measure for Algorithms (Cont’d) Source Rank Time M easure based on Source InLinks Time in secondsds 700 660 600 500 400 Source Rank 300 200 100 21 3 0 1 2 Source InLinks (1: 75217, 2: 509693, 3: 9988138) 3 Time Measure for Algorithms (Cont’d) Truncated PageRank Time M easure based on Node InLinks Time in secondsds 20 17 15 12 10 Truncated PageRank 5 2 0 1 2 3 Node InLinks (1: 2905183, 2: 21781790, 3: 28346447) Node In-Link Distribution across Nodes (4M URLs) Distribution of Nodes and InLinks for 4M 4500000 4000000 3500000 2500000 Nodes 2000000 1500000 1000000 500000 # of InLinks 65578 12648 9058 6298 3527 2444 1920 1579 1325 1068 900 751 642 523 425 349 278 208 139 70 0 1 # of Nodes 3000000 Node In-Link Distribution across Nodes (4M URLs) Cluster In-Link Distribution across Clusters (4M URLs) Distribution of Clusters and InLinks for 4M 140000 120000 80000 Nodes 60000 40000 20000 # of InLinks 54223 9997 6033 3158 2376 1815 1448 1109 936 744 608 489 412 340 283 237 196 157 118 79 40 0 1 # of Clustersrs 100000 Source In-Link Distribution across Sources (4M URLs) Distribution of Sources and InLinks for 4M 100 90 80 60 50 Nodes 40 30 20 10 # of InLinks 13950 3255 1340 759 347 263 195 152 114 95 80 68 53 40 29 22 15 8 0 1 # of Sources 70 Quality Measure for Algorithms • • • Survey performed on quality of ranking algorithms, using 25 search keywords, by a group of people Obtained keywords from Google’s Keyword tool at: https://adwords.google.com/select/KeywordToolExternal Listed below are the keywords identified. pictures university faculty stadium undergraduate map admissions scholarships loan mba alumni computer graduate business research students technology accommodation campus vacations dean aid parking department gpa Quality Measure for Algorithms (Cont’d) • Survey performed to identify the following from KeyWord Search – First page accuracy – Second page accuracy – Result order on the first page – Result order on the second page – Overall, are the important pages showing up early? – Overall, the percentage in result hits are relevant? Quality Measure For Algorithms (Cont’d) Algorithm Quality measure based on the scale 1 to 5 (1 being the best) ClusterRank 2.06 SourceRank 1.65 Truncated PageRank 2.94 Conclusion • The ClusteRank computation can be accelerated using WebGraph. • The SourceRank algorithm takes less time for Page-Rank calculation compared to ClusterRank and is close to Truncated PageRank for the existing 4M URLs. • The SourceRank has better quality points out of the three algorithms. • By considering the Efficiency and Quality, SourceRank is better out of the three for the existing data based on experiments performed. Success Criteria • Identified the efficiency of Page-Rank computation algorithm using time-measure generated by experiments • Identified the quality of the algorithm using manual survey results • Implemented the efficient algorithm for the Needle Search Engine in UCCS • Upgraded the existing Needle Search Engine to 1 Million pages (crawled, actual URLs are 4 Million) from the current 50K URLs (crawled, actual URLs are 300K). References • [1] Paolo Boldi, Sebastiano Vigna. The WebGraph Framework 1: Compression Techniques. http://www2004.org/proceedings/docs/1p595. pdf • [2] Yen-Yu Chen, Qingqing Gan, Torsten Suel. I/O-Efficient Techniques for Computing PageRank. http://cis.poly.edu/suel/papers/pagerank.pdf • [3] Taher H. Haveliwala. Efficient Computation of PageRank. References (Cont’d) • [4] Yi Zhang. Design and Implementation of a Search Engine with the Cluster Rank Algorithm. • [5] John A. Tomlin. A New Paradigm for Ranking Pages on the World Wide Web. • [6] Lawrence Page, Sergey Brin, Rajeeve Motwani, Terry Winograd. The PageRank Citation Ranking: Bringing Order to the Web http://www.cs.huji.ac.il/~csip/1999-66.pdf References (Cont’d) • [7] Ricardo BaezaYates, Paolo Boldi, Carlos Castillo. Generalizing PageRank: Damping Functions for LinkBased Ranking Algorithms. http://www.dcc.uchile.cl/~ccastill/papers/baez a06_general_pagerank_damping_functions_li nk_ranking.pdf • [8] Gonzalo Navarro. Compressing Web Graphs like Texts. • [9] The Spiders Apprentice. http://www.monash.com/spidap1.html References (Cont’d) • [10] James Caverlee, Ling Liu, S.Webb. SpamResilient Web Ranking via influence Throttling. http://wwwstatic.cc.gatech.edu/~caverlee/pubs/caverlee07ipdps .pdf • [11] G. Jeh, J. Widom, “SimRank: A Measure of Structural-Context Similarity”. http://www-csstudents.stanford.edu/~glenj/simrank.pdf • [12] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates, “Using rank propagation and probabilistic counting for link-based spam detection, Technical report”, 2006.