CS 8803 – AIAD Prof. Ling Liu Project Proposal Page Rank for the Peer to Peer Web Crawler Deepak K Narasimhaiah Mario El Khoury Motivation and Objectives Search Engines like Google have become an Integral part of our life. Search Engine usually consists of many parts - crawling, indexing, searching and page ranking. A highly efficient web crawler is needed to download billions of web pages to index. Most of the current commercial search engines use a central server model for crawling. Major problems with centralized systems are single point of failure, expensive systems and administrative and troubleshooting challenges. Hence a decentralized crawling architecture can overcome these difficulties. Currently, a Peer to Peer Crawler based on decentralized crawling architecture is being built at the College of Computing under the guidance Prof. Ling Liu [4][5]. The World Wide Web creates many new challenges for information retrieval. It is very large and heterogeneous. The web pages are extremely diverse, ranging from simple sentences to journals about information retrieval. In addition to these major challenges, search engines on the Web must also contend with inexperienced users and pages engineered to manipulate search engine ranking functions. However, unlike "at" document collections, the World Wide Web is hypertext and provides considerable auxiliary information on top of the text of the web pages, such as link structure and link text. Larry page and Sergey Brin [1] took advantage of this link structure of the Web to produce a global importance ranking of every web page. This ranking, called Page Rank helps search engines and users quickly make sense of the vast heterogeneity of the World Wide Web. Currently, the P2P Web Crawler team in College of Computing is working on building various components for the crawler. There is not yet a finished search engine that incorporates all the features such as crawling, indexing and page ranking. As a part of the course project, we are going to implement the Page Ranking part of the P2P WebCrawler. Our implementation should apply the page ranking algorithms to the web pages which are crawled by the P2P crawler. The crawled pages are maintained in the web archive. This web archive is built by the P2P crawler which contains the list of all the pages it has crawled. We will use this list to build the web graph and apply the page ranking algorithm to this web graph. As a final step, we will try to incorporate Source Ranking into our system [3] [6]. Here below are the different blocks that will constitute the final product: • • • We will have to implement a parser or maybe import an already designed parser. Seeing that the link relationship between the pages is not provided, we will use the parser to go through every web page in the web archive to extract links from that webpage. This is needed in order to determine the out going links from every page. We may also have to fetch the pages from the web in case they are not archived. Later, we will devise a representation scheme for the relationship matrix and will map the information derived previously into such a matrix. This matrix contains every webpage as its row and all the outgoing links from it as its columns. • • • After the matrix is built, we will feed it into the web graph block to generate the matrix-relative graph representation of the web. Once the web graph is obtained, it will be fed to the Page Rank or Source Rank algorithms which will rank each page according to its importance. Finally, we will build a GUI to neatly present, manipulate and query the list of the ranked URLs. Related work Sergey Brin and Larry Page presented the page ranking algorithm for the World Wide Web in [1]. The proposed ranking of the pages was based on the link structure of the web. Every page has some number of forward links (out edges) and back links (in edges). Web pages vary greatly in terms of the number of back links they have. Generally, highly linked pages are more “important" than pages with few links. Simple citation counting has been used to speculate on the future winners of the Nobel Prize. Page Rank provides a more sophisticated method for doing citation counting. The reason that Page Rank is interesting is that there are many cases where simple citation counting does not correspond to our common sense notion of importance. For example, if a web page has a link off the Yahoo home page, it may be just one link but it is a very important one. This page should be ranked higher than many pages with more links but from obscure places. Page Rank is an attempt to see how good an approximation to “importance" can be obtained just from the link structure. Figure: A and B are Back links of C (Source: From [1]) Page Rank-based algorithms suffer from several critical problems, like a very long update time and the problem of web spam. The first problem is a direct consequence of the huge size of the World Wide Web and the inherent expense associated with Page Ranks’ eigen vector based iterative algorithm. The World Wide Web comprises of Terabytes of data; search engines are capable of crawling and indexing a few billion pages. A web graph comprising of these billions of pages needs to be analyzed by the Page Rank algorithm. The end result of the attempts to perform a link-based analysis results in weeks and months spent in updating the Page Ranks of the web pages. As the size of the World Wide Web continues to grow there are no clear cut solutions available to deal with the complexity associated with the Page Rank computation. An obvious solution is to somehow reduce the size of the graph. The size of the graph can indeed be reduced by grouping together pages according to some criteria. In [3], the authors propose the idea of using a source- based link analysis to reduce the complexity of these operations. A source can be loosely identified as a collection of pages, such as pages belonging to the same domain. Distributed processing of the web graph can also help speed up the processing time for calculating page rank. A logical division of the web pages into collections is required for such an algorithm to work. Source Rank provides us with a methodology for dividing the web graph into separate collections. The web graph can be split into different sources and a distributed system can be used to calculate the Page Rank of pages within the source and the Source Rank of the particular source in relation to other sources. We plan to implement the Source Rank (and Page Rank) algorithm on top of Peer to Peer Web Crawler. Proposed work Applying a ranking algorithm to the Peer-to-Peer WebCrawler is not straightforward and many blocks need to be put in place to obtain the intended result. First of all, the link relationships between the crawled URLs need to be determined and then represented neatly and used to build the web graph. At this point, the ranking algorithms are run over the relational representation of the web and in a final part, ranked pages are efficiently presented to the user. The user will be able to extract meaningful results from any query he may issue. First of all, we will have to process the URL list. As discussed previously, we may need to write our own parser and run it on every cached page in order to extract the out-links of every URL. For this purpose, we will need to assign a unique identifier for every URL and then specify the identifiers each URL cites in its page. The identifier can be assigned incrementally starting from one and relatively to the position of the URL in the list. Below is the block diagram for our system: Once the URL relationships are set, we will build our matrix. The matrix will be formed of N rows where N is the number of URLs in the list and each row will have as many columns as the number of out-links for that URL. More specifically, the first column will contain the identifier of the URL processed and the succeeding columns will hold the unique identifier for the links parsed out of the URL’s web page. After the matrix is built, we might need to convert it into a representation format compatible with that of the input to the Web Graph algorithm. The algorithm will be run and the graphical representation of the crawled domains will be determined and drawn based on the information extracted from the list and mapped to the matrix. The resulting graph will be used as an input to the Page Rank or Source Rank algorithm and the relative rank of each URL will be derived. We might need to do some fine tunings to make the output data of the Web Graph compatible with the input to the ranking algorithm. The final step would be to feed the resulting URL rankings into the GUI and display it in a user appealing format. We will provide several control metrics and functionalities so that a use may query the database of ranked URLs or manipulate the data to obtain more meaningful or interesting results. An important topic we will have to deal with is scalability. Building a system that will accommodate a couple hundreds of URLs is by far different than one that will handle millions of URLs. Hence, we need to derive an algorithm and a data representation scheme that can be parallelized and distributed among several threads with as little communication overhead as possible. Plan of action We will implement our system in Java since most of the building blocks already offered are coded in Java. We might need to cluster some computers or maybe use some specialized machines in case we were to parse the web pages and especially if we don’t have the cached archive and we have to fetch them from the web before we can parse them. For this purpose, we will seek the help of the P2P WebCrawler crew and use the resources they are using to run their system. We will implement the building blocks in the order described in the previous section entitled “Proposed Work”. We will move gradually in the project since, even if we implement some of the last step blocks before others, we might get stuck in a prerequisite to those blocks and our work will be useless. Therefore it is better if we serialize our implementations and make it easy for other people to be able to continue from the location we reached. Here is an approximate timeline to follow for the project design and implementation: • • • • • • Two weeks for building the parser, a page fetching program and extracting the out-links. Two weeks for understanding the interface to Web Graph and building an efficient matrix representation accordingly. One week for building the web graph for the crawled domains. One week for applying the ranking algorithms to the obtained web graph assuming we already have the code and the only work is to integrate it into our system. Two weeks for building the GUI. The remaining time is to be spent on running experiments, performance analysis and fine tuning of the different parts. It will also be used to write the paper and prepare for the final presentation for that project. Evaluation and Testing Method We are not sure yet whether we will be able to find some benchmarks against which we will test our system, however, the results will be graphically represented and through common sense, we will able to deduce whether the output is correct or not. The system evaluation is primarily based on the experimentation and analysis of the www.gatech.edu domain. We analyze the importance scores assigned by our algorithm to the pages in this domain to see if important pages are being ranked higher. We might try to compare the results of one word search queries with those of Google, Yahoo and other search engines to see how close our results will be. However, we cannot guarantee the results will be identical since we will be dealing with only a subpart of the Web Graph while those search engines have and make use of more data than we have. Furthermore, we cannot guarantee that these sources will be using a ranking algorithm purely similar to the one we are using. Another parameter for evaluation will be the performance of the Source Rank algorithm for updating the rankings. A comparison of the Page Rank and Source Rank algorithms for this particular domain will be provided in order to emphasize the advantages of Source Rank over Page Rank. We will also provide graphs that prove the scalability of our system at specific blocks and whenever this is possible. We will show the overhead incurred on the system or entity of the system while we increase the number of URLs. We will also try to identify the source of possible bottlenecks in the system in order to resolve them in future works. Finally we will provide the total delay needed for an URL list of length N to be processed and for the results to be displayed. Displaying these results for multiple values of N will help prove the scalability of the system. Most of our results will be on real data and not the results of a system simulation. We hope that our system will be of help to the P2P Web Crawler team and maybe be incorporated in their final product and help advance towards a P2P search engine. Bibliography [1] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Computer Science Department, Stanford University, 1998 [2] S. Brin and L. Page. “The anatomy of a large-scale hypertextual Web search engine”. Computer Networks and ISDN Systems, 30(1–7):107–117, 1998. [3] James Caverlee and Ling Liu, “Enhancing PageRank through Source-Based Link Analysis”, Manuscript under preparation. [4] Aameek Singh, Mudhakar Srivatsa, Ling Liu, Todd Miller, "Apoidea: A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web ", Proceedings of the SIGIR 2003 Workshopon Distributed Information Retrieval, Lecture Notes in Computer Science, Volume 2924 [5] V. J. Padliya and L. Liu. Peercrawl: A decentralized peer-to-peer architecture for crawling the world wide web. Technical report, Georgia Institute of Technology, May 2006 [6] B Bamba, L. Liu, et al - DSphere: A Source-Centric Approach to Crawling, Indexing and Searching the World Wide Web [7] J. Caverlee, L. Liu, and W. B. Rouse. Link-Based Ranking of the Web with SourceCentric Collaboration (invited). 2nd International Conference on Collaborative Computing: Networking, Applications and Worksharing, (CollaborateCom), Atlanta, 2006. [8] J. Caverlee, M. Srivatsa, and L. Liu. Countering Web Spam Using Link-Based Analysis. 2nd Cyber Security and Information Infrastructure Research Workshop (CSIIRW), Oak Ridge National Laboratory, 2006.