Page Rank for the Peer to Peer Web Crawler

advertisement
CS 8803 – AIAD
Prof. Ling Liu
Project Proposal
Page Rank for the Peer to Peer
Web Crawler
Deepak K Narasimhaiah
Mario El Khoury
Motivation and Objectives
Search Engines like Google have become an Integral part of our life. Search
Engine usually consists of many parts - crawling, indexing, searching and page ranking.
A highly efficient web crawler is needed to download billions of web pages to index.
Most of the current commercial search engines use a central server model for crawling.
Major problems with centralized systems are single point of failure, expensive systems
and administrative and troubleshooting challenges. Hence a decentralized crawling
architecture can overcome these difficulties. Currently, a Peer to Peer Crawler based on
decentralized crawling architecture is being built at the College of Computing under the
guidance Prof. Ling Liu [4][5].
The World Wide Web creates many new challenges for information retrieval. It is
very large and heterogeneous. The web pages are extremely diverse, ranging from simple
sentences to journals about information retrieval. In addition to these major challenges,
search engines on the Web must also contend with inexperienced users and pages
engineered to manipulate search engine ranking functions. However, unlike "at"
document collections, the World Wide Web is hypertext and provides considerable
auxiliary information on top of the text of the web pages, such as link structure and link
text. Larry page and Sergey Brin [1] took advantage of this link structure of the Web to
produce a global importance ranking of every web page. This ranking, called Page Rank
helps search engines and users quickly make sense of the vast heterogeneity of the World
Wide Web.
Currently, the P2P Web Crawler team in College of Computing is working on
building various components for the crawler. There is not yet a finished search engine
that incorporates all the features such as crawling, indexing and page ranking. As a part
of the course project, we are going to implement the Page Ranking part of the P2P
WebCrawler. Our implementation should apply the page ranking algorithms to the web
pages which are crawled by the P2P crawler. The crawled pages are maintained in the
web archive. This web archive is built by the P2P crawler which contains the list of all
the pages it has crawled. We will use this list to build the web graph and apply the page
ranking algorithm to this web graph. As a final step, we will try to incorporate Source
Ranking into our system [3] [6].
Here below are the different blocks that will constitute the final product:
•
•
•
We will have to implement a parser or maybe import an already designed parser.
Seeing that the link relationship between the pages is not provided, we will use the
parser to go through every web page in the web archive to extract links from that
webpage. This is needed in order to determine the out going links from every page.
We may also have to fetch the pages from the web in case they are not archived.
Later, we will devise a representation scheme for the relationship matrix and will map
the information derived previously into such a matrix. This matrix contains every
webpage as its row and all the outgoing links from it as its columns.
•
•
•
After the matrix is built, we will feed it into the web graph block to generate the
matrix-relative graph representation of the web.
Once the web graph is obtained, it will be fed to the Page Rank or Source Rank
algorithms which will rank each page according to its importance.
Finally, we will build a GUI to neatly present, manipulate and query the list of the
ranked URLs.
Related work
Sergey Brin and Larry Page presented the page ranking algorithm for the World
Wide Web in [1]. The proposed ranking of the pages was based on the link structure of
the web. Every page has some number of forward links (out edges) and back links (in
edges). Web pages vary greatly in terms of the number of back links they have.
Generally, highly linked pages are more “important" than pages with few links. Simple
citation counting has been used to speculate on the future winners of the Nobel Prize.
Page Rank provides a more sophisticated method for doing citation counting. The reason
that Page Rank is interesting is that there are many cases where simple citation counting
does not correspond to our common sense notion of importance. For example, if a web
page has a link off the Yahoo home page, it may be just one link but it is a very important
one. This page should be ranked higher than many pages with more links but from
obscure places. Page Rank is an attempt to see how good an approximation to
“importance" can be obtained just from the link structure.
Figure: A and B are Back links of C (Source: From [1])
Page Rank-based algorithms suffer from several critical problems, like a very
long update time and the problem of web spam. The first problem is a direct consequence
of the huge size of the World Wide Web and the inherent expense associated with Page
Ranks’ eigen vector based iterative algorithm. The World Wide Web comprises of
Terabytes of data; search engines are capable of crawling and indexing a few billion
pages. A web graph comprising of these billions of pages needs to be analyzed by the
Page Rank algorithm. The end result of the attempts to perform a link-based analysis
results in weeks and months spent in updating the Page Ranks of the web pages. As the
size of the World Wide Web continues to grow there are no clear cut solutions available
to deal with the complexity associated with the Page Rank computation.
An obvious solution is to somehow reduce the size of the graph. The size of the
graph can indeed be reduced by grouping together pages according to some criteria. In
[3], the authors propose the idea of using a source- based link analysis to reduce the
complexity of these operations. A source can be loosely identified as a collection of
pages, such as pages belonging to the same domain. Distributed processing of the web
graph can also help speed up the processing time for calculating page rank. A logical
division of the web pages into collections is required for such an algorithm to work.
Source Rank provides us with a methodology for dividing the web graph into separate
collections. The web graph can be split into different sources and a distributed system can
be used to calculate the Page Rank of pages within the source and the Source Rank of the
particular source in relation to other sources. We plan to implement the Source Rank (and
Page Rank) algorithm on top of Peer to Peer Web Crawler.
Proposed work
Applying a ranking algorithm to the Peer-to-Peer WebCrawler is not
straightforward and many blocks need to be put in place to obtain the intended result.
First of all, the link relationships between the crawled URLs need to be determined and
then represented neatly and used to build the web graph. At this point, the ranking
algorithms are run over the relational representation of the web and in a final part, ranked
pages are efficiently presented to the user. The user will be able to extract meaningful
results from any query he may issue.
First of all, we will have to process the URL list. As discussed previously, we
may need to write our own parser and run it on every cached page in order to extract the
out-links of every URL. For this purpose, we will need to assign a unique identifier for
every URL and then specify the identifiers each URL cites in its page. The identifier can
be assigned incrementally starting from one and relatively to the position of the URL in
the list.
Below is the block diagram for our system:
Once the URL relationships are set, we will build our matrix. The matrix will be
formed of N rows where N is the number of URLs in the list and each row will have as
many columns as the number of out-links for that URL. More specifically, the first
column will contain the identifier of the URL processed and the succeeding columns will
hold the unique identifier for the links parsed out of the URL’s web page.
After the matrix is built, we might need to convert it into a representation format
compatible with that of the input to the Web Graph algorithm. The algorithm will be run
and the graphical representation of the crawled domains will be determined and drawn
based on the information extracted from the list and mapped to the matrix.
The resulting graph will be used as an input to the Page Rank or Source Rank
algorithm and the relative rank of each URL will be derived. We might need to do some
fine tunings to make the output data of the Web Graph compatible with the input to the
ranking algorithm.
The final step would be to feed the resulting URL rankings into the GUI and
display it in a user appealing format. We will provide several control metrics and
functionalities so that a use may query the database of ranked URLs or manipulate the
data to obtain more meaningful or interesting results.
An important topic we will have to deal with is scalability. Building a system that
will accommodate a couple hundreds of URLs is by far different than one that will handle
millions of URLs. Hence, we need to derive an algorithm and a data representation
scheme that can be parallelized and distributed among several threads with as little
communication overhead as possible.
Plan of action
We will implement our system in Java since most of the building blocks already
offered are coded in Java. We might need to cluster some computers or maybe use some
specialized machines in case we were to parse the web pages and especially if we don’t
have the cached archive and we have to fetch them from the web before we can parse
them. For this purpose, we will seek the help of the P2P WebCrawler crew and use the
resources they are using to run their system.
We will implement the building blocks in the order described in the previous
section entitled “Proposed Work”. We will move gradually in the project since, even if
we implement some of the last step blocks before others, we might get stuck in a
prerequisite to those blocks and our work will be useless. Therefore it is better if we
serialize our implementations and make it easy for other people to be able to continue
from the location we reached.
Here is an approximate timeline to follow for the project design and implementation:
•
•
•
•
•
•
Two weeks for building the parser, a page fetching program and extracting the
out-links.
Two weeks for understanding the interface to Web Graph and building an
efficient matrix representation accordingly.
One week for building the web graph for the crawled domains.
One week for applying the ranking algorithms to the obtained web graph
assuming we already have the code and the only work is to integrate it into
our system.
Two weeks for building the GUI.
The remaining time is to be spent on running experiments, performance
analysis and fine tuning of the different parts. It will also be used to write the
paper and prepare for the final presentation for that project.
Evaluation and Testing Method
We are not sure yet whether we will be able to find some benchmarks against
which we will test our system, however, the results will be graphically represented and
through common sense, we will able to deduce whether the output is correct or not.
The system evaluation is primarily based on the experimentation and analysis of
the www.gatech.edu domain. We analyze the importance scores assigned by our
algorithm to the pages in this domain to see if important pages are being ranked higher.
We might try to compare the results of one word search queries with those of
Google, Yahoo and other search engines to see how close our results will be. However,
we cannot guarantee the results will be identical since we will be dealing with only a subpart of the Web Graph while those search engines have and make use of more data than
we have. Furthermore, we cannot guarantee that these sources will be using a ranking
algorithm purely similar to the one we are using.
Another parameter for evaluation will be the performance of the Source Rank
algorithm for updating the rankings. A comparison of the Page Rank and Source Rank
algorithms for this particular domain will be provided in order to emphasize the
advantages of Source Rank over Page Rank.
We will also provide graphs that prove the scalability of our system at specific
blocks and whenever this is possible. We will show the overhead incurred on the system
or entity of the system while we increase the number of URLs. We will also try to
identify the source of possible bottlenecks in the system in order to resolve them in future
works.
Finally we will provide the total delay needed for an URL list of length N to be
processed and for the results to be displayed. Displaying these results for multiple values
of N will help prove the scalability of the system. Most of our results will be on real data
and not the results of a system simulation. We hope that our system will be of help to the
P2P Web Crawler team and maybe be incorporated in their final product and help
advance towards a P2P search engine.
Bibliography
[1] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking:
Bringing order to the web. Technical report, Computer Science Department, Stanford
University, 1998
[2] S. Brin and L. Page. “The anatomy of a large-scale hypertextual Web search
engine”. Computer Networks and ISDN Systems, 30(1–7):107–117, 1998.
[3] James Caverlee and Ling Liu, “Enhancing PageRank through Source-Based Link
Analysis”, Manuscript under preparation.
[4] Aameek Singh, Mudhakar Srivatsa, Ling Liu, Todd Miller, "Apoidea: A
Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web ",
Proceedings of the SIGIR 2003 Workshopon Distributed Information Retrieval, Lecture
Notes in Computer Science, Volume 2924
[5] V. J. Padliya and L. Liu. Peercrawl: A decentralized peer-to-peer architecture for
crawling the world wide web. Technical report, Georgia Institute of Technology, May
2006
[6] B Bamba, L. Liu, et al - DSphere: A Source-Centric Approach to Crawling, Indexing
and Searching the World Wide Web
[7] J. Caverlee, L. Liu, and W. B. Rouse. Link-Based Ranking of the Web with SourceCentric Collaboration (invited). 2nd International Conference on Collaborative
Computing: Networking, Applications and Worksharing, (CollaborateCom), Atlanta,
2006.
[8] J. Caverlee, M. Srivatsa, and L. Liu. Countering Web Spam Using Link-Based
Analysis. 2nd Cyber Security and Information Infrastructure Research Workshop
(CSIIRW), Oak Ridge National Laboratory, 2006.
Download