Indexing Service for PeerCrawl: A Performance and Deployment Analysis

advertisement
CS 8803 – Advanced Internet Application Development
Project Proposal
Indexing Service for PeerCrawl: A Performance and
Deployment Analysis
Izudin Ibrahimbegovic
Professor Ling Liu
Motivation and objectives:
One of the most important internet applications is the search engine. Commercial
implementations of search engines are centralized and are relatively successful at
searching but it is interesting to consider design alternatives and explore potential
advantages and disadvantages from using a peer-to-peer architecture for a search engine.
The Peer-to-Peer architecture has proven itself to be very resilient even in the wake of
lawsuits and efforts to shut down the networks. This particular feature of the P2P
architecture would allow a search engine to be unbiased and free of censorship do to its
distributed nature. Other advantages include increased robustness and availability. I
believe that it would be interesting to explore different design and implementation
alternatives of indexing for the PeerCrawl crawler while utilizing its distributed overlay
network for the indexing and searching of the indexed documents.
Related work:
This project will add to the existing crawler PeerCrawl and use ideas presented by Li et al
[1] and Yang et al [2]. [1] analyzes the feasibility of Peer-to-Peer web indexing and
search and presents a range of optimization techniques which if used correctly can bring
the performance of P2P web search within a comparable distance of a centralized
architecture. [1] also provides a list of compromises that could be considered in order to
boost performance and bring it closer to the performance of centralized systems. These
compromises exploit known patterns in web search and user interest in only highly
relevant results.
Proposed work:
The goal of this project is to design and implement indexing and search capabilities for
the already existing PeerCrawl crawler and evaluate their performance on a large scale.
For the indexing part, two alternatives will be considered:
- Partition by Keyword
- Partition by Document
Once a partition scheme is chosen, appropriate optimization techniques will be applied in
order to improve the performance of the system. As proposed in [1], the following
optimizations will be considered
- Caching and Precomputation
o Computing intersections of lists and storing posting lists received from
other peers.
- Compression
- Bloom Filters
- Clustering of similar documents
Additionally, attractive compromises will be evaluated based on the gain in performance.
If time allows, ranking alternatives will be considered and evaluated.
Plan of action:
The following resources will be required
- Working crawler implementation and source code
- Previously collected documents
- A small number of machines to participate in the P2P network
Schedule:
A rough breakdown of the schedule is as follows:
2/21
2/28
3/21
4/4
4/18
4/24
- Installation of PeerCrawl
- Familiarization with source code
- Evaluation of partitioning schemes
- Development of indexing and search
- Implementation of proposed optimizations
- Ranking system implementation
- Testing and evaluation
- Final project demo
Evaluation and testing methodology:
The testing will be conducted using previously collected document. These documents
will be indexed and a list of common search keywords will be collected. Another method
for testing is by picking a specific document that was collected, selecting a keyword
within the document and performing a search for this specific keyword in order to verify
that it was correctly indexed and associated with the appropriate document.
Bibliography:
[1] Jinyang Li, Boon Thau Loo, Joseph M. Hellerstein, M. Frans Kasshoek, David R.
Karger, Robert Morris, “On the Feasibility of Peer-to-Peer Web Indexing and Search”,
MIT Lab for Computer Science, UC Berkeley
[2] Yong Yang, Rocky Dunlap, Michael Rexroad, Brian F. Cooper, “Performance of Full
Text Search in Structured and Unstructured Peer-to-Peer Systems”, College of
Computing, Georgia Institute of Technology
Download