CS 8803 – Advanced Internet Application Development Project Proposal Indexing Service for PeerCrawl: A Performance and Deployment Analysis Izudin Ibrahimbegovic Professor Ling Liu Motivation and objectives: One of the most important internet applications is the search engine. Commercial implementations of search engines are centralized and are relatively successful at searching but it is interesting to consider design alternatives and explore potential advantages and disadvantages from using a peer-to-peer architecture for a search engine. The Peer-to-Peer architecture has proven itself to be very resilient even in the wake of lawsuits and efforts to shut down the networks. This particular feature of the P2P architecture would allow a search engine to be unbiased and free of censorship do to its distributed nature. Other advantages include increased robustness and availability. I believe that it would be interesting to explore different design and implementation alternatives of indexing for the PeerCrawl crawler while utilizing its distributed overlay network for the indexing and searching of the indexed documents. Related work: This project will add to the existing crawler PeerCrawl and use ideas presented by Li et al [1] and Yang et al [2]. [1] analyzes the feasibility of Peer-to-Peer web indexing and search and presents a range of optimization techniques which if used correctly can bring the performance of P2P web search within a comparable distance of a centralized architecture. [1] also provides a list of compromises that could be considered in order to boost performance and bring it closer to the performance of centralized systems. These compromises exploit known patterns in web search and user interest in only highly relevant results. Proposed work: The goal of this project is to design and implement indexing and search capabilities for the already existing PeerCrawl crawler and evaluate their performance on a large scale. For the indexing part, two alternatives will be considered: - Partition by Keyword - Partition by Document Once a partition scheme is chosen, appropriate optimization techniques will be applied in order to improve the performance of the system. As proposed in [1], the following optimizations will be considered - Caching and Precomputation o Computing intersections of lists and storing posting lists received from other peers. - Compression - Bloom Filters - Clustering of similar documents Additionally, attractive compromises will be evaluated based on the gain in performance. If time allows, ranking alternatives will be considered and evaluated. Plan of action: The following resources will be required - Working crawler implementation and source code - Previously collected documents - A small number of machines to participate in the P2P network Schedule: A rough breakdown of the schedule is as follows: 2/21 2/28 3/21 4/4 4/18 4/24 - Installation of PeerCrawl - Familiarization with source code - Evaluation of partitioning schemes - Development of indexing and search - Implementation of proposed optimizations - Ranking system implementation - Testing and evaluation - Final project demo Evaluation and testing methodology: The testing will be conducted using previously collected document. These documents will be indexed and a list of common search keywords will be collected. Another method for testing is by picking a specific document that was collected, selecting a keyword within the document and performing a search for this specific keyword in order to verify that it was correctly indexed and associated with the appropriate document. Bibliography: [1] Jinyang Li, Boon Thau Loo, Joseph M. Hellerstein, M. Frans Kasshoek, David R. Karger, Robert Morris, “On the Feasibility of Peer-to-Peer Web Indexing and Search”, MIT Lab for Computer Science, UC Berkeley [2] Yong Yang, Rocky Dunlap, Michael Rexroad, Brian F. Cooper, “Performance of Full Text Search in Structured and Unstructured Peer-to-Peer Systems”, College of Computing, Georgia Institute of Technology