The GUWgle Search Engine Christopher V. Kaukl (cvkaukl@cs.washington.edu) Sean McManus (seanmc@cs.washington.edu) Tam Vu (tvu@cs.washington.edu) Abstract This paper describes GUWgle, a search engine designed to provide highperformance searching within the University of Washington web space. GUWgle illustrates how constraining the scope of a search engine allows optimization of performance and functionality while allowing simple, expandable implementation. This paper explains GUWgle’s design at the system level then examines each component and how the aforementioned constraints influence its design. How simplicity of implementation affects scalability will also be examined. I. Introduction Search engines, like all software, must make tradeoffs and assign balances. In addition to the compromises between memory and processor utilization found in all software, search engines require unusual focus on more complex issues; among these are performance versus feature set and simplicity versus scalability. Given an undefined or arbitrarily large scope of operation, emphasizing one of these aspects at the cost of another is inevitable. However, constricting a search engine’s scope of operation removes the necessity for such polarization. The following factors motivated GUWgle’s design: Following the constraints of the project. GUWgle was created for the CSE 490i class, which required that search engines created for the class provide minimal search engine functionality, most important of which were constraining search scope to the University of Washington web space and insuring that the crawler component of the search engine obey polite crawling rules. Alleviating problems with the starter code. GUWgle is loosely based off the webSPHINX crawler. This code was found to be buggy and inefficient. In order to meet the project requirements, heavy modification and extension was necessary. Providing a simple implementation. Simple code is easy to modify and extend. On a pragmatic level, simple code facilitates working independently on different aspects of the search engine. Creating a good search engine. GUWgle is designed to provide fast, comprehensive, and feature-rich searches. The rest of this paper will examine GUWgle’s implementation of a search engine. After an overview of how GUWgle’s components interact, each component will be analyzed in terms of its basic functionality and how, if at all, the principle of restricted scope allows enhancement of that functionality beyond what would otherwise be possible. Functionality that is not complete or does not work as well as desired will be enumerated next, followed by an explanation of group member roles and the conclusion. II. Overview of components As its name suggests, GUWgle is based off the Google model. The environment dictates the final form of the engine, though: GUWgle’s constricted scope allows a simpler implementation than Google’s. crawler repository UW network user query engine index indexer The above diagram provides a graphical representation of GUWgle’s components and the manner in which data flows between them. Data is harvested from the web by the crawler, which stores it in the repository. The indexer pulls this data from the repository, organizes and ranks it, and stores the resulting data in the index. The user sends query data to the query engine, which then pulls data from the index and repository as needed to answer the user’s query. For deeper understanding of Google-based search engine behavior, please refer to “The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Brin and Page. III. The Crawler The crawler component of the GUWgle search engine maintains politeness through the use of multithreaded queues. GUWgle is based off WebSPHINX; therefore, the greater volume of code conforms to the preexisting code base. However, WebSPHINX’s fundamental behavior was radically changed for GUWgle. Previously, WebSPHINX used a naïve multithreaded model to crawl pages. URLs for pages that had not been downloaded were stored in a synchronized priority queue called the fetchQueue. On startup, a predefined number of threads were created, all of which contended for the opportunity to remove a link from the fetchQueue of links for downloading. After downloading a page and extracting its links, threads again waited for access to the fetchQueue in order to enqueue the newfound links. The process began once more after adding the new links to the fetchQueue. This method would frequently result in WebSPHINX downloading all content from a given server in a matter of seconds, then moving on to another server, which is far from polite. GUWgle’s first priority is permitting parallel downloads while maintaining politeness. Toward this end, WebSPHINX’s threading paradigm was radically changed. Instead of maintaining a single fetchQueue and crawler-wide threads, the crawler contains a hash table mapping IP addresses to Worms (Worms are abstractions of threads containing thread-specific data and methods). Every unique IP address is mapped to a single Worm, which is responsible for handling documents from the server at that IP address. Each Worm contains its own fetchQueue. When a Worm extracts links from an HTML document, an IP lookup is performed on the link’s URL. The resulting IP address is used as a key for the IP hash table to retrieve the appropriate Worm. The original Worm waits for the Worm to whom the IP address belongs to relinquish control of its fetchQueue then inserts the new link into the other Worm’s fetchQueue. If the IP address of the link does not have a corresponding Worm in the IP hash table (i.e. no Worm is retrieved by the initial hash table lookup), a new Worm is created, inserted into the hash table, and started on the link. After finishing processing on a page, the Worm sleeps for six seconds in order to relinquish CPU cycles for other Worms and insure that the server is never accessed more than ten times a minute. This implementation has an interesting side effect: GUWgle performs a host-wise breadth-first search. After a thread has finished processing a page belonging to a server at a certain IP address, the subsequent six seconds of sleep provide ample time for other threads to process their pages. If a thread has not seen activity after being woken twice, it is killed to free memory. In practice, the number of hosts in the University of Washington IP address space insures over 200 threads are operating at any given time which, when combined with network latency and bandwidth limitations, results in an average per-server access rate well under the ten accesses per minute dictated in the code. Due to the relatively small size of the domain to be searched, scalability concerns such as the number of threads that would be created while searching the entire internet may be discounted. Should application in a larger web space be desired, though, the crawler’s relatively intuitive implementation allows easy addition of thread pools and/or metaqueues, which would allow scalability. One aspect of crawler page processing is fingerprinting in order to avoid storing duplicate pages. After the text of a page is downloaded but before it is processed, an md5 checksum is calculated on the page’s data. This checksum is used to index a hash table. If no hash table value corresponding to the checksum key exists, an arbitrary value is placed in the hash table entry indexed by the checksum and page processing proceeds. If a value exists, though, the page is discarded to save space and time. After enabling the checksum, it was found that approximately 8% of all pages in the UW web space are duplicates. Regardless of constrained crawling space, memory inefficiencies in the original WebSPHINX crawler limited the maximum duration of crawls to approximately 30 minutes, after which time physical memory ran out and disk paging began, effectively halting the crawler after it had only crawled about 1,000 pages. After memory optimizations, the crawler now uses a static amount of memory. The maximum crawl duration to date has been 52 hours, over which time the crawler has retrieved 434,000 pages. This crawl was terminated only due to a diminishing number of pages being retrieved. IV. The repository Google uses a compact, monolithic repository structure with a specialized internal layout. With respect to this design decision, GUWgle deviates significantly from Google. Since GUWgle must only store unique pages found on UW servers, this efficient specialization could be jettisoned in favor of something much simpler. The decision to create a simple repository which takes advantage of existing code libraries and data structures turned out to be the cause of several problems; however, the same simplicity that caused the problems also simplified the solution thereof. The data structure for a stored web page has not changed throughout the evolution of the repository. Web pages are stored as zip entries indexed by their document IDs (a simple sequential number assigned during page processing). The zip file contains as its first line the URL corresponding to the page. The subsequent lines constitute the complete HTML content of the page. In anticipation of cached browsing, which would allow a user to navigate through entirely cached pages, reverse repository entries were also specified. A reverse repository entry indexes a stored document by its URL, as opposed to the document ID. The only contents of a reverse entry are the document ID for that entry, which allows all other data pertaining to that document to be retrieved. Initially, the repository was to be a single zip file containing both regular entries and reverse entries. However, the zip file specification would not allow that many entries. In order to avoid this problem, the repository was broken up into a directory of zip files, each of which contained a single entry and was named after the document ID of the entry. The reverse repository entries were maintained in a single zip file in order to avoid filesystem problems with the URL names. This solution created too many files for the filesystem, though, so a hybrid of the two earlier attempts was created: 434 separate zip files are assembled, each holding one thousand entries. The reverse repository entries remain the same. V. Indexer and index The indexer resembles the repository in both philosophy and design: A zip file is used for its ease of implementation and conceptual approachability. As with Google, the indexer uses a previously created repository to create the index. For each document in the repository, the indexer parses out all applicable words and performs any trimming and stemming on each word before using the word to index a word occurrence hash table. The occurrence list is simply implemented as a string of page numbers and page locations. The occurrence list is written to the index zip file as a zip entry indexed by the word name. VI. Query engine The query engine is best addressed in terms of its constituent functionality. Searching by complex Boolean queries is allowed, as is exact phrase search. Thanks to the limited size of the repository it is possible to combine these searches in an intuitive fashion. The query page allows the user to submit queries for any combination of any words, all words, no words, and exact phrases. Independent document ID lookups are performed on the index for each word, and the resulting sets of document IDs are then subjected to set operations depending on the criteria by which the words were selected. Sets produced by “any word” queries are unified, while “all word” sets are intersected. “No word” sets are also unified. Exact phrase search compares query words' occurrence information found in the index. First, all documents containing the phrase’s first word are retrieved, with all occurrences noted. As the rest of the words are processed, this set of documents is filtered, eliminating documents in which there is no occurrence of the first word that is followed by the sequence of query words processed thus far. The resulting set of document IDs represents all documents containing the exact phrase. At this point, “any word”, “all words”, and exact phrase sets are intersected and any resulting document IDs which can be found in the set of “no word” results are subtracted. Once a set of document IDs corresponding to the search parameters has been acquired, the search results are displayed. Each result consists of the URL to a page, a text snippet, and a link to the cached page. The URL is retrieved from the first line of the repository entry for the given document ID. The snippet is simply a certain number of words before and after the query word, the position of which is found in the index’s occurrence information. Cached page links are Java Server Pages accessed with a document ID as a parameter. The JSP retrieves the document from the repository and prints the HTML. It should be noted that all results are ordered according to page rank. Although the total number of operations involved in retrieving information about search data may be high, the actual procedures involved are not conceptually difficult. The tradeoff between conceptual simplicity and performance proved practically impossible to reliably measure, though. The platform on which the query engine was run (cubist.cs.Washington.edu) operated under wildly varying levels of load and, at times, reliability. Under ideal conditions, the query engine demonstrated less than one second response times, even when dealing with complex queries. However, more conclusive experimental results could not be obtained. VII. Future expansion Not all functionality hoped and/or intended for GUWgle could be implemented, even given the limited scope of the project. Cached browsing was one casualty of time, platform instability, and redesign. Although the repository was originally designed to support the ability to follow links on a cached page to another cached page, the changing repository structure and need to continually rebuild the repository delayed the implementation of this feature to the point that server troubles and time issues prevented it. While the ability to view cached pages is implemented, cached browsing remains a future expansion. One intriguing development, which never reached full implementation due to the interests of time and functionality, was a distributed version of the GUWgle system. Distributed GUWgle is five separate programs, one of which acts as a coordinating server, one of which acts as an indexer, and three of which act as crawlers. These programs create an index on the fly, but the index proved to be unsorted due to multithreading issues. Due to lack of time and need for attention elsewhere, index sorting was abandoned. VIII. Team member roles Tam Vu: Fingerprinting Index/indexer design Complex Boolean queries Phrase searches Snippets Search engine user interface Sean McManus Crawler performance optimization Crawler/indexer design/implementation Crawler/indexer user interface Page rank and anchor text Repository design Distributed GUWgle Chris Kaukl Crawler design Repository design/implementation Cached pages Documentation IX. Conclusions Given parameters that require extensive or indefinite scalability, tradeoffs between performance and simplicity in a search engine must be made. By designing within environment-dictated parameters, though, one may create implementations that combine conceptual simplicity and high performance. However, this does not insure success. As shown by the evolution of the repository, naïve implementations tend to encounter obstacles as they scale. Taking these tendencies into consideration, GUWgle demonstrates that careful design and constrained operational parameters grant the luxury of both high performance and conceptual simplicity.