The GUWgle Search Engine - University of Washington

advertisement
The GUWgle Search Engine
Christopher V. Kaukl (cvkaukl@cs.washington.edu)
Sean McManus (seanmc@cs.washington.edu)
Tam Vu (tvu@cs.washington.edu)
Abstract
This paper describes GUWgle, a search engine designed to provide highperformance searching within the University of Washington web space. GUWgle
illustrates how constraining the scope of a search engine allows optimization of
performance and functionality while allowing simple, expandable
implementation. This paper explains GUWgle’s design at the system level then
examines each component and how the aforementioned constraints influence its
design. How simplicity of implementation affects scalability will also be
examined.
I.
Introduction
Search engines, like all software, must make tradeoffs and assign balances. In addition to
the compromises between memory and processor utilization found in all software, search
engines require unusual focus on more complex issues; among these are performance
versus feature set and simplicity versus scalability. Given an undefined or arbitrarily
large scope of operation, emphasizing one of these aspects at the cost of another is
inevitable. However, constricting a search engine’s scope of operation removes the
necessity for such polarization.
The following factors motivated GUWgle’s design:



Following the constraints of the project. GUWgle was created for the CSE 490i
class, which required that search engines created for the class provide minimal
search engine functionality, most important of which were constraining search
scope to the University of Washington web space and insuring that the crawler
component of the search engine obey polite crawling rules.
Alleviating problems with the starter code. GUWgle is loosely based off the
webSPHINX crawler. This code was found to be buggy and inefficient. In order
to meet the project requirements, heavy modification and extension was
necessary.
Providing a simple implementation. Simple code is easy to modify and extend.
On a pragmatic level, simple code facilitates working independently on different
aspects of the search engine.

Creating a good search engine. GUWgle is designed to provide fast,
comprehensive, and feature-rich searches.
The rest of this paper will examine GUWgle’s implementation of a search engine. After
an overview of how GUWgle’s components interact, each component will be analyzed in
terms of its basic functionality and how, if at all, the principle of restricted scope allows
enhancement of that functionality beyond what would otherwise be possible.
Functionality that is not complete or does not work as well as desired will be enumerated
next, followed by an explanation of group member roles and the conclusion.
II.
Overview of components
As its name suggests, GUWgle is based off the Google model. The environment dictates
the final form of the engine, though: GUWgle’s constricted scope allows a simpler
implementation than Google’s.
crawler
repository
UW network
user
query
engine
index
indexer
The above diagram provides a graphical representation of GUWgle’s components and the
manner in which data flows between them. Data is harvested from the web by the
crawler, which stores it in the repository. The indexer pulls this data from the repository,
organizes and ranks it, and stores the resulting data in the index. The user sends query
data to the query engine, which then pulls data from the index and repository as needed to
answer the user’s query. For deeper understanding of Google-based search engine
behavior, please refer to “The Anatomy of a Large-Scale Hypertextual Web Search
Engine” by Brin and Page.
III.
The Crawler
The crawler component of the GUWgle search engine maintains politeness through the
use of multithreaded queues. GUWgle is based off WebSPHINX; therefore, the greater
volume of code conforms to the preexisting code base. However, WebSPHINX’s
fundamental behavior was radically changed for GUWgle. Previously, WebSPHINX used
a naïve multithreaded model to crawl pages. URLs for pages that had not been
downloaded were stored in a synchronized priority queue called the fetchQueue. On
startup, a predefined number of threads were created, all of which contended for the
opportunity to remove a link from the fetchQueue of links for downloading. After
downloading a page and extracting its links, threads again waited for access to the
fetchQueue in order to enqueue the newfound links. The process began once more after
adding the new links to the fetchQueue. This method would frequently result in
WebSPHINX downloading all content from a given server in a matter of seconds, then
moving on to another server, which is far from polite.
GUWgle’s first priority is permitting parallel downloads while maintaining politeness.
Toward this end, WebSPHINX’s threading paradigm was radically changed. Instead of
maintaining a single fetchQueue and crawler-wide threads, the crawler contains a hash
table mapping IP addresses to Worms (Worms are abstractions of threads containing
thread-specific data and methods). Every unique IP address is mapped to a single Worm,
which is responsible for handling documents from the server at that IP address. Each
Worm contains its own fetchQueue. When a Worm extracts links from an HTML
document, an IP lookup is performed on the link’s URL. The resulting IP address is used
as a key for the IP hash table to retrieve the appropriate Worm. The original Worm waits
for the Worm to whom the IP address belongs to relinquish control of its fetchQueue then
inserts the new link into the other Worm’s fetchQueue. If the IP address of the link does
not have a corresponding Worm in the IP hash table (i.e. no Worm is retrieved by the
initial hash table lookup), a new Worm is created, inserted into the hash table, and started
on the link. After finishing processing on a page, the Worm sleeps for six seconds in
order to relinquish CPU cycles for other Worms and insure that the server is never
accessed more than ten times a minute.
This implementation has an interesting side effect: GUWgle performs a host-wise
breadth-first search. After a thread has finished processing a page belonging to a server
at a certain IP address, the subsequent six seconds of sleep provide ample time for other
threads to process their pages. If a thread has not seen activity after being woken twice, it
is killed to free memory. In practice, the number of hosts in the University of
Washington IP address space insures over 200 threads are operating at any given time
which, when combined with network latency and bandwidth limitations, results in an
average per-server access rate well under the ten accesses per minute dictated in the code.
Due to the relatively small size of the domain to be searched, scalability concerns such as
the number of threads that would be created while searching the entire internet may be
discounted. Should application in a larger web space be desired, though, the crawler’s
relatively intuitive implementation allows easy addition of thread pools and/or metaqueues, which would allow scalability.
One aspect of crawler page processing is fingerprinting in order to avoid storing duplicate
pages. After the text of a page is downloaded but before it is processed, an md5
checksum is calculated on the page’s data. This checksum is used to index a hash table.
If no hash table value corresponding to the checksum key exists, an arbitrary value is
placed in the hash table entry indexed by the checksum and page processing proceeds. If
a value exists, though, the page is discarded to save space and time. After enabling the
checksum, it was found that approximately 8% of all pages in the UW web space are
duplicates.
Regardless of constrained crawling space, memory inefficiencies in the original
WebSPHINX crawler limited the maximum duration of crawls to approximately 30
minutes, after which time physical memory ran out and disk paging began, effectively
halting the crawler after it had only crawled about 1,000 pages. After memory
optimizations, the crawler now uses a static amount of memory. The maximum crawl
duration to date has been 52 hours, over which time the crawler has retrieved 434,000
pages. This crawl was terminated only due to a diminishing number of pages being
retrieved.
IV.
The repository
Google uses a compact, monolithic repository structure with a specialized internal layout.
With respect to this design decision, GUWgle deviates significantly from Google. Since
GUWgle must only store unique pages found on UW servers, this efficient specialization
could be jettisoned in favor of something much simpler. The decision to create a simple
repository which takes advantage of existing code libraries and data structures turned out
to be the cause of several problems; however, the same simplicity that caused the
problems also simplified the solution thereof.
The data structure for a stored web page has not changed throughout the evolution of the
repository. Web pages are stored as zip entries indexed by their document IDs (a simple
sequential number assigned during page processing). The zip file contains as its first line
the URL corresponding to the page. The subsequent lines constitute the complete HTML
content of the page. In anticipation of cached browsing, which would allow a user to
navigate through entirely cached pages, reverse repository entries were also specified. A
reverse repository entry indexes a stored document by its URL, as opposed to the
document ID. The only contents of a reverse entry are the document ID for that entry,
which allows all other data pertaining to that document to be retrieved.
Initially, the repository was to be a single zip file containing both regular entries and
reverse entries. However, the zip file specification would not allow that many entries. In
order to avoid this problem, the repository was broken up into a directory of zip files,
each of which contained a single entry and was named after the document ID of the entry.
The reverse repository entries were maintained in a single zip file in order to avoid
filesystem problems with the URL names. This solution created too many files for the
filesystem, though, so a hybrid of the two earlier attempts was created: 434 separate zip
files are assembled, each holding one thousand entries. The reverse repository entries
remain the same.
V.
Indexer and index
The indexer resembles the repository in both philosophy and design: A zip file is used for
its ease of implementation and conceptual approachability. As with Google, the indexer
uses a previously created repository to create the index. For each document in the
repository, the indexer parses out all applicable words and performs any trimming and
stemming on each word before using the word to index a word occurrence hash table.
The occurrence list is simply implemented as a string of page numbers and page
locations. The occurrence list is written to the index zip file as a zip entry indexed by the
word name.
VI.
Query engine
The query engine is best addressed in terms of its constituent functionality. Searching by
complex Boolean queries is allowed, as is exact phrase search. Thanks to the limited size
of the repository it is possible to combine these searches in an intuitive fashion. The
query page allows the user to submit queries for any combination of any words, all
words, no words, and exact phrases. Independent document ID lookups are performed on
the index for each word, and the resulting sets of document IDs are then subjected to set
operations depending on the criteria by which the words were selected. Sets produced by
“any word” queries are unified, while “all word” sets are intersected. “No word” sets are
also unified. Exact phrase search compares query words' occurrence information found
in the index. First, all documents containing the phrase’s first word are retrieved, with all
occurrences noted. As the rest of the words are processed, this set of documents is
filtered, eliminating documents in which there is no occurrence of the first word that is
followed by the sequence of query words processed thus far. The resulting set of
document IDs represents all documents containing the exact phrase. At this point, “any
word”, “all words”, and exact phrase sets are intersected and any resulting document IDs
which can be found in the set of “no word” results are subtracted.
Once a set of document IDs corresponding to the search parameters has been acquired,
the search results are displayed. Each result consists of the URL to a page, a text snippet,
and a link to the cached page. The URL is retrieved from the first line of the repository
entry for the given document ID. The snippet is simply a certain number of words before
and after the query word, the position of which is found in the index’s occurrence
information. Cached page links are Java Server Pages accessed with a document ID as a
parameter. The JSP retrieves the document from the repository and prints the HTML. It
should be noted that all results are ordered according to page rank.
Although the total number of operations involved in retrieving information about search
data may be high, the actual procedures involved are not conceptually difficult. The
tradeoff between conceptual simplicity and performance proved practically impossible to
reliably measure, though. The platform on which the query engine was run
(cubist.cs.Washington.edu) operated under wildly varying levels of load and, at times,
reliability. Under ideal conditions, the query engine demonstrated less than one second
response times, even when dealing with complex queries. However, more conclusive
experimental results could not be obtained.
VII. Future expansion
Not all functionality hoped and/or intended for GUWgle could be implemented, even
given the limited scope of the project.
Cached browsing was one casualty of time, platform instability, and redesign. Although
the repository was originally designed to support the ability to follow links on a cached
page to another cached page, the changing repository structure and need to continually
rebuild the repository delayed the implementation of this feature to the point that server
troubles and time issues prevented it. While the ability to view cached pages is
implemented, cached browsing remains a future expansion.
One intriguing development, which never reached full implementation due to the interests
of time and functionality, was a distributed version of the GUWgle system. Distributed
GUWgle is five separate programs, one of which acts as a coordinating server, one of
which acts as an indexer, and three of which act as crawlers. These programs create an
index on the fly, but the index proved to be unsorted due to multithreading issues. Due to
lack of time and need for attention elsewhere, index sorting was abandoned.
VIII. Team member roles
Tam Vu:
 Fingerprinting





Index/indexer design
Complex Boolean queries
Phrase searches
Snippets
Search engine user interface
Sean McManus
 Crawler performance optimization
 Crawler/indexer design/implementation
 Crawler/indexer user interface
 Page rank and anchor text
 Repository design
 Distributed GUWgle
Chris Kaukl
 Crawler design
 Repository design/implementation
 Cached pages
 Documentation
IX.
Conclusions
Given parameters that require extensive or indefinite scalability, tradeoffs between
performance and simplicity in a search engine must be made. By designing within
environment-dictated parameters, though, one may create implementations that combine
conceptual simplicity and high performance. However, this does not insure success. As
shown by the evolution of the repository, naïve implementations tend to encounter
obstacles as they scale. Taking these tendencies into consideration, GUWgle
demonstrates that careful design and constrained operational parameters grant the luxury
of both high performance and conceptual simplicity.
Download