TEST BED FOR SEARCH ENGINES Adrian SARBU, Bogdan BUCUR, Paul - Alexandru CHIRITA, Radu STOICESCU, Alexandru TICA, Florin RADULESCU Computer Science Departamen., “POLITEHNICA” University of Bucharest 313 Splaiul Independentei St., Sector 6, Bucharest, Romania asarbu@cs.pub.ro, bbucur@ubisoft.ro, paul_chirita@yahoo.com, radu.stoicescu@philips.com, atica@ubisoft.ro, florin@cs.pub.ro Abstract: The term "search engine" is often used generically to describe both crawlerbased search engines and human-powered directories. These two types of search engines gather their listings in radically different ways. Crawler-based search engines, such as Google, create their listings automatically. The key component of every cralwler-based search engine is the evaluation of page importance and the sorting of query results accordingly. Recent research has focused on improvement of page-rank algorithms leading to the emergence of many new variants. This paper presents an open search-engine modular architecture that allows easy testing of different approaches to page importance assesment algorithms. Keywords: search engine, page rank, crawler, web, client-server, information retrieval. 1. BACKGROUND Recent research for improving search engines has concentrated on two main aspects: search personalization and improving search speed. The former is mostly concentrated on developing personalized pagerank algorithms (Jeh and Widom, 2002a; Jeh and Widom, 2002b; Haveliwala, 2002). These algorithms are extensions of the original Google pagerank algorithm and exploit the personalization vector presented in the paper (Brin and Page, 1998). Focusing is targeted either on user profiles (Jeh and Widom, 2002a), or on similarity between user query and specific topics (Haveliwala, 2002). The former presents a method to compute one personalized pagerank vector (PPV) for each user and then decompose these vectors in two parts, one similar to more users and computed offline, and the other computed at run-time. This way, extensive storage of each PPV is partially avoided. The latter starts by computing off-line a set of 16 pagerank vectors oriented on the 16 main topics of the Open Directory Project (ODP). Then, a similarity between the user query and each of these topics is computed at run-time, and finally the 16 page ranks are combined using different weights into one. Furthermore, other researchers have tried to build topic oriented search engines (Frankel, et. al, 1997). While these provide better results then normal search engines, users find it difficult to switch between lots of engines when willing to query on different topics. A more sensible subject is search engine speed. It involves crawling speed, index access speed and pagerank speed. Future solutions will probably focus on the distributed nature of the WWW. Some are already trying to build distributed indexes or to compute the pagerank in a distributed manner (Kamvar, et al., 2003a; Haveliwala, 1999). The latter approach is proved to be quite effective. A local pagerank is first computed for each strongly connected component of the WWW graph, and then these ranks are combined into an initial approximation of the Google pagerank. The possible parallelism of the first step is obvious. All existing search engines have weaknesses, even Google (link searches must be exact, it does not support full Boolean, it only indexes a part of the web pages or PDF files, etc.). Search engine testbeds are therefore still a necessity. Other work has been done in this area. The Stanford WebBase Project, for example is built on the previous Google project and is intended to be an infrastructure for searching, clustering and mining the Web. Contains the minimal web page download mechanism. It is entirely driven by the URL Manager, which supplies it with URLs to be retrieved from the web servers. Link Manager: Computes the page ranks as the URL Manager retrieves them from the web. Word Counter: The main module that handles creation and maintenance of the word dictionary and document index. This is a critical component of every search engine, as it must handle large amounts of data at very high speed. Search: A CGI application running under Apache that provides as the graphical interface for querying the indexed pages. Internet Web servers Digger 2. ENGINE ARCHITECTURE This search engine has three major elements. First is the spider, also called the crawler. The spider visits a web page, reads it, and then follows links to other pages within the site. This is what it means when someone refers to a site being "spidered" or "crawled." The spider returns to the site on a regular basis, such as every month or two, to look for changes. Everything the spider finds goes into the second part of the search engine, the index. The index, sometimes called the catalogue, is like a giant book containing a copy of every web page that the spider finds. If a web page changes, then this book is updated with new information. Search engine software is the third part of a search engine. This is the program that sifts through the millions of pages recorded in the index to find matches to a search and rank them in order of what it believes is most relevant. The three major elements outlined above are broken into five modules for improved manageability, development and customization of the search. These five modules are interconnected by a client-server architecture using sockets. This architecture allows for easy testing and evaluation of new modules (ex: new Page Rank algorithms) and allows the user to choose the optimum module for his search. The modules are written in C++ but due to the socket communications any programming language can be used as long as the inter-module communications protocols are respected. The five basic modules as presented in fig 1, are: URL Manager: Controls the web crawling activity, manages the page rank computation module and also feeds data to the word indexer; Digger: Downloaded page Download request / Error msg. URLManager Packets containig DocID and word list Page ID Page URL Packets containig Parent DocID and referred DocIDs LinkManager WordCounter Query Page ID Page Rank Page ID Search Module Query page in browser Fig 1. System architecture 3. URL MANAGER URLManager’s main functionality consists in coordinating multiple Digger processes as they download web pages. To keep things simple, this component uses a client/server architecture, allowing a number of Diggers to connect to it. If there are no disk backups, the module assumes this is its first launch and queues the URLs found in the configuration file in the unvisited links structure. These were www.yahoo.com and www.google.com, so that the number of derived links would be sufficient. The unvisited links queue is used to feed the connected Digger processes. A Digger receives a URL from URL Manager, downloads it (sending a HTTP request on port 80 of the host found in the URL), and feeds the result back to the Manager. The latter, in case of a valid response, parses the HTML for links found in tags like A, DIV, AREA, BASE, LINK or SPAN, and for words found between the tags. These two different information are further filtered so that the relative links are converted to absolute links based on the eventual information found in BASE, the links that point to documents other than HTML are dropped, the words consisting only in numbers are dropped, also the same behaviour is for words longer than 30 characters or shorter than 2, the ones that start with other characters than literals, the ones that end with other characters than literals, and the ones that begin with an & and end with ;. The filtered links are added in the unvisited links queue, and the words are packed and queued in a document data structure that will feed the Word Counter. To be able to restore the process after a crash, this module caches its data structures on the disk. The module allows another client, the Link Manager, to connect using the same client/server architecture. The latter module receives the web structure as pairs of parent document – child document associations and uses them to compute page ranks. 4. LINK MANAGER This module handles the computation of page ranks. It also allows the Search process to connect for sorting a query results by page importance. There are many brands of page ranking algorithms, among which the most popular seems to be the one used by Google. This search engines implements two Page Ranking algorithms. The first one is Google’s page rank and the second one, still under development, uses the geographical proximity of a web site to rank pages. The implementation of the Google pagerank algorithm uses a Jacobian-like method, called The Power Method. Suppose A is the adjacency matrix of the Web and x the pagerank vector. The computation starts with an initial solution: x (0) 1 ... 1 At each iteration a new pagerank vector is computed using the same formula as in the Google algorithm (Brin and Page, 1998), like below: x ( n1) d A x ( n) (1 d ) e D is a constant called dumping factor and its value is 0.85 in our case. Furthermore, e is a vector with equal size as x and containing only 0 or 1 values. Brin and Page call this vector personalization vector and the general model The Random Surfer Model. It states that an imaginary surfer will go with probability d towards an outgoing link of the page it is currently surfing (represented by the first term of the right member of the equation) or will get bored with probability (1-d) and will choose a totally different page (using the e vector). It was proved that matrix A is a Markov matrix and the power method converges for such matrices (Kamvar, et al., 2003b). The convergence rate is 0.85, which means that the computation will be fast, even with a larger Web size. A proof for this convergence rate was given by Kamvar and Haveliwala (Kamvar and Haveliwala, 2003). The Link Manager has two parts: one that communicates with the URL Manager and one that communicates with the Search Manager. Communication is realized using Windows TCP/IP sockets. The first part is actually a client, which will read from the URL Manager structures of the form {parentID, childID} and will first add them to a database. For optimization reasons, the database is actually a vector containing structures of the type {parentID, vector_with_all_children}. As the URL Manager discovers new links, the Link Manager adds them to the already existing structures (if the page was already parsed, but the link was not), or it creates a new structure for them. After enough new pages have been added or enough modified pages appeared, the Link Manager launches a separate Windows thread, which will compute the new page ranks, using the new graph of the Web pages. The page ranks are stored using a vector with structures of the type {pageID, pageRank}. Resulting new ranks are saved on the disk, both in the form of the vector containing structures with page references, as in the form of the page ranks vector. The second part of the Link Manager is a Windows server, waiting requests from the Search Manager. It will receive a flow of page IDs representing the pages chosen as result for a query. This flow ends with a key page ID called QUERY_DONE. As the requests come, they are added to a queue, and after the flow ends, the Link Manager will send back a pagerank for each page ID element popped from the queue. When the queue is empty, it will send QUERY_DONE back to the Search Manager. Finally, when Search Manager wishes to end its activity, it will send another key ID to the Link Manager, called QUERY_OVER. This way, the Link Manager will terminate the Windows thread serving the Search Manager. More Search Managers can be served in the same time, so multiple searches can be solved using multiple threads. The implementation of the new GeoProx module, still under development, mines the crawled web pages for addresses and then compares them to the user’s location using a simple street graph. The pages containing locations closest to the user get a bigger rank. As the testbed is already created, this project will further expand towards improved page ranking algorithms and indexing. Fig 2. Inter-module communication scheme: 5. WORD COUNTER This module is the critical component of the search engine. It performs document and word indexing and handles query response. It is both a client and a server. Its client side is connected to the URL Manager, receiving parsed documents as lists of words separated by a space. This information is then stored in a hash table having words as keys (a dictionary) that is used to map words to word IDs. Every word ID is associated with a position vector that gives the position of the word in the document. The document is stored as a vector of word IDs. This way of storing data provides for lower memory fragmentation and higher access speed. The server side of the module waits for queries from the user interface. These are strings, eventually having quotes to assure the terms between them will be looked for without other intricate words. The words are searched in the index and a list of document IDs along with a corresponding match score are sent back to the client that initiated the query. It is the client’s task to sort these pages by their page rank. 6. SEARCH MODULE This is the user interface that communicates with the rest of the components. It is a CGI application that displays a form expecting a user to input the text to be searched for. When the submit button is pressed, the application connects to the Word Counter and sends it the query string. After the latter finishes the search, the CGI application receives the document IDs and the matching scores. It then connects to the URL Manager to resolve the document IDs to actual URL strings that could be displayed to the user in a meaningful fashion. The last step is to connect to the Link Manager in order to retrieve the page ranks for the document IDs. These ranks are combined with the matching scores to sort the results as they will be displayed in the web page. 7. CONCLUSIONS By implementing a modular architecture, new modules can be written using any programming language and any operating system. There are currently two versions of this search engine one under Windows and one under Linux. If executed on different machines, modules from the two implementations are completely compatible. This test-bed is meant to encourage developers and researchers to implement and test new algorithms to improve the quality of the search. As search engines everywhere evolve, more features are required to satisfy the users’ requests. The Page Rank algorithm was a big step forward and GOOGLE was pushed to the top of the Search Engines community by the quality of the returned results. In order to improve the quality and, in fact even guess the users’ requests, advanced data mining techniques should be implemented in the next generation of search engines. The crawled pages should be completely cached for data extraction and the query strings should be better processed to provide valuable user information, which will eventually lead to the partial personalization of the search. Two such improvements are the geographical localization of a company’s website and the completion of the query string by examining the search history. The first one is done by mining addresses out of web pages and comparing them with the users location while the second uses the association rules mining on the query strings to find correlations between string atoms and completing, as possible, the search items. Both methods should modify the Page Rank algorithm. 8. REFERENCES Brin, S. and L. Page (1998). The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems Journal, Volume 30, Pages 107-117. Frankel, C., M. Swain and V. Athitsos (1997). WebSeer: An Image Search Engine for the World Wide Web. In Proceedings of the IEEE Computer Vision and Pattern Recognition Conference. Haveliwala, T. (1999). Efficient computation of pagerank. Stanford University Technical Report. Haveliwala, T. (2002). Topic-sensitive pagerank. In Proceedings of the WWW2002 Conference. Jeh, G. and J. Widom (2002a). Scaling personalized Web search. Stanford University Technical Report. Jeh, G. and J. Widom (2002b). SimRank: A measure of structural-context similarity. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Kamvar, S. and T. Haveliwala (2003). The second eigenvalue of the Google matrix. Stanford University Technical Report. Kamvar, S., T. Haveliwala, C. Manning and G. Golub (2003a). Exploiting the block structure of the Web for computing pagerank. Stanford University Technical Report. Kamvar, S., T. Haveliwala, C. Manning and G. Golub (2003b). Extrapolation Methods for Accelerating Pagerank Computations. Stanford University Technical Report. http://google.stanford.edu/ http://wwwdiglib.stanford.edu/~testbed/doc2/WebBase/ http://www.yahoo.com/ http://www.altavista.com/ http://www.lycos.com/ http://www.alltheweb.com/ http://www.culturalheritage.net/ http://www.searchuk.com/ http://www.webcrawler.com/