test bed for search engines

advertisement
TEST BED FOR SEARCH ENGINES
Adrian SARBU, Bogdan BUCUR,
Paul - Alexandru CHIRITA, Radu STOICESCU,
Alexandru TICA, Florin RADULESCU
Computer Science Departamen., “POLITEHNICA” University of Bucharest
313 Splaiul Independentei St., Sector 6, Bucharest, Romania
asarbu@cs.pub.ro, bbucur@ubisoft.ro,
paul_chirita@yahoo.com, radu.stoicescu@philips.com,
atica@ubisoft.ro, florin@cs.pub.ro
Abstract: The term "search engine" is often used generically to describe both crawlerbased search engines and human-powered directories. These two types of search
engines gather their listings in radically different ways. Crawler-based search engines,
such as Google, create their listings automatically. The key component of every
cralwler-based search engine is the evaluation of page importance and the sorting of
query results accordingly. Recent research has focused on improvement of page-rank
algorithms leading to the emergence of many new variants. This paper presents an open
search-engine modular architecture that allows easy testing of different approaches to
page importance assesment algorithms.
Keywords: search engine, page rank, crawler, web, client-server, information retrieval.
1. BACKGROUND
Recent research for improving search engines has
concentrated on two main aspects: search
personalization and improving search speed. The
former is mostly concentrated on developing
personalized pagerank algorithms (Jeh and Widom,
2002a; Jeh and Widom, 2002b; Haveliwala, 2002).
These algorithms are extensions of the original
Google pagerank algorithm and exploit the
personalization vector presented in the paper (Brin
and Page, 1998). Focusing is targeted either on user
profiles (Jeh and Widom, 2002a), or on similarity
between user query and specific topics (Haveliwala,
2002). The former presents a method to compute one
personalized pagerank vector (PPV) for each user
and then decompose these vectors in two parts, one
similar to more users and computed offline, and the
other computed at run-time. This way, extensive
storage of each PPV is partially avoided. The latter
starts by computing off-line a set of 16 pagerank
vectors oriented on the 16 main topics of the Open
Directory Project (ODP). Then, a similarity between
the user query and each of these topics is computed
at run-time, and finally the 16 page ranks are
combined using different weights into one.
Furthermore, other researchers have tried to build
topic oriented search engines (Frankel, et. al, 1997).
While these provide better results then normal search
engines, users find it difficult to switch between lots
of engines when willing to query on different topics.
A more sensible subject is search engine speed. It
involves crawling speed, index access speed and
pagerank speed. Future solutions will probably focus
on the distributed nature of the WWW. Some are
already trying to build distributed indexes or to
compute the pagerank in a distributed manner
(Kamvar, et al., 2003a; Haveliwala, 1999). The latter
approach is proved to be quite effective. A local
pagerank is first computed for each strongly
connected component of the WWW graph, and then
these ranks are combined into an initial
approximation of the Google pagerank. The possible
parallelism of the first step is obvious.
All existing search engines have weaknesses, even
Google (link searches must be exact, it does not
support full Boolean, it only indexes a part of the
web pages or PDF files, etc.). Search engine testbeds
are therefore still a necessity. Other work has been
done in this area. The Stanford WebBase Project, for
example is built on the previous Google project and
is intended to be an infrastructure for searching,
clustering and mining the Web.
Contains the minimal web page download
mechanism. It is entirely driven by the URL
Manager, which supplies it with URLs to be
retrieved from the web servers.
Link Manager:
Computes the page ranks as the URL Manager
retrieves them from the web.
Word Counter:
The main module that handles creation and
maintenance of the word dictionary and document
index. This is a critical component of every search
engine, as it must handle large amounts of data at
very high speed.
Search:
A CGI application running under Apache that
provides as the graphical interface for querying the
indexed pages.
Internet Web
servers
Digger
2. ENGINE ARCHITECTURE
This search engine has three major elements. First is
the spider, also called the crawler. The spider visits a
web page, reads it, and then follows links to other
pages within the site. This is what it means when
someone refers to a site being "spidered" or
"crawled." The spider returns to the site on a regular
basis, such as every month or two, to look for
changes. Everything the spider finds goes into the
second part of the search engine, the index. The
index, sometimes called the catalogue, is like a giant
book containing a copy of every web page that the
spider finds. If a web page changes, then this book is
updated with new information. Search engine
software is the third part of a search engine. This is
the program that sifts through the millions of pages
recorded in the index to find matches to a search and
rank them in order of what it believes is most
relevant.
The three major elements outlined above are broken
into five modules for improved manageability,
development and customization of the search. These
five modules are interconnected by a client-server
architecture using sockets. This architecture allows
for easy testing and evaluation of new modules (ex:
new Page Rank algorithms) and allows the user to
choose the optimum module for his search. The
modules are written in C++ but due to the socket
communications any programming language can be
used as long as the inter-module communications
protocols are respected.
The five basic modules as presented in fig 1, are:
URL Manager:
Controls the web crawling activity, manages the
page rank computation module and also feeds data
to the word indexer;
Digger:
Downloaded page
Download
request
/ Error msg.
URLManager
Packets
containig
DocID and
word list
Page
ID
Page
URL
Packets containig
Parent DocID and
referred DocIDs
LinkManager
WordCounter
Query
Page
ID
Page Rank
Page
ID
Search Module
Query page in browser
Fig 1. System architecture
3. URL MANAGER
URLManager’s main functionality consists in
coordinating multiple Digger processes as they
download web pages. To keep things simple, this
component uses a client/server architecture, allowing
a number of Diggers to connect to it. If there are no
disk backups, the module assumes this is its first
launch and queues the URLs found in the
configuration file in the unvisited links structure.
These were www.yahoo.com and www.google.com,
so that the number of derived links would be
sufficient. The unvisited links queue is used to feed
the connected Digger processes. A Digger receives a
URL from URL Manager, downloads it (sending a
HTTP request on port 80 of the host found in the
URL), and feeds the result back to the Manager. The
latter, in case of a valid response, parses the HTML
for links found in tags like A, DIV, AREA, BASE,
LINK or SPAN, and for words found between the
tags. These two different information are further
filtered so that the relative links are converted to
absolute links based on the eventual information
found in BASE, the links that point to documents
other than HTML are dropped, the words consisting
only in numbers are dropped, also the same
behaviour is for words longer than 30 characters or
shorter than 2, the ones that start with other
characters than literals, the ones that end with other
characters than literals, and the ones that begin with
an & and end with ;. The filtered links are added in
the unvisited links queue, and the words are packed
and queued in a document data structure that will
feed the Word Counter. To be able to restore the
process after a crash, this module caches its data
structures on the disk. The module allows another
client, the Link Manager, to connect using the same
client/server architecture. The latter module receives
the web structure as pairs of parent document – child
document associations and uses them to compute
page ranks.
4. LINK MANAGER
This module handles the computation of page ranks.
It also allows the Search process to connect for
sorting a query results by page importance. There are
many brands of page ranking algorithms, among
which the most popular seems to be the one used by
Google. This search engines implements two Page
Ranking algorithms. The first one is Google’s page
rank and the second one, still under development,
uses the geographical proximity of a web site to rank
pages.
The implementation of the Google pagerank
algorithm uses a Jacobian-like method, called The
Power Method. Suppose A is the adjacency matrix of
the Web and x the pagerank vector. The computation
starts with an initial solution:
x
(0)
1 
 ...
1 
At each iteration a new pagerank vector is computed
using the same formula as in the Google algorithm
(Brin and Page, 1998), like below:
x ( n1)  d  A  x ( n)  (1  d )  e
D is a constant called dumping factor and its value is
0.85 in our case. Furthermore, e is a vector with
equal size as x and containing only 0 or 1 values.
Brin and Page call this vector personalization vector
and the general model The Random Surfer Model. It
states that an imaginary surfer will go with
probability d towards an outgoing link of the page it
is currently surfing (represented by the first term of
the right member of the equation) or will get bored
with probability (1-d) and will choose a totally
different page (using the e vector).
It was proved that matrix A is a Markov matrix and
the power method converges for such matrices
(Kamvar, et al., 2003b). The convergence rate is
0.85, which means that the computation will be fast,
even with a larger Web size. A proof for this
convergence rate was given by Kamvar and
Haveliwala (Kamvar and Haveliwala, 2003).
The Link Manager has two parts: one that
communicates with the URL Manager and one that
communicates
with
the
Search
Manager.
Communication is realized using Windows TCP/IP
sockets.
The first part is actually a client, which will read
from the URL Manager structures of the form
{parentID, childID} and will first add them to a
database. For optimization reasons, the database is
actually a vector containing structures of the type
{parentID, vector_with_all_children}. As the URL
Manager discovers new links, the Link Manager adds
them to the already existing structures (if the page
was already parsed, but the link was not), or it creates
a new structure for them. After enough new pages
have been added or enough modified pages appeared,
the Link Manager launches a separate Windows
thread, which will compute the new page ranks,
using the new graph of the Web pages. The page
ranks are stored using a vector with structures of the
type {pageID, pageRank}. Resulting new ranks are
saved on the disk, both in the form of the vector
containing structures with page references, as in the
form of the page ranks vector.
The second part of the Link Manager is a Windows
server, waiting requests from the Search Manager. It
will receive a flow of page IDs representing the
pages chosen as result for a query. This flow ends
with a key page ID called QUERY_DONE. As the
requests come, they are added to a queue, and after
the flow ends, the Link Manager will send back a
pagerank for each page ID element popped from the
queue. When the queue is empty, it will send
QUERY_DONE back to the Search Manager.
Finally, when Search Manager wishes to end its
activity, it will send another key ID to the Link
Manager, called QUERY_OVER. This way, the Link
Manager will terminate the Windows thread serving
the Search Manager. More Search Managers can be
served in the same time, so multiple searches can be
solved using multiple threads.
The implementation of the new GeoProx module,
still under development, mines the crawled web
pages for addresses and then compares them to the
user’s location using a simple street graph. The pages
containing locations closest to the user get a bigger
rank.
As the testbed is already created, this project will
further expand towards improved page ranking
algorithms and indexing.
Fig 2. Inter-module communication scheme:
5. WORD COUNTER
This module is the critical component of the search
engine. It performs document and word indexing and
handles query response. It is both a client and a
server. Its client side is connected to the URL
Manager, receiving parsed documents as lists of
words separated by a space. This information is then
stored in a hash table having words as keys (a
dictionary) that is used to map words to word IDs.
Every word ID is associated with a position vector
that gives the position of the word in the document.
The document is stored as a vector of word IDs. This
way of storing data provides for lower memory
fragmentation and higher access speed.
The server side of the module waits for queries from
the user interface. These are strings, eventually
having quotes to assure the terms between them will
be looked for without other intricate words. The
words are searched in the index and a list of
document IDs along with a corresponding match
score are sent back to the client that initiated the
query. It is the client’s task to sort these pages by
their page rank.
6. SEARCH MODULE
This is the user interface that communicates with the
rest of the components. It is a CGI application that
displays a form expecting a user to input the text to
be searched for. When the submit button is pressed,
the application connects to the Word Counter and
sends it the query string. After the latter finishes the
search, the CGI application receives the document
IDs and the matching scores. It then connects to the
URL Manager to resolve the document IDs to actual
URL strings that could be displayed to the user in a
meaningful fashion. The last step is to connect to the
Link Manager in order to retrieve the page ranks for
the document IDs. These ranks are combined with
the matching scores to sort the results as they will be
displayed in the web page.
7. CONCLUSIONS
By implementing a modular architecture, new
modules can be written using any programming
language and any operating system. There are
currently two versions of this search engine one
under Windows and one under Linux. If executed on
different machines, modules from the two
implementations are completely compatible. This
test-bed is meant to encourage developers and
researchers to implement and test new algorithms to
improve the quality of the search.
As search engines everywhere evolve, more features
are required to satisfy the users’ requests. The Page
Rank algorithm was a big step forward and
GOOGLE was pushed to the top of the Search
Engines community by the quality of the returned
results. In order to improve the quality and, in fact
even guess the users’ requests, advanced data mining
techniques should be implemented in the next
generation of search engines. The crawled pages
should be completely cached for data extraction and
the query strings should be better processed to
provide valuable user information, which will
eventually lead to the partial personalization of the
search. Two such improvements are the geographical
localization of a company’s website and the
completion of the query string by examining the
search history. The first one is done by mining
addresses out of web pages and comparing them with
the users location while the second uses the
association rules mining on the query strings to find
correlations between string atoms and completing, as
possible, the search items. Both methods should
modify the Page Rank algorithm.
8. REFERENCES
Brin, S. and L. Page (1998). The Anatomy of a
Large-Scale Hypertextual Web Search Engine.
Computer Networks and ISDN Systems Journal,
Volume 30, Pages 107-117.
Frankel, C., M. Swain and V. Athitsos (1997).
WebSeer: An Image Search Engine for the World
Wide Web. In Proceedings of the IEEE Computer
Vision and Pattern Recognition Conference.
Haveliwala, T. (1999). Efficient computation of
pagerank. Stanford University Technical Report.
Haveliwala, T. (2002). Topic-sensitive pagerank. In
Proceedings of the WWW2002 Conference.
Jeh, G. and J. Widom (2002a). Scaling personalized
Web search. Stanford University Technical Report.
Jeh, G. and J. Widom (2002b). SimRank: A
measure of structural-context similarity. In
Proceedings of the Eighth ACM SIGKDD
International Conference on Knowledge Discovery
and Data Mining.
Kamvar, S. and T. Haveliwala (2003). The second
eigenvalue of the Google matrix. Stanford University
Technical Report.
Kamvar, S., T. Haveliwala, C. Manning and G.
Golub (2003a). Exploiting the block structure of the
Web for computing pagerank. Stanford University
Technical Report.
Kamvar, S., T. Haveliwala, C. Manning and G.
Golub (2003b). Extrapolation Methods for
Accelerating Pagerank Computations. Stanford
University Technical Report.
http://google.stanford.edu/
http://wwwdiglib.stanford.edu/~testbed/doc2/WebBase/
http://www.yahoo.com/
http://www.altavista.com/
http://www.lycos.com/
http://www.alltheweb.com/
http://www.culturalheritage.net/
http://www.searchuk.com/
http://www.webcrawler.com/
Download