Overview of the Technology

advertisement
Deep Search : Concepts and Architecture for a
Next-Generation Information Search Engine
Simon Shepherd
(Professor)
Calum Cookson
(Ph.D Student)
Master of Science Programme in Information Technology In eCommerce
School of Engineering, Design and Technology,
University of Bradford, BD7 1DP, UK.
Background
“If only I had more information…”, is the recurrent cry! In the age of broadband Internet, we are not short of
information – we are drowning in it. A broadband connection can deliver as much information to the desktop
in 3½ hours as a human speed-reader can take in over a lifetime1. The success of the Google engine is largely
due to its ability to present Web pages in a rank order that puts the pages the user is most likely to want to see at
the top of the list. This ability is due to its PageRank algorithm [1] which works on the principle that what
has been useful to other users with similar interests will most likely be useful to the current user. However, a
query on the words “Paris Hilton” soon shows up the shortcomings in even the Google search. Perhaps the user
is looking for travel information on the Paris hotel. Sadly, all he will find on the first dozen or so pages of a
Google search is sites about the socialite’s latest party exploits.
It is becoming increasingly important for users to be able to take very large amounts of information and turn it
into genuinely useful intelligence, a process that the military has undertaken for centuries and which is
traditionally a very labour-intensive, human task. When the police are investigating a major crime they
sometimes fail to make a connection because of the sheer overwhelming number of individual facts and pieces
of paper. In order to make best use of the enormous information resources that are becoming available to
organisations, corporations and individuals, it is essential that the problem of accurate “Do what I mean, not
what I say” information search and retrieval is addressed.
Deep Search is a first attempt to address this hugely ambitious goal. The system is designed to be used not
only for public information retrieval from the Web but also for use by companies, corporations, police,
government, military and intelligence services to make more effective use of the collective corporate
knowledge base, built up often at great cost and over many years of careful, painstaking work. Deep Search
allows a much greater leverage on this investment.
In order to understand some of the philosophy behind its concepts, it is informative to look at the portal
screenshot shown in Figure 1.
Like all search engines, Deep Search has a box (1) where the user can type query keywords. Unlike other
search engines, however, it also has a box (2) and a Browse button for entering paths to the local file system. A
selector button (3) allows the user to choose between searching the Web or a local file system. In its default
state, Deep Search will act like any other simple search engine. It will simply find every document whose
keywords match the query keywords and return the list in the order it finds them. This is useful, but often
unhelpful. The real power of Deep Search lies in the powerful set of mathematical and heuristic algorithms that
underlie the Rank, Link and Cluster selectors (4,5,6). This gives Deep Search the ability to act like a super-fast
1
Assuming a 10 Mbps connection, that the data is text and the estimated human fast reading rate of 80 baud, it would take 50 years of
continuous speed reading to take in the information.
librarian who walks into a huge library where all the books are scattered on the floor. She can pick up and sort
the books onto the shelves almost instantly according to several different classification criteria (including
criteria not generally available in real libraries) thus making information access and retrieval vastly simpler,
quicker and more rewarding.
4
5
6
1
3
2
Figure 1. Screenshot of the Deep Search web portal.

RANK by connectivity (4) allows the user to switch on a ranking algorithm similar to the Google
PageRank but somewhat more powerful. This is a full forwards-and-backwards Markov transition
probability analysis of the connectivity link matrix. In other words, every page is analysed for
(hyper)links in the case of the Web, or references in the case of documents, to all other pages
(documents) in the collection to see which pages/documents refer to others that may be of interest to the
reader. This allows Deep Search to implement fully both the “Random Surfer” and the “Interested
Reader” models. Not only can the user read a succession of related pages/documents in logical order
based on an analysis of the most likely forward links, he can also follow links back in a direct sequence
to obtain background and reference material, again in the most logical way.

LINK by semantics (5) allows the user to avail himself of the enormous power of Latent Semantic
Indexing (LSI) [3]. This is essentially a rank-reduction technique applied to the term-document matrix
to discover hidden (or latent) semantic links between documents that are not evident from their
keywords. Thus, information that could be of value to the reader is retrieved that would otherwise be
missed in cases where none of the entered keywords match keywords in documents, but which are
nonetheless, relevant.

CLUSTER by affinity (6) allows the user to pigeonhole documents according to their similarity with
other related documents and their dissimilarity with other documents. For example, a Web search on
“SOAP” will yield many hundreds of documents covering washing soap, handmade soap, chemical
processes for making soap, soap operas and the Simple Object Access Protocol known by its acronym
‘SOAP’. The clustering algorithm will attempt to place documents concerning these very different, but
related topics in different clusters so that each topic may be distinguished clearly. This allows both
horizontal and vertical searches to be made very easily with a powerful ability to “drill down” into
topics of particular interest.
By combining these three powerful algorithms in a unique way, Deep Search sets out to solve the problems of
both synonymy and polysemy. The synonymy problem (where several different words mean the same thing) is
relatively straightforward and represents the current state-of-the-art in search engine technology. The polysemy
problem (where one word means several different things) is much harder to solve. Deep Search attempts to
overcome the polysemy problem by using logical links, semantic analysis and clustering in such a way as to
present search results in order of potential usefulness, with clusters of documents linked semantically by
meaning and context rather than simply by matching keywords and content.
Overview of the Technology
The mathematical ideas underlying the ranking, linking and clustering algorithms are not trivial and rely
fundamentally on eigenspace analyses of various matrices derived from the document collections under search.
However, the successful implementation of these algorithms is more of an engineering problem than a
mathematical one – the main problem is one of sheer scale. In the case of the Web, we are currently looking at
the eigenvector analysis of a matrix 3½ billion elements square and growing at a rate of one million documents
per day. As large as this figure is, this represents only the visible Web, that is, those resources which are
indexable by textual keywords. This is only a tiny fraction of the true amount of information available in what
is known as the Deep Web (or “hidden Web”) which comprises such resources as images, video, audio, binary
executables and other content not easily indexed. Current estimates agree that this is around 600 billion items.
For such huge problems, conventional eigensolver techniques cannot be applied (since users will be prepared to
wait at most tens of minutes rather than tens of years for results of their search query). We have largely solved
the theoretical problems for small-scale examples and so the basic mathematics is understood. We now need to
implement the algorithms “in anger” on real databases. An overview of the technologies are as follows:
We regret that, for commercial reasons, details of the
technology have been removed from this document.
References
[1] S. Brin and L. Page, “Anatomy of a Large Scale Hypertextual Web Search Engine”, Computer Networks
and ISDN Systems, 30, 1998, 107—117.
[2] S. D. Kamvar, et. al., “Spectral Learning”, IJCAI, 2003.
[3] T. Kolda, “Latent Semantic Indexing…”
[4] T. H. Wei, “The Algebraic Foundations of Ranking Theory”, Cambridge University Press, London, 1952.
Download