Deep Search : Concepts and Architecture for a Next-Generation Information Search Engine Simon Shepherd (Professor) Calum Cookson (Ph.D Student) Master of Science Programme in Information Technology In eCommerce School of Engineering, Design and Technology, University of Bradford, BD7 1DP, UK. Background “If only I had more information…”, is the recurrent cry! In the age of broadband Internet, we are not short of information – we are drowning in it. A broadband connection can deliver as much information to the desktop in 3½ hours as a human speed-reader can take in over a lifetime1. The success of the Google engine is largely due to its ability to present Web pages in a rank order that puts the pages the user is most likely to want to see at the top of the list. This ability is due to its PageRank algorithm [1] which works on the principle that what has been useful to other users with similar interests will most likely be useful to the current user. However, a query on the words “Paris Hilton” soon shows up the shortcomings in even the Google search. Perhaps the user is looking for travel information on the Paris hotel. Sadly, all he will find on the first dozen or so pages of a Google search is sites about the socialite’s latest party exploits. It is becoming increasingly important for users to be able to take very large amounts of information and turn it into genuinely useful intelligence, a process that the military has undertaken for centuries and which is traditionally a very labour-intensive, human task. When the police are investigating a major crime they sometimes fail to make a connection because of the sheer overwhelming number of individual facts and pieces of paper. In order to make best use of the enormous information resources that are becoming available to organisations, corporations and individuals, it is essential that the problem of accurate “Do what I mean, not what I say” information search and retrieval is addressed. Deep Search is a first attempt to address this hugely ambitious goal. The system is designed to be used not only for public information retrieval from the Web but also for use by companies, corporations, police, government, military and intelligence services to make more effective use of the collective corporate knowledge base, built up often at great cost and over many years of careful, painstaking work. Deep Search allows a much greater leverage on this investment. In order to understand some of the philosophy behind its concepts, it is informative to look at the portal screenshot shown in Figure 1. Like all search engines, Deep Search has a box (1) where the user can type query keywords. Unlike other search engines, however, it also has a box (2) and a Browse button for entering paths to the local file system. A selector button (3) allows the user to choose between searching the Web or a local file system. In its default state, Deep Search will act like any other simple search engine. It will simply find every document whose keywords match the query keywords and return the list in the order it finds them. This is useful, but often unhelpful. The real power of Deep Search lies in the powerful set of mathematical and heuristic algorithms that underlie the Rank, Link and Cluster selectors (4,5,6). This gives Deep Search the ability to act like a super-fast 1 Assuming a 10 Mbps connection, that the data is text and the estimated human fast reading rate of 80 baud, it would take 50 years of continuous speed reading to take in the information. librarian who walks into a huge library where all the books are scattered on the floor. She can pick up and sort the books onto the shelves almost instantly according to several different classification criteria (including criteria not generally available in real libraries) thus making information access and retrieval vastly simpler, quicker and more rewarding. 4 5 6 1 3 2 Figure 1. Screenshot of the Deep Search web portal. RANK by connectivity (4) allows the user to switch on a ranking algorithm similar to the Google PageRank but somewhat more powerful. This is a full forwards-and-backwards Markov transition probability analysis of the connectivity link matrix. In other words, every page is analysed for (hyper)links in the case of the Web, or references in the case of documents, to all other pages (documents) in the collection to see which pages/documents refer to others that may be of interest to the reader. This allows Deep Search to implement fully both the “Random Surfer” and the “Interested Reader” models. Not only can the user read a succession of related pages/documents in logical order based on an analysis of the most likely forward links, he can also follow links back in a direct sequence to obtain background and reference material, again in the most logical way. LINK by semantics (5) allows the user to avail himself of the enormous power of Latent Semantic Indexing (LSI) [3]. This is essentially a rank-reduction technique applied to the term-document matrix to discover hidden (or latent) semantic links between documents that are not evident from their keywords. Thus, information that could be of value to the reader is retrieved that would otherwise be missed in cases where none of the entered keywords match keywords in documents, but which are nonetheless, relevant. CLUSTER by affinity (6) allows the user to pigeonhole documents according to their similarity with other related documents and their dissimilarity with other documents. For example, a Web search on “SOAP” will yield many hundreds of documents covering washing soap, handmade soap, chemical processes for making soap, soap operas and the Simple Object Access Protocol known by its acronym ‘SOAP’. The clustering algorithm will attempt to place documents concerning these very different, but related topics in different clusters so that each topic may be distinguished clearly. This allows both horizontal and vertical searches to be made very easily with a powerful ability to “drill down” into topics of particular interest. By combining these three powerful algorithms in a unique way, Deep Search sets out to solve the problems of both synonymy and polysemy. The synonymy problem (where several different words mean the same thing) is relatively straightforward and represents the current state-of-the-art in search engine technology. The polysemy problem (where one word means several different things) is much harder to solve. Deep Search attempts to overcome the polysemy problem by using logical links, semantic analysis and clustering in such a way as to present search results in order of potential usefulness, with clusters of documents linked semantically by meaning and context rather than simply by matching keywords and content. Overview of the Technology The mathematical ideas underlying the ranking, linking and clustering algorithms are not trivial and rely fundamentally on eigenspace analyses of various matrices derived from the document collections under search. However, the successful implementation of these algorithms is more of an engineering problem than a mathematical one – the main problem is one of sheer scale. In the case of the Web, we are currently looking at the eigenvector analysis of a matrix 3½ billion elements square and growing at a rate of one million documents per day. As large as this figure is, this represents only the visible Web, that is, those resources which are indexable by textual keywords. This is only a tiny fraction of the true amount of information available in what is known as the Deep Web (or “hidden Web”) which comprises such resources as images, video, audio, binary executables and other content not easily indexed. Current estimates agree that this is around 600 billion items. For such huge problems, conventional eigensolver techniques cannot be applied (since users will be prepared to wait at most tens of minutes rather than tens of years for results of their search query). We have largely solved the theoretical problems for small-scale examples and so the basic mathematics is understood. We now need to implement the algorithms “in anger” on real databases. An overview of the technologies are as follows: We regret that, for commercial reasons, details of the technology have been removed from this document. References [1] S. Brin and L. Page, “Anatomy of a Large Scale Hypertextual Web Search Engine”, Computer Networks and ISDN Systems, 30, 1998, 107—117. [2] S. D. Kamvar, et. al., “Spectral Learning”, IJCAI, 2003. [3] T. Kolda, “Latent Semantic Indexing…” [4] T. H. Wei, “The Algebraic Foundations of Ranking Theory”, Cambridge University Press, London, 1952.