Information Retrieval from Distributed Databases By Ananth Anandhakrishnan Autumn 2004 Abstract Although distributed information retrieval improves certain aspects of information retrieval such as scalability and removing a single point of failure, there are still problems in the selection and merging components for retrieving relevant resources. This paper looks at various methods and algorithms used to improve the resource selection and merging from distributed information retrieval systems. The paper begins with a basic introduction to information retrieval, and distributed computing. It then goes into the field of distributed information retrieval examining the sub components of distributed information retrieval system which are the same for any information retrieval system, but they operate in different ways because of the environments they exist in. The main focus will be on the resource selection and resource merging components. The standards and protocols used in information retrieval is also covered, these are very important set of rules which help the interoperability of information of different data types coming from different types of network systems. Examples of application that use distributed information retrieval are reviewed in this paper such as Emerge, and Harvest which have roughly the same function of information retrieval but use different standards to obtain information. SETI@HOME is another distributed application, used for distributed data mining. Page 2 of 34 Contents page Discussion Notes............................................................................................................................................................................... 4 Question for Discussion .................................................................................................................................................................. 6 Introduction ....................................................................................................................................................................................... 7 Introduction to Distributed systems .............................................................................................................................................. 9 Centralised Information Retrieval ................................................................................................................................................ 11 Distributed Information Retrieval ................................................................................................................................................ 12 The Sub-components ..................................................................................................................................................................... 14 Resource selection and ranking methods..................................................................................................................14 Merging ranked lists…………………………………………………………………………………….. 15 Semi Supervised Learning model for result merging……………………………………………………..15 The selection algorithm………………………………………………………………………………….18 The Lemur toolkit……………………………………………………………………………………….21 Federated Search Engines .............................................................................................................................................................. 22 Other issues ...................................................................................................................................................................................... 25 Improve scalability by replication………………………………………………………………………...25 Standards and Protocols………………………………………………………………………………….26 Other examples of distributed information retrieval ................................................................................................................. 29 Emerge…………………………………………………………………………………………………...29 SETI@Home…………………………………………………………………………………………….29 Harvest…………………………………………………………………………………………………...30 Conclusion ........................................................................................................................................................................................ 31 Reference .......................................................................................................................................................................................... 32 Page 3 of 34 Discussion Notes This section is to give the reader an introduction into distributed information retrieval to provide some background knowledge of the field before reading more into this report. An important feature of distributed information retrieval is indexing of the documents. Keeping an up-todate index of the web can benefit everyone in obtaining the most relevant and latest documents. A distributed approach to this has been implemented called GRUB, which is a client-based program when downloaded uses the machines CPU power and internet bandwidth of the client machine to do the web crawling. The article below talks about the current state of distributed searching [1] on the web. The state of distributed search Grub aims to take participants' unused bandwidth and CPU cycles to crawl more than 10 billion Web documents and maintain an up-to-date index of the Web. Distributed search, which is still very much an emerging field, works in a way that will be familiar to people who donate processor time to projects like distributed.net and SETI@Home. By downloading and running a screensaver, you contribute some of your bandwidth and CPU cycles to spider and analyze Web pages. When you're not using your computer, the Grub client is activated. Your computer crawls the Web, following hyperlinks from document to document. The more people who do this, the bigger the index get. Web crawling for updates indexes Andre Stechert, Grub's director of technology, says, "distributed search makes it possible to deliver Page 4 of 34 results that are fresher, more comprehensive, and more relevant. It achieves this by making it feasible to maintain up-to-date crawl data and metadata in the 10+-billion document range and to use heavyweight page analysis algorithms that look deeper than keywords." Incorporating peer-to-peer technology has been a goal of Web searchers since the days of Napster, but until now has been mainly the domain of academic research papers, with many saying that it's an unachievable pipe dream. If you're Google, though, with a multi-million dollar cluster of Linux servers doing your spidering, it's more like a nightmare -- and the ever-decreasing cost of domestic bandwidth means that, dream or nightmare, it's starting to look more and more possible. We asked Google for a reaction to Grub and P2P search, but the company is currently subject to the SEC "quiet period" before its IPO. Grub doesn't directly provide a search engine -- rather, Grub clients crawl the Web to provide an index that can be used either to improve an existing search engine or create an entirely new one. Today only WiseNut, which is also part of corporate parent LookSmart, is using crawl results from Grub to improve its index, but Grub makes its results available to anyone who wants to use them, with an XML-based query API. A few test searches using WiseNut show that while the concept may be good, the execution is incomplete. For any search term you care to try, WiseNut returns far fewer results than the big search engines, in part because the Grub index is currently small compared most search engines' indexes. Andre Stechert believes that community Web crawling projects could help. Distributed crawl indexes, he says, "are today where [Internet] directories were before the advent of open directories like dmoz. They are proprietary and, for the most part, redundant. Creating and maintaining them incurs a large cost for the companies owning these infrastructures and for ISPs and Webmasters who are hosting the sites being crawled.... We think it's important to provide a community-based crawl database because it's a distraction from the main problem at hand: providing relevant, speedy answers to user queries. The challenge of bringing up a production-scale, production-quality crawling infrastructure is significant enough to chew up lots of time from any team. By lowering the barrier to entry for search engine research, we hope to focus more of the research effort where value is being created for the world at large." That's what distributed search is all about: lowering barriers to entry. Just as open directories helped Google and others to challenge Yahoo's seemingly unassailable position (just look at the list of sites using data from DMOZ, which includes many well-known search engines) distributed crawling could do the same for Google's competitors. Imagine a world where any new player in the search market can focus solely on providing an excellent user interface, without having to worry about the logistics of crawling the enormous World Wide Web. That is the future that projects like Grub promise. As Stechert says, "distributed computing is an enabling platform for the future of search." Grub currently spiders about 100 million URLs per day. P2P searching Peer-2-Peer is a service where communication responsibilities is taken away from a dedicated server and relied upon client machines for effective communication. The first well known type of system to appear from this technology was Napster – an online file sharing software which allows user to swap the latest chart music in mp3 file format. Napster was running as an illegal file sharing service up until 2001. It now serves as a commercial service, after the law suit that was filed against it at the end of 2000, but many other P2P file sharing software were created such as Kazaa and WinMX which took over Napster later on. Page 5 of 34 Napster relied on a centralised list being produced where all results from the users music query would be merged and presented in a single list, so there is still a need for a centralised server. Gnutella is another P2P technology which is based on a decentralised file list, where the service is not bound to a centralised server. However, this new technology did not last very long, there were huge problems with the scalability of the software, providing a service which was unreliable [2] Google is a centralised search engine because it is controlled by a central organisation which controls the indexing and presentation of the information. P2P searching allows users to put up information in the search index for other users, this makes this method more dynamic and update as with GRUB which builds the most up-to-date web indexes. Grub is still a new area, and the index it has built up is very small compared to the likes of Google. If it gains wide support it can prove to be a very useful tool for gaining the most up-to-date indexes. P2P has proven to be a great tool for file sharing between communities, but as for a wider service or commercial purpose there is still a lot of work to be done in terms of building effective searching and merging techniques. Question for Discussion Questions I raised in this report are for the readers benefit to give a clear picture of what is involved in distributed information retrieval and how things work. The questions were also used to guide me in learning about certain aspects of distributed information retrieval, because the whole area was very new to me. Some of these questions were only raised when I gained in-depth knowledge of how distributed information retrieval works. Distributed computer systems: what is it and how do they work? Is the WWW an example of a distributed system? Centralised information retrieval: why might it be desirable to keep a centralised repository of all the resources? What are the drawbacks? How can distributed information retrieval help? Distributed information retrieval: how can concepts in distributed computing help with distributed information retrieval? How does it work? Components of distributed information retrieval system: what are the main components of the system? What problems are there currently with these components? How do resource algorithms work? What is the aim of the Semi Supervised Learning model? Are there any applications that use this? Improving distributed information retrieval: how can applying standards and protocols help the retrieval of information from distributed heterogeneous sites? Page 6 of 34 1 Introduction In this report I will begin by introducing the reader to the basics of distributed computer systems to give a better understanding of how a distributed information retrieval system works and why there is a need for it. The main body of the report will be based around some of the important sub components of distributed information retrieval specifically focusing on resource selection from distributed databases and merging the ranked list of resources into a single file. This will involve examining mathematical algorithms using statistics and probability. There is no doubt that distributed information retrieval solves many of the problems that are faced in centralised information retrieval systems. However centralised systems must have there benefits, and these benefits should be able to aid distributed information retrieval in some way. An investigation into this will also be carried out, also looking current application available on the web, which are technologies of distributed information retrieval. Following section provides information of what information retrieval is and how it works to give the reader some background knowledge of the field as a whole. What is information retrieval? Information retrieval is ‘the retrieval of information from a collection of written documents. The retrieved documents aim at satisfying a user information need usually expressed in natural language [3].’ This information or set of data can be anything from documents to media files. Researching for specific information could be made better by being able to base search results on the embedded information of the file. The embedded information is called metadata – data about data. How does information retrieval work? This section is taken from Susan Feldman paper “NLP Meets the Jabberwocky: Natural Language Processing in Information Retrieval” [4], to show what stages are involved in information retrieval. An information retrieval system consists of three separate files, - List of all documents includes the indexing information or resource description which is information about the document (author, title, date). - A dictionary is a collection of all the unique words in the database - An inversion list stores all the locations of the occurrence for each word in the dictionary. Document Processing - documents stored in an information retrieval system have unique identifiers attached so it can be generalised in a group of the same kind of documents and also to uniquely identify the document. Query Processing – retrieval systems use one of two ways or a combination of the two to process the query. The use of mathematical operators is one possibility, it uses AND, OR, NOT statements to search for matches with the query to a document. The other way is to use probability and statistics; this matches all the words in the query to the documents to see how many times it occurs. Query Matching – with the system searching with the mathematical operators, the system will only return a list of documents which satisfy the query condition. For example the user enters two words in the search engine “rebellion” and “Angola” which they want to be within 15 words of each other, the system will only pick out the documents which obey this condition. If there is a document which has the two words but are Page 7 of 34 spaced by 16 words, it will not retrieve it. With the probability and statistics system, it will bring back all the matches for the query in the order of highest occurring first. Ranking and Sorting - the system will return documents sorted in a specify order, usually with the ones with high relevance is at the top. Figure 1.1 shows the basic structure of what happens in an information retrieval system, from when the user enters a query using the USER INTERFACE. USER INTERFACE Page containing list of documents Makes a query request Generate display (HTML pages) Query translation Ranking component Query databases Ranked list of page titles CENTRALISED DATABASE Figure 1.1 shows a how information retrieval works – framework of the diagram borrowed from Tanenbaum and Van Steen “distributed systems”[5] Page 8 of 34 2 Introduction to Distributed systems Distributed systems have now been used for more than 40 years. In 1961 IBM introduced a Compatible Time Sharing System which allowed separate terminals in different offices to access the same hardware. In 1972 the United States developed ARPANET to allow computers in remote sites to communicate; from this the internet was born. The internet gave rise to the World Wide Web, which now is used to host an infinite number of distributed services [5]. What is a distributed system? ‘A Distributed system is a collection of independent computers that appears to its users as a single coherent system.’ [5] The statement about making the system appear to the user as if it were a single coherent system is not true of all distributed systems. The World Wide Web is an example of this, when the user does a search on the web, the returned results are obtained from different sites fro all around the world. However, the user is made aware of where the results are coming from because the return list comes back with URL links. How is it built? A Distributed system in a homogenous environment are managed by a distributed operating system, which is able to manage the sharing of resources between the computers and hide the fact that they are a collection of computers. In a heterogeneous environment the system is managed by a network operating system, but it is unable to provide the single view distributed operating system provide. The single view is achieved by software: middleware is an additional layer of software between the distributed application and the operating system used to hide the heterogeneity of the collection of underlying platforms [5]. Server A Server B Server C Server D DISTRIBUTED APPLICATION SOFTWARE LEVEL (provides transparency for the system) Network Fig2.1 general structure of a distributed system with use of a software level to hide the differences in the underlying platform and provide system transparency [ 5]. As shown on figure 2.1 servers are networked together on a Local Area Network or a Wide Area Network (WAN). You can have the server’s setup in many ways; one possibility is setting it up as a replicated system, like a Cluster: this is where servers have the same information on each node, where there is one primary Page 9 of 34 node, and the others are secondary. If the primary server goes down one of the secondary servers will take over as primary until the problem has been fixed with the original primary server. This is one type of cluster; you also have clustered computing which shares their resources or workload between them. The Google search engine is an example of clustered server computing; the service is running of a cluster of over 4000 Linux servers which does the web crawling [1]. What are the goals of the distributed systems? - Allow users to access remote resources and to be able to share them to other users in a controlled way. - Hide the fact that the system is a distributed system; this is called making the system transparent. There are different types of transparency that can be used access transparency – hide differences in data representation and how a resource is accessed. - The system should be able to adapt to changes in the system. Be able to provide the same level performance if the system becomes larger. - The system should by fault tolerant; able to maintain the same level of performance after a partial failure of the system. What types of distributed systems are there? Client-Server model is where the server is used to host services such as file and printing services and the client which are normal workstations request the services. There two types of this model, one is the vertical distribution model which is used in organisations located at one location with multiple servers. The other is horizontal distribution model which is used in organisation which have servers and users located in different geographic areas. Database systems like ORACLE have distributed database management functionalities. Database tables can be split into fragments so users of the database system only see what is relevant to them. For example if a company requires to know information about their customers in England only, then the database which has information about customers all around the UK will be fragmented so customers from England are only on their system. Air Traffic Control System have there services distributed geographically. Each service has is supported with redundancy, so if there is partial failure of the service, there is a backup server to take its place. Page 10 of 34 3 Centralised Infor mation Retrieval There is a reason why it is desirable to keep a centralised repository of information: - Centralised systems are usually in control of the resources in the databases, and so can build accurate descriptions of the resources. As a result, the selection and merging of information from a centralised system produce more relevant information because the operations take place at one site, which uses one method of selection, ranking and merging. However having a centralised repository of a huge set of resources is not ideal when you have millions of users accessing the same search engine to retrieve information. This increases network traffic reducing the effectiveness of the service. With information ever growing and different types of information (file formats, various sizes) emerging, it will not be scalable to keep everything on a centralised system. If the centralised site was to crash, the whole system will be unable to the users, so having a single point of failure must be avoided. Centralised retrieval systems that hold information that is specific to a particular field of study and are in formats that are unrecognizable to other systems can only be hosted on a server that is capable of handing the information. Distributed information retrieval solves some of these problems: - Since search is distributed you have more than one machine that can process the retrieval of information. Improves scalability – reduces server load and network traffic by distribution of the service to groups of users in different geographic areas (replication systems). There is no single point of failure, if one server goes down; there is always another server to search from. There are now standards and protocols defined to enable the combination of heterogeneous networks, which hold heterogonous information to work together to provide a uniform view for the user. These standards and protocols will be discussed later. Selection and merging of information from distributed sites, is something distributed information retrieval system finds difficult to emulate the same performance as a centralised information retrieval system. We will look at how the problem of selection and merging of resources is tackled in distributed environments later in this report. Page 11 of 34 4 Distributed Infor mation Retrieval The section above describes differences in distributed information retrieval and centralised information retrieval. I haven’t given a definition yet for distributed information retrieval, but from reading the introduction it should be clear what its purpose is. ‘The goal of distributed information retrieval is to enable the identification and retrieval of data sets relevant to a general description or query, wherever those data sets may be located or hosted”[6]. There are different ways to think of distributed information retrieval systems. The main point is information is gathered from distributed sites, but where are these sites? Is it within an organisation or a collection of independent resource providers? I think we can break distributed information retrieval systems into two types of environments: cooperative and uncooperative. Cooperative environments There are different types of cooperative environment the figure 4.1 shows an environment where the distributed system belongs to an organisation. USER APPLICATION DISTRIBUTED DATABSES Figure 4.1 – basic structure of a distributed information system Page 12 of 34 Library as a cooperative environment Say we have library organisation that has sites all around London where each library has a collection of different internet accessible resources (e-journals, e-books) in different categories (literature, science, computing, geography, history, sports), and each library may have resources of the same type. Each library maintains its own database, with the resources, a unique identifier for the resouces and detailed descriptions of the resources, and statistical information about the resource content. The library organisation has an online search engine, which enables users to search for any online resource in any category in all the libraries databases. Example of query User requests a list of e-journals and e-books related to ‘natural language and information retrieval’ in the category of computing. The search application will pass this request to all the individual library databases. These databases will return a list of unique identifiers of the relevant resources which are merged together in the search engine to present to the single ranked list to the user. If user finds a resource they want to view, the resource identifier is used to retrieve the resource. The other type of cooperative environments is where the distributed information system is linked to various resource providers who also provide information about the document to make searching more accurate. Each service provider may have same types of documents; this area is covered more in the next section. Uncooperative environment is where resource providers do no provide additional descriptions about documents residing in databases; this makes retrieving accurate results difficult. A method used to overcome this problem is discussed in the next section. Page 13 of 34 5 The Sub-components Note: This section is broken into two parts. The first part breaks down the technical details of the components of distributed information retrieval systems, which are the same as any other information retrieval systems, but they work in a different way. The second part focuses on the resource selection algorithm, shows how it works mathematically. By reading the first part you can get a good understanding of how resource selection and merging works in a distributed environment. Readers who wish to learn more of how the algorithms work can follow this section all the way through, otherwise swiftly press page down when you start seeing maths. There are four parts of distributed information retrieval that are important to obtain the best possible results; resource description, resource selection, query translation and result merging [7]. Resource Description – for each database it is essential to build a resource description to give more knowledge of what type of information each database holds. START is a protocol for acquiring and describing database contents from the source provider. Contents are described using vocabulary and term frequency information. The resource description also contains a list of words in documents the number of occurrences of the word and a unique id for each document; this is used for the resource selection stage. Information that is provided: average document length, inverse document frequency, detailed resource descriptions. START protocol can be used with corporative environments, but it can’t be used in uncooperative environments. To obtain resource descriptions from uncooperative environments a method called query-based sampling is used, which involves querying databases to obtain resource descriptions from different databases. This is proven to be an effective method of obtaining resource descriptions. Resource Selection – is based on identifying databases that contain document relevant to a query based on the resource descriptions. This process is conducted by an algorithm (set of instructions) which uses probability and statistics. These probability and statistics are used to select or rank the documents relevance to a query and are sent to the merging tool. Query Translation – representation of the query entered by a user to map it to relevant resources in the selected databases. Result Merging – after results are returned from the selected databases they are complied into a single list similar to as it would be presented from a centralised information retrieval system and presented to the user. Resource selection and ranking methods Resource selection involves identifying a small set of databases from the distributed information retrieval system that contains documents relevant to a query requested by a user. The resource selection component is an important part of the distributed information retrieval system model; it resides at the core of the system where it has access to all the databases and up-to-date records of the information in each database. “If the distribution of relevant documents across the different databases were known, databases could be ranked by the number of relevant documents they contain; such a ranking is called a relevance based Page 14 of 34 ranking (RBR). Resource ranking algorithms are evaluated by how closely they approximate a relevancebased ranking” [8]. There are many different types of algorithms used in resource selection and ranking: CORI “models each document database as a very large document, and ranks each database using a modified version of a well-known document ranking algorithm” [9]. This algorithm uses resource descriptions that consist of vocabulary, term frequency and corpus information. KL divergence algorithm is based on language modelling. “In the Language Model retrieval algorithm, each document is considered as a sample of text generated from a specific language. The language model for each document is estimated based on the word frequency patterns observed in the text. The relevance of a document to a query is estimated by how likely the query is to be generated by the language model that generated the document” [11]. Relevant Document Distribution Estimation (ReDDE) algorithm “estimates the distribution of relevant documents across the databases for each user query and ranks databases according to this distribution of relevant documents” [8]. The ReDDE algorithm uses constants to model the probability or relevance given a document and to determine how much of a centralized complete database ranking should be used to model the relevance. Evaluation of these algorithms against different test databases of various sizes shows that the CORI algorithm is not suited for large distributed environments. The KL-divergence performs to some extent, but its performance is inconsistent when over 10 databases are involved. It was found that the ReDDE algorithm is more effective in these environments showing consistent results throughout the tests. Merging ranked lists After resource ranking and selection, the query is forwarded to the selected databases, and then a resultmerging algorithm merges the ranked lists returned from the different databases into a single, final ranked list. Problems in result merging involve databases not using the same selection algorithm to produce the list of returned documents; this means they produce different statistics. If a standard is applied to use one algorithm it will solve this problem, but it will not be effective. Result merging with current solutions such as raw score merging (results based on document scores) and round robin merging (selecting the first database that it hits, doesn’t take into account of its relevance) occurs at the client end application, so it isolated from the rest of the distributed information system. The information available for normalizing database-specific document scores is very limited, and so solutions are based on assumptions. Semi Supervised Learning model for result merging The Semi Supervised Learning (SSL) model involves producing a single ranked list of documents that closely approximates the ranked list that would be produced if all of the documents from all of the databases were stored in a single, global database [5]. This is achieved by running a centralised database in parallel to the distributed databases as a test database to base the merged results against. Page 15 of 34 The centralised sample database This database is built up by using the query based sampling method to query distributed databases to search and retrieve documents to create resource descriptions for them. Once the documents are retrieved they are scanned and resource descriptions are built of the document, after this the documents are discarded. The reason why they are discarded is because the resource selection stage of retrieval doesn’t need the documents to build a ranked list. However, ‘The documents obtained by sampling all of the available databases can be combined into a single searchable database. This centralized sample database is a representative sample of the single global database that would be created if all of the documents from all of the databases were combined into a single database. The vocabulary and frequency information in the centralized sample database would be expected to approximate the vocabulary and frequency patterns across the complete set of available databases’ [6]. The SSL algorithm is used for the merging of document from distributed databases. To make the merging of documents from the document selection it would be more effective to have the result merging to occur where the selection of documents occur. How distributed information retrieval works in more detail - A user enters a query - The query is used to rank the collection of databases from which a set of databases are selected. - The query is then broadcasted to all the selected databases from which it produces a ranked list of all matches with document id and scores. The document ids and scores are added to the merging algorithm. - The query is also broadcasted to the parallel running centralized database and the ranked list of document id’s and scores are also inputted into the merging algorithm. The ranked list provided by the central database will influence the resources merged from the distributed databases. The SSL algorithm specifically models result merging as a task of transforming sets of database-specific document scores into a single set of database-independent document scores by using the documents acquired by query-based sampling as training data Page 16 of 34 CENTRALISED SAMPLE DATABASE Resource Descriptions of documents held on all databases. Obtained by querying Query is sent to a centralised sample database Individual ranked lists Ranked list of documents from central database. Query entry Resource selection Combine document ranking Merged results ranked by relevance. SELECTED DISTRIBUTED DATABASES BY SELECTION ALGORITHM Merging results Database specific scores Merged list Database independent scores Figure 5.1 a more detailed diagram of distributed information retrieval system using the SSL method Page 17 of 34 The selection algorithm This example is taken from papers by Luo Si and Jamie Callan who are members of the Information Retrieval Group - which “studies a wide range of issues related to finding, organizing, analyzing, and communicating information”. Most of the explanations have been taken out by the papers written by Luo Si and Jamie Callan, I have added additional written to try and make the mathematics clearer. [7] [8] [9] [10] [11]. The goal of resource selection is to select a small set of resources that contain a lot of relevant documents. The Relevant Document Distribution Estimation ReDDE algorithm tries to estimate the distribution of relevant documents among the available databases. The number of documents in database Cj that are relevant to query q is estimated as: Equation 1: Estimate relevant documents for the jth database for a specific query NCj - the number of documents in database Cj P(di | Cj) - the probability of selecting document di from the collection of documents in the database Cj P(rel | di) – the probability that the document di is relevant N In cooperative environments we say that the probability P(di | Cj) would be 1/ Cj . The same is true for uncooperative environments, even though there is no cooperation with the providers of the resource descriptions it is possible to build resource descriptions from the query based sampling. As mentioned in the previous section this is done by submitting queries to the database and examining the documents that are returned This probability is substituted into equation 1 which produces a new probability algorithm below: Page 18 of 34 Equation 2: Estimate relevant documents for the jth database for a specific query Given a sample of documents from the database. NCj_samp – number of sample documents from the databases NCj. A centralised complete database is a union of all the documents in all the databases. P(rel|di) is the probability of relevance given a specific document. The probability is calculated by reference to the centralized complete database. P(rel|di) can be simplified as the probability of relevance given the document rank when the centralized complete database is searched by an effective retrieval method. The probability distribution function is a step function, which means that for the documents at the top of the ranked list the probabilities are a positive constant, and are 0 otherwise. This can be modelled as follows: Equation 3: the probability of relevance given the document rank where Rank_central (di) is the rank of a document in the centralized complete database, and N all is the number of documents in all the databases. Cq – query dependent constant. Rank_central(di) – is the rank of a document in the centralized complete database. Nall – is the number of documents in all the databases. The ReDDE algorithm uses constants to model the relevance given a document, and to determine how much of a centralized complete database ranking to model. A ratio is used to search for relevant documents in a fraction of the centralised complete database. Usually the ratio is set to 0.0003, so If you have 1,000,000 documents and the ratio is set to 0.0003, this means the top 3000 documents are considered for relevance ranking. A centralised database that keeps all documents of all databases is impractical. However, the rank of a document in the centralized complete database can be estimated, from its rank within the centralized sample database, which is the union of all the sampled documents from different databases (query-based sampling). The query is submitted to the centralized sample database; a document ranking is returned. Page 19 of 34 Given representative resource descriptions and database size estimates we can estimate how documents from the different databases would construct a ranked list if the centralized complete database existed and were searched. The rank of a document in the centralized sample database is estimated as: Equation 4: rank of document in centralized sample database The last step is to normalize the estimates in Equation 2 (Estimate relevant documents for the jth database for a specific query) to remove the query dependent constant, which provides the distribution of relevant documents among the databases. Equation 4: Estimate distribution of relevant documents The distribution of relevant documents in sample database. The distribution of relevant documents in different databases. Resource selection algorithms are typically compared using the recall metric Rk. B as a baseline ranking, which is often the RBR (relevance based ranking). And E as a ranking provided by a resource selection algorithm. Page 20 of 34 relevant documents in the ranked database of E relevant documents in the ranked database of B After resource ranking and selection, the query is forwarded to the selected databases, and then a resultmerging algorithm merges the ranked lists returned from the different databases into a single, final ranked list. Documents gathered from the distributed databases have to be merged; this is done by using the SSL method. The SSL method is available in the Lemur toolkit which is a system used widely in information retrieve written in the C and C++ languages, and is designed as a research system to run under Unix operating systems, although it can also run under Windows [12]. The Lemur toolkit “The Lemur toolkit is designed to facilitate research in language modelling and information retrieval, where IR is broadly interpreted to include such technologies as ad hoc and distributed retrieval, cross-language IR, summarization, filtering, and classification” [12]. Lemur has many applications for indexing and retrieval that are fully functional for many purposes, they also provide the source code to allow users to try and build there own classes using existing methods. In distributed IR Lemur provides applications for: - query-based sampling - database ranking (CORI) - results merging (CORI, single regression and multi-regression merge) Regression merge applications use the SSL algorithm which can be used with the ReDDE algorithm for merging the results. Regression algorithm matches all the database specific scores to the database independent scores so a better score list can be built with minimal redundancy. Page 21 of 34 6 Federated Search Engines A Federated search engine is a combination of multiple ‘channels of information’ to provide resources to the user using a single interface. Resource such as: e-journals, subscription databases, electronic print collections, other digital repositories, and the Internet. There are different types of federated search engines, ones which allow the user to select which databases they want to search in for there documents, or the other type is where the user doesn't need to know which databases are used for their service. The vendors in the federated search area such as MuseSearch and Webfeat offer their products as a way to search multiple, subscription resources at one time through an easy to use front-end. Federated search engines use of resource selection and ranking and the SSL merging algorithms to retrieval documents from distributed databases. A query is broadcasted to a group of heterogeneous databases which are specialised in a specific area of study (such as medical physics), the results retrieved from the databases are merged together and presented to the user in a unified format with minimal duplication. Prism [13] is a commercial search engine and translator system developed by WebFeat which is used to with the Thomson ISI Web of Knowledge [14] system to provide its users with a federated search engine with access to a large variety of recourses simultaneously: ISI® resources ISI partner resources Subscribed databases Freely available databases See complete list Editorially evaluated Web sites Select publishers' full-text documents Library catalogue holdings Proprietary databases Develop a single, convenient portal library to all of your institution's electronic research resources (translation of any database) Manage and organize diverse library collections Customize the search interface to meet your access and format preferences Receive detailed usage reports Have an extended library automation system solution with advanced integration services Receive simple set up and ongoing maintenance by WebFeat specialists Below are screen shots of the Thomson ISI Web of Knowledge system taken from [15] Page 22 of 34 CrossSearch: - 9,000+ International Journals - 100,000+ meetings, symposia, and reports - 11.3 million Patented Inventions Enter the query Page 23 of 34 You can filter results by specific database. This is especially helpful in identifying particular information, such as patent data, within the results list. Page 24 of 34 7 Other issues Improve scalability by replication One of the main problems with centralised information retrieval is being able to maintain excessive workloads and a large index of documents. There is also the problem that the centralised system has a single point of failure, so if the main connection goes down, everything goes down. Within distributed computing there is a process called replication, which distributes excessive workloads of a single server onto different replicated servers located in different areas (closer to groups of users). This method can be applied to distribute information retrieval. Figure in an example of how replication of information may work in a distributed information retrieval environment [16]. Figure 7.1 shows the hierarchy structure a replication system How it works When a user issues a query from location ‘cluster 1’ the query goes to ‘replica selection’ index 1 to see which replica has the best matches for the query in replica 1-1, replica 1 and the original collection. If replica 1-1 has the best match – the query can be sent to any of the levels it is linked to the ‘cluser 1’ depending on which has the lowest workload. If replica 1 has the best match – the query is sent to either replica 1 or the ‘original collection’. Page 25 of 34 Standards and Protocols Having standards and protocols for resources and the distributed information environment is important to selection of resources and presenting users with information from heterogeneous networks. Dublin Core Metadata is data about data, information which is known about a specific document, image, video or audio file enabling direct access to it. With effective use of metadata it will be possible to uniquely identify a group of information which might be specific to a user request. Or it can be used to generalise the group of information, it depends on how many metadata data you use. The Dublin Core metadata standard is a simple yet effective element set for describing a wide range of networked resources. The Dublin Core standard comprises fifteen elements, the semantics of which have been established through consensus by an international, cross-disciplinary group of professionals from librarianship, computer science, text encoding, the museum community, and other related fields of scholarship [15]. Title Creator Subject Description Publisher Contributor Date Type Format Identifier Source Language Relation Coverage Rights Retrieval of information from DIR Ananth Anandhakrishnan Distributed Information Retrieval Resource selection, ranking, merging Dave Inman Luo Si, Jie Lu, and Jamie Callan 24/11/2004 report Word document G2 English none How metadata is embedded into HTML In HTML there is a tag called META, this is used to describe parts of a document such as the creator of the document or other information resource listed above. Dublin Core metadata is defined by using DC before specifying the resource type. Example below Page 26 of 34 <html> <head> <title> Distributed Information Retrieval</title> <link rel = "schema.DC" href = "http://purl.org/DC/elements/1.0/"> <meta name = "DC.Title" content = " Retrieval of information from DIR "> <meta name = "DC.Creator" content = "Ananth Anandhakrishnan"> <meta name = "DC.Type" content = "report"> <meta name = "DC.Date" content = "24/11/2004"> <meta name = "DC.Format" content = "text/html"> <meta name = "DC.Language" content = "en"> </head> <body> Distributed Information Retrieval Abstract Intro Body Conclusion </body> </html> How metadata is embedded into RDF/XML RDF (Resource Description Framework) is a model for machine understandable metadata used to provide standard descriptions of web resources. It uses XML to present it, it creates metadata schemes to be read by humans as well as machines. For example: <rdf:RDF <rdf:Description rdf:about=" http://myweb.lsbu.ac.uk/~dave/"> <dc:creator>Ananth Anandhakrishnan</dc:creator> <dc:title>Distributed Information Retreival </dc:title> <dc:description> How does Information Retrieval from Distributed databases works</dc:description> <dc:date>2004-10-24</dc:date> </rdf:Description> </rdf:RDF> Enforcing standards in metadata will be of great assistance to resource selection in incorporated environment. Page 27 of 34 Z39.50 search protocol The Z39.50 is an ISO standard network protocol for information retrieval from different network of computers. It specifies formats and procedures governing the exchange of messages between a client and server, where the client has access to a large number of heterogeneous information which reside of very different network of computers. Z39.50 recognizes that information retrieval consists of two primary components-selection of information based upon some criteria and retrieval of that information, and it provides a common language for both activities. Z39.50 standardizes the manner in which the client and the server communicate and interoperate even when there are differences between computer systems, search engines, and databases. Interoperability is achieved through standardization of; Codifying mechanics - a standard way of encoding the data for transmission over the network, and Content semantics - a standard data model with shared semantic knowledge for specific communities to allow interoperable searching and retrieval within each of these domains [16]. This protocol is used in federated search engines such as WebFeat to all the heterogonous networks combine together to provide a single view to the user. Page 28 of 34 8 Other examples of distributed infor mation retrieval Emerge Emerge is a software built on java with an XML-based translation engine which can perform metadata mapping and query translation. It attempts to solve the problem of information retrieval of heterogeneous information residing on heterogeneous networks using distributed information retrieval. Emerge focuses on the retrieval of scientific data sets (information), which are much more complex than the "document-like" data found on the web. There are many scientific data formats, many of them non-standard [17]. It would have been impractical to store these data-sets in a centralised system, because of scalability issues; it would not be able to provide an effective service. Distributed information retrieval depends on standards and search protocols make information retrieval more interoperable. Emerge uses Dublin Core metadata standards so the data sets can be uniquely identified and generalized, and it uses the Z39.50 protocol to allow searching over different networks. SETI@Home SETI@Home (Search for Extraterrestrial Life) is a screensaver program developed by Berkley University which uses CPU power of client machines who choice to download the program and help the Berkeley team in the search for extraterrestrial life [18]. Data gathered from the Arecbico radio telescope in Puerto Rico is stored disks and tapes labelled with the required information to uniquely identify it. When the data is needed to be processed it is loaded onto the Berkley server which distributes packets of the data to client machines all around the world for processing. Once the data is processed, the client machines sends the results back to the server which does some analysis to see if there are any possible hits. User information to keep track of processing data Figure 8.1 SETI@HOME screenshot Page 29 of 34 The server is able to know which machine is processing which information, by keeping an index of the data packets sent out. If you see the screenshot above you can notice a user information area, which hold information about the user, and there are other information boxes stating which data is being processed. Harvest Harvest is a search system which is distributed over different computers to collect information and make them searchable using a web interface. Harvest can collect information on inter- and intranet using http, ftp, nntp as well as local files like data on hard disk, CDROM and file servers. Current list of supported formats in addition to HTML include TeX, DVI, PS, full text, mail, man pages, news, troff, WordPerfect, RTF, Microsoft Word/Excel, SGML, C sources and many more [19]. How it works Harvest consists of three subsystems: The gatherer subsystem collects indexing information from resources available at the provider sites, such as FTP and HTTP servers. The broker subsystem retrieves indexing information from one or more Gatherers, removes any duplication, incrementally indexes the collected information, and provides a query interface to it. Provider 1 Client Broker Collects, stores and managers the information for clients to query Gatherer Provider 2 Collects information available at provider Provider 3 Figure 8.2 the Harvest Structure Harvest Gatherers and Brokers communicate using an attribute-value stream protocol called the Summary Object Interchange Format (SOIF). Gatherers generate content summaries for individual objects in SOIF, and serve these summaries to Brokers that wish to collect and index them. SOIF provides a means of bracketing collections of summary objects, allowing Harvest Brokers to retrieve SOIF content summaries from a Gatherer for many objects in a single, efficient compressed stream. Harvest Brokers provide support for querying SOIF data using structured attribute-value queries and many other types of queries [19]. Page 30 of 34 Conclusion The aim of distributed information retrieval is to provide the user with a uniform interface to search for and retrieve a ranked list of documents from heterogeneous database which contain heterogeneous information relevant to the users query. This can be achieved firstly by applying a search protocol which enables information retrieval from heterogeneous networks, this gives information more interoperability. To improve the selection of documents, standards have to be applied on metadata – data about data, such as the Dublin core which defines 15 attributes to be sorted to identify documents. Distributed information retrieval depends on its sub components for it to work effectively. The resource selection is to do with selecting a set of databases and which are relevant to a query and selecting the highest ranked documents from it. The resource selection algorithm Relevant Document Distribution Estimation (ReDDE) is seen as the most effective for doing this. The Semi Supervised Learning (SSL) method involves producing a single ranked list of documents from distributed databases that closely approximates the ranked list that would be produced if all of the documents from all of the databases were stored in a centralised global database. This has proven to be the most effective way to tackle the problem of merging ranked lists from distributed database. The SSL method uses a centralised sample database which contains resource descriptions and some of the documents obtained from query-based-sampling. The aim of SSL is to get database specific scores to match database independent scores; this is done by an algorithm called regression. Federated search engines and current P2P searching technologies use this model to retrieve documents from heterogeneous databases. A Federated search engine is a combination of multiple channels of information presented to the user as a single user interface. Distributed information retrieval systems can be used in many fields such as space science. The SETI@HOME program is an example of distributed data mining, because the data packets processed by client machines are used to search for extraterrestrial life in the universe. Other examples of distributed information retrieval systems are: Emerge which searches for scientific information of different formats using an XML-based translation engine which can perform metadata mapping and query translation. Harvest a search system which is distributed over different computers to collect information and make them searchable using a web interface. Harvest consists of three subsystems: provider, gatherer, and broker. Information is sent between them by the use of SOIF which builds up the resource descriptions of the contents being sent between the subsystems. Emerge and Harvest are similar applications, the difference is the way the meta data is handled, the Emerge method would probably be the better one because it uses XML. Page 31 of 34 Reference [1] Tom Walker “The state of distributed search”, NewsForge, August 2004 http://internet.newsforge.com/article.pl?sid=04/08/04/1345206 A look into a new distributed searching program called grub. [2] No Author, “Peer-2-Peer”, Wikipedia http://en.wikipedia.org/wiki/Freenet Computer encyclopaedia, provided information about different generation of P2P technologies. [3] Ricardo Baeza-Yates and Berthier Ribeiro-Neto “Modern Information Retrieval” http://www.sims.berkeley.edu/~hearst/irbook/glossary.html Provide an online glossary for information retrieval terms, and some resources from the book. [4] Susan Feldman “NLP Meets the Jabberwocky: Natural Language Processing in Information Retrieval”, Information Today, Inc. May 1999 An easy to read introduction to NLP and information retrieval. [5] Tanenbaum Van Steen “distributed systems book” Prentice, Core textbook for distributed computer systems. [6] No Author,“Distributed Information Retrieval of Scientific Data”, Emerge http://dlt.ncsa.uiuc.edu/archive/emerge/distributed_search.html Emerge is a software solution attempting to overcome some of the problems in distributed information retrieval – interoperability of information. [7] Luo Si and Jamie Callan “A Semisupervised Learning Method to Merge Search Engine Results”, Carnegie Mellon University January 2003 Detailed paper about a new method for result merging from distributed information retrieval system SemiSupervised learning. [8] Luo Si and Jamie Callan “The Effect of Database Size Distribution on Resource Selection Algorithms” Carnegie Mellon University Investigate into various resource selection, ranking and merging algorithms. Page 32 of 34 [9] Luo Si “Federated Search of Text Search Engines in Uncooperative Environments , Carnegie Mellon University PowerPoint slides showing how the ReDDE algorithm works in distributed searching system. [10] Luo Si, Jie Lu, and Jamie Callan “Distributed Information Retrieval With Skewed Database Size Distributions”, Carnegie Mellon University In-depth look at how the ReDDE algorithm works. [11] Luo Si and Jamie Callan “Relevant Document Distribution Estimation Method for Resource Selection” Carnegie Mellon University [12] No Author, http://www-2.cs.cmu.edu/~lemur/ The Lemur toolkit The Lemur Toolkit is designed to facilitate research in language modelling and information retrieval, where IR is broadly interpreted to include such technologies as ad hoc and distributed retrieval, crosslanguage IR, summarization, filtering, and classification. [13] No Author, http://www.webfeat.org/products/prism.htm Webfeat Developers of federated search engines for commercial use. [14] No Author, Thomson ISI Web of Knowledge http://www.isinet.com/ “A fully integrated research platform... empowering researchers and accelerating discovery” [15] No Author, PowerPoint Presentation of the Thomson ISI web of knowledge www.deflink.dk/upload/doc_filer/doc_alle/1152_Derwent_2003.ppt [16] Kathryn McKinley http://www-ali.cs.umass.edu/Darch/ “Using replication in distributed information retrieval systems” [17] Diane Hillmann, “Using Dublin Core”, Dublin Core Metadata Initiative, “This document is intended as an entry point for users of Dublin Core. For non-specialists, it will assist them in creating simple descriptive records for information resources (for example, electronic documents). Specialists may find the document a useful point of reference to the documentation of Dublin Core, as it changes and grows.” [18] Sonya Finnigan and Nigel Ward“Z39.50 Made Simple” http://www.dstc.edu.au/Research/Projects/Z3950/zsimple.htm Information about how the search protocol works. [19] SETI@HOME http://setiathome.ssl.berkeley.edu/ , Berkerley Univeristy Page 33 of 34 SETI@home is a scientific experiment that will harness the power of hundreds of thousands of Internet-connected computers in the Search for Extra-Terrestrial Intelligence (SETI). [20] Darren R. Hardy, Michael F. Schwartz, Duane Wessels, Kang-Jin Lee “Harvest user manual”, October 2002, http://harvest.sourceforge.net/harvest/doc/html/manual-1.html Site provides information on how the software works and how to install and use it. Page 34 of 34