4 Distributed Information Retrieval

advertisement
Information Retrieval from Distributed Databases
By Ananth Anandhakrishnan Autumn 2004
Abstract
Although distributed information retrieval improves certain aspects of information retrieval such as
scalability and removing a single point of failure, there are still problems in the selection and merging
components for retrieving relevant resources. This paper looks at various methods and algorithms used to
improve the resource selection and merging from distributed information retrieval systems.
The paper begins with a basic introduction to information retrieval, and distributed computing. It then goes
into the field of distributed information retrieval examining the sub components of distributed information
retrieval system which are the same for any information retrieval system, but they operate in different ways
because of the environments they exist in. The main focus will be on the resource selection and resource
merging components.
The standards and protocols used in information retrieval is also covered, these are very important set of
rules which help the interoperability of information of different data types coming from different types of
network systems.
Examples of application that use distributed information retrieval are reviewed in this paper such as
Emerge, and Harvest which have roughly the same function of information retrieval but use different
standards to obtain information. SETI@HOME is another distributed application, used for distributed data
mining.
Page 2 of 34
Contents page
Discussion Notes............................................................................................................................................................................... 4
Question for Discussion .................................................................................................................................................................. 6
Introduction ....................................................................................................................................................................................... 7
Introduction to Distributed systems .............................................................................................................................................. 9
Centralised Information Retrieval ................................................................................................................................................ 11
Distributed Information Retrieval ................................................................................................................................................ 12
The Sub-components ..................................................................................................................................................................... 14
Resource selection and ranking methods..................................................................................................................14
Merging ranked lists…………………………………………………………………………………….. 15
Semi Supervised Learning model for result merging……………………………………………………..15
The selection algorithm………………………………………………………………………………….18
The Lemur toolkit……………………………………………………………………………………….21
Federated Search Engines .............................................................................................................................................................. 22
Other issues ...................................................................................................................................................................................... 25
Improve scalability by replication………………………………………………………………………...25
Standards and Protocols………………………………………………………………………………….26
Other examples of distributed information retrieval ................................................................................................................. 29
Emerge…………………………………………………………………………………………………...29
SETI@Home…………………………………………………………………………………………….29
Harvest…………………………………………………………………………………………………...30
Conclusion ........................................................................................................................................................................................ 31
Reference .......................................................................................................................................................................................... 32
Page 3 of 34
Discussion Notes
This section is to give the reader an introduction into distributed information retrieval to provide some
background knowledge of the field before reading more into this report.
An important feature of distributed information retrieval is indexing of the documents. Keeping an up-todate index of the web can benefit everyone in obtaining the most relevant and latest documents. A
distributed approach to this has been implemented called GRUB, which is a client-based program when
downloaded uses the machines CPU power and internet bandwidth of the client machine to do the web
crawling. The article below talks about the current state of distributed searching [1] on the web.
The state of distributed search
Grub aims to take participants' unused bandwidth and CPU cycles to crawl more than 10 billion Web
documents and maintain an up-to-date index of the Web.
Distributed search, which is still very much an emerging field, works in a way that will be familiar to
people who donate processor time to projects like distributed.net and SETI@Home. By downloading
and running a screensaver, you contribute some of your bandwidth and CPU cycles to spider and
analyze Web pages. When you're not using your computer, the Grub client is activated. Your computer
crawls the Web, following hyperlinks from document to document. The more people who do this, the
bigger the index get.
Web crawling for updates indexes
Andre Stechert, Grub's director of technology, says, "distributed search makes it possible to deliver
Page 4 of 34
results that are fresher, more comprehensive, and more relevant. It achieves this by making it feasible
to maintain up-to-date crawl data and metadata in the 10+-billion document range and to use
heavyweight page analysis algorithms that look deeper than keywords."
Incorporating peer-to-peer technology has been a goal of Web searchers since the days of Napster, but
until now has been mainly the domain of academic research papers, with many saying that it's an
unachievable pipe dream. If you're Google, though, with a multi-million dollar cluster of Linux servers
doing your spidering, it's more like a nightmare -- and the ever-decreasing cost of domestic bandwidth
means that, dream or nightmare, it's starting to look more and more possible. We asked Google for a
reaction to Grub and P2P search, but the company is currently subject to the SEC "quiet period"
before its IPO.
Grub doesn't directly provide a search engine -- rather, Grub clients crawl the Web to provide an index
that can be used either to improve an existing search engine or create an entirely new one. Today only
WiseNut, which is also part of corporate parent LookSmart, is using crawl results from Grub to
improve its index, but Grub makes its results available to anyone who wants to use them, with an
XML-based query API.
A few test searches using WiseNut show that while the concept may be good, the execution is
incomplete. For any search term you care to try, WiseNut returns far fewer results than the big search
engines, in part because the Grub index is currently small compared most search engines' indexes.
Andre Stechert believes that community Web crawling projects could help. Distributed crawl indexes,
he says, "are today where [Internet] directories were before the advent of open directories like dmoz.
They are proprietary and, for the most part, redundant. Creating and maintaining them incurs a large
cost for the companies owning these infrastructures and for ISPs and Webmasters who are hosting the
sites being crawled.... We think it's important to provide a community-based crawl database because it's
a distraction from the main problem at hand: providing relevant, speedy answers to user queries. The
challenge of bringing up a production-scale, production-quality crawling infrastructure is significant
enough to chew up lots of time from any team. By lowering the barrier to entry for search engine
research, we hope to focus more of the research effort where value is being created for the world at
large." That's what distributed search is all about: lowering barriers to entry. Just as open directories
helped Google and others to challenge Yahoo's seemingly unassailable position (just look at the list of
sites using data from DMOZ, which includes many well-known search engines) distributed crawling
could do the same for Google's competitors. Imagine a world where any new player in the search
market can focus solely on providing an excellent user interface, without having to worry about the
logistics of crawling the enormous World Wide Web. That is the future that projects like Grub
promise. As Stechert says, "distributed computing is an enabling platform for the future of search."
Grub currently spiders about 100 million URLs per day.
P2P searching
Peer-2-Peer is a service where communication responsibilities is taken away from a dedicated server
and relied upon client machines for effective communication. The first well known type of system to
appear from this technology was Napster – an online file sharing software which allows user to
swap the latest chart music in mp3 file format. Napster was running as an illegal file sharing service
up until 2001. It now serves as a commercial service, after the law suit that was filed against it at the
end of 2000, but many other P2P file sharing software were created such as Kazaa and WinMX
which took over Napster later on.
Page 5 of 34
Napster relied on a centralised list being produced where all results from the users music query
would be merged and presented in a single list, so there is still a need for a centralised server.
Gnutella is another P2P technology which is based on a decentralised file list, where the service is
not bound to a centralised server. However, this new technology did not last very long, there were
huge problems with the scalability of the software, providing a service which was unreliable [2]
Google is a centralised search engine because it is controlled by a central organisation which
controls the indexing and presentation of the information. P2P searching allows users to put up
information in the search index for other users, this makes this method more dynamic and update as
with GRUB which builds the most up-to-date web indexes.
Grub is still a new area, and the index it has built up is very small compared to the likes of Google.
If it gains wide support it can prove to be a very useful tool for gaining the most up-to-date indexes.
P2P has proven to be a great tool for file sharing between communities, but as for a wider service or
commercial purpose there is still a lot of work to be done in terms of building effective searching
and merging techniques.
Question for Discussion
Questions I raised in this report are for the readers benefit to give a clear picture of what is involved in
distributed information retrieval and how things work. The questions were also used to guide me in learning
about certain aspects of distributed information retrieval, because the whole area was very new to me. Some
of these questions were only raised when I gained in-depth knowledge of how distributed information
retrieval works.
Distributed computer systems: what is it and how do they work? Is the WWW an example of a
distributed system?
Centralised information retrieval: why might it be desirable to keep a centralised repository of all the
resources? What are the drawbacks? How can distributed information retrieval help?
Distributed information retrieval: how can concepts in distributed computing help with distributed
information retrieval? How does it work?
Components of distributed information retrieval system: what are the main components of the system?
What problems are there currently with these components? How do resource algorithms work? What is the
aim of the Semi Supervised Learning model? Are there any applications that use this?
Improving distributed information retrieval: how can applying standards and protocols help the retrieval
of information from distributed heterogeneous sites?
Page 6 of 34
1 Introduction
In this report I will begin by introducing the reader to the basics of distributed computer systems to give a
better understanding of how a distributed information retrieval system works and why there is a need for it.
The main body of the report will be based around some of the important sub components of distributed
information retrieval specifically focusing on resource selection from distributed databases and merging the
ranked list of resources into a single file. This will involve examining mathematical algorithms using
statistics and probability. There is no doubt that distributed information retrieval solves many of the
problems that are faced in centralised information retrieval systems. However centralised systems must have
there benefits, and these benefits should be able to aid distributed information retrieval in some way. An
investigation into this will also be carried out, also looking current application available on the web, which
are technologies of distributed information retrieval.
Following section provides information of what information retrieval is and how it works to give the reader
some background knowledge of the field as a whole.
What is information retrieval?
Information retrieval is ‘the retrieval of information from a collection of written documents. The retrieved
documents aim at satisfying a user information need usually expressed in natural language [3].’ This
information or set of data can be anything from documents to media files. Researching for specific
information could be made better by being able to base search results on the embedded information of the
file. The embedded information is called metadata – data about data.
How does information retrieval work?
This section is taken from Susan Feldman paper “NLP Meets the Jabberwocky: Natural Language Processing in
Information Retrieval” [4], to show what stages are involved in information retrieval.
An information retrieval system consists of three separate files,
- List of all documents includes the indexing information or resource description which is information
about the document (author, title, date).
- A dictionary is a collection of all the unique words in the database
- An inversion list stores all the locations of the occurrence for each word in the dictionary.
Document Processing - documents stored in an information retrieval system have unique identifiers
attached so it can be generalised in a group of the same kind of documents and also to uniquely identify the
document.
Query Processing – retrieval systems use one of two ways or a combination of the two to process the query.
The use of mathematical operators is one possibility, it uses AND, OR, NOT statements to search for
matches with the query to a document. The other way is to use probability and statistics; this matches all the
words in the query to the documents to see how many times it occurs.
Query Matching – with the system searching with the mathematical operators, the system will only return a
list of documents which satisfy the query condition. For example the user enters two words in the search
engine “rebellion” and “Angola” which they want to be within 15 words of each other, the system will only
pick out the documents which obey this condition. If there is a document which has the two words but are
Page 7 of 34
spaced by 16 words, it will not retrieve it. With the probability and statistics system, it will bring back all
the matches for the query in the order of highest occurring first.
Ranking and Sorting - the system will return documents sorted in a specify order, usually with the ones with
high relevance is at the top.
Figure 1.1 shows the basic structure of what happens in an information retrieval system, from when the user
enters a query using the USER INTERFACE.
USER INTERFACE
Page containing list
of documents
Makes a query
request
Generate display
(HTML pages)
Query translation
Ranking component
Query
databases
Ranked list of
page titles
CENTRALISED
DATABASE
Figure 1.1 shows a how information retrieval works – framework of the diagram borrowed from Tanenbaum and Van
Steen “distributed systems”[5]
Page 8 of 34
2 Introduction to Distributed systems
Distributed systems have now been used for more than 40 years. In 1961 IBM introduced a Compatible
Time Sharing System which allowed separate terminals in different offices to access the same hardware.
In 1972 the United States developed ARPANET to allow computers in remote sites to communicate; from
this the internet was born. The internet gave rise to the World Wide Web, which now is used to host an
infinite number of distributed services [5].
What is a distributed system?
‘A Distributed system is a collection of independent computers that appears to its users as a single coherent
system.’ [5]
The statement about making the system appear to the user as if it were a single coherent system is not true
of all distributed systems. The World Wide Web is an example of this, when the user does a search on the
web, the returned results are obtained from different sites fro all around the world. However, the user is
made aware of where the results are coming from because the return list comes back with URL links.
How is it built?
A Distributed system in a homogenous environment are managed by a distributed operating system, which
is able to manage the sharing of resources between the computers and hide the fact that they are a collection
of computers.
In a heterogeneous environment the system is managed by a network operating system, but it is unable to
provide the single view distributed operating system provide. The single view is achieved by software:
middleware is an additional layer of software between the distributed application and the operating system
used to hide the heterogeneity of the collection of underlying platforms [5].
Server A
Server B
Server C
Server D
DISTRIBUTED APPLICATION
SOFTWARE LEVEL
(provides transparency for the system)
Network
Fig2.1 general structure of a distributed system with use of a software level to hide
the differences in the underlying platform and provide system transparency [ 5].
As shown on figure 2.1 servers are networked together on a Local Area Network or a Wide Area Network
(WAN). You can have the server’s setup in many ways; one possibility is setting it up as a replicated system,
like a Cluster: this is where servers have the same information on each node, where there is one primary
Page 9 of 34
node, and the others are secondary. If the primary server goes down one of the secondary servers will take
over as primary until the problem has been fixed with the original primary server. This is one type of cluster;
you also have clustered computing which shares their resources or workload between them. The Google
search engine is an example of clustered server computing; the service is running of a cluster of over 4000
Linux servers which does the web crawling [1].
What are the goals of the distributed systems?
- Allow users to access remote resources and to be able to share them to other users in a controlled
way.
- Hide the fact that the system is a distributed system; this is called making the system transparent.
There are different types of transparency that can be used access transparency – hide differences in
data representation and how a resource is accessed.
- The system should be able to adapt to changes in the system. Be able to provide the same level
performance if the system becomes larger.
- The system should by fault tolerant; able to maintain the same level of performance after a partial
failure of the system.
What types of distributed systems are there?
Client-Server model is where the server is used to host services such as file and printing services and the
client which are normal workstations request the services. There two types of this model, one is the vertical
distribution model which is used in organisations located at one location with multiple servers. The other is
horizontal distribution model which is used in organisation which have servers and users located in different
geographic areas.
Database systems like ORACLE have distributed database management functionalities. Database tables can
be split into fragments so users of the database system only see what is relevant to them. For example if a
company requires to know information about their customers in England only, then the database which has
information about customers all around the UK will be fragmented so customers from England are only on
their system.
Air Traffic Control System have there services distributed geographically. Each service has is supported
with redundancy, so if there is partial failure of the service, there is a backup server to take its place.
Page 10 of 34
3 Centralised Infor mation Retrieval
There is a reason why it is desirable to keep a centralised repository of information:
-
Centralised systems are usually in control of the resources in the databases, and so can build
accurate descriptions of the resources. As a result, the selection and merging of information from a
centralised system produce more relevant information because the operations take place at one site,
which uses one method of selection, ranking and merging.
However having a centralised repository of a huge set of resources is not ideal when you have millions of
users accessing the same search engine to retrieve information. This increases network traffic reducing the
effectiveness of the service. With information ever growing and different types of information (file formats,
various sizes) emerging, it will not be scalable to keep everything on a centralised system. If the centralised
site was to crash, the whole system will be unable to the users, so having a single point of failure must be
avoided.
Centralised retrieval systems that hold information that is specific to a particular field of study and are in
formats that are unrecognizable to other systems can only be hosted on a server that is capable of handing
the information.
Distributed information retrieval solves some of these problems:
-
Since search is distributed you have more than one machine that can process the retrieval of
information. Improves scalability – reduces server load and network traffic by distribution of the
service to groups of users in different geographic areas (replication systems).
There is no single point of failure, if one server goes down; there is always another server to search
from.
There are now standards and protocols defined to enable the combination of heterogeneous
networks, which hold heterogonous information to work together to provide a uniform view for the
user. These standards and protocols will be discussed later.
Selection and merging of information from distributed sites, is something distributed information retrieval
system finds difficult to emulate the same performance as a centralised information retrieval system. We will
look at how the problem of selection and merging of resources is tackled in distributed environments later
in this report.
Page 11 of 34
4 Distributed Infor mation Retrieval
The section above describes differences in distributed information retrieval and centralised information
retrieval. I haven’t given a definition yet for distributed information retrieval, but from reading the
introduction it should be clear what its purpose is.
‘The goal of distributed information retrieval is to enable the identification and retrieval of data sets relevant
to a general description or query, wherever those data sets may be located or hosted”[6].
There are different ways to think of distributed information retrieval systems. The main point is information
is gathered from distributed sites, but where are these sites? Is it within an organisation or a collection of
independent resource providers?
I think we can break distributed information retrieval systems into two types of environments: cooperative
and uncooperative.
Cooperative environments
There are different types of cooperative environment the figure 4.1 shows an environment where the
distributed system belongs to an organisation.
USER
APPLICATION
DISTRIBUTED DATABSES
Figure 4.1 – basic structure of a distributed information system
Page 12 of 34
Library as a cooperative environment
Say we have library organisation that has sites all around London where each library has a collection of
different internet accessible resources (e-journals, e-books) in different categories (literature, science, computing,
geography, history, sports), and each library may have resources of the same type.
Each library maintains its own database, with the resources, a unique identifier for the resouces and detailed
descriptions of the resources, and statistical information about the resource content.
The library organisation has an online search engine, which enables users to search for any online resource
in any category in all the libraries databases.
Example of query
User requests a list of e-journals and e-books related to ‘natural language and information retrieval’ in
the category of computing.
The search application will pass this request to all the individual library databases.
These databases will return a list of unique identifiers of the relevant resources which are merged
together in the search engine to present to the single ranked list to the user. If user finds a resource they
want to view, the resource identifier is used to retrieve the resource.
The other type of cooperative environments is where the distributed information system is linked to various
resource providers who also provide information about the document to make searching more accurate.
Each service provider may have same types of documents; this area is covered more in the next section.
Uncooperative environment is where resource providers do no provide additional descriptions about
documents residing in databases; this makes retrieving accurate results difficult. A method used to overcome
this problem is discussed in the next section.
Page 13 of 34
5 The Sub-components
Note: This section is broken into two parts. The first part breaks down the technical details of the
components of distributed information retrieval systems, which are the same as any other information
retrieval systems, but they work in a different way. The second part focuses on the resource selection
algorithm, shows how it works mathematically. By reading the first part you can get a good understanding
of how resource selection and merging works in a distributed environment. Readers who wish to learn more
of how the algorithms work can follow this section all the way through, otherwise swiftly press page down
when you start seeing maths.
There are four parts of distributed information retrieval that are important to obtain the best possible
results; resource description, resource selection, query translation and result merging [7].
Resource Description – for each database it is essential to build a resource description to give more
knowledge of what type of information each database holds. START is a protocol for acquiring and
describing database contents from the source provider. Contents are described using vocabulary and term
frequency information. The resource description also contains a list of words in documents the number of
occurrences of the word and a unique id for each document; this is used for the resource selection stage.
Information that is provided: average document length, inverse document frequency, detailed resource
descriptions.
START protocol can be used with corporative environments, but it can’t be used in uncooperative
environments.
To obtain resource descriptions from uncooperative environments a method called query-based sampling
is used, which involves querying databases to obtain resource descriptions from different databases. This is
proven to be an effective method of obtaining resource descriptions.
Resource Selection – is based on identifying databases that contain document relevant to a query based on
the resource descriptions. This process is conducted by an algorithm (set of instructions) which uses
probability and statistics. These probability and statistics are used to select or rank the documents relevance
to a query and are sent to the merging tool.
Query Translation – representation of the query entered by a user to map it to relevant resources in the
selected databases.
Result Merging – after results are returned from the selected databases they are complied into a single list
similar to as it would be presented from a centralised information retrieval system and presented to the user.
Resource selection and ranking methods
Resource selection involves identifying a small set of databases from the distributed information retrieval
system that contains documents relevant to a query requested by a user. The resource selection component
is an important part of the distributed information retrieval system model; it resides at the core of the
system where it has access to all the databases and up-to-date records of the information in each database.
“If the distribution of relevant documents across the different databases were known, databases could be
ranked by the number of relevant documents they contain; such a ranking is called a relevance based
Page 14 of 34
ranking (RBR). Resource ranking algorithms are evaluated by how closely they approximate a relevancebased ranking” [8].
There are many different types of algorithms used in resource selection and ranking:
CORI “models each document database as a very large document, and ranks each database using a
modified version of a well-known document ranking algorithm” [9]. This algorithm uses resource
descriptions that consist of vocabulary, term frequency and corpus information.
KL divergence algorithm is based on language modelling. “In the Language Model retrieval algorithm, each
document is considered as a sample of text generated from a specific language. The language model for each
document is estimated based on the word frequency patterns observed in the text. The relevance of a
document to a query is estimated by how likely the query is to be generated by the language model that
generated the document” [11].
Relevant Document Distribution Estimation (ReDDE) algorithm “estimates the distribution of relevant
documents across the databases for each user query and ranks databases according to this distribution of
relevant documents” [8].
The ReDDE algorithm uses constants to model the probability or relevance given a document and to
determine how much of a centralized complete database ranking should be used to model the relevance.
Evaluation of these algorithms against different test databases of various sizes shows that the CORI
algorithm is not suited for large distributed environments. The KL-divergence performs to some extent, but
its performance is inconsistent when over 10 databases are involved. It was found that the ReDDE
algorithm is more effective in these environments showing consistent results throughout the tests.
Merging ranked lists
After resource ranking and selection, the query is forwarded to the selected databases, and then a resultmerging algorithm merges the ranked lists returned from the different databases into a single, final ranked
list. Problems in result merging involve databases not using the same selection algorithm to produce the list
of returned documents; this means they produce different statistics.
If a standard is applied to use one algorithm it will solve this problem, but it will not be effective. Result
merging with current solutions such as raw score merging (results based on document scores) and round
robin merging (selecting the first database that it hits, doesn’t take into account of its relevance) occurs at
the client end application, so it isolated from the rest of the distributed information system. The
information available for normalizing database-specific document scores is very limited, and so solutions are
based on assumptions.
Semi Supervised Learning model for result merging
The Semi Supervised Learning (SSL) model involves producing a single ranked list of documents that
closely approximates the ranked list that would be produced if all of the documents from all of the
databases were stored in a single, global database [5]. This is achieved by running a centralised database in
parallel to the distributed databases as a test database to base the merged results against.
Page 15 of 34
The centralised sample database
This database is built up by using the query based sampling method to query distributed databases to search
and retrieve documents to create resource descriptions for them. Once the documents are retrieved they are
scanned and resource descriptions are built of the document, after this the documents are discarded. The
reason why they are discarded is because the resource selection stage of retrieval doesn’t need the
documents to build a ranked list. However, ‘The documents obtained by sampling all of the available
databases can be combined into a single searchable database. This centralized sample database is a
representative sample of the single global database that would be created if all of the documents from all of
the databases were combined into a single database. The vocabulary and frequency information in the
centralized sample database would be expected to approximate the vocabulary and frequency patterns
across the complete set of available databases’ [6].
The SSL algorithm is used for the merging of document from distributed databases. To make the merging
of documents from the document selection it would be more effective to have the result merging to occur
where the selection of documents occur.
How distributed information retrieval works in more detail
-
A user enters a query
-
The query is used to rank the collection of databases from which a set of databases are selected.
-
The query is then broadcasted to all the selected databases from which it produces a ranked list of
all matches with document id and scores. The document ids and scores are added to the merging
algorithm.
-
The query is also broadcasted to the parallel running centralized database and the ranked list of
document id’s and scores are also inputted into the merging algorithm. The ranked list provided by
the central database will influence the resources merged from the distributed databases.
The SSL algorithm specifically models result merging as a task of transforming sets of database-specific
document scores into a single set of database-independent document scores by using the documents
acquired by query-based sampling as training data
Page 16 of 34
CENTRALISED SAMPLE DATABASE
Resource Descriptions of
documents held on all
databases. Obtained by
querying
Query is sent to a
centralised sample database
Individual ranked lists
Ranked list of documents
from central database.
Query entry
Resource selection
Combine document
ranking
Merged results ranked by
relevance.
SELECTED DISTRIBUTED
DATABASES BY
SELECTION ALGORITHM
Merging results
Database specific scores
Merged list
Database independent scores
Figure 5.1 a more detailed diagram of distributed information retrieval system using the SSL method
Page 17 of 34
The selection algorithm
This example is taken from papers by Luo Si and Jamie Callan who are members of the Information
Retrieval Group - which “studies a wide range of issues related to finding, organizing, analyzing, and
communicating information”. Most of the explanations have been taken out by the papers written by Luo Si
and Jamie Callan, I have added additional written to try and make the mathematics clearer.
[7] [8] [9] [10] [11].
The goal of resource selection is to select a small set of resources that contain a lot of relevant documents.
The Relevant Document Distribution Estimation ReDDE algorithm tries to estimate the distribution of
relevant documents among the available databases.
The number of documents in database Cj that are relevant to query q is estimated as:
Equation 1: Estimate relevant documents for the jth database for a specific query
NCj - the number of documents in database Cj
P(di | Cj) - the probability of selecting document di from the collection of documents in the database Cj
P(rel | di) – the probability that the document di is relevant
N
In cooperative environments we say that the probability P(di | Cj) would be 1/
Cj . The same is true
for uncooperative environments, even though there is no cooperation with the providers of the resource
descriptions it is possible to build resource descriptions from the query based sampling. As mentioned in
the previous section this is done by submitting queries to the database and examining the documents that
are returned
This probability is substituted into equation 1 which produces a new probability algorithm below:
Page 18 of 34
Equation 2: Estimate relevant documents for the jth database for a specific query
Given a sample of documents from the database.
NCj_samp – number of sample documents from the databases NCj.
A centralised complete database is a union of all the documents in all the databases. P(rel|di) is the
probability of relevance given a specific document. The probability is calculated by reference to the
centralized complete database.
P(rel|di) can be simplified as the probability of relevance given the document rank when the centralized
complete database is searched by an effective retrieval method.
The probability distribution function is a step function, which means that for the documents at the top
of the ranked list the probabilities are a positive constant, and are 0 otherwise. This can be modelled as
follows:
Equation 3: the probability of relevance given the document rank
where Rank_central (di) is the rank of a document in the centralized complete database, and N all is the
number of documents in all the databases.
Cq – query dependent constant.
Rank_central(di) – is the rank of a document in the centralized complete database.
Nall – is the number of documents in all the databases.
The ReDDE algorithm uses constants to model the relevance given a document, and to determine how
much of a centralized complete database ranking to model. A ratio is used to search for relevant documents
in a fraction of the centralised complete database. Usually the ratio is set to 0.0003, so If you have 1,000,000
documents and the ratio is set to 0.0003, this means the top 3000 documents are considered for relevance
ranking.
A centralised database that keeps all documents of all databases is impractical. However, the rank of a
document in the centralized complete database can be estimated, from its rank within the centralized
sample database, which is the union of all the sampled documents from different databases (query-based
sampling).
The query is submitted to the centralized sample database; a document ranking is returned.
Page 19 of 34
Given representative resource descriptions and database size estimates we can estimate how documents
from the different databases would construct a ranked list if the centralized complete database existed and
were searched. The rank of a document in the centralized sample database is estimated as:
Equation 4: rank of document in centralized sample database
The last step is to normalize the estimates in Equation 2 (Estimate relevant documents for the jth database
for a specific query) to remove the query dependent constant, which provides the distribution of relevant
documents among the databases.
Equation 4: Estimate distribution of relevant
documents
The distribution of relevant
documents in sample database.
The distribution of relevant documents in
different databases.
Resource selection algorithms are typically compared using the recall metric Rk. B as a baseline ranking,
which is often the RBR (relevance based ranking). And E as a ranking provided by a resource selection
algorithm.
Page 20 of 34
relevant documents in the ranked database of E
relevant documents in the ranked database of B
After resource ranking and selection, the query is forwarded to the selected databases, and then a resultmerging algorithm merges the ranked lists returned from the different databases into a single, final ranked
list.
Documents gathered from the distributed databases have to be merged; this is done by using the SSL
method. The SSL method is available in the Lemur toolkit which is a system used widely in information
retrieve written in the C and C++ languages, and is designed as a research system to run under Unix
operating systems, although it can also run under Windows [12].
The Lemur toolkit
“The Lemur toolkit is designed to facilitate research in language modelling and information retrieval, where
IR is broadly interpreted to include such technologies as ad hoc and distributed retrieval, cross-language IR,
summarization, filtering, and classification” [12].
Lemur has many applications for indexing and retrieval that are fully functional for many purposes, they
also provide the source code to allow users to try and build there own classes using existing methods.
In distributed IR Lemur provides applications for:
- query-based sampling
- database ranking (CORI)
- results merging (CORI, single regression and multi-regression merge)
Regression merge applications use the SSL algorithm which can be used with the ReDDE algorithm for
merging the results. Regression algorithm matches all the database specific scores to the database
independent scores so a better score list can be built with minimal redundancy.
Page 21 of 34
6 Federated Search Engines
A Federated search engine is a combination of multiple ‘channels of information’ to provide resources to
the user using a single interface. Resource such as: e-journals, subscription databases, electronic print
collections, other digital repositories, and the Internet. There are different types of federated search engines,
ones which allow the user to select which databases they want to search in for there documents, or the
other type is where the user doesn't need to know which databases are used for their service. The vendors in
the federated search area such as MuseSearch and Webfeat offer their products as a way to search multiple,
subscription resources at one time through an easy to use front-end.
Federated search engines use of resource selection and ranking and the SSL merging algorithms to retrieval
documents from distributed databases. A query is broadcasted to a group of heterogeneous databases which
are specialised in a specific area of study (such as medical physics), the results retrieved from the databases
are merged together and presented to the user in a unified format with minimal duplication.
Prism [13] is a commercial search engine and translator system developed by WebFeat which is used to with
the Thomson ISI Web of Knowledge [14] system to provide its users with a federated search engine with
access to a large variety of recourses simultaneously:
 ISI® resources
 ISI partner resources
 Subscribed databases
 Freely available databases
 See complete list
 Editorially evaluated Web sites
 Select publishers' full-text documents
 Library catalogue holdings
 Proprietary databases
 Develop a single, convenient portal library to all of your institution's electronic research resources
(translation of any database)
 Manage and organize diverse library collections
 Customize the search interface to meet your access and format preferences
 Receive detailed usage reports
 Have an extended library automation system solution with advanced integration services
 Receive simple set up and ongoing maintenance by WebFeat specialists
Below are screen shots of the Thomson ISI Web of Knowledge system taken from [15]
Page 22 of 34
CrossSearch:
- 9,000+
International
Journals
- 100,000+
meetings,
symposia, and
reports
- 11.3 million
Patented
Inventions
Enter the query
Page 23 of 34
You can filter results by specific database. This is especially helpful in identifying
particular information, such as patent data, within the results list.
Page 24 of 34
7 Other issues
Improve scalability by replication
One of the main problems with centralised information retrieval is being able to maintain excessive
workloads and a large index of documents. There is also the problem that the centralised system has a single
point of failure, so if the main connection goes down, everything goes down. Within distributed computing
there is a process called replication, which distributes excessive workloads of a single server onto different
replicated servers located in different areas (closer to groups of users). This method can be applied to
distribute information retrieval.
Figure in an example of how replication of information may work in a distributed information retrieval
environment [16].
Figure 7.1 shows the hierarchy structure a replication system
How it works
When a user issues a query from location ‘cluster 1’ the query goes to ‘replica selection’ index 1 to see which
replica has the best matches for the query in replica 1-1, replica 1 and the original collection.
If replica 1-1 has the best match – the query can be sent to any of the levels it is linked to the ‘cluser 1’
depending on which has the lowest workload.
If replica 1 has the best match – the query is sent to either replica 1 or the ‘original collection’.
Page 25 of 34
Standards and Protocols
Having standards and protocols for resources and the distributed information environment is important to
selection of resources and presenting users with information from heterogeneous networks.
Dublin Core
Metadata is data about data, information which is known about a specific document, image, video or audio
file enabling direct access to it. With effective use of metadata it will be possible to uniquely identify a group
of information which might be specific to a user request. Or it can be used to generalise the group of
information, it depends on how many metadata data you use.
The Dublin Core metadata standard is a simple yet effective element set for describing a wide range of
networked resources. The Dublin Core standard comprises fifteen elements, the semantics of which have
been established through consensus by an international, cross-disciplinary group of professionals from
librarianship, computer science, text encoding, the museum community, and other related fields of
scholarship [15].
Title
Creator
Subject
Description
Publisher
Contributor
Date
Type
Format
Identifier
Source
Language
Relation
Coverage
Rights
Retrieval of information from DIR
Ananth Anandhakrishnan
Distributed Information Retrieval
Resource selection, ranking, merging
Dave Inman
Luo Si, Jie Lu, and Jamie Callan
24/11/2004
report
Word document
G2
English
none
How metadata is embedded into HTML
In HTML there is a tag called META, this is used to describe parts of a document such as the creator of the
document or other information resource listed above. Dublin Core metadata is defined by using DC before
specifying the resource type. Example below
Page 26 of 34
<html>
<head>
<title> Distributed Information Retrieval</title>
<link rel
= "schema.DC"
href
= "http://purl.org/DC/elements/1.0/">
<meta name
= "DC.Title"
content = " Retrieval of information from DIR ">
<meta name
= "DC.Creator"
content = "Ananth Anandhakrishnan">
<meta name
= "DC.Type"
content = "report">
<meta name
= "DC.Date"
content = "24/11/2004">
<meta name
= "DC.Format"
content = "text/html">
<meta name
= "DC.Language"
content = "en">
</head>
<body>
Distributed Information Retrieval
Abstract
Intro
Body
Conclusion
</body>
</html>
How metadata is embedded into RDF/XML
RDF (Resource Description Framework) is a model for machine understandable metadata used to provide
standard descriptions of web resources. It uses XML to present it, it creates metadata schemes to be read by
humans as well as machines.
For example:
<rdf:RDF
<rdf:Description rdf:about=" http://myweb.lsbu.ac.uk/~dave/">
<dc:creator>Ananth Anandhakrishnan</dc:creator>
<dc:title>Distributed Information Retreival </dc:title>
<dc:description> How does Information Retrieval from Distributed databases
works</dc:description>
<dc:date>2004-10-24</dc:date>
</rdf:Description>
</rdf:RDF>
Enforcing standards in metadata will be of great assistance to resource selection in incorporated
environment.
Page 27 of 34
Z39.50 search protocol
The Z39.50 is an ISO standard network protocol for information retrieval from different network of
computers. It specifies formats and procedures governing the exchange of messages between a client and
server, where the client has access to a large number of heterogeneous information which reside of very
different network of computers.
Z39.50 recognizes that information retrieval consists of two primary components-selection of information
based upon some criteria and retrieval of that information, and it provides a common language for both
activities. Z39.50 standardizes the manner in which the client and the server communicate and interoperate
even when there are differences between computer systems, search engines, and databases. Interoperability
is achieved through standardization of; Codifying mechanics - a standard way of encoding the data for
transmission over the network, and Content semantics - a standard data model with shared semantic
knowledge for specific communities to allow interoperable searching and retrieval within each of these
domains [16].
This protocol is used in federated search engines such as WebFeat to all the heterogonous networks
combine together to provide a single view to the user.
Page 28 of 34
8 Other examples of distributed infor mation retrieval
Emerge
Emerge is a software built on java with an XML-based translation engine which can perform metadata
mapping and query translation. It attempts to solve the problem of information retrieval of heterogeneous
information residing on heterogeneous networks using distributed information retrieval. Emerge focuses on
the retrieval of scientific data sets (information), which are much more complex than the "document-like"
data found on the web. There are many scientific data formats, many of them non-standard [17]. It would
have been impractical to store these data-sets in a centralised system, because of scalability issues; it would
not be able to provide an effective service. Distributed information retrieval depends on standards and
search protocols make information retrieval more interoperable. Emerge uses Dublin Core metadata
standards so the data sets can be uniquely identified and generalized, and it uses the Z39.50 protocol to
allow searching over different networks.
SETI@Home
SETI@Home (Search for Extraterrestrial Life) is a screensaver program developed by Berkley University
which uses CPU power of client machines who choice to download the program and help the Berkeley
team in the search for extraterrestrial life [18]. Data gathered from the Arecbico radio telescope in Puerto
Rico is stored disks and tapes labelled with the required information to uniquely identify it. When the data is
needed to be processed it is loaded onto the Berkley server which distributes packets of the data to client
machines all around the world for processing. Once the data is processed, the client machines sends the
results back to the server which does some analysis to see if there are any possible hits.
User information to keep track of processing data
Figure 8.1 SETI@HOME screenshot
Page 29 of 34
The server is able to know which machine is processing which information, by keeping an index of the data
packets sent out. If you see the screenshot above you can notice a user information area, which hold
information about the user, and there are other information boxes stating which data is being processed.
Harvest
Harvest is a search system which is distributed over different computers to collect information and make
them searchable using a web interface. Harvest can collect information on inter- and intranet using http, ftp,
nntp as well as local files like data on hard disk, CDROM and file servers. Current list of supported formats
in addition to HTML include TeX, DVI, PS, full text, mail, man pages, news, troff, WordPerfect, RTF,
Microsoft Word/Excel, SGML, C sources and many more [19].
How it works
Harvest consists of three subsystems:
The gatherer subsystem collects indexing information from resources available at the provider sites, such
as FTP and HTTP servers. The broker subsystem retrieves indexing information from one or more
Gatherers, removes any duplication, incrementally indexes the collected information, and provides a query
interface to it.
Provider 1
Client
Broker
Collects, stores and
managers the information
for clients to query
Gatherer
Provider 2
Collects information
available at provider
Provider 3
Figure 8.2 the Harvest Structure
Harvest Gatherers and Brokers communicate using an attribute-value stream protocol called the Summary
Object Interchange Format (SOIF). Gatherers generate content summaries for individual objects in SOIF,
and serve these summaries to Brokers that wish to collect and index them. SOIF provides a means of
bracketing collections of summary objects, allowing Harvest Brokers to retrieve SOIF content summaries
from a Gatherer for many objects in a single, efficient compressed stream. Harvest Brokers provide support
for querying SOIF data using structured attribute-value queries and many other types of queries [19].
Page 30 of 34
Conclusion
The aim of distributed information retrieval is to provide the user with a uniform interface to search for and
retrieve a ranked list of documents from heterogeneous database which contain heterogeneous information
relevant to the users query.
This can be achieved firstly by applying a search protocol which enables information retrieval from
heterogeneous networks, this gives information more interoperability. To improve the selection of
documents, standards have to be applied on metadata – data about data, such as the Dublin core which
defines 15 attributes to be sorted to identify documents.
Distributed information retrieval depends on its sub components for it to work effectively. The resource
selection is to do with selecting a set of databases and which are relevant to a query and selecting the highest
ranked documents from it. The resource selection algorithm Relevant Document Distribution Estimation
(ReDDE) is seen as the most effective for doing this.
The Semi Supervised Learning (SSL) method involves producing a single ranked list of documents from
distributed databases that closely approximates the ranked list that would be produced if all of the
documents from all of the databases were stored in a centralised global database. This has proven to be the
most effective way to tackle the problem of merging ranked lists from distributed database. The SSL
method uses a centralised sample database which contains resource descriptions and some of the
documents obtained from query-based-sampling. The aim of SSL is to get database specific scores to match
database independent scores; this is done by an algorithm called regression.
Federated search engines and current P2P searching technologies use this model to retrieve documents
from heterogeneous databases. A Federated search engine is a combination of multiple channels of
information presented to the user as a single user interface.
Distributed information retrieval systems can be used in many fields such as space science. The
SETI@HOME program is an example of distributed data mining, because the data packets processed by
client machines are used to search for extraterrestrial life in the universe.
Other examples of distributed information retrieval systems are: Emerge which searches for scientific
information of different formats using an XML-based translation engine which can perform metadata
mapping and query translation. Harvest a search system which is distributed over different computers to
collect information and make them searchable using a web interface. Harvest consists of three subsystems:
provider, gatherer, and broker. Information is sent between them by the use of SOIF which builds up the
resource descriptions of the contents being sent between the subsystems. Emerge and Harvest are similar
applications, the difference is the way the meta data is handled, the Emerge method would probably be the
better one because it uses XML.
Page 31 of 34
Reference
[1] Tom Walker “The state of distributed search”, NewsForge, August 2004
http://internet.newsforge.com/article.pl?sid=04/08/04/1345206
A look into a new distributed searching program called grub.
[2] No Author, “Peer-2-Peer”, Wikipedia http://en.wikipedia.org/wiki/Freenet
Computer encyclopaedia, provided information about different generation of P2P technologies.
[3] Ricardo Baeza-Yates and Berthier Ribeiro-Neto “Modern Information Retrieval”
http://www.sims.berkeley.edu/~hearst/irbook/glossary.html
Provide an online glossary for information retrieval terms, and some resources from the book.
[4] Susan Feldman “NLP Meets the Jabberwocky: Natural Language Processing in Information Retrieval”,
Information Today, Inc. May 1999
An easy to read introduction to NLP and information retrieval.
[5] Tanenbaum Van Steen “distributed systems book” Prentice,
Core textbook for distributed computer systems.
[6] No Author,“Distributed Information Retrieval of Scientific Data”, Emerge
http://dlt.ncsa.uiuc.edu/archive/emerge/distributed_search.html
Emerge is a software solution attempting to overcome some of the problems in distributed
information retrieval – interoperability of information.
[7] Luo Si and Jamie Callan “A Semisupervised Learning Method to Merge Search Engine Results”,
Carnegie Mellon University January 2003
Detailed paper about a new method for result merging from distributed information
retrieval system
SemiSupervised learning.
[8] Luo Si and Jamie Callan “The Effect of Database Size Distribution on Resource Selection Algorithms”
Carnegie Mellon University
Investigate into various resource selection, ranking and merging algorithms.
Page 32 of 34
[9] Luo Si “Federated Search of Text Search Engines in Uncooperative Environments , Carnegie Mellon University
PowerPoint slides showing how the ReDDE algorithm works in distributed searching system.
[10] Luo Si, Jie Lu, and Jamie Callan “Distributed Information Retrieval With Skewed Database Size
Distributions”, Carnegie Mellon University
In-depth look at how the ReDDE algorithm works.
[11] Luo Si and Jamie Callan “Relevant Document Distribution Estimation Method for Resource Selection”
Carnegie Mellon University
[12] No Author, http://www-2.cs.cmu.edu/~lemur/ The Lemur toolkit
The Lemur Toolkit is designed to facilitate research in language modelling and information retrieval,
where IR is broadly interpreted to include such technologies as ad hoc and distributed retrieval, crosslanguage IR, summarization, filtering, and classification.
[13] No Author, http://www.webfeat.org/products/prism.htm Webfeat
Developers of federated search engines for commercial use.
[14] No Author, Thomson ISI Web of Knowledge http://www.isinet.com/
“A fully integrated research platform... empowering researchers and accelerating discovery”
[15] No Author, PowerPoint Presentation of the Thomson ISI web of knowledge
www.deflink.dk/upload/doc_filer/doc_alle/1152_Derwent_2003.ppt
[16] Kathryn McKinley http://www-ali.cs.umass.edu/Darch/
“Using replication in distributed information retrieval systems”
[17] Diane Hillmann, “Using Dublin Core”, Dublin Core Metadata Initiative,
“This document is intended as an entry point for users of Dublin Core. For non-specialists, it will
assist them in creating simple descriptive records for information resources (for example, electronic
documents). Specialists may find the document a useful point of reference to the documentation of
Dublin Core, as it changes and grows.”
[18] Sonya Finnigan and Nigel Ward“Z39.50 Made Simple”
http://www.dstc.edu.au/Research/Projects/Z3950/zsimple.htm
Information about how the search protocol works.
[19] SETI@HOME http://setiathome.ssl.berkeley.edu/ , Berkerley Univeristy
Page 33 of 34
SETI@home is a scientific experiment that will harness the power of hundreds of thousands of
Internet-connected computers in the Search for Extra-Terrestrial Intelligence (SETI).
[20] Darren R. Hardy, Michael F. Schwartz, Duane Wessels, Kang-Jin Lee “Harvest user manual”,
October 2002, http://harvest.sourceforge.net/harvest/doc/html/manual-1.html
Site provides information on how the software works and how to install and use it.
Page 34 of 34
Download