A Parallel High Performance Search Service in DL GRID

advertisement
A High Performance Implementation of an OAI-Based Federation Service
Kurt Maly, Mohammad Zubair, Li Xuemei
Old Dominion University, Norfolk, Virginia, USA
{maly, zubair, xli}@cs.odu.edu
Abstract
As the number of OAI-PMH compliant digital
libraries are growing, we are encountering significant
performance bottlenecks. Our earlier experience of
implementing the searching and indexing for Arc [6]
on a single processor utilizing the standard database
technology like Oracle does not scale. In this paper, we
describe how the cluster of PCs can be used to improve
indexing and searching performance for an OAI based
federation service. For supporting parallel indexing
and searching, we use the Lucene search engine. We
evaluated our architecture on a total number of 10
million harvested records on a Sun cluster. We
compare the performance of the new implementation
with the existing single processor implementation that
uses Oracle 9i as the underlying database for
searching and indexing. We demonstrate significant
performance gain over the single processor
implementation.
1. Introduction
We are seeing an exponential growth of online
information and digital libraries are playing a key role
in managing this information explosion. A wide variety
of digital libraries [9] exist today in terms of the type
of information they are managing. On one end of the
spectrum we have digital libraries managing
unstructured information like Web pages, popularly
referred as Search Engines[10][12][4]; and on the other
end of the spectrum we have digital libraries that are
managing structured information. These digital
libraries differ in the services they provide to the endusers and the collection they hold. Google [3] is an
example of a Search Engine that harvests Web pages
and provides discovery service using a keyword
search. ACM Digital Library [1] is an example of
digital library, which stores refereed conference
proceedings and journal articles along with metadata
fields like authors, title, etc. The ACM library provides
discovery services over various stored metadata fields.
A number of digital libraries managing structured
information exist today. However, there is a lack of a
federated service (like Google for the Web sites) that
provides a unified interface to all these libraries, which
we believe is necessary for faster dissemination. The
biggest obstacle for building a federated service is that
many digital libraries use different, non-interoperable
technologies.
One major effort that addresses
interoperability is the Open Archive Initiative (OAI)
framework [11] to facilitate the discovery of content
stored in distributed archives. The OAI framework
supports data providers (archives) and service
providers. Service providers develop value-added
services based on the information collected from
cooperating archives. These value added services could
take the form of a federated search engine like Arc [6].
A typical data provider would be a digital library
without any constraints on how it implemented its
services with its own set of publishing tools and
policies. The major addition is a layer that will expose
its metadata (e.g., fields such as creator and title) in a
well-specified format. Normally, one of the fields is a
link to the actual document in its collection.
Assuming that a rapid increase (e.g., several orders
of magnitude) in the adoption of OAI-PMH occurs, we
now have a different problem: how to efficiently
discover, harvest and index the burgeoning OAI-PMH
corpus. Currently our research group at Old Dominion
University provides a federation service – Arc – pro
bono publico.
Since harvesting, indexing, and
searching are all running on the same server,
performance is becoming a bottleneck, and the
reliability is low. We are working on a project to
improve performance and reliability by exploiting
parallelism at all levels: harvesting, indexing and
searching. In another paper, we have discussed how we
use Grid technology to parallelize harvesting and
improve performance on that part of the system [8]. In
this paper, we focus on how a cluster of PCs can be
used to improve indexing and searching performance
for an OAI based federated service. Our earlier
experience of implementing the searching and indexing
for Arc [6] on a single processor utilizing the standard
that results in large number of hits is taking several
minutes. We propose an architecture to parallelize the
indexing and searching process on a cluster of PCs.
The two main factors that influence our design are:
load balance and data synchronization. Our solution to
load balance is to provide each cluster node with the
same number of records and run the indexing process
on each node in parallel.
For supporting parallel indexing and searching, we
use the open source Apache Jakarta Lucene search
engine. We evaluated our architecture on a total
number of 10 million harvested records on a Sun
cluster of 11Sun Fire V20z machines, powered by
2x2.4GHz AMD Opteron processors. We compare the
performance of the new implementation with the
existing single processor implementation that uses
Oracle 9i as the underlying database for searching and
indexing. We demonstrate significant performance
gain over the single processor implementation.
database technology like Oracle does not scale. The
indexing process on Oracle for 7 million records is
taking
around
two
days,
and
a
query
set of individual digital libraries), one metadata record
is a document. The text of a field is tokenized and
indexed, and is stored in the index to be returned along
with the hits.
Harvest Scheduler Service
get_task
get_task
metadata
metadata
T
R
A
N
S
F
E
R
Harvest Node
T
R
A
N
S
F
E
R
metadata
Harvest Node
Harvest Node
T
R
A
N
S
F
E
R
Metadata Collection/Transfer
Service (ODU)
(ODU)
Grid Component
Federated Search Node
(ODU)
Small chunk data distribution
Divide / Merge
2. Architecture
In Figure 1 we have reproduced from the
companion paper [8] the architecture of the DL Grid
which is what we call the parallel, high performance
federation services. Of concern for this paper is only
the lower part that shows the indexing, storing and
searching over a cluster of nodes that are not part of the
Grid. Of importance though is the Metadata
Collection/transfer Service (MCS) that is the interface
between the searching and the harvesting component of
the DL Grid. It receives records from various
harvesters at asynchronously and needs to distribute
them to the index nodes in a balanced fashion. We
need to remember that a federation service is an
ongoing, continuously changing collections as new
records are being published in the source digital
libraries and periodic harvests bring these new records
to the federation. On the other end of the system is the
search node that interfaces the end-user with the index
nodes such the latest most up-to-date index is being
used and no interference with ongoing re-indexing
occurs.
We have designed our architecture with Lucene as
the cornerstone. Lucene is a free text-indexing and
searching API entirely written in Java[5]. We combine
the index server with the document server using lucene
indexing API. Here, the index refers to a file that
contains a list of terms and for each term it contains
pointers to the documents that contain the term. In our
application (search metadata records harvested from a
get_task
Non-Grid Component
D1
D1 …D3 are Storage
devices that store both
metadata and indices.
D2
D3
- Search request/response
1
I1
I2
I3
- Indexing operations
Figure 1. Overall architecture of DL Grid
The total process of parallelizing both indexing and
searching consists of four steps: (1) distribute
documents (metadata records) evenly among available
nodes; (2) index document locally at each node in
parallel (3) bind the search engine dynamically to the
latest indexes , and (4) process the query in parallel.
Figure 2 illustrates these steps and we follow with a
more detailed description.
Record Distribution. The MCS node collects the
records from the harvester nodes and has the task to
load balance the index nodes such that the indexes
have approximately the same size. In the paper [8] we
describe the design process of the current choice of a
single node and how we may change it in the future as
it can become another bottleneck. Here we assume this
one node that will have all records pass through.
Data Synchronization. For indexing documents we
run an index scheduler in each index node. It checks to
see if new data records have arrived from the MCS
node. As new arrivals are detected we will
incrementally re-index and add to the existing index.
Table 1 shows that the indexing time increases linearly
with the number of documents (metadata records) on a
single node.
Data Distribute
Web Server
harvesting of updated records. This requires that the
original record at the federated search engine is
replaced by the updated metadata record. For this
purpose either metadata collection needs to keep
information about every record location or developing
techniques to easily identify the location of the record.
Table 2. Contrast of three methods
1
Uniform
Storage
Syn
RR
Yes
Not
Good
Yes
Hash
No
Good
Yes
RR/
Search
Yes
Good
No
4
Index Servers (Document Servers)
Performance
Dis*
Update
O(n)
or
O(n)
Lower
O(n)
O(1)
O(n)
or
T*
Lower
Dis* means distribution of records
2
3
Figure 2. Index and search cluster architecture
Table 1. Scaling of indexes
Indexed
records
Number
10million
5million
2million
1million
Index Time
(hour)
9.8
4.7
1.9
0.95
Document
Index
Size
5.8G
2.9G
1.2G
626M
Search Service. When a user enters a query, the
request is sent to each node of the cluster, where the
query is processed locally against the index shard on
that node. A shard is a random subset of the complete
index of the federation. The results from each node are
merged according to a relevance scoring algorithm.
The search results are returned in a form of a Hits
object. A Hits object acts as an array of hits and for
each hit it provides a copy of the document (only the
‘storable’ fields) and the score of that hit [7].
3. Records Bookkeeping and Distribution
As outlined in Figure 1, the metadata collection
service is responsible for distributing metadata records
to various indexing nodes. The two issues we need to
address here are: (a) uniform distribution of records
amongst available indexing nodes for load balancing of
the search algorithms; and (b) bookkeeping for
managing record updates. The second point needs
more explanation. The OAI framework supports
We considered several techniques for this problem.
One naïve way is to distribute the records uniformly
using round-robin method along with a table to keep
track of destination index node for each record. This
technique requires every record to be checked (before
shipping) against this potentially very large table to
make sure that a record with the same id does not exist.
If the record with the identical id is in the table, it is
shipped to the identical indexing node (and not what is
assigned by the round-robin method). This process is
clearly inefficient in terms of storage and time
performance.
The second technique we explored uses a hash
function to determine each records destination node
(index node). More specifically, we use the record_id
Mod n to determine a records destination node. Here,
record_id is typically the OAI ID, which is unique, and
‘n” is the number of index nodes. This technique can
lead to non-uniform distribution of records amongst the
indexing nodes (because of the non-uniform
distribution for this hash function).
The third technique we explored uses the round
robin method for uniform distribution, but does not
keep track of records destinations. The record update is
handled by a separate process, which for every record
coming from the incremental harvests uses the search
service of the federation to determine whether such a
record already exists. In case the record exists, the
search service also returns the source index node for
the record. The advantage of this method is that the
update process can be run asynchronously in the
background and only needs to keep track of the limited
number of records not yet processed.
We contrast these three methods in Table2. In this
table n is the number of records. We use T to represent
RR/Search time in bookkeeping for managing record
updates. T refers to the time used to search and
determine whether such a record exists and which node
it is in.
4. Data Synchronization
The complete index for the federation is partitioned
into as many shards as we have index nodes in the
cluster. In Figure 3 we show the process of indexing
and binding. Our design follows several guidelines:
▪ Incremental index is better than reindex.(Experiments
we do show this is true.)
▪ Non-blocking during search and update index.
▪The number of index shard in each cluster node
should be minimal.
machines. When the search service server is started it
reads the information of remote search cluster nodes
including IP addresses and search engine running ports
from this file. Then it uses this information to create
connection with remote machines using rmi.
We implemented parallel multiple searches using
the Lucene API. The user interface accepts a user’s
query and calls JSP/Servlet applications. The parallel
multiple searcher then implements parallel search over
a set of remote ‘searchables’ which are bindings on the
local index and document file system. The matches are
returned from each search nodes and sorted by specific
field and merged by relevance score (or whatever
Index and Doc
File System
User Interface
Based on these requirements, we redesign our
scheme which uses only one index shard in each node.
The IndexScheduler and Search Engine are binding on
the same index shard. When a new set of records
arrives, the IndexScheduler incrementally indexes that
shard. Additionally, there is a “indexed” directory in
each cluster node. After IndexScheduler completes its
indexing task it moves these new set of records to the
“indexed” directory. In order to get the fresh data the
Search Engine checks the change of index shard and
automatically rebind again if it finds any update on
index shard. During the indexing, and dynamical
binding, service remains uninterrupted, and all parts of
the index remain available.
Index Scheduler
Search Engine
Query
Results
Lucene
JSP/
Servlet
Remote
Searchable
Parallel
Multiple
Searcher
rmi
…
Remote
Searchable
Searchers.xml
Merger
Sorter
Search Service Node
Index and Doc
File System
Cluster Nodes
sorting the user requested) before displayed to user.
Incremental Indexing
Binding
Figure 4. Query serving architecture
New
Trunk
XML Files
Index Shard
Figure 3. Indexing and search engine binding in
each cluster node
5. Search Service
The Search interface supports both simple and
advanced search with sorting by specific fields.
Advanced search allows a user to search free text
according to title, author, description using boolean
operators (AND,OR) now. Figure 4 shows the query
serving architecture.
In Figure 4 searchers.xml refers to a configuration
file which stores information about all remote
6. Experimental Results
To test our architecture we developed a test bed
with 10 index nodes and one search node in a Sun
cluster of 11 Sun Fire V20z machines, powered by
2x2.4GHz AMD Opteron processors. We use a
comprehensive HTTP-client/server test application Webserver Stress Tool [13] to test the performance of
our search engine. Table 3 shows the search times of
our DL GRID. For comparison we show the times of
the search engine ARC which is a centralized
federation service in our digital libraries and has a total
of metadata records of 7.1millions. ARC is a single
processor implementation that uses Oracle 9i as the
underlying database for searching and indexing. We
translated metadata records in the Arc database directly
to xml files that can be indexed by our DL GRID
indexer (the harvester part of the system was not yet
ready when we performed the experiments). During
this conversion process we had to discard a significant
number of records that could not be parsed by our xml
parser (we did not bother trying to fix the records as we
were only interested in order of magnitude of
comparable collection size). We obtained about 5
millions records from this database in this simple
harvest. Due to different index size the comparisons
are only approximate. We selected two keywords
“Network” and “John” whose result sets are close in
size in both cases. For keyword “network” the matches
for ARC and DLGRID are about 25K. For keyword
“john” the matches for both are about 250K.
Table 3. Search times comparison between Arc
and DL Grid
Time (s)
Key
ARC
Netwo
rk
John
DLGRID
Netwo
rk
John
Times
Total
records
1
2
1
2
7.1
millions
1
2
1
2
5
millions
Displaying
records
Display
how
many
records
487.1
329.4
1110.9
832
3.1
3.3
16
9.8
16.1
14.3
432.1
125
0.1
0.1
0.1
0.1
Table 3 shows for each keyword two separate
searches and the times for displaying the entire result
set and displaying only how many records are in the
result set. The speedup of using the DL Grid is not just
a factor of 11 (the number of index nodes) but
significantly higher. Clearly it is an indication that the
Oracle single database system is at the limit of
performance.
We wanted to see if the index size has a significant
impact on the search time and submitted a number of
different queries to instances of the system where we
distributed the 5 Million records to a single cluster
node, to 2, and to four nodes.
Table 4 shows the results for computing only the
number of records rather than displaying them. In order
to avoid the distortion through caching we only list the
first search time for every keyword. We find that only
when the result set reaches a large value we can see a
scaling effect of decreasing search time with more
nodes being added.
Table 4. Search time by number of machines on
same 5 million records
Key
Maly
Dynamical
Network
Computer
John
Matches
165
11,259
25,282
110,080
205,453
Search time (ms) by
num of nodes
1
2
4
54
75 44
124
74 44
141
108 65
210
163 86
229
179 90
We did find that the search time increases with the
number of records in the index. Table 5 shows search
time (number of records matched)as a function of the
size of the index on one node.
Table 5. Search time increases with the
number of records
2.5million
Key
Match
maly
Dy*
Co*
54
4,836
49,844
T*
.06
.09
.16
5million
Match
165
11,259
110,080
T*
.05
.12
.21
10million
Match
300
22,519
220,228
T*
.07
.14
.34
Dy* means dynamical, Co* means computer, T*
means search time, s means seconds
7. Conclusions and Future Work
We have demonstrated that we can successfully
realize a large federation digital library with over 10
million records and have performance at better than
acceptable rate. The total system uses grid technology
for harvesting records from over 200 individual
collections and Lucene to implement a parallel search
on a cluster of PC nodes. In this paper we have
discussed the issues involved in integrating the two
parts and the issues involved in parallelizing a
changing set of indexes where documents may need to
be updated and thus be located before re-indexing is
done. The performance increase we obtained in the
demonstration test bed were by a factor of 30 and
promised to scale well for even larger than 10 million
records.
We are planning to investigate the issues arising
from node failure and providing a more reliable
system. Secondly we need to improve the relevance
scoring.
A cluster web service is a complex system and
much remains to be done. Our following goals include:
1. Create new index pool which can solve the
problem of node failure. The same index shard should
be stored on different machines. A pool of machines
serves requests for each shard [2]. This method is used
by Google search engine. Each request chooses a
machine within a pool using an intermediate load
balancer---in other words, each query goes to one
machine (or a subset of machines) assigned to each
shard. If a shard’s replica goes down, the load balancer
will avoid using it, and choose another machine which
has the same shard for queries.
2. One of important measures of a search engine is
the quality of its search results. We use the default
page rank calculation of Lucene now. The Lucene’s
scoring algorithm is based on the sum, frequency for
terms and documents [7]. It is not sufficient for our
application. We are adjusting the order of results
according to our own application requests. We are also
trying to create our own ranking algorithm for getting
more accurate results if possible.
A cluster search engine is a very rich research
environment. We need smart data synchronization
algorithms, luster management and as load balancing
methods.
8. References
[1] ACM Digital Library, http://portal.acm.org/dl.cfm
[2] Barroso, L.A., et al., Web Search for a Planet: the Google
Cluster Architecture, IEEE Computer Society, 2003, pp.2228.
[3] Brin, S., Page, L., The Anatomy of a Large-Scale
Hypertextual Web Search Engine, http://wwwdb.stanford.edu/~backrub/google.html
[4] Egothor Search Engine, http://java-source.net/opensource/search-engines/egothor
[5] Gospodnetic, O., Introduction to Text with Apache
Jakarta Lucene,
http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html
, 2003
[6] Liu, X., Maly, K., Zubair, M. and Nelson, M.L. Arc: An
OAI Service Provider for Cross Archive Searching,
Proceedings of the First ACM/IEEE Joint Conference on
Digital Libraries, Roanoke, VA, June24-28, 2001, pp.65-66.
[7] Lucene, http://jakarta.apache.org/lucene/docs/index.html
[8]Kurt Maly, Mohammad Zubair, V. Chilukumari, Pratik
Kothari, GRID Based Federated Digital Library, submitted to
Computing Frontiers, Ischia, Italy, May 2005
[9] NSDL,
http://faculty.kutztown.edu/vitz/NSDL/national_science_digi
tal_librari.htm
[10] Nutch, http://www.nutch.org/docs/en/about.html
[11] Open Archives Initiative, http://www.openarchives.org/
[12] Oxyus Search Engine, http://java-source.net/opensource/search-engines/oxyus
[13] Webserver Stress Tool ,
http://www.paessler.com/webstress
Download