A High Performance Implementation of an OAI-Based Federation Service Kurt Maly, Mohammad Zubair, Li Xuemei Old Dominion University, Norfolk, Virginia, USA {maly, zubair, xli}@cs.odu.edu Abstract As the number of OAI-PMH compliant digital libraries are growing, we are encountering significant performance bottlenecks. Our earlier experience of implementing the searching and indexing for Arc [6] on a single processor utilizing the standard database technology like Oracle does not scale. In this paper, we describe how the cluster of PCs can be used to improve indexing and searching performance for an OAI based federation service. For supporting parallel indexing and searching, we use the Lucene search engine. We evaluated our architecture on a total number of 10 million harvested records on a Sun cluster. We compare the performance of the new implementation with the existing single processor implementation that uses Oracle 9i as the underlying database for searching and indexing. We demonstrate significant performance gain over the single processor implementation. 1. Introduction We are seeing an exponential growth of online information and digital libraries are playing a key role in managing this information explosion. A wide variety of digital libraries [9] exist today in terms of the type of information they are managing. On one end of the spectrum we have digital libraries managing unstructured information like Web pages, popularly referred as Search Engines[10][12][4]; and on the other end of the spectrum we have digital libraries that are managing structured information. These digital libraries differ in the services they provide to the endusers and the collection they hold. Google [3] is an example of a Search Engine that harvests Web pages and provides discovery service using a keyword search. ACM Digital Library [1] is an example of digital library, which stores refereed conference proceedings and journal articles along with metadata fields like authors, title, etc. The ACM library provides discovery services over various stored metadata fields. A number of digital libraries managing structured information exist today. However, there is a lack of a federated service (like Google for the Web sites) that provides a unified interface to all these libraries, which we believe is necessary for faster dissemination. The biggest obstacle for building a federated service is that many digital libraries use different, non-interoperable technologies. One major effort that addresses interoperability is the Open Archive Initiative (OAI) framework [11] to facilitate the discovery of content stored in distributed archives. The OAI framework supports data providers (archives) and service providers. Service providers develop value-added services based on the information collected from cooperating archives. These value added services could take the form of a federated search engine like Arc [6]. A typical data provider would be a digital library without any constraints on how it implemented its services with its own set of publishing tools and policies. The major addition is a layer that will expose its metadata (e.g., fields such as creator and title) in a well-specified format. Normally, one of the fields is a link to the actual document in its collection. Assuming that a rapid increase (e.g., several orders of magnitude) in the adoption of OAI-PMH occurs, we now have a different problem: how to efficiently discover, harvest and index the burgeoning OAI-PMH corpus. Currently our research group at Old Dominion University provides a federation service – Arc – pro bono publico. Since harvesting, indexing, and searching are all running on the same server, performance is becoming a bottleneck, and the reliability is low. We are working on a project to improve performance and reliability by exploiting parallelism at all levels: harvesting, indexing and searching. In another paper, we have discussed how we use Grid technology to parallelize harvesting and improve performance on that part of the system [8]. In this paper, we focus on how a cluster of PCs can be used to improve indexing and searching performance for an OAI based federated service. Our earlier experience of implementing the searching and indexing for Arc [6] on a single processor utilizing the standard that results in large number of hits is taking several minutes. We propose an architecture to parallelize the indexing and searching process on a cluster of PCs. The two main factors that influence our design are: load balance and data synchronization. Our solution to load balance is to provide each cluster node with the same number of records and run the indexing process on each node in parallel. For supporting parallel indexing and searching, we use the open source Apache Jakarta Lucene search engine. We evaluated our architecture on a total number of 10 million harvested records on a Sun cluster of 11Sun Fire V20z machines, powered by 2x2.4GHz AMD Opteron processors. We compare the performance of the new implementation with the existing single processor implementation that uses Oracle 9i as the underlying database for searching and indexing. We demonstrate significant performance gain over the single processor implementation. database technology like Oracle does not scale. The indexing process on Oracle for 7 million records is taking around two days, and a query set of individual digital libraries), one metadata record is a document. The text of a field is tokenized and indexed, and is stored in the index to be returned along with the hits. Harvest Scheduler Service get_task get_task metadata metadata T R A N S F E R Harvest Node T R A N S F E R metadata Harvest Node Harvest Node T R A N S F E R Metadata Collection/Transfer Service (ODU) (ODU) Grid Component Federated Search Node (ODU) Small chunk data distribution Divide / Merge 2. Architecture In Figure 1 we have reproduced from the companion paper [8] the architecture of the DL Grid which is what we call the parallel, high performance federation services. Of concern for this paper is only the lower part that shows the indexing, storing and searching over a cluster of nodes that are not part of the Grid. Of importance though is the Metadata Collection/transfer Service (MCS) that is the interface between the searching and the harvesting component of the DL Grid. It receives records from various harvesters at asynchronously and needs to distribute them to the index nodes in a balanced fashion. We need to remember that a federation service is an ongoing, continuously changing collections as new records are being published in the source digital libraries and periodic harvests bring these new records to the federation. On the other end of the system is the search node that interfaces the end-user with the index nodes such the latest most up-to-date index is being used and no interference with ongoing re-indexing occurs. We have designed our architecture with Lucene as the cornerstone. Lucene is a free text-indexing and searching API entirely written in Java[5]. We combine the index server with the document server using lucene indexing API. Here, the index refers to a file that contains a list of terms and for each term it contains pointers to the documents that contain the term. In our application (search metadata records harvested from a get_task Non-Grid Component D1 D1 …D3 are Storage devices that store both metadata and indices. D2 D3 - Search request/response 1 I1 I2 I3 - Indexing operations Figure 1. Overall architecture of DL Grid The total process of parallelizing both indexing and searching consists of four steps: (1) distribute documents (metadata records) evenly among available nodes; (2) index document locally at each node in parallel (3) bind the search engine dynamically to the latest indexes , and (4) process the query in parallel. Figure 2 illustrates these steps and we follow with a more detailed description. Record Distribution. The MCS node collects the records from the harvester nodes and has the task to load balance the index nodes such that the indexes have approximately the same size. In the paper [8] we describe the design process of the current choice of a single node and how we may change it in the future as it can become another bottleneck. Here we assume this one node that will have all records pass through. Data Synchronization. For indexing documents we run an index scheduler in each index node. It checks to see if new data records have arrived from the MCS node. As new arrivals are detected we will incrementally re-index and add to the existing index. Table 1 shows that the indexing time increases linearly with the number of documents (metadata records) on a single node. Data Distribute Web Server harvesting of updated records. This requires that the original record at the federated search engine is replaced by the updated metadata record. For this purpose either metadata collection needs to keep information about every record location or developing techniques to easily identify the location of the record. Table 2. Contrast of three methods 1 Uniform Storage Syn RR Yes Not Good Yes Hash No Good Yes RR/ Search Yes Good No 4 Index Servers (Document Servers) Performance Dis* Update O(n) or O(n) Lower O(n) O(1) O(n) or T* Lower Dis* means distribution of records 2 3 Figure 2. Index and search cluster architecture Table 1. Scaling of indexes Indexed records Number 10million 5million 2million 1million Index Time (hour) 9.8 4.7 1.9 0.95 Document Index Size 5.8G 2.9G 1.2G 626M Search Service. When a user enters a query, the request is sent to each node of the cluster, where the query is processed locally against the index shard on that node. A shard is a random subset of the complete index of the federation. The results from each node are merged according to a relevance scoring algorithm. The search results are returned in a form of a Hits object. A Hits object acts as an array of hits and for each hit it provides a copy of the document (only the ‘storable’ fields) and the score of that hit [7]. 3. Records Bookkeeping and Distribution As outlined in Figure 1, the metadata collection service is responsible for distributing metadata records to various indexing nodes. The two issues we need to address here are: (a) uniform distribution of records amongst available indexing nodes for load balancing of the search algorithms; and (b) bookkeeping for managing record updates. The second point needs more explanation. The OAI framework supports We considered several techniques for this problem. One naïve way is to distribute the records uniformly using round-robin method along with a table to keep track of destination index node for each record. This technique requires every record to be checked (before shipping) against this potentially very large table to make sure that a record with the same id does not exist. If the record with the identical id is in the table, it is shipped to the identical indexing node (and not what is assigned by the round-robin method). This process is clearly inefficient in terms of storage and time performance. The second technique we explored uses a hash function to determine each records destination node (index node). More specifically, we use the record_id Mod n to determine a records destination node. Here, record_id is typically the OAI ID, which is unique, and ‘n” is the number of index nodes. This technique can lead to non-uniform distribution of records amongst the indexing nodes (because of the non-uniform distribution for this hash function). The third technique we explored uses the round robin method for uniform distribution, but does not keep track of records destinations. The record update is handled by a separate process, which for every record coming from the incremental harvests uses the search service of the federation to determine whether such a record already exists. In case the record exists, the search service also returns the source index node for the record. The advantage of this method is that the update process can be run asynchronously in the background and only needs to keep track of the limited number of records not yet processed. We contrast these three methods in Table2. In this table n is the number of records. We use T to represent RR/Search time in bookkeeping for managing record updates. T refers to the time used to search and determine whether such a record exists and which node it is in. 4. Data Synchronization The complete index for the federation is partitioned into as many shards as we have index nodes in the cluster. In Figure 3 we show the process of indexing and binding. Our design follows several guidelines: ▪ Incremental index is better than reindex.(Experiments we do show this is true.) ▪ Non-blocking during search and update index. ▪The number of index shard in each cluster node should be minimal. machines. When the search service server is started it reads the information of remote search cluster nodes including IP addresses and search engine running ports from this file. Then it uses this information to create connection with remote machines using rmi. We implemented parallel multiple searches using the Lucene API. The user interface accepts a user’s query and calls JSP/Servlet applications. The parallel multiple searcher then implements parallel search over a set of remote ‘searchables’ which are bindings on the local index and document file system. The matches are returned from each search nodes and sorted by specific field and merged by relevance score (or whatever Index and Doc File System User Interface Based on these requirements, we redesign our scheme which uses only one index shard in each node. The IndexScheduler and Search Engine are binding on the same index shard. When a new set of records arrives, the IndexScheduler incrementally indexes that shard. Additionally, there is a “indexed” directory in each cluster node. After IndexScheduler completes its indexing task it moves these new set of records to the “indexed” directory. In order to get the fresh data the Search Engine checks the change of index shard and automatically rebind again if it finds any update on index shard. During the indexing, and dynamical binding, service remains uninterrupted, and all parts of the index remain available. Index Scheduler Search Engine Query Results Lucene JSP/ Servlet Remote Searchable Parallel Multiple Searcher rmi … Remote Searchable Searchers.xml Merger Sorter Search Service Node Index and Doc File System Cluster Nodes sorting the user requested) before displayed to user. Incremental Indexing Binding Figure 4. Query serving architecture New Trunk XML Files Index Shard Figure 3. Indexing and search engine binding in each cluster node 5. Search Service The Search interface supports both simple and advanced search with sorting by specific fields. Advanced search allows a user to search free text according to title, author, description using boolean operators (AND,OR) now. Figure 4 shows the query serving architecture. In Figure 4 searchers.xml refers to a configuration file which stores information about all remote 6. Experimental Results To test our architecture we developed a test bed with 10 index nodes and one search node in a Sun cluster of 11 Sun Fire V20z machines, powered by 2x2.4GHz AMD Opteron processors. We use a comprehensive HTTP-client/server test application Webserver Stress Tool [13] to test the performance of our search engine. Table 3 shows the search times of our DL GRID. For comparison we show the times of the search engine ARC which is a centralized federation service in our digital libraries and has a total of metadata records of 7.1millions. ARC is a single processor implementation that uses Oracle 9i as the underlying database for searching and indexing. We translated metadata records in the Arc database directly to xml files that can be indexed by our DL GRID indexer (the harvester part of the system was not yet ready when we performed the experiments). During this conversion process we had to discard a significant number of records that could not be parsed by our xml parser (we did not bother trying to fix the records as we were only interested in order of magnitude of comparable collection size). We obtained about 5 millions records from this database in this simple harvest. Due to different index size the comparisons are only approximate. We selected two keywords “Network” and “John” whose result sets are close in size in both cases. For keyword “network” the matches for ARC and DLGRID are about 25K. For keyword “john” the matches for both are about 250K. Table 3. Search times comparison between Arc and DL Grid Time (s) Key ARC Netwo rk John DLGRID Netwo rk John Times Total records 1 2 1 2 7.1 millions 1 2 1 2 5 millions Displaying records Display how many records 487.1 329.4 1110.9 832 3.1 3.3 16 9.8 16.1 14.3 432.1 125 0.1 0.1 0.1 0.1 Table 3 shows for each keyword two separate searches and the times for displaying the entire result set and displaying only how many records are in the result set. The speedup of using the DL Grid is not just a factor of 11 (the number of index nodes) but significantly higher. Clearly it is an indication that the Oracle single database system is at the limit of performance. We wanted to see if the index size has a significant impact on the search time and submitted a number of different queries to instances of the system where we distributed the 5 Million records to a single cluster node, to 2, and to four nodes. Table 4 shows the results for computing only the number of records rather than displaying them. In order to avoid the distortion through caching we only list the first search time for every keyword. We find that only when the result set reaches a large value we can see a scaling effect of decreasing search time with more nodes being added. Table 4. Search time by number of machines on same 5 million records Key Maly Dynamical Network Computer John Matches 165 11,259 25,282 110,080 205,453 Search time (ms) by num of nodes 1 2 4 54 75 44 124 74 44 141 108 65 210 163 86 229 179 90 We did find that the search time increases with the number of records in the index. Table 5 shows search time (number of records matched)as a function of the size of the index on one node. Table 5. Search time increases with the number of records 2.5million Key Match maly Dy* Co* 54 4,836 49,844 T* .06 .09 .16 5million Match 165 11,259 110,080 T* .05 .12 .21 10million Match 300 22,519 220,228 T* .07 .14 .34 Dy* means dynamical, Co* means computer, T* means search time, s means seconds 7. Conclusions and Future Work We have demonstrated that we can successfully realize a large federation digital library with over 10 million records and have performance at better than acceptable rate. The total system uses grid technology for harvesting records from over 200 individual collections and Lucene to implement a parallel search on a cluster of PC nodes. In this paper we have discussed the issues involved in integrating the two parts and the issues involved in parallelizing a changing set of indexes where documents may need to be updated and thus be located before re-indexing is done. The performance increase we obtained in the demonstration test bed were by a factor of 30 and promised to scale well for even larger than 10 million records. We are planning to investigate the issues arising from node failure and providing a more reliable system. Secondly we need to improve the relevance scoring. A cluster web service is a complex system and much remains to be done. Our following goals include: 1. Create new index pool which can solve the problem of node failure. The same index shard should be stored on different machines. A pool of machines serves requests for each shard [2]. This method is used by Google search engine. Each request chooses a machine within a pool using an intermediate load balancer---in other words, each query goes to one machine (or a subset of machines) assigned to each shard. If a shard’s replica goes down, the load balancer will avoid using it, and choose another machine which has the same shard for queries. 2. One of important measures of a search engine is the quality of its search results. We use the default page rank calculation of Lucene now. The Lucene’s scoring algorithm is based on the sum, frequency for terms and documents [7]. It is not sufficient for our application. We are adjusting the order of results according to our own application requests. We are also trying to create our own ranking algorithm for getting more accurate results if possible. A cluster search engine is a very rich research environment. We need smart data synchronization algorithms, luster management and as load balancing methods. 8. References [1] ACM Digital Library, http://portal.acm.org/dl.cfm [2] Barroso, L.A., et al., Web Search for a Planet: the Google Cluster Architecture, IEEE Computer Society, 2003, pp.2228. [3] Brin, S., Page, L., The Anatomy of a Large-Scale Hypertextual Web Search Engine, http://wwwdb.stanford.edu/~backrub/google.html [4] Egothor Search Engine, http://java-source.net/opensource/search-engines/egothor [5] Gospodnetic, O., Introduction to Text with Apache Jakarta Lucene, http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html , 2003 [6] Liu, X., Maly, K., Zubair, M. and Nelson, M.L. Arc: An OAI Service Provider for Cross Archive Searching, Proceedings of the First ACM/IEEE Joint Conference on Digital Libraries, Roanoke, VA, June24-28, 2001, pp.65-66. [7] Lucene, http://jakarta.apache.org/lucene/docs/index.html [8]Kurt Maly, Mohammad Zubair, V. Chilukumari, Pratik Kothari, GRID Based Federated Digital Library, submitted to Computing Frontiers, Ischia, Italy, May 2005 [9] NSDL, http://faculty.kutztown.edu/vitz/NSDL/national_science_digi tal_librari.htm [10] Nutch, http://www.nutch.org/docs/en/about.html [11] Open Archives Initiative, http://www.openarchives.org/ [12] Oxyus Search Engine, http://java-source.net/opensource/search-engines/oxyus [13] Webserver Stress Tool , http://www.paessler.com/webstress