ICIM05dtic - ODU Computer Science

advertisement

Integrating Legacy Collections into an Interoperable Digital Library Framework 1

Integrating Legacy Collections into an Interoperable Digital

Library Framework

J. Tang, K. Maly, and M. Zubair

Old Dominion University, Computer Science Department

Norfolk VA 23529 USA

{tang_j,maly,zubair }@cs.odu.edu

Integrating Legacy Collections into an Interoperable Digital Library Framework 2

Abstract

With the growth of the Internet and related tools, there has been an exponential growth of online resources. In particular, availability of high-quality OCR tools, it has become easy to convert existing corpus into digital form and make it available online. However, a number of organizations have legacy collections that lack metadata, which hampers not only their discovery and dispersion over the Web, but also their interoperability with other collections. Unfortunately, manual metadata creation is expensive and timeconsuming for a large collection. In this paper, we propose a new approach to extract metadata automatically from a large heterogeneous legacy collection such as DTIC

(Defense Technical Information Center) collection consisting of hard copy or scanned image documents. In our approach, we first classify documents into equivalent classes.

Next, a set of rules, or a template, is created using a XML based language for each document class. By decoupling rules and representing them in a XML format, our approach is easy to adapt to a different document class. We also integrate two machinelearning techniques, SVM (Support Vector Machine) and HMM (Hidden Markov

Model), to improve system performance. We have evaluated our approach on the testbed consisting of documents from DTIC collection and our results are encouraging. We also demonstrate how the extracted metadata can be utilized to create an OAI-PMH (Open

Archive Initiative Protocol for Metadata Harvesting) compliant interoperable digital library.

Integrating Legacy Collections into an Interoperable Digital Library Framework 3

Integrating Legacy Collections into an Interoperable Digital

Library Framework

With high-quality OCR tools, it has become easy to convert an existing corpus into digital form and make it available online. However, lack of metadata available for these resources hampers their discovery over the Web. For example, in a collection of documents with metadata, a computer scientist is able to search for papers written by

“Kurt Maly” since 2003. With full-text searching, some resources with these characteristics may be buried by other irrelevant resources such as the resources about

Kurt Maly. Doane (as cited in Crystal & Land, 2003) estimated that it might save about

$8,200 per employee for a company to use metadata in its intranet to reduce employee time for searching, verifying and organizing the files. In addition, using metadata such as

Dublin core (DC Metadata Set, 2003) can make collections interoperable with the help of

OAI-PMH (Open Archive Initiatives Protocols for Metadata Harvesting). OAI-PMH is a framework to provide interoperability among distributed repositories (Lagoze, Sompel,

Nelson, & Warner, 2003). The OAI-PMH defines a common format for metadata exchange that is independent of the underlying database. The OAI-PMH is becoming widely accepted and many archives are currently or soon-to-be OAI-PMH compliant.

However, creating metadata for a large collection manually is an extremely timeconsuming task. According to Rosenfeld (as cited in Crystal & Land, 2003) estimates, it takes about 60 employee-years to create metadata for 1 million documents. Thus, it becomes essential to explore approaches for automated metadata extraction for large collections.

Two main approaches to extract metadata automatically are rule-based approach and machine-learning approach. In the first approach, we process documents for metadata extraction according to a fixed set of rules that are defined by observing the document set. Most existing rule-based system use one single rule base for the whole document set.

This restricts the approach to a specific set of documents, which is a set of documents with a common layout and structure. Machine-learning approach learns from samples in a

Integrating Legacy Collections into an Interoperable Digital Library Framework 4 collection (documents for which metadata has been extracted manually), and uses this knowledge to extract metadata for other documents in the collection. Machine-learning approach, in contrast to rule-based approach, easily adapts to a new set of documents.

However, to achieve high accuracy, it requires a large number of sample documents and more so when the document set is not homogeneous. How to reach desirable accuracy for a large heterogeneous collection by any of the two approaches is still a challenge.

In this paper, we propose a new approach where we first classify documents into classes based on similarity. We create a set of rules, or a template, for document in each class. By decoupling rules and representing them in a XML format, our approach is easy to adapt to a different document class. We also integrate two machine-learning techniques, SVM (Support Vector Machine) and HMM (Hidden Markov Model), to improve system performance by observing SVM's ability to handle a very large number of features and HMM's ability to extract patterns from a sequence. We have evaluated our approach on the testbed consisting of documents from DTIC (Defense Technical

Information Center) collection and our results are encouraging. We also demonstrate how the extracted metadata can be integrated with an interoperable digital library framework based on OAI (Open Archives Initiative).

Background

OAI and Digital Library

A Digital Library (DL) is a network accessible and searchable collection of digital information (Lesk, 1997). OAI is an initiative with aim to provide interoperability among heterogeneous digital libraries using different technologies and metadata schemas. In

OAI-PMH framework, there are two kinds of participants: Data Providers (DP) and

Services Providers (SP). DPs provide metadata according to OAI-PMH requests. SPs harvest metadata from DPs and provide high-level services based on the aggregated metadata. Each DP has to support at least Dublin Core (DC) metadata format. OAI-PMH is based on HTTP and metadata exchanged between Data providers and Services providers are encoded in XML (Lagoze, Sompel, Nelson, & Warner, 2003).

Integrating Legacy Collections into an Interoperable Digital Library Framework 5

Metadata Extraction Approaches

Metadata are essential for creating an OAI compliant repository. In this section, we give a brief review of two main automated metadata extraction approaches: rule-based approach and machine-learning approach.

In the rule-based approach, we use rules to specify how to extract the metadata from a collection of documents. The rules are mainly based on the layout structure and other visual cues. . Researchers in the past have developed rule-based systems that work mostly for a specific document set. Bergmark (2000) developed a tool for the PDF files from

ACM and Kim, Le and Thoma (2003) created a rule-based system to work with Medical journals. Some efforts were made to work on a heterogeneous collection of documents.

Klink, Dengel and Kieninger (2000) showed their system worked for two different types of documents. They manually created a set of rules, used fuzzy techniques to match rules, and produced probabilistic results.

Machine-learning approach learns from samples in a collection, and uses this knowledge to extract metadata for other documents in the collection. Seymore,

McCallum and Rosenfeld (1999) used Hidden Markov Models to extract metadata from text document headers. Han, Giles, Manavoglu, Zha, Zhang and Fox (2003) used Support

Vector Machine (SVM) to extract metadata from the same data set. Seymore et al. (1999) and Han et al. (2003) reported overall accuracy 90.1% and 92.9% respectively.

HMM and SVM

HMM is a probabilistic technique for the study of events. In a HMM model, the system starts in a state, transits from one state to another state and emits a symbol in each state. For a HMM, the underlying states cannot be observed, i.e. they are hidden. HMM is powerful for finding patterns in sequence. By using HMM, for a given sequence of observation symbols, we can get a most probable sequence of hidden states.

SVM is a statistical model proposed by Vapnik (1995). It is widely used in Pattern

Recognition (Burges, 1998). Joachims (2002) successfully applied it to text categorization. And Han et al. (2003) used it to extract metadata from document headers.

The basic idea of SVM is to find a hyperplane to separate two classes with the largest

Integrating Legacy Collections into an Interoperable Digital Library Framework 6 margin from pre-classified data and classify data into two classes based on which side they are located.

Figure 1. shows an example to determine whether a text string in a document header is title or not. The plus and minus symbols represents the trained data in this feature space.

The plus symbols indicate that the sample is a title and the minus symbols indicate the sample is not a title. One main good feature of SVM is that it can be used to solve problems with very large feature sets.

Overall Architecture

In this paper, we propose a system to automate the task of converting existing corpus into an OAI-compliant repository. The system architecture is shown in figure 2.

To start with hardcopies are scanned and stored as PDF files, which are later processed by the OCR software (in our case we use Scansoft’s Omnipage pro 14.0 ) to create XML files that preserve formatting and layout information. Next, we extract metadata from these XML files and populate the extracted metadata into a database (in our case Oracle

9). We add a software layer to this database to make it OAI-PMH 2.0 compliant. We also implement a simple search engine for local search. Users can search in the metadata database and access the original documents.

Metadata Extraction Approach

Our metadata extraction approach is shown in Figure 3. Our approach integrates the rule-based approach (implemented using template-base module) with machine learning approach. The proposed architecture has a feedback loop, which refines the metadata extraction techniques to improve accuracy of the results.

Template-based Module

Instead of creating rules for a large collection of heterogeneous documents, we first classify documents into document classes based on similarity and then create a template for each document class. In this paper we forego discussion of the process of partitioning a large set of documents into equivalence classes. Suffice it to say that this in itself is a difficult problem.

Integrating Legacy Collections into an Interoperable Digital Library Framework 7

In our template-based module, we decouple rules from code by keeping our templates in separate XML files. This makes our approach easy to be adapted to different document classes. Our approach also can reduce rule errors. A template for a document class is much simpler than the rule base for all documents, and hence the chance of rule errors is lower. We can apply a template to sample documents, check the results, and refine the template if necessary. In this way, we can correct errors at an early time.

Integration SVM with HMM

SVM can work with a very large number of features and hence is very suitable for information extraction. However, SVM requires the feature set to be defined in advance.

This makes it not easy to exploit correlated features. For example, SVM cannot use a feature, “a section before an author section is most probable a title section”, for classifying a section as a title section. HMM, on the other hand, is good at finding sequence patterns but uses many resources when it works with a large number of unique symbols.

Based on this observation, we propose to integrate SVM with HMM. In our approach, we first use SVM to classify each section into classes with probabilities.

Meanwhile, we estimate HMM state transition probabilities from samples directly. Then we use HMM along with its state transition matrix and the probabilities from SVM to produce a final prediction.

Integration of a Rule-based Approach with a Machine-learning Approach

The motivation to integrate a machine-learning approach with our rule-based approach is to overcome two drawbacks of rule-based system and thereby improve accuracy.

Lack of auto-correction ability: A rule-based system has difficulties to correct itself when an error occurs because rules are fixed.

Lack of statistical fundamentals: Integrating machine-learning techniques, especially statistical techniques such as SVM and HMM can help reduce the bias caused by threshold values defined arbitrarily.

Integrating Legacy Collections into an Interoperable Digital Library Framework 8

Test bed

We use one of DTIC’s collections to build our test bed. This DTIC collection contains technical reports in PDF format, which came from different organizations and have different layouts.

With the metadata extracted from these documents, we create a repository with an

OAI layer that supports OAI-PMH requests from the Internet. Figure 5. shows an OAI-

PMH response sample. We also implement a local search interface shown in Figure 6.

Experiments

At this stage of our work we have only experiments with modules working independently, specifically, a rule-based module and SVM based module.

SVM Experiments

We applied SVM to three data sets: the data set used by Seymore et al. (1999), 100 tagged first pages we selected from the DTIC collection, 33 tagged first pages with identical layout from also chosen from the DTIC collection. For simplicity, we call them

Seymore935, DTIC100 and DTIC33 respectively. These three data sets are different in sense of heterogeneity. Documents in DTIC 33 have identical layout, documents in

Seymore935 are similar, and documents in DTIC100 are the most heterogeneous.

We used a similar feature extraction approach as that used by Han et al. (2003). The tool we used is LibSVM (Hsu & Lin, 2002). We measured overall accuracy in our experiments. The overall accuracy is the percentage of data classified correctly.

An accuracy of 95% for DTIC33, 93% for Seymore935, and 85% for DTIC100 indicates that the more heterogeneous the data set, the lower the overall accuracy.

Template-based Experiment

We manually assigned documents in DTIC100 to document classes based on their layout information. After eliminating those classes with fewer than three documents, we obtain seven document classes with a total of 73 documents. Then we created one template for each document classes and applied our template-based module to these

Integrating Legacy Collections into an Interoperable Digital Library Framework 9 documents. It is difficult to evaluate this approach because the results are heavily depends on how well the templates are designed. In our experiment, we evaluated this approach manually as follows. First, for each document class, we create the template based on our observation. Then we run our code with a document chosen at random and refine the template. After this, we apply this template to extract metadata from all documents in this class and use the results as our test results. We evaluated recall and precision and show the results in Table 1.

The experimental results are promising. Most are 100% correct. Some bad results are mainly because of imprecise information provided by the OCR software we used.

Conclusions and Future Works

In this paper, we describe an approach to automate the task of converting an existing corpus of documents available as a set hard-copy papers or PDF files into an

OAI-compliant repository. The task of metadata extraction is made the more difficult the more heterogeneous the collection of documents is. In our approach, we first assign documents into groups based on similarity. We create a template for documents in each group instead of all documents. We have applied our approach to a small test case and obtained promising results.

However, the data set is small and therefore insufficient to apply both HMM and

SVM techniques and in particular, their integration with a rule-based approach. Based on our preliminary results for rule-based and SVM methods, we shall create a larger test case and experiment with HMM and a combined method. Also in the future is the problem of placing documents into equivalence classes automatically.

Integrating Legacy Collections into an Interoperable Digital Library Framework 10

References

Bergmark, D. (2000). Automatic Extraction of Reference Linking Information from

Online Documents. CSTR 2000-1821, November 2000.

Burges, C.J.C. (1998). A tutorial on Support Vector Machine for Pattern Recognition.

Data Mining and Knowledge Discovery, 2(2): 955-974.

Chang, C.-C., & Lin, C.-J. (2001). LIBSVM: A library for support vector machines.

Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

Crystal, A., & Land, P. (2003). Metadata and Search: Global Corporate Circle DCMI

2003 Workshop. Seattle, Washington, USA. Retrieved April 3, 2004, from http://dublincore.org/groups/corporate/Seattle/ .

DTIC Collection (n. d.). Retrieved March 5, 2004, from http://stinet.dtic.mil/str/index.html

Dublin Core Metadata Element Set, Version 1.1: Reference Description (2003),

Retrieved May 11, 2004, from http://dublincore.org/documents/dces/

Han, H., Giles, C. L., Manavoglu, E., Zha, H., Zhang, Z., & Fox. E. A. (2003). Automatic

Document Metadata Extraction Using Support Vector Machine. 2003 Joint

Conference on Digital Libraries (JCDL'03). Houston, Texas, USA.

Joachims, T. (2002). Learning to Classify Text Using Support Vector Machines.

Dissertation, Kluwer.

Kim, J., Le, D. X., & Thoma, G. R. (2003). Automated labeling algorithms for biomedical document images. Proc. 7th World Multiconference on Systemics,

Cybernetics and Informatics, Vol. V (pp. 352-57). Orlando FL.

Klink, S., Dengel, A. , & Kieninger, T. (2000). Document structure analysis based on layout and textual features. In Proc. of Fourth IAPR International Workshop on

Document Analysis Systems, DAS2000 (pp. 99-111). Rio de Janeiro, Brazil.

Lagoze, C., Sompel, H., Nelson, M., & Warner, S. (2003). The Open Archives Initiative

Protocol for Metadata Harvesting. Retrieved April 3, 2004, from http://www.openarchives.org/OAI/openarchivesprotocol.html

.

Integrating Legacy Collections into an Interoperable Digital Library Framework 11

Lesk, M. (1997). Practical Digital Libraries: books, bytes, and bucks. California: Morgan

Kaufmann Publishers.

Seymore, K., McCallum, A., & Rosenfeld R. (1999). Learning hidden Markov model structure for information extraction. In AAAI Workshop on Machine Learning for

Information Extraction.

Vapnik, V. N. (1995). The nature of Statistical Learning Theory. Berlin: Springer.

Integrating Legacy Collections into an Interoperable Digital Library Framework 12

Table 1

Template

Afrl

Arl

Edgewood

Nps

Usnce

Afit

Text

Rule-based Experimental Results

# of Documents Class Name Precision

5

5

4

15

5

6

33

Creator

Creator

Date

Title

Creator

Date

Title

Title

Creator

Contributor

Identifier

Right

Title

Creator

Contributor

Date

Type

Identifier

Type

Date

Title

Creator

Contributor

Publisher

Identifier

Date

Title

Creator

Identifier

Date

Title

85.71%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

75.00%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

66.67%

93.33%

96.67%

86.67%

90.00%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

Recall

100%

100%

100%

100%

100%

100%

100%

100%

100%

83.33%

100%

100%

100%

83.33%

Integrating Legacy Collections into an Interoperable Digital Library Framework 13

Figures & Captions

Figure 1.

SVM in Two-dimension Space

Figure 2.

System Architecture

Integrating Legacy Collections into an Interoperable Digital Library Framework 14

Figure 3.

Metadata Extraction Approach

Figure 4.

SVM Experimental Results

Integrating Legacy Collections into an Interoperable Digital Library Framework 15

Figure 5.

OAI Response

Integrating Legacy Collections into an Interoperable Digital Library Framework 16

Figure 6. Search Interface

Download