Automated Metadata Extraction for Personal Publishing in Kepler Kurt Maly Computer Science Department, Old Dominion University Norfolk, VA 23529, USA maly@cs.odu.edu Mohammad Zubair Computer Science Department, Old Dominion University Norfolk, VA 23529, USA zubair@cs.odu.edu Abstract: In this paper, we report on our experience with the creation of an automated, human-assisted process to extract metadata from documents in a personal publishing tool of Kepler. Kepler has been developed to help individual publishers to have a digital library, called an archivelet, that installs in minutes on even a home machine. Kepler also supports the concept of community support such that all archivelets that share a community interest, such as high school mathematics teachers in Virginia , have the metadata federated on a community server. The metadata of all these archivelets are then searchable by the general public. To make this process even more enticing to the publishing researchers we have added modules that will automatically extract the metadata from the pdf file that constitutes the actual document in the archivelet. Thus the publishing process becomes even simpler than in the traditional Kepler archivelet. We believe an easy to use approach for building focus communities will help educators in a domain to come together and share educational material. Introduction In the last decade literally thousands of digital libraries have emerged; one of the biggest obstacles for dissemination of information to a user community is that many digital libraries use different, proprietary technologies that inhibit interoperability. Building interoperable digital libraries will allow communities to share information across institutional and geographic borders. One major effort that addresses interoperability is the Open Archive Initiative that has developed a framework to facilitate the discovery of content stored in distributed archives [Lagoze 2005]. One of the efforts in this direction is the Kepler project [Kepler 2006, Maly 2003], which gives publication control to individual publishers, supports rapid dissemination, and addresses interoperability. In Kepler, OAI-PMH is used to support "personal data providers" or "archivelets". The individual publishers can be integrated with an institutional repository like Dspace by means of a Kepler Group Digital Library (GDL). The GDL aggregates metadata and full text from archivelets and can act as an OAI-compliant data provider for institutional repositories. The Kepler framework is ideal for disseminating information between education communities. It enables building of focus communities that help educators in a domain to come together and share educational material. In this paper, we focus on enhancing the Kepler archivelet to enable automatic ingestion of metadata thus making it more enticing to the individual publishers. . For this, we have added modules that will automatically extract the metadata from the pdf file that constitutes the actual document in the archivelet. The approach for metadata extraction is based on templates that describe a set of rules for each document class. An engine understands the template language and applies those rules to a document to extract the metadata. In the next two section we describe the Kepler framework and the metadata extraction approach we have used for automatically populating the metadata for an individual publisher. Kepler Archivelet Here, we summarize Kepler project that has been reported in [Kepler 2006, Maly 2003]. The Kepler archivelet implementation supports Dublin Core (DC), OCLAC format, and can be easily extended to other formats. The Kepler design has a well-defined API specification defining the various functions that are implemented by every module that in turn are available for other modules. Support for new metadata formats requires just the implementation of a metadata driver module. In the Kepler software documentation, we provide developer guidelines for fast and easy implementation of a metadata driver. The following sub-sections describe the features show in Fig. 1. Figure 1: Archivelet Architecture Webserver. This module creates a server socket and listens for OAI-PMH and full-text documents requests. If the request is for a document, the document is retrieved from the data folder and served to the requester. If it is an OAIPMH request, it is parsed and the appropriate method from the OAI-PMH API of the Metadata Manager is invoked and the results are returned. The server is controllable by the user; at any time the user can turn it on or off. This allows users to control when to make their collection available for harvesting. Kepler User Interface. This module creates the main archivelet user interface which allows the user to publish/edit metadata, view a list of published items, view full-text documents, start/stop the OAI-PMH server, and do other configuration tasks such as changing the server port and registering with Kepler groups. It communicates with the metadata manager through the UI API. Metadata Manager. This module is responsible for instantiating the various metadata drivers for the system. It also implements the OAI-PMH API that provides a method for each of the six OAI-PMH verbs. OAI-PMH requests received by the Webserver module are forwarded to the Driver Manager that decides what metadata drivers are involved and invokes these drivers to get partial responses from each. The Driver Manager then constructs the whole response from these partial responses. The Driver Manager also implements a User Interface API. This API contains methods that are invoked in response to user interactions with the main interface. For example, when the user clicks “publish”, the Driver Manager brings a simple GUI that allows the user to select which metadata format she wants to use and then the Driver Manager invokes the appropriate Driver to display the appropriate publishing tool. Metadata Driver. This module implements the OAI-PMH processing and the user interface functions such as publishing tools for the specific metadata format that the Driver handles. The publishing interface has dynamic field types (mandatory or optional), which are determined by a configuration file, based on XML schema, managed by the group server administrator. The Driver invokes the Validation module whenever new metadata is published to validate the metadata against the constraints specified in the configuration file and uses the repository API to store metadata and files. Validation. This module exists in every driver and is responsible for downloading the configuration files from the Kepler group server periodically (in this implementation we download the file whenever the user clicks on publish). These configuration files contain information on regular expressions for the various metadata elements, values for drop down lists for elements that uses predefined options (e.g., language), and the mandatory/optional status for every metadata element. The Validation module API is invoked whenever the user publishes or edits metadata and insures the metadata meet the constraints specified in the configuration files. Repository. This module provides access to the local collection. Whenever the user publishes new metadata or edits existing metadata, the repository module writes this metadata to the collection as XML and also uploads the full-text specified by the user. There are two implementations: a File-System Repository and a Database Repository. The File-System Repository is used in the traditional archivelet and the Database Repository is used in the server-side archivelet. For the Database Repository, only the metadata is stored in the database and the full-text is uploaded to a folder in the server machine. For the File-System Repository, both the metadata and full-text are stored in a folder in the file system. Metadata Extraction The state of the art in automatic metadata extraction is at the same time quite advanced and limited [Bergmark 2000, Seymore 1999]. Individual methods such as SVM, HMM, or rule-based methods work well by themselves for specific domains, typically fairly homogenous collection of documents. They all would fail rather badly in the environment we are targeting: growing collections of a very heterogeneous nature. The overall architecture of our approach is shown in Fig. 2. The input to the system consists of PDF files in which the text may either be encoded as strings (with font and layout information) or may appear as scanned images. The OCR module produces an XML format that depends on the OCR software used. The OCR output is first fed to the document representation block. We generate the document representation from the native format of the OCR engine. The classification stage represents our primary mechanism for coping with static heterogeneity and for detecting and reacting to dynamic heterogeneity. The purpose of this stage is to identify a class or small set of classes of similar documents for which a common procedure for metadata extraction has been devised. If classification is successful, the document mode is passed on to the metadata extractor for the selected homogenous class(es). In the literature, as well as with our own experience, rule based systems have been shown to work extremely well for homogenous collections. We propose a template-based approach to encoding what we hope will be, thanks to the successful classification, a relatively straightforward set of extraction rules. For example, a rule might say: ‘If the phrase is centered, bold and in the largest font, it is a title’. A template language allows changes to existing rules or the addition of new rules to a template for a class without having to modify the extraction engine. OCR output of Scanned documents Document In class 1 Metadata Extraction using Template 1 Document In class 2 Metadata Extraction using Template 2 Document Classification Metadata Document In class N Metadata Extraction using Template N Figure 2: Template-based Metadata Extraction Integration of Metadata Extraction with Kepler Archivelet We have completely automated the metadata extraction scheme described in section 3 such that it will be invoked whenever a pdf file is placed in its input directory. It will process the pdf file and place the extracted metadata in the OAI–PMH Static Repository [Hochstenbach 2003] format and we have modified the Kepler import feature such that this location is the default location. The use of a Static Repository Gateway allows a harvester to harvest a static repository as if it were a regular OAI-PMH data provider. We next use the import support of the archivelet to ingest the extracted metadata. We have created a set of templates that will recognize most technical report formats currently in use at Computer Science departments and used them as the default templates for the extraction engine. We have also modified the Kepler Metadata Editor that it will display by default any metadata imported through the Extractor. The process for a user to take of advantage of the extraction system is to place the pdf file of the report they want to publish in the input directory of the extraction system before starting the Kepler archivelet. By the time archivelet is running the extraction process will have been completed and be available at the Static Repository server. The user then clicks on import before going to the metadata interface. At the metadata editor all the extracted metadata will be displayed in the proper fields. The user can correct any errors in the automated system output directly in the fields. From our testing with real technical reports the rate of correct metadata fields approaches 95%. Conclusion In this paper, we have proposed an approach to make it easier for individual publishers to create and maintain their OAI compliant repository. This was made possible by integrating metadata extraction technology with the Kepler framework. It is critical that such tools are easy to use for broader acceptability among high school teachers. With its easy to use features, Kepler framework enables formation of group digital library in a domain for sharing educational material. For example, with Kepler framework, it is possible for high school math teachers in Virginia to form a community to share their teaching experiences and course materials. For future, we need to address access and security restrictions. The current system will allow anyone to access and share information and thus would make inappropriate for sharing some educational material like solution to class exercises. References Bergmark D. Automatic Extraction of Reference Linking Information from Online Documents. CSTR 2000-1821, November 2000. Hochstenbach, P., Jerez, H., Van de Sompel, H.: The OAI-PMH Static Repository and Static Repository Gateway. Proc. of the third ACM/IEEE Joint Conference on Digital Libraries, Houston TX (2003) 210-217 Kepler, 2005. http://kepler.cs.odu.edu/ Lagoze C, Sompel H, Nelson M, and Warner S. The Open Archives Initiative Protocol for Metadata Harvesting. 2003. Retrieved April, 2005, from http://www.openarchives.org/OAI/openarchivesprotocol.html. Maly, K., Nelson, M., Zubair, M., Amrou, A., Kothamasa S., Wang L., Luce, R., 2003. Light-Weight Communal Digital Libraries. In Proc. of the fourth ACM/IEEE Joint Conference on Digital Libraries, Tucson AZ, pp.237-238 Seymore K, McCallum A, and Rosenfeld, R. Learning hidden Markov model structure for information extraction. In AAAI Workshop on Machine Learning for Information Extraction, 1999.