Automated Metadata Extraction for Personal Publishing in Kepler

advertisement
Automated Metadata Extraction for Personal Publishing in Kepler
Kurt Maly
Computer Science Department,
Old Dominion University
Norfolk, VA 23529, USA
maly@cs.odu.edu
Mohammad Zubair
Computer Science Department,
Old Dominion University
Norfolk, VA 23529, USA
zubair@cs.odu.edu
Abstract:
In this paper, we report on our experience with the creation of an automated, human-assisted
process to extract metadata from documents in a personal publishing tool of Kepler. Kepler has
been developed to help individual publishers to have a digital library, called an archivelet, that
installs in minutes on even a home machine. Kepler also supports the concept of community
support such that all archivelets that share a community interest, such as high school mathematics
teachers in Virginia , have the metadata federated on a community server. The metadata of all these
archivelets are then searchable by the general public. To make this process even more enticing to
the publishing researchers we have added modules that will automatically extract the metadata
from the pdf file that constitutes the actual document in the archivelet. Thus the publishing process
becomes even simpler than in the traditional Kepler archivelet. We believe an easy to use approach
for building focus communities will help educators in a domain to come together and share
educational material.
Introduction
In the last decade literally thousands of digital libraries have emerged; one of the biggest obstacles for dissemination
of information to a user community is that many digital libraries use different, proprietary technologies that inhibit
interoperability. Building interoperable digital libraries will allow communities to share information across
institutional and geographic borders. One major effort that addresses interoperability is the Open Archive Initiative
that has developed a framework to facilitate the discovery of content stored in distributed archives [Lagoze 2005].
One of the efforts in this direction is the Kepler project [Kepler 2006, Maly 2003], which gives publication control
to individual publishers, supports rapid dissemination, and addresses interoperability. In Kepler, OAI-PMH is used
to support "personal data providers" or "archivelets". The individual publishers can be integrated with an
institutional repository like Dspace by means of a Kepler Group Digital Library (GDL). The GDL aggregates
metadata and full text from archivelets and can act as an OAI-compliant data provider for institutional repositories.
The Kepler framework is ideal for disseminating information between education communities. It enables building of
focus communities that help educators in a domain to come together and share educational material.
In this paper, we focus on enhancing the Kepler archivelet to enable automatic ingestion of metadata thus making it
more enticing to the individual publishers. . For this, we have added modules that will automatically extract the
metadata from the pdf file that constitutes the actual document in the archivelet. The approach for metadata
extraction is based on templates that describe a set of rules for each document class. An engine understands the
template language and applies those rules to a document to extract the metadata. In the next two section we describe
the Kepler framework and the metadata extraction approach we have used for automatically populating the metadata
for an individual publisher.
Kepler Archivelet
Here, we summarize Kepler project that has been reported in [Kepler 2006, Maly 2003]. The Kepler archivelet
implementation supports Dublin Core (DC), OCLAC format, and can be easily extended to other formats. The
Kepler design has a well-defined API specification defining the various functions that are implemented by every
module that in turn are available for other modules. Support for new metadata formats requires just the
implementation of a metadata driver module. In the Kepler software documentation, we provide developer
guidelines for fast and easy implementation of a metadata driver. The following sub-sections describe the features
show in Fig. 1.
Figure 1: Archivelet Architecture
Webserver. This module creates a server socket and listens for OAI-PMH and full-text documents requests. If the
request is for a document, the document is retrieved from the data folder and served to the requester. If it is an OAIPMH request, it is parsed and the appropriate method from the OAI-PMH API of the Metadata Manager is invoked
and the results are returned. The server is controllable by the user; at any time the user can turn it on or off. This
allows users to control when to make their collection available for harvesting.
Kepler User Interface. This module creates the main archivelet user interface which allows the user to publish/edit
metadata, view a list of published items, view full-text documents, start/stop the OAI-PMH server, and do other
configuration tasks such as changing the server port and registering with Kepler groups. It communicates with the
metadata manager through the UI API.
Metadata Manager. This module is responsible for instantiating the various metadata drivers for the system. It also
implements the OAI-PMH API that provides a method for each of the six OAI-PMH verbs. OAI-PMH requests
received by the Webserver module are forwarded to the Driver Manager that decides what metadata drivers are
involved and invokes these drivers to get partial responses from each. The Driver Manager then constructs the whole
response from these partial responses. The Driver Manager also implements a User Interface API. This API contains
methods that are invoked in response to user interactions with the main interface. For example, when the user clicks
“publish”, the Driver Manager brings a simple GUI that allows the user to select which metadata format she wants
to use and then the Driver Manager invokes the appropriate Driver to display the appropriate publishing tool.
Metadata Driver. This module implements the OAI-PMH processing and the user interface functions such as
publishing tools for the specific metadata format that the Driver handles. The publishing interface has dynamic field
types (mandatory or optional), which are determined by a configuration file, based on XML schema, managed by
the group server administrator. The Driver invokes the Validation module whenever new metadata is published to
validate the metadata against the constraints specified in the configuration file and uses the repository API to store
metadata and files.
Validation. This module exists in every driver and is responsible for downloading the configuration files from the
Kepler group server periodically (in this implementation we download the file whenever the user clicks on publish).
These configuration files contain information on regular expressions for the various metadata elements, values for
drop down lists for elements that uses predefined options (e.g., language), and the mandatory/optional status for
every metadata element. The Validation module API is invoked whenever the user publishes or edits metadata and
insures the metadata meet the constraints specified in the configuration files.
Repository. This module provides access to the local collection. Whenever the user publishes new metadata or edits
existing metadata, the repository module writes this metadata to the collection as XML and also uploads the full-text
specified by the user. There are two implementations: a File-System Repository and a Database Repository. The
File-System Repository is used in the traditional archivelet and the Database Repository is used in the server-side
archivelet. For the Database Repository, only the metadata is stored in the database and the full-text is uploaded to a
folder in the server machine. For the File-System Repository, both the metadata and full-text are stored in a folder in
the file system.
Metadata Extraction
The state of the art in automatic metadata extraction is at the same time quite advanced and limited [Bergmark 2000,
Seymore 1999]. Individual methods such as SVM, HMM, or rule-based methods work well by themselves for
specific domains, typically fairly homogenous collection of documents. They all would fail rather badly in the
environment we are targeting: growing collections of a very heterogeneous nature. The overall architecture of our
approach is shown in Fig. 2. The input to the system consists of PDF files in which the text may either be encoded
as strings (with font and layout information) or may appear as scanned images. The OCR module produces an XML
format that depends on the OCR software used. The OCR output is first fed to the document representation block.
We generate the document representation from the native format of the OCR engine. The classification stage
represents our primary mechanism for coping with static heterogeneity and for detecting and reacting to dynamic
heterogeneity. The purpose of this stage is to identify a class or small set of classes of similar documents for which a
common procedure for metadata extraction has been devised.
If classification is successful, the document mode is passed on to the metadata extractor for the selected
homogenous class(es). In the literature, as well as with our own experience, rule based systems have been shown to
work extremely well for homogenous collections. We propose a template-based approach to encoding what we hope
will be, thanks to the successful classification, a relatively straightforward set of extraction rules. For example, a rule
might say: ‘If the phrase is centered, bold and in the largest font, it is a title’. A template language allows changes to
existing rules or the addition of new rules to a template for a class without having to modify the extraction engine.
OCR output of
Scanned
documents
Document
In class 1
Metadata Extraction
using
Template 1
Document
In class 2
Metadata Extraction
using
Template 2
Document
Classification
Metadata
Document
In class N
Metadata Extraction
using
Template N
Figure 2: Template-based Metadata Extraction
Integration of Metadata Extraction with Kepler Archivelet
We have completely automated the metadata extraction scheme described in section 3 such that it will be invoked
whenever a pdf file is placed in its input directory. It will process the pdf file and place the extracted metadata in the
OAI–PMH Static Repository [Hochstenbach 2003] format and we have modified the Kepler import feature such that
this location is the default location. The use of a Static Repository Gateway allows a harvester to harvest a static
repository as if it were a regular OAI-PMH data provider. We next use the import support of the archivelet to ingest
the extracted metadata. We have created a set of templates that will recognize most technical report formats
currently in use at Computer Science departments and used them as the default templates for the extraction engine.
We have also modified the Kepler Metadata Editor that it will display by default any metadata imported through the
Extractor.
The process for a user to take of advantage of the extraction system is to place the pdf file of the report they want to
publish in the input directory of the extraction system before starting the Kepler archivelet. By the time archivelet is
running the extraction process will have been completed and be available at the Static Repository server. The user
then clicks on import before going to the metadata interface. At the metadata editor all the extracted metadata will
be displayed in the proper fields. The user can correct any errors in the automated system output directly in the fields.
From our testing with real technical reports the rate of correct metadata fields approaches 95%.
Conclusion
In this paper, we have proposed an approach to make it easier for individual publishers to create and maintain their
OAI compliant repository. This was made possible by integrating metadata extraction technology with the Kepler
framework. It is critical that such tools are easy to use for broader acceptability among high school teachers. With its
easy to use features, Kepler framework enables formation of group digital library in a domain for sharing
educational material. For example, with Kepler framework, it is possible for high school math teachers in Virginia to
form a community to share their teaching experiences and course materials. For future, we need to address access
and security restrictions. The current system will allow anyone to access and share information and thus would make
inappropriate for sharing some educational material like solution to class exercises.
References
Bergmark D. Automatic Extraction of Reference Linking Information from Online Documents. CSTR 2000-1821,
November 2000.
Hochstenbach, P., Jerez, H., Van de Sompel, H.: The OAI-PMH Static Repository and Static Repository Gateway.
Proc. of the third ACM/IEEE Joint Conference on Digital Libraries, Houston TX (2003) 210-217
Kepler, 2005. http://kepler.cs.odu.edu/
Lagoze C, Sompel H, Nelson M, and Warner S. The Open Archives Initiative Protocol for Metadata Harvesting.
2003. Retrieved April, 2005, from
http://www.openarchives.org/OAI/openarchivesprotocol.html.
Maly, K., Nelson, M., Zubair, M., Amrou, A., Kothamasa S., Wang L., Luce, R., 2003. Light-Weight Communal
Digital Libraries. In Proc. of the fourth ACM/IEEE Joint Conference on Digital Libraries, Tucson AZ, pp.237-238
Seymore K, McCallum A, and Rosenfeld, R. Learning hidden Markov model structure for information extraction. In
AAAI Workshop on Machine Learning for Information Extraction, 1999.
Download