Readmefile - Extracting Metadata And Structure

advertisement

Metadata Extraction Software

Package Version 3.2

Digital Library Research Group, Old Dominion University

Contents

1. Introduction

2. System Requirements

3. Software Installation

4. Metadata extraction process

5. Execution Flow

6. Preparing New Templates

7. Support

1. INTRODUCTION

In this file, we provide instructions for installing and using Metadata Extraction software.

The software package is distributed as an executable Java JAR files. The package itself is platform independent. However, because the OCR software OmniPage Pro 14 only supports Microsoft Windows environments at this time, we provide installation instructions only for Windows.

2. SYSTEM REQUIREMENTS

To install and use software, the following requirements should be met: i. Windows XP machine. ii. Java J2SE (JDK)/JRE 1.5 or later

You can obtain J2SE(JDK)/JRE from http://java.sun.com/j2se/1.5.0/download.jsp

iii. OmniPage Pro 14 Office

3. SOFTWARE INSTALLATION

There are two parts of the software installation, OmniPage OCR and Metadata Extraction software installation.

Fig 3-1 Components of software installation

3.1 Metadata Extraction software installation

Download the software at http://dtic.cs.odu.edu/deliverables/deliverables.html

, it is a zip file named extractSoftwarev3.2.zip, save it on a local machine, and extract the zip file to a directory. This gives an executable JAR file which when double clicked, starts the installation process of the metadata extraction software.

Following are the steps of installation:

Step 1: Double click the JAR file obtained from zip file and start the installation

Step2: Specify the location where the installation is to be done. This could be any directory of user choice.

Step 3: Select the document collection as shown in the screen below

Step 4: Select the output directory for the software.

Step 5: Select the directory for putting the input PDF documents to be processed.

Step 6: Installation is complete! Press Continue to finish.

The Metadata extraction software consists of a set of JAR files and other files needed to extract metadata from PDF files. A batch file (metadataextraction.bat) is provided at installationDirectory\extract_software directory to simplify the execution.

Fig 3-3 Partial directory structure of the software

(Only directories important from user point of view are shown for clarity)

4. METADATA EXTRACTION PROCESS

4.1 Verifying the setup

1.

Assuming the installation is done in C:\installationDirectory , following directories should exist:

C:\installationDirectory\work where all the intermediate files generated from the software execution reside.

C:\installationDirectory\extract_software where all the software resides.

The directory where input documents are to be put.

The directory where the output will be generated.

Note: Once the execution starts some temporary directories will be created.

4.2 Software Execution

4.2.1 software execution process

Now, it is time to start the software.

 To start the software:

Click on the file “C:\installationDirectory\extract_software\metadataextraction.bat”.

This batch file starts execution, constantly monitoring the input document directory for any PDF file.

The output information on the console looks like this:

OmniPage thread started...

.unified extractor loop

Place PDF Documents in the input directory to start the extraction process.

Note that the order of putting the PDF files and the starting of the software doesn’t matter.

After some PDF files are dropped, the output information on the console looks like this.

It pre-processes the input PDF files and then returns to monitor the input directory

.OmniPage thread started... unified extractor loop

-PdfPreprocessor: 4 pdf files found

-PdfPreprocessor: preprocess file: ADA392625.pdf

-PdfPreprocessor: preprocess file: ADA393370.pdf

-PdfPreprocessor: preprocess file: ADA395354.pdf

-PdfPreprocessor: preprocess file: ADA396545.pdf pdfs ready to be processed by omnipage unified extractor loop

The metadata extractor (the 2nd module of the software) then begins to extract metadata automatically.

Formmetadata extractor

(extracts metadata from the original PDF files which has a standard FORM in it).

Non-Form Metadata extractor

(extracts metadata from the original PDF files without a standard FORM in it).

The output on the console looks like these. processing file -C:\testingPC_out\omnipage_xml\ADA392625.xml

.Writing metadata to resolved directory at

C:\testingPC_out\resolved\ADA392625.meta.xml

(note: there might be some errors output to the console, but that does not matter, the metadata extractor is working fine if metadata files are output to corresponding directories).

Files are being continuously processed; meanwhile one can check the output directory to get the extracted metadata files.

The output directory has the following sub-directories:

Fig 3-4 Directory structure of the output

You should check the resolved directory to get the extracted metadata files:

-----------------------------------------------------------------------------------------------------------

(Note: while the software is executing, it copies the PDF files to the c:\outputDirectory\backup where you can find all the original PDF files. PDF files which can not be handled by the software are placed at c:\outputDirectory\exception_pdf directory (files should be checked by user) and c:\outputDirectory\tmp_pdf is a directory where temporary files created by the software are placed. It places the output in the directory “ C:\outputDirectory\resolved ”.

During execution, logs recording the execution trace of the whole process are recorded in the log file which can be found at location C:\installationDirectory

\extract_software\extractor.log)

4.2.2 Software execution process monitoring

While the software is running, three windows should be checked in order to monitor whether the software is running smoothly.

Command line (DOS) window:

Check the metadata extraction software to see if it is working properly.

It keeps on watching the input directory for new PDF files, preprocess those files and then OmniPage software begins to OCR those documents. And metadata files are then created and placed into different directories.

If you do not input new PDF files, it would be in a watching state. It will keep on checking for PDF files until you close the DOS window.

Don’t close the command-line window until make sure that all the input files have been processed.

-PdfPreprocessor: checking for pdf files

-PdfPreprocessor: 0 pdf files found

Windows explorer:

To check if the PDF files are being processed by OmniPage and output files are generated, look at the directory:

C:\outputDirectory\resolved

5. Execution Flow

The PDF documents are fed as the input to the process, which reduces the file by extracting only first five and last five pages and also copies the original to the backup directory.

Now these reduced PDFs are consumed by the OmniPage OCR to produce OmniPage

XML files. Once the OmniPage XML’s are produced, the software “Form Processor” applies the XML template to the XML to find the form in the document. If it finds the form in the document then the same XML template file is used to extract the metadata. The extracted metadata is saved in an XML file (metadata format) in the output directory. If no form is found in the document or the metadata can not be extracted properly, the file is passing on to the non-form and validation part of the software to extract the metadata using non-form templates and validation scripts.

6. Preparing new Templates

The software already ships with several form templates and non-form templates (a template is a file consisting of XML rules for metadata extraction), which are in the directory.

 installationDirectory\extract_software\form_templates

--templates for document collection with form

 installationDirectory\extract_software\non_form_templates

--templates for DTIC document collection without form

 installationDirectory\extract_software\non_form_templates_nasa

--templates for NASA document collection without form

If you want to write a template for a new kind of document, then study one of existing templates and its corresponding form in the PDF file (for documents with form) and the corresponding structure of the documents for documents without form. This will give you an idea of how the templates are written. Start by modifying an existing template’s rules. Place the new template in the corresponding template directory. The software automatically picks the templates from this directory.

7. SUPPORT

E-mail to <dtic@cs.odu.edu>

Download