Readmefile

advertisement
DTIC Metadata Extraction Software
Package Version 3.0
Digital Library Research Group, Old Dominion University
1. INTRODUCTION
In this file, we provide instructions for installing and using DTIC Metadata Extraction
software. The software package is distributed as a set of Java JAR files. The package
itself is platform independent. However, because the OCR software OmniPage Pro 14
only supports Microsoft Windows environments at this time, we provide installation
instructions only for Windows.
2. SYSTEM REQUIREMENTS
To install and use DTIC software, you need the following:
i. Windows XP machine.
ii. Java J2SE (JDK)/JRE 1.5 or later
You can obtain J2SE(JDK)/JRE from http://java.sun.com/j2se/1.5.0/download.jsp
iii. OmniPage Pro 14 Office
3. SOFTWARE INSTALLATION
There are two parts of the software installation and configuration, OmniPage OCR and
DTIC Metadata Extraction software.
Windows XP
Java Running Enviroment
Metadata Extraction
software installation and
configuration
Software
installation
Extract metadata based on the
output files from OmniPage OCR
OCR the pdf documents and
OmniPage OCR installation transform into a format that the
metadata extraction software
and configuration
can handle
Fig 3-1 Components of software installation
Metadata extraction software installation
0
Windows XP
Java running
enviroment
1
Download
software
2
Extract
software
4
5
6
Install
OmniPage
Metadata
extraction
software
configuration
OmniPage
configuration
3
Copy dticsoftware
& dticdocs to C:\
C:\
.
.
.
.
.
.
dticdocs
dticsoftware
Fig 3-2 Software installation and configuration roadmap
You can change the configuration of the metadata extraction software (step 5) based on
different data collections, step 1, 2,3,4,6 are just one-time set-up.
3.1 Metadata Extraction software installation
Download the software at http://dtic.cs.odu.edu/deliverables/deliverables.html, it is a zip
file named dticsoftwarev3.0.zip, save it on a local machine, and extract the files to a
directory. It contains two subdirectory “dticsoftware” and “dticdocs”. Copy these two
directories and put them under C:\ drive.
The Metadata extraction software (C:\dticsoftware) consists of a set of JAR files and
other files needed to extract metadata from PDF files. A batch file
(metadataextraction.bat) is provided to simplify the execution at C:\dticsoftware
directory.
The “C:\dticdocs” directory is where the input files reside and output and intermediate
files generated. Make sure that there are three subdirectories under C:\dticdocs directory.



C:\dticdocs\dticsoftware_input_pdf
C:\dticdocs\omnipage_pdf
C:\dticdocs\omnipage_xml
Now, the directory created for metadata extraction looks like this:
C:\
dticdocs
dticsoftware_input_pdf
.
.
.
.
.
.
omnipage_pdf
omnipage_xml
dticsoftware
authorityfiles
executable
form_templates
non_form_templates
non_form_templates_nasa
pdftk
validation_script
validation_script_nasa
xsl
metadataextraction.bat
Fig 3-3 Directory structure of the software
3.2 OmniPage OCR installation and configuration
The following procedure shows you the configuration of OmniPage Pro 14.0 to OCR the
documents. Make sure the following two directories have been created after the
installation of the metadata extraction software.
C:\dticdocs\omnipage_pdf
C:\dticdocs\omnipage_xml
Step 1:
Start the “OmniPage Batch Manager”.
Select “Fresh Start” and click “Next”.
Step 2:
Select “Load Image Files” and click “Next”.
Step 3:
Click on “Folders” in the “Workflow Assistant” window. It will open the
“Browse for Folder” window as shown above. Select the directory
“c:\dticdocs\omnipage_pdf” directory from the tree structure, and enter
“*.pdf” in the textbox “Files of type”. Click “OK”.
Step 4:
In this step, make sure that you select the “Watch folders for incoming image files
of the specified types” checkbox.
Step 5:
Select “Recognize Images” and click “Next”.
Step 6:
Accept the defaults, click “Next”
Step 7:
Select “Save” and click “Next”.
Step 8:
Select
“Text” in “Save as” option,
“XML (*.xml)” in “File type” drop down list box,
“Create a new file for each image file” in “File options”,
Select the path “C:\dticdocs\omnipage_pdf\*.pdf” under the “Input”, and click
“Specify Output Folder”, and browse to the output folder. It should be
“C:\dticdocs\omnipage_xml”. Make sure that you see
“C:\dticdocs\omnipage_xml\<original filename>.xml” under the
“Output <multiple files, name after original>” column of the table.
Step 9:
Select “Finish Job” and click “Next”
Step 10:
Specify “Job name” as “pdf to xml” and select the “Delete input image files”
check box, and accept the defaults for others. Click Finish.
The OmniPage batch manager should be running by now as shown below:
Step 11:
Click tools  options on the menu, update “maximum number of pages in output
documents” to 50.
4. METADATA EXTRACTION PROCESS
Please refer to Figure 5-1. METADATA EXTRACTION PROCESS below to get an
overview of the process.
4.1 Verifying the setup
1. Two directories should exist in the C:\ drive.


C:\dticdocs where all the PDF and XML documents generated reside.
C:\dticsoftware where all the software resides.
2. In Directory C:\dticdocs we should have the following directories.



C:\dticdocs\dticsoftware_input_pdf
C:\dticdocs\omnipage_xml
C:\dticdocs\omnipage_pdf
Note: Once the execution starts some temporary directories will be created.
3. Make sure to update the configuration of c:\dticsoftware\config.properties file
for different document collections (DTIC or NASA)
For DTIC document collection, set
non_form_templates_dir = c:\\dticsoftware\\non_form_templates
non_form_validation_spec_dir = c:\\dticsoftware\\validation_script
collection_type = dtic
For NASA document collection, set
non_form_templates_dir = c:\\dticsoftware\\non_form_templates_nasa
non_form_validation_spec_dir = c:\\dticsoftware\\validation_script_nasa
collection_type = nasa
4. Double check the instruction in 3.2 OmniPage OCR installation and
configuration, so that the OmniPage OCR is setup correctly.
4.2 Software Execution
4.2.1 software execution process and monitoring
Now, it is time to start the software.
 To start the software:
Click on the file “C:\dticsoftware\metadataextraction.bat”.
This batch file starts execution, constantly monitoring
“C:\dticdocs\dticsoftware_input_pdf” for any new PDF file.
the
directory
The output information on the console looks like this:
C:\dticsoftware>java -jar c:\dticsoftware\executable\metadata_extraction_helper.
jar c:\dticsoftware\log4j.properties c:\dticsoftware\config.properties
Trying to configure the post processor
Completed
…..
And OmniPage software will show:
It is now watching the folder C:\dticdocs\omnipage_pdf for incoming PDF files.
 Place PDF Documents in C:\dticdocs\dticsoftware_input_pdf directory to start the
extraction process.
After some PDF files are dropped, the output information on the console looks like this.
It pre-processes the input PDF files and then returns to monitor the directory
“C:\dticdocs\dticsoftware_input_pdf”.
C:\dticsoftware>java -jar c:\dticsoftware\executable\metadata_extraction_helper.
jar c:\dticsoftware\log4j.properties c:\dticsoftware\config.properties
Trying to configure the post processor
Completed
..........-PdfPreprocessor: 5 pdf files found
-PdfPreprocessor: preprocess file: ADA396592.pdf
-PdfPreprocessor: preprocess file: ADA402550.pdf
-PdfPreprocessor: preprocess file: ADA424733.pdf
-PdfPreprocessor: preprocess file: ADA445345.pdf
-PdfPreprocessor: preprocess file: ADA445417.pdf
pdfs ready to be processed by omnipage
OmniPage software begins to OCR these documents (at C:\dticdocs\omnipage_pdf).
OmniPage is now at the state of “Running”.
You can see that files are “recognized” and then “exported" to C:\dticdocs\omnipage_xml
directory.
 The metadata extractor (the 2nd module of the software) then begins to extract
metadata automatically.
 Formmetadata extractor
(extract metadata from the original PDF files which has a standard FORM in it).
 Non-Form Metadata extractor
(extract metadata from the original PDF files without a standard FORM in it).
Then the output on the console looks like these.
...............--- File: ADA396592 started.
...........-file unresolved
--- File: ADA402550 started.
.Using template to extract metadata for non-form documents.
.......---File: ADA396592.xml finished: in directory c:\dticdocs\dticsoftware_ou
tput\unresolved\output
...chose template: sf298_1
--- File: ADA402550 finished: in directory c:\dticdocs\dticsoftware_output\resol
ved\meta\ADA402550.xml
--- File: ADA424733 started.
..........chose template: sf298_2
--- File: ADA424733 finished: in directory c:\dticdocs\dticsoftware_output\resol
ved\meta\ADA424733.xml
--- File: ADA445345 started.
............-file unresolved
Using template to extract metadata for non-form documents.
Error
SXXP0003: Error reported by XML parser: Premature end of file.
Error
SXXP0003: Error reported by XML parser: Premature end of file.
[Fatal Error] :-1:-1: Premature end of file.
[Fatal Error] :-1:-1: Premature end of file.
.---File: ADA445345.xml finished: in directory c:\dticdocs\dticsoftware_output\u
nresolved\output
--- File: ADA445417 started.
..........-file unresolved
.Using template to extract metadata for non-form documents.
.---File: ADA445417.xml finished: in directory c:\dticdocs\dticsoftware_output\u
nresolved\output
.............
(not: there might be some errors output to the console, but that does not matter, the
metadata extractor is working fine if metadata files are output to corresponding
directories).
And OmniPage software returns to the state of watching. Waiting for new files at
C:\dticdocs\omnipage_pdf directory.
 Files are being processed, so you can check the C:\dticdocs\dticsoftware_output
directory to get the extracted metadata files.
The output directory structure looks as follows:
C:\
dticdocs
.
.
.
.
.
.
dticsoftware_output
resolved
Failed
idm
idm_backup
omnipage
meta
output
unresolved
(metadata for document with form)
idm
idm_backup
omnipage
output (metadata for document without form)
Fig 3-4 Directory structure of the output
You should check three directories to get the extracted metadata files:
1. for original PDF documents with a standard form in it
Check C:\dticdocs\dticsoftware_output\resolved\output
Check C:\dticdocs\dticsoftware_output\resolved\failed
Note:
A post-processor(a module of the software) validates some of the fields of the
metadata files against an authority file (C:\dticsoftware\authorityfiles\authority.xml).


if post processing step succeeds, then metadata file is moved to
C:\dticdocs\dticsoftware_output\resolved\output directory.
If post processing step failed, then metadata file is moved to
C:\dticdocs\dticsoftware_output\resolved\failed directory.
So, you should check these 2 directories to get the metadata files for original PDF
documents with a standard form in it.
2. for original PDF documents without a standard form in it
Check C:\dticdocs\dticsoftware_output\unresolved\output directory.
----------------------------------------------------------------------------------------------------------(Note: while the software is executing, it copies the PDF files to the c:\dticdocs\backup
where you can find all the original PDF files. PDF files which can not be handled by the
software are placed at c:\dticdocs\exception_pdffile directory (files should be checked by
user) and c:\dticdocs\tmpfile is a directory where temporary files created by the software
are placed. It places the output in the directory “C:\dticdocs\dticsoftware_output”.
During execution, logs recording the execution trace of the whole process are recorded
in the log files which can be found at location C:\dticsoftware\metaextract. )
4.2.2 Software execution process monitoring
While the software is running, three windows should be checked in order to monitor
whether the software is running smoothly.
Command line (DOS) window:
Check the metadata extraction software to see if it is working properly.
It keeps on watching the C:\dticdocs\dticsoftware_input_pdf directory for new PDF
files, preprocess those files and then OmniPage software begins to OCR those
documents. And metadata files are then created and placed into different directories.
If you do not input new PDF files, it would be in a watching state. It will keep on
checking for PDF files until you close the DOS window.
Don’t close the command-line window until make sure that all the input files have
been processed.
-PdfPreprocessor: checking for pdf files
-PdfPreprocessor: 0 pdf files found
OmniPage software:
Check that OmniPage software is running correctly.
Most of the time, OmniPage will work fine. But you should check whether PDF files
are being “recognized” and “exported” by OmniPage software to make sure that it is
working correctly.
Sometimes OmniPage shows “waiting” state although it is OCRing the PDF files, you
should press F5 to check if files is being “recognized” and “exported”.
You can also check if the PDF files at C:\dticdocs\omnipage_pdf is disappearing
periodically after you have input some new PDF files at
C:\dticdocs\dticsoftware_input_pdf directory:
C:\dticdocs\dticsoftware_input_pdf Put the input PDF files at this directory.
(maybe hundreds of pages per file).
The PDF files that OmniPage will process.
C:\dticdocs\omnipage_pdf
(the first and last five pages of the original
PDF files).
Check this directory and the OmniPage
software to make sure that the OmniPage
step is working smoothly.
Windows explorer:
To check if the PDF files are being processed by OmniPage and output files are
generated at these two directories.
C:\dticdocs\dticsoftware_output\resolved
C:\dticdocs\dticsoftware_output\unresolved\output
4.2.3 Software Execution Troubleshooting
You may encounter the following problems:
i. No response from the OmniPage OCR software
If OmniPage OCR software doesn’t respond for a long time even there are still some files
at C:\dticdocs\omnipage_pdf directory, you can cut all these files to a local directory
and then copy these files to C:\dticdocs\omnipage_pdf directory again to check if
OmniPage will come into the “Running” state.
If it is still hung for quite a long time, stop this OmniPage job and create a new one
following the instructions at section 3.2 OmniPage OCR installation and configuration.
ii. Showing error OmniPage OCR software
When OmniPage gets caught in an error state, it continues to show "Error" forever
afterwards. Most of the time, this is OK, what you should do is to watch the directories to
see if files are disappearing from C:\dticdocs\omnipage_pdf directory and PDF files are
“recognized” and “exported” by OmniPage software using F5 to refresh the output
information on the right side of the OmniPage software window.
5. Execution Flow
The complete process of metadata extraction is as shown in the figure below
Input
Documents
Extract 1st &
last 5 pages
PDF
Reduced
PDF
OCR
Original PDF
Backup
Omnipage XML
Form Templates
sf298_1
sf298_2
Form Processor
...
Omnipage XML
Omnipage XML
IDM
Resolved Documents
Unresolved
Convert to
CleanXML
IDM
Resolved
CleanXML
Nonform Templates
Omnipage XML
IDM
Meta
au
eagle
Extract
Metadata
...
Validation
Script
Candidate
Metadata
Sets
Extracted Metadata
Select Best
Metadata
Authority File
Permitted
Values
Post
Processor
Cleaned
Metadata
Final Form
Output
Selected Metadata
IDM
Omnipage
Clean
Fig 5-1. METADATA EXTRACTION PROCESS
The PDF documents are fed as the input to the process, which reduces the file by
extracting only first five and last five pages and also copies the original to the backup
directory.
Now these reduced PDFs are consumed by the OmniPage OCR to produce OmniPage
XML files. Once the OmniPage XML’s are produced, the software “Form Processor”
applies the XML template to the XML to find the form in the document. If it finds the
form in the document then the same XML template file is used to extract the metadata.
The extracted metadata is saved in an XML file (metadata format) in the output directory
which is inside C:\dticdocs\dticsoftware_output\resolved folder. If no form is found
in the document or the metadata can not be extracted properly, the file is passing on to the
non-form and validation part of the software to extract the metadata using non-form
Final Nonform
Output
templates and validation scripts. The extracted metadata is saved at
C:\dticdocs\dticsoftware_output\unresolved\output folder. So, the extracted metadata
output files are saved at two directories:

C:\dticdocs\dticsoftware_output\resolved\output
---metadata for document with form

C:\dticdocs\dticsoftware_output\unresolved\output
---metadata for document without form
6. Preparing new Templates
The software already ships with several form templates and non-form templates (a
template is a file consisting of XML rules for metadata extraction), which are in the
directory.



C:\dticsoftware\form_templates
--templates for document collection with form
C:\dticsoftware\non_form_templates
--templates for DTIC document collection without form
C:\dticsoftware\non_form_templates_nasa
--templates for NASA document collection without form
If you want to write a template for a new kind of document, then study one of existing
templates and its corresponding form in the PDF file (for documents with form) and the
corresponding structure of the documents for documents without form. This will give
you an idea of how the templates are written. Start by modifying an existing template’s
rules. Place the new template in the corresponding template directory. The software
automatically picks the templates from this directory.
7. SUPPORT
E-mail to <dtic@cs.odu.edu>
Download