Operational manual - Extracting Metadata And Structure

advertisement
Metadata Extraction Software
Package Version 3.2
Digital Library Research Group, Old Dominion University
Operations Manual
1. INTRODUCTION
In this file, we provide instructions for running and monitoring ODU Metadata Extraction
software Version 3.2. For this document we assume
1) The input directory (the directory where we put the input documents to be
processed) was chosen during installation as C:\extractInput.
2) The output directory (the directory where we get all the output files) was chosen
during installation as C:\extractOutput.
Please note that this directory further has several sub-directories and the final
output goes to C:\extractOutput\resolved. The rest of the sub-directories are for
debugging and intermediate processing.
3) All the necessary software
C:\extractInstallation.
has
been
installed
in
the
directory:
You will find several sub-directories under this directory:



C:\extractInstallation\work where all the intermediate files generated from
the software execution reside.
C:\extractInstallation\extract_software where all the software resides.
A batch file, metadataextraction.bat, is provided at
C:\extractInstallation\extract_software directory to simplify the execution
Note: Once the execution starts some temporary directories will be created.
The directory structure looks like this:
Fig 1-1 Partial directory structure of the software
(Only directories important from the user’s point of view are shown for clarity)
2. METADATA EXTRACTION PROCESS
2.1 Software Execution
2.1.1 software execution process
 To start the software:
Click on the file “C:\extractInstallation\extract_software\metadataextraction.bat”.
This batch file starts execution, constantly monitoring the input document
directory C:\extractInput for any PDF file.
The output information on the console looks like this:
OmniPage thread started...
.:..:.:.
The exact pattern of “.” And “:” markers may vary.
 Place PDF Documents in the input directory C:\extractInput to start the extraction
process.
Note that the order of putting the PDF files and the starting of the software doesn’t
matter.
After some PDF files are dropped, the output information on the console looks like this.
It .OmniPage
pre-processes
the input PDF files and then returns to monitor the input directory
thread started...
..:.
-PdfPreprocessor: 4 pdf files found
-PdfPreprocessor: preprocess file: ADA392625.pdf
-PdfPreprocessor: preprocess file: ADA393370.pdf
-PdfPreprocessor: preprocess file: ADA395354.pdf
-PdfPreprocessor: preprocess file: ADA396545.pdf
pdfs ready to be processed by omnipage
…
 The metadata extractor (the modules labeled form processing and non-form
processing) then begins to extract metadata automatically.
Input
Documents
PDF
Input
Processing &
OCR
XML model of document
Form Templates
sf298_1
sf298_2
...
Form Processing
Extracted Metadata
Unresolved Documents
Extracted Metadata
Nonform Templates
au
eagle
Nonform
Processing
Post
Processing
Cleaned
Metadata
...
Untrusted
Metadata
Outputs
Validation
trusted outputs
Human
Review &
Correction
corrected
metadata
Final
Metadata
Output
 Form-metadata extractor
(Extracts metadata from the original PDF files which has a standard form (a.k.a.
Report Document Page or RDP) in it).
 Non-Form Metadata extractor
(Extracts metadata from the original PDF files without a standard form in it).
The output on the console looks like these.
processing file -C:\extractOutput\omnipage_xml\ADA392625.xml
.Writing metadata to resolved directory at
C:\extractOutput\resolved\ADA392625.meta.xml
(note: there might be some errors output to the console, but that does not matter, the
metadata extractor is working fine if metadata files are output to corresponding
directories).
 Files are being continuously processed; meanwhile one can check the output
directory to get the extracted metadata files.
The output directory has the following sub-directories:
Fig 2-1 Partial Directory structure of the output
(Note! Some of the above shown sub-directories are created only when needed. For
example, if no document fails the processing, no untrusted sub-directory is created.)
The software attempts to monitor its own behavior to determine whether its output is
likely to be correct or not. Metadata that the program believes to be trustworthy will be
placed in the resolved directory. Metadata that the program believes to be suspect and
that should be inspected and, if necessary, corrected by a human will be placed in the
untrusted directory. (Future releases of this software will provide support for this
correction process.)You should check the resolved and untrusted directories to get the
extracted metadata files.
----------------------------------------------------------------------------------------------------------(Note: while the software is executing, it copies the PDF files to the
c:\extractOutput\backup where you can find all the original PDF files. PDF files which
can not be handled by the software are placed at c:\extractOutput\exception_pdf
directory (files should be checked by user) and c:\extractOutput\tmp_pdf is a directory
where temporary files created by the software are placed. During execution, logs
recording the execution trace of the whole process are recorded in the log file which can
be found at location C:\extractInstallation \extract_software\extractor.log) In, hopefully,
rare circumstances, the software may be unable to generate any metadata for a document
even though the document itself was handled. Depending on the nature of the problem,
such documents will be copied into either the error or unresolved directories.
2.1.2 Software execution process monitoring
While the software is running, three windows should be checked in order to monitor
whether the software is running smoothly.
Command line (DOS) window:
Check the metadata extraction software to see if it is working properly.
It keeps on watching the input directory for new PDF files, preprocess those files and
then OmniPage software begins to OCR those documents. And metadata files are
then created and placed into different directories.
If you do not input new PDF files, it would be in a watching state. It will keep on
checking for PDF files until you close the DOS window.
Don’t close the command-line window until make sure that all the input files have
been processed.
-PdfPreprocessor: checking for pdf files
-PdfPreprocessor: 0 pdf files found
Windows explorer:
To check if the PDF files are being processed by OmniPage and output files are
generated, look at the directories: C:\extractOutput\resolved and
C:\extractOutput\untrusted
Download