Metadata Extraction Software Package Version 3.2 Digital Library Research Group, Old Dominion University Operations Manual 1. INTRODUCTION In this file, we provide instructions for running and monitoring ODU Metadata Extraction software Version 3.2. For this document we assume 1) The input directory (the directory where we put the input documents to be processed) was chosen during installation as C:\extractInput. 2) The output directory (the directory where we get all the output files) was chosen during installation as C:\extractOutput. Please note that this directory further has several sub-directories and the final output goes to C:\extractOutput\resolved. The rest of the sub-directories are for debugging and intermediate processing. 3) All the necessary software C:\extractInstallation. has been installed in the directory: You will find several sub-directories under this directory: C:\extractInstallation\work where all the intermediate files generated from the software execution reside. C:\extractInstallation\extract_software where all the software resides. A batch file, metadataextraction.bat, is provided at C:\extractInstallation\extract_software directory to simplify the execution Note: Once the execution starts some temporary directories will be created. The directory structure looks like this: Fig 1-1 Partial directory structure of the software (Only directories important from the user’s point of view are shown for clarity) 2. METADATA EXTRACTION PROCESS 2.1 Software Execution 2.1.1 software execution process To start the software: Click on the file “C:\extractInstallation\extract_software\metadataextraction.bat”. This batch file starts execution, constantly monitoring the input document directory C:\extractInput for any PDF file. The output information on the console looks like this: OmniPage thread started... .:..:.:. The exact pattern of “.” And “:” markers may vary. Place PDF Documents in the input directory C:\extractInput to start the extraction process. Note that the order of putting the PDF files and the starting of the software doesn’t matter. After some PDF files are dropped, the output information on the console looks like this. It .OmniPage pre-processes the input PDF files and then returns to monitor the input directory thread started... ..:. -PdfPreprocessor: 4 pdf files found -PdfPreprocessor: preprocess file: ADA392625.pdf -PdfPreprocessor: preprocess file: ADA393370.pdf -PdfPreprocessor: preprocess file: ADA395354.pdf -PdfPreprocessor: preprocess file: ADA396545.pdf pdfs ready to be processed by omnipage … The metadata extractor (the modules labeled form processing and non-form processing) then begins to extract metadata automatically. Input Documents PDF Input Processing & OCR XML model of document Form Templates sf298_1 sf298_2 ... Form Processing Extracted Metadata Unresolved Documents Extracted Metadata Nonform Templates au eagle Nonform Processing Post Processing Cleaned Metadata ... Untrusted Metadata Outputs Validation trusted outputs Human Review & Correction corrected metadata Final Metadata Output Form-metadata extractor (Extracts metadata from the original PDF files which has a standard form (a.k.a. Report Document Page or RDP) in it). Non-Form Metadata extractor (Extracts metadata from the original PDF files without a standard form in it). The output on the console looks like these. processing file -C:\extractOutput\omnipage_xml\ADA392625.xml .Writing metadata to resolved directory at C:\extractOutput\resolved\ADA392625.meta.xml (note: there might be some errors output to the console, but that does not matter, the metadata extractor is working fine if metadata files are output to corresponding directories). Files are being continuously processed; meanwhile one can check the output directory to get the extracted metadata files. The output directory has the following sub-directories: Fig 2-1 Partial Directory structure of the output (Note! Some of the above shown sub-directories are created only when needed. For example, if no document fails the processing, no untrusted sub-directory is created.) The software attempts to monitor its own behavior to determine whether its output is likely to be correct or not. Metadata that the program believes to be trustworthy will be placed in the resolved directory. Metadata that the program believes to be suspect and that should be inspected and, if necessary, corrected by a human will be placed in the untrusted directory. (Future releases of this software will provide support for this correction process.)You should check the resolved and untrusted directories to get the extracted metadata files. ----------------------------------------------------------------------------------------------------------(Note: while the software is executing, it copies the PDF files to the c:\extractOutput\backup where you can find all the original PDF files. PDF files which can not be handled by the software are placed at c:\extractOutput\exception_pdf directory (files should be checked by user) and c:\extractOutput\tmp_pdf is a directory where temporary files created by the software are placed. During execution, logs recording the execution trace of the whole process are recorded in the log file which can be found at location C:\extractInstallation \extract_software\extractor.log) In, hopefully, rare circumstances, the software may be unable to generate any metadata for a document even though the document itself was handled. Depending on the nature of the problem, such documents will be copied into either the error or unresolved directories. 2.1.2 Software execution process monitoring While the software is running, three windows should be checked in order to monitor whether the software is running smoothly. Command line (DOS) window: Check the metadata extraction software to see if it is working properly. It keeps on watching the input directory for new PDF files, preprocess those files and then OmniPage software begins to OCR those documents. And metadata files are then created and placed into different directories. If you do not input new PDF files, it would be in a watching state. It will keep on checking for PDF files until you close the DOS window. Don’t close the command-line window until make sure that all the input files have been processed. -PdfPreprocessor: checking for pdf files -PdfPreprocessor: 0 pdf files found Windows explorer: To check if the PDF files are being processed by OmniPage and output files are generated, look at the directories: C:\extractOutput\resolved and C:\extractOutput\untrusted