DTIC Metadata Extraction Software Package Version 3.0 Digital Library Research Group, Old Dominion University 1. INTRODUCTION In this file, we provide instructions for installing and using DTIC Metadata Extraction software. The software package is distributed as a set of Java JAR files. The package itself is platform independent. However, because the OCR software OmniPage Pro 14 only supports Microsoft Windows environments at this time, we provide installation instructions only for Windows. 2. SYSTEM REQUIREMENTS To install and use DTIC software, you need the following: i. Windows XP machine. ii. Java J2SE (JDK)/JRE 1.5 or later You can obtain J2SE(JDK)/JRE from http://java.sun.com/j2se/1.5.0/download.jsp iii. OmniPage Pro 14 Office 3. SOFTWARE INSTALLATION There are two parts of the software installation and configuration, OmniPage OCR and DTIC Metadata Extraction software. Windows XP Java Running Enviroment Metadata Extraction software installation and configuration Software installation Extract metadata based on the output files from OmniPage OCR OCR the pdf documents and OmniPage OCR installation transform into a format that the metadata extraction software and configuration can handle Fig 3-1 Components of software installation Metadata extraction software installation 0 Windows XP Java running enviroment 1 Download software 2 Extract software 4 5 6 Install OmniPage Metadata extraction software configuration OmniPage configuration 3 Copy dticsoftware & dticdocs to C:\ C:\ . . . . . . dticdocs dticsoftware Fig 3-2 Software installation and configuration roadmap You can change the configuration of the metadata extraction software (step 5) based on different data collections, step 1, 2,3,4,6 are just one-time set-up. 3.1 Metadata Extraction software installation Download the software at http://dtic.cs.odu.edu/deliverables/deliverables.html, it is a zip file named dticsoftwarev3.0.zip, save it on a local machine, and extract the files to a directory. It contains two subdirectory “dticsoftware” and “dticdocs”. Copy these two directories and put them under C:\ drive. The Metadata extraction software (C:\dticsoftware) consists of a set of JAR files and other files needed to extract metadata from PDF files. A batch file (metadataextraction.bat) is provided to simplify the execution at C:\dticsoftware directory. The “C:\dticdocs” directory is where the input files reside and output and intermediate files generated. Make sure that there are three subdirectories under C:\dticdocs directory. C:\dticdocs\dticsoftware_input_pdf C:\dticdocs\omnipage_pdf C:\dticdocs\omnipage_xml Now, the directory created for metadata extraction looks like this: C:\ dticdocs dticsoftware_input_pdf . . . . . . omnipage_pdf omnipage_xml dticsoftware authorityfiles executable form_templates non_form_templates non_form_templates_nasa pdftk validation_script validation_script_nasa xsl metadataextraction.bat Fig 3-3 Directory structure of the software 3.2 OmniPage OCR installation and configuration The following procedure shows you the configuration of OmniPage Pro 14.0 to OCR the documents. Make sure the following two directories have been created after the installation of the metadata extraction software. C:\dticdocs\omnipage_pdf C:\dticdocs\omnipage_xml Step 1: Start the “OmniPage Batch Manager”. Select “Fresh Start” and click “Next”. Step 2: Select “Load Image Files” and click “Next”. Step 3: Click on “Folders” in the “Workflow Assistant” window. It will open the “Browse for Folder” window as shown above. Select the directory “c:\dticdocs\omnipage_pdf” directory from the tree structure, and enter “*.pdf” in the textbox “Files of type”. Click “OK”. Step 4: In this step, make sure that you select the “Watch folders for incoming image files of the specified types” checkbox. Step 5: Select “Recognize Images” and click “Next”. Step 6: Accept the defaults, click “Next” Step 7: Select “Save” and click “Next”. Step 8: Select “Text” in “Save as” option, “XML (*.xml)” in “File type” drop down list box, “Create a new file for each image file” in “File options”, Select the path “C:\dticdocs\omnipage_pdf\*.pdf” under the “Input”, and click “Specify Output Folder”, and browse to the output folder. It should be “C:\dticdocs\omnipage_xml”. Make sure that you see “C:\dticdocs\omnipage_xml\<original filename>.xml” under the “Output <multiple files, name after original>” column of the table. Step 9: Select “Finish Job” and click “Next” Step 10: Specify “Job name” as “pdf to xml” and select the “Delete input image files” check box, and accept the defaults for others. Click Finish. The OmniPage batch manager should be running by now as shown below: Step 11: Click tools options on the menu, update “maximum number of pages in output documents” to 50. 4. METADATA EXTRACTION PROCESS Please refer to Figure 5-1. METADATA EXTRACTION PROCESS below to get an overview of the process. 4.1 Verifying the setup 1. Two directories should exist in the C:\ drive. C:\dticdocs where all the PDF and XML documents generated reside. C:\dticsoftware where all the software resides. 2. In Directory C:\dticdocs we should have the following directories. C:\dticdocs\dticsoftware_input_pdf C:\dticdocs\omnipage_xml C:\dticdocs\omnipage_pdf Note: Once the execution starts some temporary directories will be created. 3. Make sure to update the configuration of c:\dticsoftware\config.properties file for different document collections (DTIC or NASA) For DTIC document collection, set non_form_templates_dir = c:\\dticsoftware\\non_form_templates non_form_validation_spec_dir = c:\\dticsoftware\\validation_script collection_type = dtic For NASA document collection, set non_form_templates_dir = c:\\dticsoftware\\non_form_templates_nasa non_form_validation_spec_dir = c:\\dticsoftware\\validation_script_nasa collection_type = nasa 4. Double check the instruction in 3.2 OmniPage OCR installation and configuration, so that the OmniPage OCR is setup correctly. 4.2 Software Execution 4.2.1 software execution process and monitoring Now, it is time to start the software. To start the software: Click on the file “C:\dticsoftware\metadataextraction.bat”. This batch file starts execution, constantly monitoring “C:\dticdocs\dticsoftware_input_pdf” for any new PDF file. the directory The output information on the console looks like this: C:\dticsoftware>java -jar c:\dticsoftware\executable\metadata_extraction_helper. jar c:\dticsoftware\log4j.properties c:\dticsoftware\config.properties Trying to configure the post processor Completed ….. And OmniPage software will show: It is now watching the folder C:\dticdocs\omnipage_pdf for incoming PDF files. Place PDF Documents in C:\dticdocs\dticsoftware_input_pdf directory to start the extraction process. After some PDF files are dropped, the output information on the console looks like this. It pre-processes the input PDF files and then returns to monitor the directory “C:\dticdocs\dticsoftware_input_pdf”. C:\dticsoftware>java -jar c:\dticsoftware\executable\metadata_extraction_helper. jar c:\dticsoftware\log4j.properties c:\dticsoftware\config.properties Trying to configure the post processor Completed ..........-PdfPreprocessor: 5 pdf files found -PdfPreprocessor: preprocess file: ADA396592.pdf -PdfPreprocessor: preprocess file: ADA402550.pdf -PdfPreprocessor: preprocess file: ADA424733.pdf -PdfPreprocessor: preprocess file: ADA445345.pdf -PdfPreprocessor: preprocess file: ADA445417.pdf pdfs ready to be processed by omnipage OmniPage software begins to OCR these documents (at C:\dticdocs\omnipage_pdf). OmniPage is now at the state of “Running”. You can see that files are “recognized” and then “exported" to C:\dticdocs\omnipage_xml directory. The metadata extractor (the 2nd module of the software) then begins to extract metadata automatically. Formmetadata extractor (extract metadata from the original PDF files which has a standard FORM in it). Non-Form Metadata extractor (extract metadata from the original PDF files without a standard FORM in it). Then the output on the console looks like these. ...............--- File: ADA396592 started. ...........-file unresolved --- File: ADA402550 started. .Using template to extract metadata for non-form documents. .......---File: ADA396592.xml finished: in directory c:\dticdocs\dticsoftware_ou tput\unresolved\output ...chose template: sf298_1 --- File: ADA402550 finished: in directory c:\dticdocs\dticsoftware_output\resol ved\meta\ADA402550.xml --- File: ADA424733 started. ..........chose template: sf298_2 --- File: ADA424733 finished: in directory c:\dticdocs\dticsoftware_output\resol ved\meta\ADA424733.xml --- File: ADA445345 started. ............-file unresolved Using template to extract metadata for non-form documents. Error SXXP0003: Error reported by XML parser: Premature end of file. Error SXXP0003: Error reported by XML parser: Premature end of file. [Fatal Error] :-1:-1: Premature end of file. [Fatal Error] :-1:-1: Premature end of file. .---File: ADA445345.xml finished: in directory c:\dticdocs\dticsoftware_output\u nresolved\output --- File: ADA445417 started. ..........-file unresolved .Using template to extract metadata for non-form documents. .---File: ADA445417.xml finished: in directory c:\dticdocs\dticsoftware_output\u nresolved\output ............. (not: there might be some errors output to the console, but that does not matter, the metadata extractor is working fine if metadata files are output to corresponding directories). And OmniPage software returns to the state of watching. Waiting for new files at C:\dticdocs\omnipage_pdf directory. Files are being processed, so you can check the C:\dticdocs\dticsoftware_output directory to get the extracted metadata files. The output directory structure looks as follows: C:\ dticdocs . . . . . . dticsoftware_output resolved Failed idm idm_backup omnipage meta output unresolved (metadata for document with form) idm idm_backup omnipage output (metadata for document without form) Fig 3-4 Directory structure of the output You should check three directories to get the extracted metadata files: 1. for original PDF documents with a standard form in it Check C:\dticdocs\dticsoftware_output\resolved\output Check C:\dticdocs\dticsoftware_output\resolved\failed Note: A post-processor(a module of the software) validates some of the fields of the metadata files against an authority file (C:\dticsoftware\authorityfiles\authority.xml). if post processing step succeeds, then metadata file is moved to C:\dticdocs\dticsoftware_output\resolved\output directory. If post processing step failed, then metadata file is moved to C:\dticdocs\dticsoftware_output\resolved\failed directory. So, you should check these 2 directories to get the metadata files for original PDF documents with a standard form in it. 2. for original PDF documents without a standard form in it Check C:\dticdocs\dticsoftware_output\unresolved\output directory. ----------------------------------------------------------------------------------------------------------(Note: while the software is executing, it copies the PDF files to the c:\dticdocs\backup where you can find all the original PDF files. PDF files which can not be handled by the software are placed at c:\dticdocs\exception_pdffile directory (files should be checked by user) and c:\dticdocs\tmpfile is a directory where temporary files created by the software are placed. It places the output in the directory “C:\dticdocs\dticsoftware_output”. During execution, logs recording the execution trace of the whole process are recorded in the log files which can be found at location C:\dticsoftware\metaextract. ) 4.2.2 Software execution process monitoring While the software is running, three windows should be checked in order to monitor whether the software is running smoothly. Command line (DOS) window: Check the metadata extraction software to see if it is working properly. It keeps on watching the C:\dticdocs\dticsoftware_input_pdf directory for new PDF files, preprocess those files and then OmniPage software begins to OCR those documents. And metadata files are then created and placed into different directories. If you do not input new PDF files, it would be in a watching state. It will keep on checking for PDF files until you close the DOS window. Don’t close the command-line window until make sure that all the input files have been processed. -PdfPreprocessor: checking for pdf files -PdfPreprocessor: 0 pdf files found OmniPage software: Check that OmniPage software is running correctly. Most of the time, OmniPage will work fine. But you should check whether PDF files are being “recognized” and “exported” by OmniPage software to make sure that it is working correctly. Sometimes OmniPage shows “waiting” state although it is OCRing the PDF files, you should press F5 to check if files is being “recognized” and “exported”. You can also check if the PDF files at C:\dticdocs\omnipage_pdf is disappearing periodically after you have input some new PDF files at C:\dticdocs\dticsoftware_input_pdf directory: C:\dticdocs\dticsoftware_input_pdf Put the input PDF files at this directory. (maybe hundreds of pages per file). The PDF files that OmniPage will process. C:\dticdocs\omnipage_pdf (the first and last five pages of the original PDF files). Check this directory and the OmniPage software to make sure that the OmniPage step is working smoothly. Windows explorer: To check if the PDF files are being processed by OmniPage and output files are generated at these two directories. C:\dticdocs\dticsoftware_output\resolved C:\dticdocs\dticsoftware_output\unresolved\output 4.2.3 Software Execution Troubleshooting You may encounter the following problems: i. No response from the OmniPage OCR software If OmniPage OCR software doesn’t respond for a long time even there are still some files at C:\dticdocs\omnipage_pdf directory, you can cut all these files to a local directory and then copy these files to C:\dticdocs\omnipage_pdf directory again to check if OmniPage will come into the “Running” state. If it is still hung for quite a long time, stop this OmniPage job and create a new one following the instructions at section 3.2 OmniPage OCR installation and configuration. ii. Showing error OmniPage OCR software When OmniPage gets caught in an error state, it continues to show "Error" forever afterwards. Most of the time, this is OK, what you should do is to watch the directories to see if files are disappearing from C:\dticdocs\omnipage_pdf directory and PDF files are “recognized” and “exported” by OmniPage software using F5 to refresh the output information on the right side of the OmniPage software window. 5. Execution Flow The complete process of metadata extraction is as shown in the figure below Input Documents Extract 1st & last 5 pages PDF Reduced PDF OCR Original PDF Backup Omnipage XML Form Templates sf298_1 sf298_2 Form Processor ... Omnipage XML Omnipage XML IDM Resolved Documents Unresolved Convert to CleanXML IDM Resolved CleanXML Nonform Templates Omnipage XML IDM Meta au eagle Extract Metadata ... Validation Script Candidate Metadata Sets Extracted Metadata Select Best Metadata Authority File Permitted Values Post Processor Cleaned Metadata Final Form Output Selected Metadata IDM Omnipage Clean Fig 5-1. METADATA EXTRACTION PROCESS The PDF documents are fed as the input to the process, which reduces the file by extracting only first five and last five pages and also copies the original to the backup directory. Now these reduced PDFs are consumed by the OmniPage OCR to produce OmniPage XML files. Once the OmniPage XML’s are produced, the software “Form Processor” applies the XML template to the XML to find the form in the document. If it finds the form in the document then the same XML template file is used to extract the metadata. The extracted metadata is saved in an XML file (metadata format) in the output directory which is inside C:\dticdocs\dticsoftware_output\resolved folder. If no form is found in the document or the metadata can not be extracted properly, the file is passing on to the non-form and validation part of the software to extract the metadata using non-form Final Nonform Output templates and validation scripts. The extracted metadata is saved at C:\dticdocs\dticsoftware_output\unresolved\output folder. So, the extracted metadata output files are saved at two directories: C:\dticdocs\dticsoftware_output\resolved\output ---metadata for document with form C:\dticdocs\dticsoftware_output\unresolved\output ---metadata for document without form 6. Preparing new Templates The software already ships with several form templates and non-form templates (a template is a file consisting of XML rules for metadata extraction), which are in the directory. C:\dticsoftware\form_templates --templates for document collection with form C:\dticsoftware\non_form_templates --templates for DTIC document collection without form C:\dticsoftware\non_form_templates_nasa --templates for NASA document collection without form If you want to write a template for a new kind of document, then study one of existing templates and its corresponding form in the PDF file (for documents with form) and the corresponding structure of the documents for documents without form. This will give you an idea of how the templates are written. Start by modifying an existing template’s rules. Place the new template in the corresponding template directory. The software automatically picks the templates from this directory. 7. SUPPORT E-mail to <dtic@cs.odu.edu>