DTIC User Conference - Extracting Metadata And Structure

advertisement
The ODU Metadata Extraction
Project
March 28, 2007
Dr. Steven J. Zeil
zeil@cs.odu.edu
Outline
1. Overview
2. Recent Developments
A. Independent Document Model
B. Validation
C. Diversifying – NASA & GPO collections
3. New Issues & Future Directions
A. Post-processing
B. Image-Based Classification
1. Overview
Input
Documents
PDF
Input
Processing &
OCR
XML model of document
Form Templates
sf298_1
sf298_2
...
Form Processing
Extracted Metadata
Unresolved Documents
Extracted Metadata
Nonform Templates
au
eagle
Nonform
Processing
Post
Processing
Cleaned
Metadata
...
Untrusted
Metadata
Outputs
Validation
trusted outputs
Human
Review &
Correction
corrected
metadata
Final
Metadata
Output
Input Processing & OCR
• Select pages of interest
• Apply Off-The-Shelf OCR software
• Convert OCR output to XML model format
Form Processing
• Scan document for form names
– Select form template
• Apply form extraction engine to document
and template
Sample RDP
Sample RDP (cont.)
Metadata Extracted
from Sample RDP (1/3)
<metadata templateName="sf298_2">
<ReportDate>18-09-2003</ReportDate>
<DescriptiveNote>Final Report</DescriptiveNote>
<DescriptiveNote>1 April 1996 - 31 August 2003</DescriptiveNote>
<UnclassifiedTitle>VALIDATION OF IONOSPHERIC MODELS</UnclassifiedTitle>
<ContractNumber>F19628-96-C-0039</ContractNumber>
<ContractNumber></ContractNumber>
<ProgramElementNumber>61102F</ProgramElementNumber>
<PersonalAuthor>Patricia H. Doherty Leo F. McNamara
Susan H. Delay Neil J. Grossbard</PersonalAuthor>
<ProjectNumber>1010</ProjectNumber>
<TaskNumber>IM</TaskNumber>
<WorkUnitNumber>AC</WorkUnitNumber>
<CorporateAuthor>Boston College / Institute for Scientific Research 140
Commonwealth Avenue Chestnut Hill, MA 02467-3862</CorporateAuthor>
Metadata Extracted
from Sample RDP (2/3)
<ReportNumber></ReportNumber>
<MonitorNameAndAddress>Air Force Research Laboratory 29 Randolph
Road Hanscom AFB, MA 01731-3010</MonitorNameAndAddress>
<MonitorAcronym>VSBP</MonitorAcronym>
<MonitorSeries>AFRL-VS-TR-2003-1610</MonitorSeries>
<DistributionStatement>Approved for public release; distribution
unlimited.</DistributionStatement>
<Abstract>This document represents the final report for work
performed under the Boston College contract F I9628-96C-0039. This
contract was entitled Validation of Ionospheric Models. The
objective of this contract was to obtain satellite and ground-based
ionospheric measurements from a wide range of geographic locations
and to utilize the resulting databases to validate the theoretical
ionospheric models that are the basis of the Parameterized Real-time
Ionospheric Specification Model (PRISM) and the Ionospheric Forecast
Model (IFM). Thus our various efforts can be categorized as either
observational databases or modeling studies.</Abstract>
Metadata Extracted
from Sample RDP (3/3)
<Identifier>Ionosphere, Total Electron Content (TEC), Scintillation,
Electron density, Parameterized Real-time Ionospheric Specification
Model (PRISM), Ionospheric Forecast Model (IFM), Paramaterized
Ionosphere Model (PIM), Global Positioning System
(GPS)</Identifier>
<ResponsiblePerson>John Retterer</ResponsiblePerson>
<Phone>781-377-3891</Phone>
<ReportClassification>U</ReportClassification>
<AbstractClassification>U</AbstractClassification>
<AbstractLimitaion>SAR</AbstractLimitaion>
</metadata>
Non-Form Processing
• Classification – compare document
against known document layouts
– Select template written for closest matching
layout
• Apply non-form extraction engine to
document and template
Non-Form Sample (1/2)
Non-Form Sample (2/2)
Template Used for Sample
Document
<structdef pagenumber="1" templateID="au">
<identifier min="1" max="1">
<begin inclusive="current">
<stringmatch case="yes" loc="beginwith">AU/</stringmatch>
</begin>
<end>onesection</end>
</identifier>
<CorporateAuthor min="1" max="1">
<begin inclusive="current">
<stringmatch case="no" loc="beginwith">
AIR COMMAND | AIR WAR
</stringmatch>
</begin>
<end inclusive="current">
<stringmatch case="no" loc="beginwith">AIR UNIVERSITY</stringmatch>
</end>
</CorporateAuthor>
<UnclassifiedTitle min="1" max="1">
<begin inclusive="after">CorporateAuthor</begin>
<end inclusive="before">
<stringmatch case="no" loc="beginwith">by</stringmatch>
</end>
</UnclassifiedTitle>
…
Metadata Extracted From the
Title Page of the Sample Document
<paper templateid="au">
<identifier>AU/ACSC/012/1999-04</identifier>
<CorporateAuthor>AIR COMMAND AND STAFF COLLEGE
AIR UNIVERSITY</CorporateAuthor>
<UnclassifiedTitle>INTEGRATING COMMERCIAL
ELECTRONIC EQUIPMENT TO IMPROVE
MILITARY CAPABILITIES
</UnclassifiedTitle>
<PersonalAuthor>Jeffrey A. Bohler LCDR, USN</PersonalAuthor>
<advisor>Advisor: CDR Albert L. St.Clair</advisor>
<ReportDate>April 1999</ReportDate>
</paper>
Post-Processing
• Coerce extracted values into standard
formats
Validation
• Estimate quality of extracted metadata
• Untrusted outputs referred (to humans) for
review and correction
Recent Developments
A. Independent Document Model
B. Validation
C. Diversifying – NASA and GPO
Collections
A.
Independent Document Model
(IDM)
• Platform independent Document Model
• Motivation
– Dramatic XML Schema Change between Omnipage
14 and 15
– Tie the template engine to stable specification
– Protects from linking directly to specific OCR product
– Allows us to include statistics for enhanced feature
usage
• Statistics (i.e. avgDocFontSize, avgPageFontSize,
wordCount, avgDocWordCount, etc..)
Documents in IDM
• A document consists of pages
• pages are divided into regions
• regions may be divided into
–
–
–
–
blocks of vertical whitespace
paragraphs
tables
images
• paragraphs are divided into lines
• lines are divided into words
All of these carry standard attributes for size,
position, font, etc.
Generating IDM
• Use XSLT 2.0 stylesheets to transform
– Supporting new OCR schema only requires
generation of new XSLT stylesheet. -- Engine
does not change
IDM Usage
OmniPage
14 XML Doc
docTreeModelOmni14.xsl
Form Based Extraction
docTreeModelOmni15.xsl
OmniPage
15 XML Doc
IDM XML Doc
docTreeModelOther.xsl
Other OCR
Output XML
Doc
Non Form Extraction
IDM Tool Status
• Converters completed to generate IDM from Omnipage
14 and 15 XML
– Omnipage 15 proved to have numerous errors in its
representation of an OCR’d document
– Consequently, not recommended
• Form-based extraction engine revised to work from IDM
• Non-form engine still works from our older “CleanXML”
– convertor from IDM to CleanXML completed as stop-gap
measure
– direct use of IDM deferred pending review of other engine
modifications
B. Validation
• Given a set of extracted metadata
– mark each field with a confidence value indicating how
trustworthy the extracted value is
– mark the set with a composite confidence score
• Fields and Sets with low confidence scores may be
referred for additional processing
– automated post-processing
– human intervention and correction
Validating Extracted Metadata
• Techniques must be independent of the extraction
method
• A validation specification is written for each collection,
combining
• Field-specific validation rules
– statistical models derived for each field of
• text length
• % of words from English dictionary
• % of phrases from knowledge base prepared for
that field
– pattern matching
Sample Validation Specification
• Combines results from multiple fields
<val:validate collection="dtic"
xmlns:val="jelly:edu.odu.cs.dtic.validation.ValidationTagLibrary"
>
<val:average>
<val:field name="UnclassifiedTitle">...</val:field>
<val:field name="PersonalAuthor">...</val:field>
<val:field name="CorporateAuthor">...</val:field>
<val:field name="ReportDate">...</val:field>
</val:average>
</val:validate>
Validation Spec: Field Tests
• Each field is subjected to one or more tests
…
<val:field name="PersonalAuthor">
<val:average>
<val:length/>
<val:max>
<val:phrases length="1"/>
<val:phrases length="2"/>
<val:phrases length="3"/>
</val:max>
</val:average>
</val:field>
<val:field name="ReportDate">
<val:reportFormat/>
</val:field>
...
Sample Input Metadata Set
<metadata>
<UnclassifiedTitle>Thesis Title: The Military
Extraterritorial Jurisdiction
Act</UnclassifiedTitle>
<PersonalAuthor>Name of Candidate: LCDR
Kathleen A. Kerrigan</PersonalAuthor>
<ReportDate>Accepted this 18th day of June 2004
by:</ReportDate>
</metadata>
Sample Validator Output
<metadata confidence="0.522">
<UnclassifiedTitle confidence="0.943">Thesis Title: The
Military Extraterritorial Jurisdiction
Act</UnclassifiedTitle>
<PersonalAuthor confidence="0.622">Name of
Candidate: LCDR Kathleen A.
Kerrigan</PersonalAuthor>
<ReportDate confidence="0.0" warning="ReportDate field
does not match required pattern">Accepted this 18th day
of June 2004 by:</ReportDate>
</metadata>
Classification (a priori)
Unresolved
Document
CleanXML
Nonform Templates
au
eagle
...
Classify
(select best
template)
selected
template
Extract
Metadata
Extracted Metadata
Final Nonform
Output
• Previously, we had attempted various schemes for a priori
classification
– x-y trees
– bin classification
• Still investigating some
– image-based recognition
Post-Hoc Classification
Unresolved
Document
Validation
Spec.
CleanXML
validation rules
Nonform Templates
au
eagle
...
Extract
Metadata
Candidate
Metadata
Sets
Select Best
Metadata
Selected Metadata
Final Nonform
Output
• Apply all templates to document
– results in multiple candidate sets of metadata
• Score each candidate using the validator
– Select the best-scoring set
Experimental Results
Manually
Assigned
Class
Au
Number of Documents
Validator Preferred
Total
86
0
0
0
86
Eagle
0
8
33
4
45
Rand
0
0
8
4
12
Title
0
0
1
23
24
Interpretation of Results
• Validator agreed with human on 125 out of 167
cases
• Of 42 cases where they disagreed
– 37 were due to “extra” words in extracted metadata
(e.g., military ranks in author names)
• highlights need for post-processing to clean up metadata
– 2 were mistakes by template
– 2 were due to garbled characters by OCR
– 1 due to a bug in the validator
C. Diversifying – NASA and GPO
Collections
Document collections differ in
• whether forms are used and form layout
• document layout
• what metadata fields are present & which
ones are collected
Changing Collections
• Porting to a new document collection
– identify pages of interest
– training classifiers to recognize new document layouts
(?)
– templates for forms & document layouts
– new validation scripts
• collect statistics for collection model
– new post-processing rules
• No changes required to core engines & other
software
NASA Technical Reports
• Different layouts than DTIC
– fewer total
– tend to be visually more similar
– mixture with and without RDPs
NASA Sample Document
Extracted Metadata
for NASA Sample
<paper templateid="singleAuthor">
<metadata>
<UnclassifiedTitle>
A Computationally Efficient Meshless Local Petrov-Galerkin Method
for Axisymmetric Problems
</UnclassifiedTitle>
<PersonalAuthor>
I.S. Raju* and T. Chen?
</PersonalAuthor>
<CorporateAuthor>
NASA Langley Research Center
Hampton, VA 23681
</CorporateAuthor>
<Abstract>
The Meshless Local Petrov-Galerkin (MLPG)
method is one of the recently developed element-free
…
Govt. Printing Office
• Congressional acts & reports
• EPA reports Preliminary study with Acts
of Congress and EPA reports
• samples suggest layouts are more diverse
than DTIC or NASA
– metadata actually present in document varies
widely
GPO Sample – Act of Congress
Metadata Extracted for Act of
Congress
<paper>
<metadata>
<public_law_report_num>
118 STAT. 3984 PUBLIC LAW 108?493?DEC. 23, 2004
</public_law_report_num>
<bill_number>[H.R. 5394 ] components.</bill_number>
<congress_num>108th Congress</congress_num>
<type>An Act</type>
<acttype>
Dec. 23, 2004 To amend the Internal Revenue Code of 1986 to modify the
taxation of arrow
[H.R. 5394 ] components.
</acttype>
</metadata>
</paper>
GPO sample report
Metadata Extracted from GPO
Sample Report
<paper>
<metadata>
<title>
CHINA?S PROLIFERATION PRACTICES
AND ROLE IN THE NORTH KOREA CRISIS
</title>
<type>
HEARING BEFORE THE
U.S.-CHINA ECONOMIC AND SECURITY
REVIEW COMMISSION
</type>
<session>ONE HUNDRED NINTH CONGRESS FIRST SESSION</session>
<date>MARCH 10, 2005</date>
<use>
Printed for the use of the
U.S.-China Economic and Security Review Commission
</use>
<online>Available via the World Wide Web: http://www.uscc.gov</online>
</metadata>
</paper>
3. New Issues and Future
Directions
A. Post-Processing
B. Image-Based Classification
Post-processing
• WYSIWYG
• WYG != WYW
Post-processing
• WYSIWYG
– What You See is What You Get
• WYG != WYW
Post-processing
• WYSIWYG
– What You See is What You Get
• WYG != WYW
– What You Get is not What You Want
Example – DTIC Date Format
• Document may contain:
– March 28, 2007
– 3/28/2007
– 3/28/07
• DTIC requires:
– 28 MAR 2007
Example – Personal Authors
Example – Personal Authors (cont.)
• We extract:
<PersonalAuthor>Patricia H. Doherty Leo F. McNamara Susan H.
Delay Neil J. Grossbard</PersonalAuthor>
• DTIC requires:
<PersonalAuthor>Patricia H. Doherty ;Leo F. McNamara ;Susan H.
Delay ;Neil J. Grossbard</PersonalAuthor>
• NASA requires
<author>Patricia H. Doherty</author>
<author>Leo F. McNamara</author>
<author>Susan H. Delay</author>
<author>Neil J. Grossbard</author>
Post-Processing Requirements
• Post-processing rules must vary by
– metadata field
– collection
Post-Processing Architecture
TagProcessors
MetadataPostProcessor
ed
ster
i
g
re
DTIC Dates
register (tag, processor)
DTIC
PostProcessor
NASA
PostProcessor
GPO
PostProcessor
DTIC
PersonalAuthors
NASA
PersonalAuthors
Image-Based Classification
• filter to find likely candidates for
validator-based selection of template
• Looking at a variety of techaniques
inspired by work in image recognition
Example: Image-Based
Classification
• Example: represent a page using
various colors to denote images, text,
bold text, etc.
• find visually most similar pages in
documents of known classes
– “vote” based on 5 most similar
documents
Visual Matching Example (1/2)
Visual Matching Example (2/2)
Conclusions
• Automated metadata extraction can be
performed effectively on a wide variety of
documents
– Coping with heterogeneous collections is a
major challenge
• Much attention must be paid to “support”
issues
– validation, post-processing, etc.
Download