Metadata Extraction - Extracting Metadata And Structure

advertisement
Defense Research & Engineering
Information for the Warfighter
Automated Metadata Extraction
April 5, 2006
Kurt Maly
maly@cs.odu.edu
Approved for Public Release
U.S. Government Work (17 USC§105) Not copyrighted in the U.S.
Outline
• Background and Motivation
• Challenges and Approaches
• Metadata Extraction Experience at ODU CS
• Architecture for Metadata Extraction
• Experiments with DTIC Documents
• Experiments with limited GPO Documents
• Conclusions
Digital Library Research at ODU
Digital Libraries
http://dlib.cs.odu.edu/
Content
Creation
New Content
Publication Tools
Kepler, Compopt
(NSF, US Navy)
Arc/Archon
(NSF)
Content
Sharing
Process Existing
Content
(DTIC)
Kepler
(NSF)
Distributed
Model – P2P
(NSF)
Centralized
Model
Harvesting
OAI-PMH
Real Time
LFDL
TRI
DL Grid
(Andrew Mellon)
(NASA,LANL,
SANDIA)
Secure DL
(NSF, IBM)
Motivation
•
Metadata enhances the value of a document collection
– Using metadata helps resource discovery
• It may save about $8,200 per employee for a company to use
metadata in its intranet to reduce employee time for searching,
verifying and organizing the files . (estimation made by Mike Doane
on DCMI 2003 workshop)
– Using metadata helps make collections interoperable with OAI-PMH
•
Manual metadata extraction is costly and time-consuming
– It would take about 60 employee-years to create metadata for 1 million
documents. (estimation made by Lou Rosenfeld on DCMI 2003
workshop). Automatic metadata extraction tools are essential to reduce
the cost.
– Automatic extraction tools are essential for rapid dissemination at
reasonable cost
• OCR is not sufficient for making ‘legacy’ documents searchable.
Challenges
A successful metadata extraction system must:
•
•
•
•
extract metadata accurately
scale to large document collections
cope with heterogeneity within a collection
maintain accuracy, with minimal
reprogramming/training cost, as the collection evolves
over time
• have a validation/correction process
Approaches
• Machine Learning
– HMM
– SVM
• Rule-Based
– Ad Hoc
– Expert Systems
– Template-Based (ODU CS)
Comparison
• Machine-Learning Approach
– Good adaptability but it has to be trained from samples –
very time consuming
– Performance degrades with increasing heterogeneity
– Difficult to add new fields to be extracted
– Difficult to select the right features for training
• Rule-based
– No need for training from samples
– Can extract different metadata from different documents
– Rule writing may require significant technical expertise
Metadata Extraction Experience at ODU
CS
• DTIC (2004, 2005)
– developed software to automate the task of
extracting metadata and basic structure from DTIC
PDF documents
• explored alternatives including SVM, HMM, expert
systems
• origin of the ODU template-based engine
• GPO (in progress)
• NASA (in progress)
– Feasibility study to apply template-based approach
to CASI collection
Meeting the Challenges
•
All techniques achieved reasonable accuracy for small collections
– possible to scale to large homogeneous collections
•
Heterogeneity remains a problem
– Ad hoc rule-based tend to complex monoliths
– Expert systems tend to large rule sets with complex, poorly-understood
interactions
– Machine-learning must choose between reduced accuracy and confidence or
state explosion
•
Evolution problematic for machine-learning approaches
– older documents may have higher rate of OCR errors
– expensive retraining required to accommodate changes in collection
– potential lag time during which accuracy decays until sufficient training instances
acquired
•
Validation: A largely unexplored area.
–
Machine-learning approaches offer some support via confidence measures
Architecture for Metadata Extraction
Human Assisted
Feedback Loop
OCR Output of
Scanned
Documents
Fail
Document Modelling
Lexical
Analysis
Semantic
Tagging
Document
Classification
Classification
Resolution and
Validation
Success
Human Assistance
Fail
Fail
Fail
Validation
Validation
Validation
Metadata
Extraction
Using
Template 1
Metadata
Extraction
Using
Template 2
Metadata
Extraction
Using
Template M
Document
in Class 1
Document
in Class 2
Document
in Class M
Our Approach: Meeting the Challenges
• Bi-level architecture
– Classification based upon document similarity
– Simple templates (rule-based) written for each
emerging class
Our Approach: Meeting the Challenges
•
Heterogeneity
– Classification, in effect, reduces the problem to multiple homogeneous
collections
– Multiple templates required, but each template is comparatively simple
• only needs to accommodate one class of documents that share a common
layout and style
•
Evolution
– New classes of documents accommodated by writing a new template
• templates are comparatively simple
• no lengthy retraining required
• potentially rapid response to changes in collection
– Enriching the template engine by introducing new features to reduce
complexity of templates
• Validation
– Exploring a variety of techniques drawn from automated software
testing & validation
Metadata Extraction – Template-based
• Template-based approach
– Classify documents into classes based on similarity
– For each document class, create a template, or a set of rules
– Decoupling rules from coding
• A template is kept in a separate file
• Advantages
– Easy to extend
• For a new document class, just create a template
– Rules are simpler
– Rules can be refined easily
Classes of documents
Template engine
Engine
Scanned
XML
Docs
Template
XML
Parser
Data
Preprocessor
Metadata
Extraction
Metadata
Document features
• Layout features
– Boldness, i.e., whether text is in bold font or not;
– Font size, i.e., the font size used in text, e.g. font size 12, font
size 14, etc;
– Alignment, i.e. whether text is left, right, central, or adjusted
alignment;
– Geometric location, for example, a block starting with
coordinates (0, 0) and ending with coordinates (100, 200);
– Geometric relation, for example, a block located below the title
block.
Document features
• Textual features
– Special words, for example, a string starting with “abstract”;
– Special patterns, for example, a string with regular expression
“[1-2][0-9][0-9][0-9]”;
– Statistics features, for example, a string with more than 20
words, a string with more than 100 letters, and a string with more
than 50% letters in upper case;
– Knowledge features, for example, a string containing a last name
from a name dictionary.
Template language
•
•
•
•
XML based
Related to document features
XML schema
Simple document model
– Document –page-zone-region-column-rowparagraphs-lines-words-character
Template sample
<?xml version=”1.0” ?>
<structdef>
<title min=”0” max=”1”>
<begin inclusive=”current”>largeststrsize(0,0.5)</begin>
<end inclusive=”before”>sizechange(1)</end>
</title>
<creator min=”0” max=”1”>
<begin inclusive=”after”>title</begin>
<end inclusive=”before”>!nameformat</end>
</creator>
<date min=”0” max=”1”>
<begin inclusive=”current”>dateformat</begin>
<end inclusive=”current”>onesection</end>
</date>
</structdef>
Sample document pdf
Scan OCR output
‘Clean XML output
Template (part)
Metadata extracted
Results Summary from DTIC Project
PDF from scan image
SVM
PDF from text
DTIC100 85%-100%
Expert
DTIC10 85%
Template
DTIC600 90%-100%
DTIC20 95%
DTIC30 100%
Coverpage
detection
SF298 loc.
DTIC1000 100%
SF298 4fields DTIC1000 95%-97%
SF298 27fields DTIC1000 88%-99%
Experiment with Limited GPO Documents
• 14 GPO Documents having Technical Report Documentation Page
• 57 GPO Documents without Technical Report Documentation Page
• 16 Congressional Reports
• 16 Public Law Documents
GPO Report Documentation Page
GPO Document
Congressional Report
Public Law Document
DEMO
Conclusions
• OCR software works very well on current
documents
• Template based approach allows automatic
metadata extraction from
–
–
–
–
Dynamically changing collections
Heterogeneous, large collections
Report document pages
High degree of accuracy
• Feasibility of structure (e.g., table of contents,
tables, equations, sections) metadata extraction
Additional Slides
Metadata Extraction: Machine-Learning Approach
• Learn the relationship between input and output from samples and
make predictions for new data
• This approach has good adaptability but it has to be trained from
samples.
• HMM (hidden Markov Model) & SVM (Support Vector Machine)
Machine Learning - Hidden Markov Models
•
“Hidden Markov Modeling is a probabilistic technique for the study of
observed items arranged in discrete-time series” --Alan B Poritz :
Hidden Markov Models : A Guided Tour, ICASSP 1988
•
HMM is a probabilistic finite state automaton
–
–
–
Transit from state to state
Emit a symbol when visit each state
States are hidden
A
B
C
D
Hidden Markov Models
• A Hidden Markov Model consists of
• A set of hidden states (e.g. coin1, coin2, coin3)
• A set of observation symbols ( e.g. H and T)
• Transition probabilities: the probabilities from
one state to another
• Emission probabilities: probability of emitting
each symbol in each state
• Initial probabilities: probability of each state to
be chosen as the first state
HMM - Metadata Extraction
– A document is a sequence of words that is
produced by some hidden states (title, author, etc.)
– The parameters of HMM was learned from
samples in advance.
– Metadata Extraction is to find the most possible
sequence of states (title, author, etc.) for a given
sequence of words.
Machine Learning: Support Vector Machines
•
hyperplane
Binary Classifier (classify data
into two classes)
–
–
–
It represents data with
pre-defined features
It finds the plane with
largest margin to
separate the two classes
from samples
It classifies data into two
classes based on which
side they located.
Font size
margin
Line number
The figure shows a SVM example to classify a line into two classes: title, not title by two
features: font size and line number (1, 2, 3, etc). Each dot represents a line. Red dot: title;
Blue dot: not title.
SVM - Metadata Extraction
• Widely used in pattern recognition areas such as face detection,
isolated handwriting digit recognition, gene classification, etc.
• Basic idea
– Classes  metadata elements
– Extract metadata from a document classify each
line (or block) into appropriate classes.
– For example
Extract document title from a document 
Classify each line to see whether it is a part of title or not
Metadata Extraction: Rule-based
• Basic idea:
– Use a set of rules to define how to extract
metadata based on human observation.
– For example, a rule may be “ The first line is title”.
• Advantage
– Can be implemented straightforwardly
– No need for training
• Disadvantage
– Lack of adaptability (work for similar document)
– Difficult to work with a large number of features
– Difficult to tune the system when errors occur because
rules are usually fixed
Metadata Extraction - Rule-based
• Expert system approach
– Build a large rule base by
using standard languages
such as prolog
– Use existed expert system
engine (for example, SWIprolog)
• Advantages
Doc
Parser
Knowledge
Base
Facts
Expert System Engine
– Can use existing engine
• Disadvantages
– Building rule base is timeconsuming
metadata
Metadata Extraction Experience at ODU CS
• We have knowledge database obtained from analyzing Arc and
DTIC collections
–
Authors (4Mill strings from
http://arc.cs.odu.edu)
– Organizations (79 from DTIC250, 200 from DTIC 600)
– Universities (52 from DTIC250)
Download