Template Writers' Manual - Extracting Metadata And Structure

advertisement
Non-Form Template Writers' Manual
May 2008
1. Introduction
This document describes the current state of the template language for processing of nonform documents in the Extract metadata extraction system. Readers unfamiliar with the
overall purpose and organization of this system are referred to [reference to ICADL
paper]. This document concentrates upon the portion of the overall system associated
with extraction of raw metadata from documents that have already been checked for
“report pages” (metadata-bearing forms) and found to lack any such.
Such documents are then passed on for attempted extraction of metadata according to
rules described in one or more executable templates, each template designed to describe a
distinct document layout.
The template language is derived from [ref to Tang’s thesis] and retains the overall
structure defined there, though many additions and modifications to the lower-level
details have been made since then.
At the time of extraction, documents (originally received as PDF files) have been passed
through OCR and eventually converted to a data structure called “CleanML”, essentially
a sequence of pages, each page represented as a sequence lines of text. Each line is
marked with certain “features” describing the primary font employed within that line.
These features are:
 weight (bold or medium),
 slant (italic or normal)
 font size,
 allCaps (true or false),
 titleCase (true or false)
 start of paragraph (i.e., is there one or more empty line preceding this one?)
The “template engine” attempts to interpret the instructions encoded in a template in
order to select, from the CleanML structure, text strings that represent meaningful
metadata.
2. The Template Language
Each template contains a set of rules designed to extract metadata from a single class of
similar documents. Figure 1 shows a template example. Each desired metadata item is
described by a rule set designating the beginning and the end of the metadata. The rules
are limited by features detectable at the line level resolution. The names of the metadata
fields can vary from one organization collection to another.
<structdef pagenumber="3" templateID="arl_1">
<CorporateAuthor>
<begin inclusive="current">
<stringmatch case="no" loc="beginwith">Army
Research</stringmatch>
</begin>
<end inclusive="before">
<stringmatch case="no"
loc="beginwith">ARL</stringmatch>
</end>
Fig. 1. Non-form Template fragment
The template language is expressed in XML. This is not a fundamental limitation. Almost
all executable languages are processed by translation into a tree-structured description
(generally referred to as an “abstract syntax tree). XML is a standard notation for
exchanging tree-structured data, and so it is convenient for a research/exploratory project
to introduce new languages directly in the syntax tree format and bypassing the otherwise
distracting process of providing higher level translation. XML is not particularly easy to
read and write, and a later section of this document describes a preliminary approach to
provide a more writer-friendly way to generate templates.
In the remainder of this section we cover the XML elements that make up a template.
2.1 The <structdef>
The top-level element in any template is the <structdef>. This serves as a container for
one or more metadata field descriptions as described below. The structdef has two
attributes:
Attribute
Required Possible values
Default Meaning
pagenumber yes
Numeric page
Pages of the original document
range: p or p1-p2
examined by this template
templateID
Yes
word
Unique identifier for this
template
2.2 Metadata Field Descriptors
Inside the <structdef> element are one or more metadata field descriptors. Each is named
for a metadata field (e.g., “CorporateAuthor” in Figure 1). The actual names depend on
the standards imposed by a particular document collection and vary from one collection
to another.
Each metadata field descriptor will contain two elements giving rules that describe the
beginning line where that metadata can be found and the final line for that metadata field
value. The actual value extracted for the field will be the text in all lines so identified.
Possible attributes (all optional) are:
Attribute Required Possible
Default Meaning
values
min
No
Non-neg
1
minimum number of repetitions of this
number
max
No
Non-neg
number
1
ignore
No
“yes” or
“no”
no
require
No
“yes” or
“no”
no
filter
No
Regular
expression
.*
field that should be expected in the
document
maximum number of repetitions of this
field that should be expected in the
document
If “yes”, this field is used merely as a
convenience to identify a position within
the document. No metadata value will
actually be extracted into the output.
If “yes”, then failure to successfully
locate and extract this field indicates that
something is wrong (e.g., this template
describes a different document layout
than is actually present in this
document). In such a case, execution of
the template is halted with no output
generated for any metadata fields.
Used to select a portion of the raw text
in the indicated lines. If the regular
expression contains no parentheses, then
the portion of the text matching the entire
regular expression is extracted. If the
regular expression contains parentheses,
then the portion of the text matching the
parenthesized sub-expressions is
extracted.
2.3 Begin & End Rules.
Each metadata field descriptor will contain one <begin> element and one <end> element.
These describe the beginning and ending line of the text to be extracted for that metadata
field. These are the most complicated part or the template language, and the syntax and
structure of these rules have not evolved in an entirely consistent.fashion over time.
All <begin> and <end> elements will contain some sort of line selector expression.
Sometimes this expression will be simply text. In other cases it is another XML element
(e.g., the <stringmatch> elements in Figure 1).
Attributes of the begin/end rules themselves are:
Attribute Required Possible
Default Meaning
values
inclusive No?
“before”.
current? Modifies the choice made by the basic
“after”,
line selector. If “before”, the selected
“current”
line is actually the one before that
indicated by the line selector. If “after”,
the selected line is the one after that
indicated by the line selector. If
“current”, the selected line is the one
indicated by the line selector.
2.4 Line Selector Expressions
Inside each <begin> or <end> rule is a line selector expression. These describe tests that
are applied to the lines of the CleanML representation of the document, until the
indicated test succeeds. By default, such tests are applied from the start of the document,
though some of the selectors may alter this choice.
Notation:
 mf: a metadata field name
 v1, v2, …: a vertical position on the page, expressed as a decimal number in the
range 0.0-1.0 denoting the fraction of the page (by line count) with 0.0 being the
top
 wc, an integer denoting a number of words
 s1, s2: sizes (of fonts), expressed in units internal to CleanML
Selector
mf
largersize
sizechange (x)
sizepctchange(x)
featurechange
largestsize(v1,v2,wc,mf)
largeststrsize (v1,v2)
Meaning
Permitted only in a <begin> rule, selects the line chosen
by the <end> rule of that metadata field. (Note: although
it is possible to extract multiple instances of the same
metadata field, e.g., multiple authors in a document, only
uniquely occurring fields should be named in selectors.)
Get current line (Lines with string length less than 3 at
the beginning are ignored.) and find a line whose size is
larger than current line (Lines with string length less
than 10 are ignored.).
Find a line whose font size is different from that of the
previous line. To overcome OCR errors, a change with
difference less than x is ignored.
Find a line whose font size is different from that of a
previous line by more than the x percent (0.0-1.00)
Find a line whose features are different from those of the
previous line. A feature change occurs when any of the
following are true:
 Font size is different
 One is bold and the other is not bold
 One is Allupcase and the other not.
 One is leadingcase and the other is not
Deprecated: Searches lines between positions v1 and v2
that are also after mf and that contain a minimum of wc
words for the largest font size. Returns the first of those
lines containing that font.
Searches for the largest font size among lines between
positions v1 and v2 that meet the following criteria:
 Its length is larger than 11




layoutchange
boldchange
beginwithmonth
dateformat(format)
dateformat
nameformat
!nameformat
size=s1
size (s1,s2)
onesection
subtitle
title
begin
end
stringmatch
verticalSpace
firstpart
lastpart
It has more than 1 words
Average word length is between 4 and 13
Percentage of letters is larger than 0.7
It does not contain any of the phrases: “center”,
“report”, “division”, “university”, “laboratory”,
“institute”, “U. S. Army”, “approved for”,
“United States Air Force”, or “United States
Naval Academy”
Find a line that differs from its preceding line in that one
of the following is true:
 Font size is different
 One is bold and the other is not bold
Find a line that differs from its preceding line in that one
is in bold and the other is not.
Find a line begins with a month such as “Mar ” ,
“January”, etc.
Find a line that has a date with specified format
(currently only “month yyyy” and “dd month yyyy” are
supported).
Find a line that has a date with format “dd month yyyy”
“month dd, yyyy” or “month yyyy”, where “month”
means a month string such as “Jan ”, “September”, etc.
Find a line that is in a name format such as (F. Last, First
Last, etc.).
Find a line that is not in a name format.
Find a line with font size s1
Return true if a line’s font size is between s1 and s2.
Permitted only in an <end> rule. Selects the same line as
the <begin> rule.
Find a line that consists of all upcase letters and that has
fewer than 4 words, or a line in which the only
uncapitalized words are ”a”, “of”, “the”, “for”, “one”,
“to”, or “in”.
Find a line that has 4 or more words and that either
consists of all upcase letters or in which the only
uncapitalized words are ”a”, “of”, “the”, or “for”.
The first line
The last line
Match a special string – see below
Find a line preceded by an empty line.
Deprecated: matches the current line (same as “begin” in
a <begin> rule or “onesection” in an <end> rule)
Line of the previous field’s end
keyword
abstract
paraChange
ParaEnd
regexps(re)
StyleChangeNextPart
creator
Deprecated: Find a line starting with “keyword” (could
be done with a stringmatch)
Deprecated: Find a line starting with “abstract” (could be
done with a stringmatch)
Deprecated: find a line staht starts a paragraph and that
has features (see featurechange, above) different from
that of the prior line.
Finds a line preceding a line that was indicated as the
start of a paragraph (by OCR)
Find a line matching a regular expression re
Deprecated: used only by experimental looping code
Deprecated: matches the current line – same as
“onesection”
2.5 String matching
Unlike most line selectors, <stringmatch> is formed as an XML element rather than as
plain test within a <begin> or <end> rule. This is presumably because of the more
elaborate set of options available for this selector. The actual text to be matched is inside
the <stringmatch> element (e.g., “Army Research” in Figure 1).
Attribute Required Possible values
Default Meaning
case
no
“yes” or “no”
yes
Yes: upper/lower case is
significant
No: upper/lowercase differences
are ignored
loc
yes
“beginwith”,
Modifies how much of the text in
“onsection”,
a line must match the provided
“contain”,
text:
“endwith”
 beginwith: the line must begin
with the provided text
 endwith: the line must end
with the provided text
 onesection: the entire line
must match the provided text
 contain: the provided text
must occur somewhere within
the line
fuzzy
no
Non-negative
0
Match succeeds even if the line
integer
differs from the provided text by
this number of single-character
changes (Levenshtein edit
distance)
3. “TemplateMaker” Tool
“TemplateMaker” is a GUI to help in the manual template creation process for which the
execution script is installed during install. As the names of the metadata fields vary from
one organization collection to another, every organization is provided with a
“TemplateMaker” tool which differs only by the metadata field names. General
functionality and usage is the same for all the tools.
3.1. Usage of “TemplateMaker” Tool
Before running the tool gather sample Omnipage XML files of a class for which you
want to create a template into one directory (place the directory where it is easily
accessible).
Step 1: Start creating a new template. Select File -> New and change the “TemplateID”
as required.
Step 2: Load the Omnipage XML files. Click Sample-> Open Samples and navigate to
the directory containing the class samples and load the xml files. Once loaded the drop
down box is populated with the sample Omnipage XML file names. Select one of them
and the “cleanML” area is populated with the respective cleanML. You can make
different selections from the drop down
Step 3: Writing a rule for a metadata field. Click “Meta Select” button and select the
required metadata field name from the drop down box and then click “Okay”. This will
place the selected metadata field with the “empty” begin and “end” rules. Make sure that
the field is placed in the right place i.e. the template is in proper xml format. You can
check the xml format by clicking on “Check Template XML” button. Any number of
metadata fields can be added depending on the requirements.
Step 4: Including the required features for the “begin” and “end” rules. Features like
“String Match” or any feature from the scroll box can be added depending on the
requirement.
Step 5: Check the result. Click on “Execute Extract” button to view the result of the
template in the area under tab “Extract Result” on to the right of the window.
Step 6: Save the template. When the is working good for most of the samples save the
template along with the comments including the author, date of creation, collection etc.
4. Testing
4.1 Regression Test
Regression testing is performed to see whether the behavior of the system has changed.
Regression tests combine inputs to the system with the outputs that a prior version of the
system was able to produce. Most regression tests will be cases where the system
produced correct input. We run these tests mainly to be sure those changes to the system
have not broken things that used to work. It’s not unusual, though, for regression tests to
include test cases where the system produced incorrect output. These tests allow us to see
if future fixes actually correct known bad behaviors.
Regression tests for Extract are run from the top project by giving Ant (build.xml) that
target “regression-test”. This target depends on the target “deploy”, so the installable
system is built first. This is then installed (within the main project’s target directory) and
the newly installed system is run on a collection of regression test suites. Each suite
consists of a number of input documents (in .pdf or already-OCR’ed .xml formats) and
the expected output for these documents. Regression tests reports are written to the
directory target/regressiontest.
To create a new regression test suite, create a directory to hold the suite. This directory
should have two subdirectories, input and expected. Input documents for the test are put
into the input directory. The expected directory contains the expected outputs, generally
gathered from prior runs of the system, for those documents. These are kept in their usual
output subdirectory (e.g., resolved or untrusted). For each file in some subdirectory of
expected, the regression test will look to see if the software produced an identically
named “actual” output file in the corresponding output subdirectory. If so, the two are
compared (currently, only XML outputs can be handled). If the expected and actual
outputs differ, or if there is no actual output corresponding to an expected output, the test
fails.
Once the suite directories have been populated with the desired files, edit the main
project build.xml file, going to the rtest-execute target. Add a new call to the
runTestSuite macro, specifying the name of the new test suite, the path to the suite
directory, the name of the collection used for this test, and the number of input
documents in the suite.
Download