Non-Form Template Writers' Manual May 2008 1. Introduction This document describes the current state of the template language for processing of nonform documents in the Extract metadata extraction system. Readers unfamiliar with the overall purpose and organization of this system are referred to [reference to ICADL paper]. This document concentrates upon the portion of the overall system associated with extraction of raw metadata from documents that have already been checked for “report pages” (metadata-bearing forms) and found to lack any such. Such documents are then passed on for attempted extraction of metadata according to rules described in one or more executable templates, each template designed to describe a distinct document layout. The template language is derived from [ref to Tang’s thesis] and retains the overall structure defined there, though many additions and modifications to the lower-level details have been made since then. At the time of extraction, documents (originally received as PDF files) have been passed through OCR and eventually converted to a data structure called “CleanML”, essentially a sequence of pages, each page represented as a sequence lines of text. Each line is marked with certain “features” describing the primary font employed within that line. These features are: weight (bold or medium), slant (italic or normal) font size, allCaps (true or false), titleCase (true or false) start of paragraph (i.e., is there one or more empty line preceding this one?) The “template engine” attempts to interpret the instructions encoded in a template in order to select, from the CleanML structure, text strings that represent meaningful metadata. 2. The Template Language Each template contains a set of rules designed to extract metadata from a single class of similar documents. Figure 1 shows a template example. Each desired metadata item is described by a rule set designating the beginning and the end of the metadata. The rules are limited by features detectable at the line level resolution. The names of the metadata fields can vary from one organization collection to another. <structdef pagenumber="3" templateID="arl_1"> <CorporateAuthor> <begin inclusive="current"> <stringmatch case="no" loc="beginwith">Army Research</stringmatch> </begin> <end inclusive="before"> <stringmatch case="no" loc="beginwith">ARL</stringmatch> </end> Fig. 1. Non-form Template fragment The template language is expressed in XML. This is not a fundamental limitation. Almost all executable languages are processed by translation into a tree-structured description (generally referred to as an “abstract syntax tree). XML is a standard notation for exchanging tree-structured data, and so it is convenient for a research/exploratory project to introduce new languages directly in the syntax tree format and bypassing the otherwise distracting process of providing higher level translation. XML is not particularly easy to read and write, and a later section of this document describes a preliminary approach to provide a more writer-friendly way to generate templates. In the remainder of this section we cover the XML elements that make up a template. 2.1 The <structdef> The top-level element in any template is the <structdef>. This serves as a container for one or more metadata field descriptions as described below. The structdef has two attributes: Attribute Required Possible values Default Meaning pagenumber yes Numeric page Pages of the original document range: p or p1-p2 examined by this template templateID Yes word Unique identifier for this template 2.2 Metadata Field Descriptors Inside the <structdef> element are one or more metadata field descriptors. Each is named for a metadata field (e.g., “CorporateAuthor” in Figure 1). The actual names depend on the standards imposed by a particular document collection and vary from one collection to another. Each metadata field descriptor will contain two elements giving rules that describe the beginning line where that metadata can be found and the final line for that metadata field value. The actual value extracted for the field will be the text in all lines so identified. Possible attributes (all optional) are: Attribute Required Possible Default Meaning values min No Non-neg 1 minimum number of repetitions of this number max No Non-neg number 1 ignore No “yes” or “no” no require No “yes” or “no” no filter No Regular expression .* field that should be expected in the document maximum number of repetitions of this field that should be expected in the document If “yes”, this field is used merely as a convenience to identify a position within the document. No metadata value will actually be extracted into the output. If “yes”, then failure to successfully locate and extract this field indicates that something is wrong (e.g., this template describes a different document layout than is actually present in this document). In such a case, execution of the template is halted with no output generated for any metadata fields. Used to select a portion of the raw text in the indicated lines. If the regular expression contains no parentheses, then the portion of the text matching the entire regular expression is extracted. If the regular expression contains parentheses, then the portion of the text matching the parenthesized sub-expressions is extracted. 2.3 Begin & End Rules. Each metadata field descriptor will contain one <begin> element and one <end> element. These describe the beginning and ending line of the text to be extracted for that metadata field. These are the most complicated part or the template language, and the syntax and structure of these rules have not evolved in an entirely consistent.fashion over time. All <begin> and <end> elements will contain some sort of line selector expression. Sometimes this expression will be simply text. In other cases it is another XML element (e.g., the <stringmatch> elements in Figure 1). Attributes of the begin/end rules themselves are: Attribute Required Possible Default Meaning values inclusive No? “before”. current? Modifies the choice made by the basic “after”, line selector. If “before”, the selected “current” line is actually the one before that indicated by the line selector. If “after”, the selected line is the one after that indicated by the line selector. If “current”, the selected line is the one indicated by the line selector. 2.4 Line Selector Expressions Inside each <begin> or <end> rule is a line selector expression. These describe tests that are applied to the lines of the CleanML representation of the document, until the indicated test succeeds. By default, such tests are applied from the start of the document, though some of the selectors may alter this choice. Notation: mf: a metadata field name v1, v2, …: a vertical position on the page, expressed as a decimal number in the range 0.0-1.0 denoting the fraction of the page (by line count) with 0.0 being the top wc, an integer denoting a number of words s1, s2: sizes (of fonts), expressed in units internal to CleanML Selector mf largersize sizechange (x) sizepctchange(x) featurechange largestsize(v1,v2,wc,mf) largeststrsize (v1,v2) Meaning Permitted only in a <begin> rule, selects the line chosen by the <end> rule of that metadata field. (Note: although it is possible to extract multiple instances of the same metadata field, e.g., multiple authors in a document, only uniquely occurring fields should be named in selectors.) Get current line (Lines with string length less than 3 at the beginning are ignored.) and find a line whose size is larger than current line (Lines with string length less than 10 are ignored.). Find a line whose font size is different from that of the previous line. To overcome OCR errors, a change with difference less than x is ignored. Find a line whose font size is different from that of a previous line by more than the x percent (0.0-1.00) Find a line whose features are different from those of the previous line. A feature change occurs when any of the following are true: Font size is different One is bold and the other is not bold One is Allupcase and the other not. One is leadingcase and the other is not Deprecated: Searches lines between positions v1 and v2 that are also after mf and that contain a minimum of wc words for the largest font size. Returns the first of those lines containing that font. Searches for the largest font size among lines between positions v1 and v2 that meet the following criteria: Its length is larger than 11 layoutchange boldchange beginwithmonth dateformat(format) dateformat nameformat !nameformat size=s1 size (s1,s2) onesection subtitle title begin end stringmatch verticalSpace firstpart lastpart It has more than 1 words Average word length is between 4 and 13 Percentage of letters is larger than 0.7 It does not contain any of the phrases: “center”, “report”, “division”, “university”, “laboratory”, “institute”, “U. S. Army”, “approved for”, “United States Air Force”, or “United States Naval Academy” Find a line that differs from its preceding line in that one of the following is true: Font size is different One is bold and the other is not bold Find a line that differs from its preceding line in that one is in bold and the other is not. Find a line begins with a month such as “Mar ” , “January”, etc. Find a line that has a date with specified format (currently only “month yyyy” and “dd month yyyy” are supported). Find a line that has a date with format “dd month yyyy” “month dd, yyyy” or “month yyyy”, where “month” means a month string such as “Jan ”, “September”, etc. Find a line that is in a name format such as (F. Last, First Last, etc.). Find a line that is not in a name format. Find a line with font size s1 Return true if a line’s font size is between s1 and s2. Permitted only in an <end> rule. Selects the same line as the <begin> rule. Find a line that consists of all upcase letters and that has fewer than 4 words, or a line in which the only uncapitalized words are ”a”, “of”, “the”, “for”, “one”, “to”, or “in”. Find a line that has 4 or more words and that either consists of all upcase letters or in which the only uncapitalized words are ”a”, “of”, “the”, or “for”. The first line The last line Match a special string – see below Find a line preceded by an empty line. Deprecated: matches the current line (same as “begin” in a <begin> rule or “onesection” in an <end> rule) Line of the previous field’s end keyword abstract paraChange ParaEnd regexps(re) StyleChangeNextPart creator Deprecated: Find a line starting with “keyword” (could be done with a stringmatch) Deprecated: Find a line starting with “abstract” (could be done with a stringmatch) Deprecated: find a line staht starts a paragraph and that has features (see featurechange, above) different from that of the prior line. Finds a line preceding a line that was indicated as the start of a paragraph (by OCR) Find a line matching a regular expression re Deprecated: used only by experimental looping code Deprecated: matches the current line – same as “onesection” 2.5 String matching Unlike most line selectors, <stringmatch> is formed as an XML element rather than as plain test within a <begin> or <end> rule. This is presumably because of the more elaborate set of options available for this selector. The actual text to be matched is inside the <stringmatch> element (e.g., “Army Research” in Figure 1). Attribute Required Possible values Default Meaning case no “yes” or “no” yes Yes: upper/lower case is significant No: upper/lowercase differences are ignored loc yes “beginwith”, Modifies how much of the text in “onsection”, a line must match the provided “contain”, text: “endwith” beginwith: the line must begin with the provided text endwith: the line must end with the provided text onesection: the entire line must match the provided text contain: the provided text must occur somewhere within the line fuzzy no Non-negative 0 Match succeeds even if the line integer differs from the provided text by this number of single-character changes (Levenshtein edit distance) 3. “TemplateMaker” Tool “TemplateMaker” is a GUI to help in the manual template creation process for which the execution script is installed during install. As the names of the metadata fields vary from one organization collection to another, every organization is provided with a “TemplateMaker” tool which differs only by the metadata field names. General functionality and usage is the same for all the tools. 3.1. Usage of “TemplateMaker” Tool Before running the tool gather sample Omnipage XML files of a class for which you want to create a template into one directory (place the directory where it is easily accessible). Step 1: Start creating a new template. Select File -> New and change the “TemplateID” as required. Step 2: Load the Omnipage XML files. Click Sample-> Open Samples and navigate to the directory containing the class samples and load the xml files. Once loaded the drop down box is populated with the sample Omnipage XML file names. Select one of them and the “cleanML” area is populated with the respective cleanML. You can make different selections from the drop down Step 3: Writing a rule for a metadata field. Click “Meta Select” button and select the required metadata field name from the drop down box and then click “Okay”. This will place the selected metadata field with the “empty” begin and “end” rules. Make sure that the field is placed in the right place i.e. the template is in proper xml format. You can check the xml format by clicking on “Check Template XML” button. Any number of metadata fields can be added depending on the requirements. Step 4: Including the required features for the “begin” and “end” rules. Features like “String Match” or any feature from the scroll box can be added depending on the requirement. Step 5: Check the result. Click on “Execute Extract” button to view the result of the template in the area under tab “Extract Result” on to the right of the window. Step 6: Save the template. When the is working good for most of the samples save the template along with the comments including the author, date of creation, collection etc. 4. Testing 4.1 Regression Test Regression testing is performed to see whether the behavior of the system has changed. Regression tests combine inputs to the system with the outputs that a prior version of the system was able to produce. Most regression tests will be cases where the system produced correct input. We run these tests mainly to be sure those changes to the system have not broken things that used to work. It’s not unusual, though, for regression tests to include test cases where the system produced incorrect output. These tests allow us to see if future fixes actually correct known bad behaviors. Regression tests for Extract are run from the top project by giving Ant (build.xml) that target “regression-test”. This target depends on the target “deploy”, so the installable system is built first. This is then installed (within the main project’s target directory) and the newly installed system is run on a collection of regression test suites. Each suite consists of a number of input documents (in .pdf or already-OCR’ed .xml formats) and the expected output for these documents. Regression tests reports are written to the directory target/regressiontest. To create a new regression test suite, create a directory to hold the suite. This directory should have two subdirectories, input and expected. Input documents for the test are put into the input directory. The expected directory contains the expected outputs, generally gathered from prior runs of the system, for those documents. These are kept in their usual output subdirectory (e.g., resolved or untrusted). For each file in some subdirectory of expected, the regression test will look to see if the software produced an identically named “actual” output file in the corresponding output subdirectory. If so, the two are compared (currently, only XML outputs can be handled). If the expected and actual outputs differ, or if there is no actual output corresponding to an expected output, the test fails. Once the suite directories have been populated with the desired files, edit the main project build.xml file, going to the rtest-execute target. Add a new call to the runTestSuite macro, specifying the name of the new test suite, the path to the suite directory, the name of the collection used for this test, and the number of input documents in the suite.