Report on cost to run GPO`s other major documents types through

advertisement
GPO Final Report Feasibility Study of Congressional Documents
Deliverable for
Project No.: 371361
Contract No.: SP4705-07-C-0012
Funding Agency: U.S Government Printing Office through the Defense
Logistics Agency
Project Title: Automated Metadata Extraction Project for the U.S.
Government Printing Office (GPO) Phase IV
Budget Period: 09/30/07 - 09/29/09
Project Period: 09/30/07 - 09/29/09
1
INTRODUCTION
ODU is working on a metadata extraction project with GPO on a collection of U.S. Environmental
Protection Agency (EPA) documents and secondly, on various congressional documents. This report
summarizes our findings on the feasibility of automated metadata extraction for GPO collections. Our
findings are based on the sample of 994 files from the EPA collection and 921 files from the
congressional collection provided by GPO for analysis.
1.1 Process Overview
The overall process for extracting metadata from sample documents consists of the following steps:
1. Manual Classification: Manually classify the sample documents
2. Writing Templates: Based on manual classification write a starting set of templates. In our
template writing we focused on the metadata as outlined in the requirement documents
provided by GPO.
3. System Testing: Test the system by extracting metadata from PDF documents and evaluate for
correctness
4. Refining Templates: Refine the templates based on the outcome of Step (3) and iterate these
steps till we reach a satisfactory accuracy of the extracted output (greater than 80%?)
5. Production Run: Do a production run to generate the metadata and submit to GPO for review.
6. Based on the reviews repeat Steps 3 & 4.
1.2 Manual Classification
For classification of sample documents, we used a manual classification process, which is a multiphase
process based on visual similarities between documents.
The first phase is a coarse classification based on similarities using the Windows Explorer Thumbnail
view for the PDF documents. Documents with matching thumbnails (as judged by the operators eye)
are grouped are grouped together as a preliminary class. During the second phase, documents in each
of class are opened using Acrobat Reader. Multiple documents are opened at the same time and
compared based on the cover page similarity, visual layout and metadata field occurrence. Documents
determined to be dissimilar to the bulk of the class are withdrawn from that class and reconsidered as
examples of potential new classes. In the next phase we placed screenshots of 3 samples of each class
into a document to serve as a visual key to the classes. Each of the remaining documents was opened
and compared to the visual key to determine the possible class based on similarity.
1.3 Writing Templates
We have developed a template writing tool based on the template language we have developed for
automated metadata extraction, see figure below.
Using this tool, we developed a variety of tempaltes describing distinct layouts of metadata text for
documents encountered in the EPA and Congressional collections.
1.4 System Testing
The figure below gives an overview of the software system used to extract the metadata. PDF
documents form the EPA and Congressional collections were supplied to the Extract program along with
the layout templates previously developed.
OCR
(Omnipage)
Human
Reviewers
XML
selected pages
PDF
Document
Untrusted metadata
Corrected metadata
ODU
Extract
Trusted metadata
Validated
Metadata
Document Layout
Templates
In normal operation, the metadata generated by Extract would be diverted into one of two streams.
Metadata judged by Extract to be trustworthy would be deposited into the final output drectory.
Untrusted metadata would be referred for human review and, if necessary, correction. Metadata can be
marked as untrusted for a variety of reasons including defects I nthe Extrac t software, incorrect
templates, OCR errors, poor-quality documents, or simply documents whose content or layouts were
statistically anomalous.
1.5
Refining Templates
Using the same tools as when initially writing the templates, this stage focused on document
layouts for which untrusted data was commonly obtained and, to a lesser degree, on layouts for
which trusted metadata differed signifi8cantly from expectations.
1.6
Production Run
Using the refined templates, documents were run through the Extract system as diagrammed
above. This time, however, both untrusted and trusted metadata were collected for delivery to
GPO.
2 FINDINGS
Based on the process outlined in the Introduction section, we developed 108 templates for the EPA
collection and 11 templates for Congress collection.
EPA: searchable metadata extracted for approximately half of all documents, though in many cases the
formatting and labeling did not rise to GPO standards.
Congressional: Pending GPO evaluation, we believe that the quality of extraction here is quite good,
modulo a few OCR issues that, unfortunately, recur consistently in a number of similar documents. (As
noted later, we have some reason to believe that these problems could be largely resolved by direct PDF
to text conversion without OCR, a capability currently under development.) Discounting those, we
believe that we would have achieved accurate searchable metadata in 75% of the sample documents.
2.1
Problems experienced
1) Unexpectedly large number of templates required for EPA collection, which still left many
documents unhandled. Such documents were typically “singletons” (one-of-a-kind documents
with metadata laid out in a fashion unlike any other document in a sample).
 This may be attributed in part to the fact that EPA collection is an aggregation both from
the EPA itself and from many different organizations reporting to the EPA. The use of web
crawling in creation of the collection leaves serious doubt as to how large and diverse the
eventual total collection might be and to the degree to which the small sample is actually
representative of that totality.
 Developing a template for a singleton document is generally not going to be cost effective.
However, some singletons may be artifacts of the sample size. In a larger sample, collected
over a longer period of time, some of these singletons would probably be matched by other
documents.
 Regardless, the diversity of the EPA collection was surprising, even for a project such as ours
that is based upon techniques for embracing diversity.
2) Unexpected numbers of OCR errors – we relied upon commercial OCR software and, to a large
degree, our success is limited by the ability of such engines to supply us with clean text.
 EPA documents contained a lot of critical text that was juxtaposed or superimposed
with graphic elements including background pictures and logos.
 EPA documents contained several layouts in which metadata was arranged in multicolumn headers. These proved particularly troublesome for the OCR engine in use.
 Congressional documents contained some repeated elements in fairly small fonts set
directly above or below a horizontal line. Again, this seemed to cause an unexpected
number of problems for the OCR engines we tried.
3) Labeling and formatting of output – we entered this project believing it to be largely an exercise
in extraction of metadata, but found that a substantial portion of our effort was actually
devoted to
 Labeling the extracted metadata as the most appropriate Marc field or subfield. We
were expected to make far more subtle distinctions with these collections than had
been required for our prior work with DTIC or NASA, and found ourselves severely
limited by our lack of expertise both in Marc classification in general and in the local
culture of GPO.
 In many cases, data that we thought had been extracted correctly for the EPA
collection was rejected because the GPO reviewers disagreed with our labeling.
During the first round of review, there were some instances in which the two
GPO reviewers disagreed with one another and/or with prior instructions we
had been given, which disagreements highlight the inherent difficulty of this
task.
 Formatting of output: application of Marc rules for punctuation and especially
capitalization call for semantic judgments that are typically difficult for software and
that had not been a focus of our prior efforts. Not having allocated resources or time for
a full-fledged effort at developing an integrated capability for named entity recognition,
we were forced to make do with a more ad hoc approach. Developing that took a fair
amount of effort, and the results were not as successful as we would have liked.
4) Limited availability of historical metadata
 A major feature of our system is the use of internal validation rules to judge whether
extracted text is likely to be good metadata or not. Many of these rules are probabilistic,
based upon statistics gathered from metadata already existing for the collection.
 For the EPA collection, our sample metadata was limited to about 2000 records. (By
contrast, we had worked from a collection of 850000 records for DTIC and 20000 for
NASA.) In addition, the EPA metadata was in many ways inconsistent with the labeling
rules that evolved for our EPA demo. Finally, the EPA metadata sample turned out to be
non-representative of the document sample from which we were doing extraction. An
obvious example of this was in the average percentage of title words that could be
found in a conventional English dictionary. The documents we were working with turned
out to have a far higher incidence of chemical names and technical terms than was
consistent with the historical metadata. As a related secondary effect, the sample
documents tended to have a far higher rate of occurrence of very long titles than was
observed in the historical metadata.
5) Evaluation Criteria
 We have noted in prior communications that we have never viewed our system as a
replacement for human judgment but as part of a system that attempts to automate
what it can, while directing human attention to what it cannot successfully automate.
Our system’s output is not only extracted metadata, but also an assessment of its
confidence in each metadata field. Towards that end, we had suggested an evaluation
scheme that distinguished between four cases:
1: extracted metadata is acceptable and was rated resolved
2: extracted metadata is acceptable and was rated untrusted
3: extracted metadata is unacceptable and was rated resolved
4: extracted metadata is unacceptable and was rated untrusted


2.2
1)
2)
3)
4)
We consider the software to be running properly in cases 1 and 4, though we
understand that if the percentage of documents falling under case 4 is too large, the
software's utility is decreased. Case 2 is a "soft" failure - it may distract the operator but
poses no threat of unacceptable data being placed into the catalog. This case is of
concern mainly if it becomes so common as prove annoying or wasteful of the
operator's time. Case 3 is a true “hard” failure.
However, the requested deliverable format and, as far as we can tell, subsequent evaluation
by GPO, discards the confidence rating and merely judges whether the metadata was
acceptable.
In truth, this may not have played a major role in the acceptability of our system in these
trials (and the problems with historical metadata lowered the effectiveness of the
confidence scoring in the EPA collection), but it is a factor that should be kept in mind in the
future.
Recommendations for future development
Coping with layout diversity: Investigate a possible extraction engine with a more robust set of
primitive extraction rules than in current use, with the aim of collapsing what are currently
many distinct but related templates for document layouts into a smaller number of more
powerful templates. Investigate the possible role of named entity recognition techniques within
the context of extraction.
OCR Errors: Replace the systematic use of OCR by direct extraction of text from PDF documents,
when possible. (This is not possible for all documents, but depends upon the technique used to
produce the PDF. Documents scanned into PDF can only be recovered via OCR, but the majority
of documents produced with the current generation of word processors could, in theory, be
handled without OCR. We have some promising preliminary results with a PDF-to-text tool
currently under development, suggesting that it would avoid many of the OCR pitfalls we
encountered in this study.
Labeling and Formatting of output
 To a large degree, the issues of labeling are organizational and communicationrelated. More consistent feedback between ODU designers and GPO staff needs to
be organized in any future efforts.
 Formatting: undertake a formal review of the state of the art in software for named
entity recognition and explore the integration of such capabilities into our system
design.
Limited availability of historical metadata:

We have posited a future capability for the statistical component to “learn”
from it mistakes over time. In a production system, this could be quite valuable.
At the same time, such a capability is almost impossible to demonstrate in a
one-shot evaluation scheme.
5) Evaluation criteria
 Need better communication on this earlier in the process
3 CONCLUSIONS AND FUTURE DEVELOPMENT
In an e-mail communication from Aug. 23, 2009, GPO communicated to us a set of questions concerning
our overall conclusions and possibilities for future development. We are replicating below the questions
and present our answers.
* Final evaluation on the feasibility of ODU being able to successfully stand up this software for GPO
use based on the research. If yes, which types of documents would be most successful for this?
Our assessment is that the ODU metadata extraction software can work effectively for collections such
as the congress collection. For collections such as EPA our software will still save considerable time in
the GPO cataloguing process by processing about 50% of the documents at the trusted level.
* Based on the collections provided, if ODU believes the system could be tweaked to produce
acceptable records, GPO would like to know how many additional hours for tweaking are needed and
what are the estimated costs associated with this?
As pointed out in the recommendation section above, to raise the acceptable extraction rate
significantly we can address a number of issues with the EPA collection by making major changes to the
extraction engine and by supporting direct extraction of the text from PDF documents. This is a
significant effort , and would need an additional 9 months to 12 months effort with a cost estimate of
$75K to $100K. For the congressional collection the only thing we would recommend is the full support
of the extraction of text from PDF documents directly. This is an effort in the $35K range. To consolidate
the EPA trusted acceptance rate at 50%, the most significant effort will lie in the library-related issues of
1) correctly distinguishing among similar fields such as titles, series, and some notes; and 2) rewriting
text extracted from a document into AACR2-conformant format. That effort would be a $50K effort.
* Estimated costs to analyze another collection (s) of documents that has minimal existing metadata
records. If GPO decides to continue to utilize the software that has been developed to date for this
project, what costs would be associated with using the software to generate metadata for other
document collections?
The cost for analyzing another collection will be around $35K to produce a template set and to test and
refine.
* Estimated costs to analyze several additional GPO collections for feasibility of success. If GPO decides
to have ODU continue to modify the already developed software, what costs would be associated with
doing this? Would cost be based on document collection size or would another project cost be proposed?
To determine the feasibility of applying our software to a particular collection, we would perform an
analysis of the characteristics of the collection (size, change rate, availability of existing metadata,..) and
do the manual classification from two different samples. As we are using samples this step does not
depend on the size of the collection and therefore we can make this a fixed cost of $8K per study of a
collection.
Download