GPO Final Report Feasibility Study of Congressional Documents Deliverable for Project No.: 371361 Contract No.: SP4705-07-C-0012 Funding Agency: U.S Government Printing Office through the Defense Logistics Agency Project Title: Automated Metadata Extraction Project for the U.S. Government Printing Office (GPO) Phase IV Budget Period: 09/30/07 - 09/29/09 Project Period: 09/30/07 - 09/29/09 1 INTRODUCTION ODU is working on a metadata extraction project with GPO on a collection of U.S. Environmental Protection Agency (EPA) documents and secondly, on various congressional documents. This report summarizes our findings on the feasibility of automated metadata extraction for GPO collections. Our findings are based on the sample of 994 files from the EPA collection and 921 files from the congressional collection provided by GPO for analysis. 1.1 Process Overview The overall process for extracting metadata from sample documents consists of the following steps: 1. Manual Classification: Manually classify the sample documents 2. Writing Templates: Based on manual classification write a starting set of templates. In our template writing we focused on the metadata as outlined in the requirement documents provided by GPO. 3. System Testing: Test the system by extracting metadata from PDF documents and evaluate for correctness 4. Refining Templates: Refine the templates based on the outcome of Step (3) and iterate these steps till we reach a satisfactory accuracy of the extracted output (greater than 80%?) 5. Production Run: Do a production run to generate the metadata and submit to GPO for review. 6. Based on the reviews repeat Steps 3 & 4. 1.2 Manual Classification For classification of sample documents, we used a manual classification process, which is a multiphase process based on visual similarities between documents. The first phase is a coarse classification based on similarities using the Windows Explorer Thumbnail view for the PDF documents. Documents with matching thumbnails (as judged by the operators eye) are grouped are grouped together as a preliminary class. During the second phase, documents in each of class are opened using Acrobat Reader. Multiple documents are opened at the same time and compared based on the cover page similarity, visual layout and metadata field occurrence. Documents determined to be dissimilar to the bulk of the class are withdrawn from that class and reconsidered as examples of potential new classes. In the next phase we placed screenshots of 3 samples of each class into a document to serve as a visual key to the classes. Each of the remaining documents was opened and compared to the visual key to determine the possible class based on similarity. 1.3 Writing Templates We have developed a template writing tool based on the template language we have developed for automated metadata extraction, see figure below. Using this tool, we developed a variety of tempaltes describing distinct layouts of metadata text for documents encountered in the EPA and Congressional collections. 1.4 System Testing The figure below gives an overview of the software system used to extract the metadata. PDF documents form the EPA and Congressional collections were supplied to the Extract program along with the layout templates previously developed. OCR (Omnipage) Human Reviewers XML selected pages PDF Document Untrusted metadata Corrected metadata ODU Extract Trusted metadata Validated Metadata Document Layout Templates In normal operation, the metadata generated by Extract would be diverted into one of two streams. Metadata judged by Extract to be trustworthy would be deposited into the final output drectory. Untrusted metadata would be referred for human review and, if necessary, correction. Metadata can be marked as untrusted for a variety of reasons including defects I nthe Extrac t software, incorrect templates, OCR errors, poor-quality documents, or simply documents whose content or layouts were statistically anomalous. 1.5 Refining Templates Using the same tools as when initially writing the templates, this stage focused on document layouts for which untrusted data was commonly obtained and, to a lesser degree, on layouts for which trusted metadata differed signifi8cantly from expectations. 1.6 Production Run Using the refined templates, documents were run through the Extract system as diagrammed above. This time, however, both untrusted and trusted metadata were collected for delivery to GPO. 2 FINDINGS Based on the process outlined in the Introduction section, we developed 108 templates for the EPA collection and 11 templates for Congress collection. EPA: searchable metadata extracted for approximately half of all documents, though in many cases the formatting and labeling did not rise to GPO standards. Congressional: Pending GPO evaluation, we believe that the quality of extraction here is quite good, modulo a few OCR issues that, unfortunately, recur consistently in a number of similar documents. (As noted later, we have some reason to believe that these problems could be largely resolved by direct PDF to text conversion without OCR, a capability currently under development.) Discounting those, we believe that we would have achieved accurate searchable metadata in 75% of the sample documents. 2.1 Problems experienced 1) Unexpectedly large number of templates required for EPA collection, which still left many documents unhandled. Such documents were typically “singletons” (one-of-a-kind documents with metadata laid out in a fashion unlike any other document in a sample). This may be attributed in part to the fact that EPA collection is an aggregation both from the EPA itself and from many different organizations reporting to the EPA. The use of web crawling in creation of the collection leaves serious doubt as to how large and diverse the eventual total collection might be and to the degree to which the small sample is actually representative of that totality. Developing a template for a singleton document is generally not going to be cost effective. However, some singletons may be artifacts of the sample size. In a larger sample, collected over a longer period of time, some of these singletons would probably be matched by other documents. Regardless, the diversity of the EPA collection was surprising, even for a project such as ours that is based upon techniques for embracing diversity. 2) Unexpected numbers of OCR errors – we relied upon commercial OCR software and, to a large degree, our success is limited by the ability of such engines to supply us with clean text. EPA documents contained a lot of critical text that was juxtaposed or superimposed with graphic elements including background pictures and logos. EPA documents contained several layouts in which metadata was arranged in multicolumn headers. These proved particularly troublesome for the OCR engine in use. Congressional documents contained some repeated elements in fairly small fonts set directly above or below a horizontal line. Again, this seemed to cause an unexpected number of problems for the OCR engines we tried. 3) Labeling and formatting of output – we entered this project believing it to be largely an exercise in extraction of metadata, but found that a substantial portion of our effort was actually devoted to Labeling the extracted metadata as the most appropriate Marc field or subfield. We were expected to make far more subtle distinctions with these collections than had been required for our prior work with DTIC or NASA, and found ourselves severely limited by our lack of expertise both in Marc classification in general and in the local culture of GPO. In many cases, data that we thought had been extracted correctly for the EPA collection was rejected because the GPO reviewers disagreed with our labeling. During the first round of review, there were some instances in which the two GPO reviewers disagreed with one another and/or with prior instructions we had been given, which disagreements highlight the inherent difficulty of this task. Formatting of output: application of Marc rules for punctuation and especially capitalization call for semantic judgments that are typically difficult for software and that had not been a focus of our prior efforts. Not having allocated resources or time for a full-fledged effort at developing an integrated capability for named entity recognition, we were forced to make do with a more ad hoc approach. Developing that took a fair amount of effort, and the results were not as successful as we would have liked. 4) Limited availability of historical metadata A major feature of our system is the use of internal validation rules to judge whether extracted text is likely to be good metadata or not. Many of these rules are probabilistic, based upon statistics gathered from metadata already existing for the collection. For the EPA collection, our sample metadata was limited to about 2000 records. (By contrast, we had worked from a collection of 850000 records for DTIC and 20000 for NASA.) In addition, the EPA metadata was in many ways inconsistent with the labeling rules that evolved for our EPA demo. Finally, the EPA metadata sample turned out to be non-representative of the document sample from which we were doing extraction. An obvious example of this was in the average percentage of title words that could be found in a conventional English dictionary. The documents we were working with turned out to have a far higher incidence of chemical names and technical terms than was consistent with the historical metadata. As a related secondary effect, the sample documents tended to have a far higher rate of occurrence of very long titles than was observed in the historical metadata. 5) Evaluation Criteria We have noted in prior communications that we have never viewed our system as a replacement for human judgment but as part of a system that attempts to automate what it can, while directing human attention to what it cannot successfully automate. Our system’s output is not only extracted metadata, but also an assessment of its confidence in each metadata field. Towards that end, we had suggested an evaluation scheme that distinguished between four cases: 1: extracted metadata is acceptable and was rated resolved 2: extracted metadata is acceptable and was rated untrusted 3: extracted metadata is unacceptable and was rated resolved 4: extracted metadata is unacceptable and was rated untrusted 2.2 1) 2) 3) 4) We consider the software to be running properly in cases 1 and 4, though we understand that if the percentage of documents falling under case 4 is too large, the software's utility is decreased. Case 2 is a "soft" failure - it may distract the operator but poses no threat of unacceptable data being placed into the catalog. This case is of concern mainly if it becomes so common as prove annoying or wasteful of the operator's time. Case 3 is a true “hard” failure. However, the requested deliverable format and, as far as we can tell, subsequent evaluation by GPO, discards the confidence rating and merely judges whether the metadata was acceptable. In truth, this may not have played a major role in the acceptability of our system in these trials (and the problems with historical metadata lowered the effectiveness of the confidence scoring in the EPA collection), but it is a factor that should be kept in mind in the future. Recommendations for future development Coping with layout diversity: Investigate a possible extraction engine with a more robust set of primitive extraction rules than in current use, with the aim of collapsing what are currently many distinct but related templates for document layouts into a smaller number of more powerful templates. Investigate the possible role of named entity recognition techniques within the context of extraction. OCR Errors: Replace the systematic use of OCR by direct extraction of text from PDF documents, when possible. (This is not possible for all documents, but depends upon the technique used to produce the PDF. Documents scanned into PDF can only be recovered via OCR, but the majority of documents produced with the current generation of word processors could, in theory, be handled without OCR. We have some promising preliminary results with a PDF-to-text tool currently under development, suggesting that it would avoid many of the OCR pitfalls we encountered in this study. Labeling and Formatting of output To a large degree, the issues of labeling are organizational and communicationrelated. More consistent feedback between ODU designers and GPO staff needs to be organized in any future efforts. Formatting: undertake a formal review of the state of the art in software for named entity recognition and explore the integration of such capabilities into our system design. Limited availability of historical metadata: We have posited a future capability for the statistical component to “learn” from it mistakes over time. In a production system, this could be quite valuable. At the same time, such a capability is almost impossible to demonstrate in a one-shot evaluation scheme. 5) Evaluation criteria Need better communication on this earlier in the process 3 CONCLUSIONS AND FUTURE DEVELOPMENT In an e-mail communication from Aug. 23, 2009, GPO communicated to us a set of questions concerning our overall conclusions and possibilities for future development. We are replicating below the questions and present our answers. * Final evaluation on the feasibility of ODU being able to successfully stand up this software for GPO use based on the research. If yes, which types of documents would be most successful for this? Our assessment is that the ODU metadata extraction software can work effectively for collections such as the congress collection. For collections such as EPA our software will still save considerable time in the GPO cataloguing process by processing about 50% of the documents at the trusted level. * Based on the collections provided, if ODU believes the system could be tweaked to produce acceptable records, GPO would like to know how many additional hours for tweaking are needed and what are the estimated costs associated with this? As pointed out in the recommendation section above, to raise the acceptable extraction rate significantly we can address a number of issues with the EPA collection by making major changes to the extraction engine and by supporting direct extraction of the text from PDF documents. This is a significant effort , and would need an additional 9 months to 12 months effort with a cost estimate of $75K to $100K. For the congressional collection the only thing we would recommend is the full support of the extraction of text from PDF documents directly. This is an effort in the $35K range. To consolidate the EPA trusted acceptance rate at 50%, the most significant effort will lie in the library-related issues of 1) correctly distinguishing among similar fields such as titles, series, and some notes; and 2) rewriting text extracted from a document into AACR2-conformant format. That effort would be a $50K effort. * Estimated costs to analyze another collection (s) of documents that has minimal existing metadata records. If GPO decides to continue to utilize the software that has been developed to date for this project, what costs would be associated with using the software to generate metadata for other document collections? The cost for analyzing another collection will be around $35K to produce a template set and to test and refine. * Estimated costs to analyze several additional GPO collections for feasibility of success. If GPO decides to have ODU continue to modify the already developed software, what costs would be associated with doing this? Would cost be based on document collection size or would another project cost be proposed? To determine the feasibility of applying our software to a particular collection, we would perform an analysis of the characteristics of the collection (size, change rate, availability of existing metadata,..) and do the manual classification from two different samples. As we are using samples this step does not depend on the size of the collection and therefore we can make this a fixed cost of $8K per study of a collection.