word comparison across outputs

Supporting Document 1 German text interpreted as “unb auf ben Â©elnrgen be6 fublic{)en” by the OCR process. IA OCR OCR 2 Transcription 1 Transcription 2 1 unb und und und Ok 2 den ben den den Ok 3 Â©elnrgen Â©ebirgen Bebirgen Gebirgen X 4 be6 des de5 des Chk 5 fublic{)en fublichen Füdlichen Südlichen X 6 Â£)eittfc{)(anb6 Deutfchlanbs Deutfchlands Deutschlands X Different texts parsing the phrase “und auf den Gebirgen des südlichen Deutschlands” (“and on the mountains of southern Germany”) In some cases, the correct text can be inferred from the agreeing versions (examples 1 and 2 in the table above), others, it can be automatically reviewed or verified (example 4 above) but others (like examples 3, 5 and 6) may indicate the need to do a manual revision of the words by a Data Analyst or even the necessity to regenerate the whole OCR text for the page with proper parameters (like a specialized engine trained in the language). For each page, the density of non-alpha characters can indicate those cases where the OCR process was too erratic. An estimation of such index could help separate initially those pages that would require the generation of a new OCR text. Then different versions of text, coming either from OCR or transcription, can be compared to identify the differences. Well-known specialized algorithms (like BLAST) that take into account characters insertions and deletions can determine such differences and allow the extraction of those problematic zones that require correction. Even when computer automation can process large quantities of tasks in less time than most humans, the human eye can distinguish images as text relatively easily (citation needed) and could be more costeffective, and that is exactly what approaches like RECAPTCHA are based on. But reviewing all the content of a book or even transcribing every single page of a book could be time consuming. We intent to reduce the size of the problem by identifying only the correct spelling of those words that need verification (like example 4 above). In order to obtain the appropriate image of each word, we intend to use the files that come as byproduct of the scanning process carried by Internet Archive. When such by-products are not available because the text is produced by transcription, another process could be used to crowdsource the determination of such coordinates. Also, the repetitive task of identifying words could be enticed by applying a purposeful gaming technique where the user could be supporting the reviewing process either consciously or not.[cite Digital Koot]

word comparison across outputs

Related documents

Products

Support

word comparison across outputs

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib