Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin Collection Bruce Robertson, Mount Allison University ἀλήθεια truth Ἀλήθεια • ‘Breathing’ marks on vowels at beginning of a word • Accents possible on all vowels Diversity of Greek Fonts in 19th C. Other Examples Greek OCR With Gamera • Dalitz and Brandt provide an experimental framework – I added splitting, grouping, sql output, etc. • Teams of undergraduates making multiple classifiers – Based on families of fonts – Comparing strategies of composite characters, splitting, etc. – Must also train for Latin scripts used • Not yet working on post-processing Good Results Systematic Approach to Automated Greek OCR • Remove the curator from the loop – especially important for journals, monographs, etc. – Assign classifier by computation means • Using: – Federico Boschetti’s ground-truth-less Greek text evaluator – Atlantic Computational Excellence Network, Atlantic Canada’s parallel computing network Process • 160 Greek-heavy texts chosen • Of these, random samples of 10 pages were taken • Each was processed with each of the 20 classifiers made this summer • The result were evaluated and given a ‘Boschetti score’ from 0 – 1 0.6 0.5 0.4 0.3 0.2 0.1 0 Teubner_Slim Teubner_Simil Teubner_Similar Teubner_SansSer Teubner_Latin Super_Swirly2 Super_Swirly Smyth Oxford Oribase_Test Oribase_Font_2 Oribase_Font_1 Oribase_Font New_Teubner Loeb_Wholistic Littre Lexicon Kurke gamera-greekocr-training-loeb-separatistic-20 Etymologicum Early_Teubner Cambridge Bude Bekker Aristides_Dindorf_1 Aristides_Dindorf Alpha_Font 16thcent Google/ABBYY Line Splitting Gamera’s Text Line Finding(bbox_merging) Replaced with runlength_smearing Two-step processing Future Work • Combining and re-optimizing classifiers? • Assign classifier based on Latin text – Is ‘Oxford’, ‘Clarendon’ or ‘Oxonii’ in the first pages of output? • Align with Google’s output, and provide Google with corrected Greek • Implement line-splitting from other OCR engines • Discover badly OCR’d Greek in others’ output • Implement OCR correction frameworks described here Common Problems • Assessments of pre-processing strategies and tools • Schemas for page description Thanks • Colleagues in Dynamic Variorum Editions: – Greg Crane at Perseus / Tufts – Brian Fuchs at Imperial College • Federico Boschetti • AceNet, especially tech. support of Sergiy Khan