OCR With Gamera and the Google/PERSEUS Greek and Latin

advertisement
Ancient Greek OCR with
Gamera and the Google/Perseus
Greek and Latin Collection
Bruce Robertson, Mount Allison
University
ἀλήθεια
truth
Ἀλήθεια
• ‘Breathing’ marks on vowels at
beginning of a word
• Accents possible on all vowels
Diversity of Greek Fonts in 19th
C.
Other Examples
Greek OCR With Gamera
• Dalitz and Brandt provide an experimental
framework
– I added splitting, grouping, sql output, etc.
• Teams of undergraduates making multiple
classifiers
– Based on families of fonts
– Comparing strategies of composite characters,
splitting, etc.
– Must also train for Latin scripts used
• Not yet working on post-processing
Good Results
Systematic Approach to
Automated Greek OCR
• Remove the curator from the loop –
especially important for journals,
monographs, etc.
– Assign classifier by computation means
• Using:
– Federico Boschetti’s ground-truth-less
Greek text evaluator
– Atlantic Computational Excellence
Network, Atlantic Canada’s parallel
computing network
Process
• 160 Greek-heavy texts chosen
• Of these, random samples of 10 pages
were taken
• Each was processed with each of the 20
classifiers made this summer
• The result were evaluated and given a
‘Boschetti score’ from 0 – 1
0.6
0.5
0.4
0.3
0.2
0.1
0
Teubner_Slim
Teubner_Simil
Teubner_Similar
Teubner_SansSer
Teubner_Latin
Super_Swirly2
Super_Swirly
Smyth
Oxford
Oribase_Test
Oribase_Font_2
Oribase_Font_1
Oribase_Font
New_Teubner
Loeb_Wholistic
Littre
Lexicon
Kurke
gamera-greekocr-training-loeb-separatistic-20
Etymologicum
Early_Teubner
Cambridge
Bude
Bekker
Aristides_Dindorf_1
Aristides_Dindorf
Alpha_Font
16thcent
Google/ABBYY Line Splitting
Gamera’s Text Line
Finding(bbox_merging)
Replaced with
runlength_smearing
Two-step processing
Future Work
• Combining and re-optimizing classifiers?
• Assign classifier based on Latin text
– Is ‘Oxford’, ‘Clarendon’ or ‘Oxonii’ in the first pages
of output?
• Align with Google’s output, and provide Google
with corrected Greek
• Implement line-splitting from other OCR
engines
• Discover badly OCR’d Greek in others’ output
• Implement OCR correction frameworks
described here
Common Problems
• Assessments of pre-processing strategies
and tools
• Schemas for page description
Thanks
• Colleagues in Dynamic Variorum
Editions:
– Greg Crane at Perseus / Tufts
– Brian Fuchs at Imperial College
• Federico Boschetti
• AceNet, especially tech. support of
Sergiy Khan
Download