Uses for Optical Character Recognition (OCR) Output.

advertisement
Data Discovery and Doer Happiness:
Uses for Optical Character
Recognition (OCR) Output.
Presenter: Deborah Paul
Florida State University
Integrated Digitized Biocollections (iDigBio)
at Biodiversity Information Standards (TDWG) 2014 Conference
Elmia Congress Centre, Rydberg Hall, Jönköping, Sweden Oct 27th, 2014
Authors: Deborah Paul, Andrea Matsunaga, Miao Chen, Jason Best, Sylvia Orli, William Ulate, Reed Beaman
iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of
Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings,
and conclusions or recommendations expressed in this material are those of the author(s) and
do not necessarily reflect the views of the National Science Foundation.
Minimal Data Capture
•
•
•
•
“filed as” name
higher geography
barcode
image
• all sheets in folder get the
same initial data
• only the barcode differs
2
Biological collection data capture: a rapid approach using curatorial data
Raw OCR output, warts and all, can be used to:
enter records faster
use the database entry ditto feature
find duplicates quickly
find the labels
find the labels with lots of handwriting
create your own record sets to transcribe by:
– collector
– country or county
– your Great Aunt Penelope
– taxon
– language
• create cogent sets to speed up validation and database
updates
• make transcribers / validators jobs easier and more fun!
•
•
•
•
•
•
3
4
Label
No. ....2L31.
National Herbarium of Canada
FLORA OF’T TERRITORIES
.
Hab. and Loc., Arctic Coast west of Mackenzie River
delta:
Between King Pt. and Kay Pt., 69° 12’ N., and 138° to
138° 30’ W.
..
Collector, A. E. Porsild July 23-25, 1934
Next imagine
output from
1000s of labels
or notebooks
or text files!
5
Seeing the dark data…
6
Robyn E Drinkwater, Robert Cubey, Elspeth Haston at TDWG 2013.
• It’s surprising what can be
used to help filter
specimens – the black art
of search terms!
7
Some work from the iDigBio CITSCribe Hackathon
Overall Word Cloud Workflow
Images
OCR
OCR
OCR
Engine
Engine
Engine
Crowd
sourcing
(BVP)
OCR
OCR
OCR
Output
Output
Output
DwC
Parsed
Output
Web
Service
(Jason Davies)
Index
(Solr)
(Google Charts,
Facet Explorer)
OCR
confidence
(n-gram)
Cluster
(carrot2)
Histogram
Word
Cloud
Google Charts: http://developers.google.com/chart/interactive/docs/gallery
N-gram: http://github.com/idigbio-citsci-hackathon/OCR-Error-Estimation
Facet explorer: http://github.com/idigbio-citsci-hackathon/facet-explorer
Jason Davies WC: http://www.jasondavies.com/wordcloud/
Apache Solr: http://lucene.apache.org/solr/
carrot2: http://project.carrot2.org/
8
Word Clouds with…
N-gram Scoring, Faceting, Solr + Carrot2
9
Imagine Integration with current software
Use for initial sort
or validation
10
11
Managing your crowdsourcing data behind the scenes
– OCR too!
12
Work on
Automated Parsing Algorithms
aOCR group finishing up
a study comparing
parsing algorithm
strategies against a
known standard to better
define what’s possible at
the moment for
automated parsing of
OCR output to standard
Darwin Core terms.
13
http://tinyurl.com/LichenRecords
14
Inside the 1899 Harriman Expedition
15
Inside the 1899 Harriman Expedition
16
Workflow Modules and
Sample Digitization
Workflows with OCR
integrated
• The iDigBio DROID and aOCR groups produced a
step-by-step series of tasks for implementing OCR
in a digitization workflow.
• Project specific workflows are available from
RBGE, NYBG, SALIX2, ASU Herbarium, ScioTR,
TTD-TCN, …
• Yours?
17
OCR use, Voice Recognition, User Interface
Optimization, Image Analysis,…
aOCR WG and Synthesys3
user-interface interest group
exemplar ML and NLP workflows
combining OCR with Voice
recognition software in Symbiota
(Macroalgal TCN)
• Automated image analysis
• combining touch-screen technology
into the digitization workflow
(ScioTR)($6.99)
•
•
•
•
18
Tack så mycket
Tack så mycket
Tack så mycket!
MaCC TCN









Andrea Matsunaga, Researcher, iDigBio
Miao Chen, Indiana University, Data to Insight Center
Jason Best, Botanical Research Institute of Texas
Sylvia Orli, IT Head, Smithsonian Botany Department
William Ulate, Technical Director, BHL
Reed Beaman, Informatics Specialist, iDigBio
Elspeth Haston, et al Royal Botanic Garden Edinburgh (RBGE)
Stephen Gottschalk, New York Botanic Gardens (NYBG)
iDigBio Augmenting Optical Character Recognition WG
SALIX2
Smithsonian
19
Find out more at iDigBio
facebook.com/iDigBio
twitter.com/iDigBio
www.idigbio.org
vimeo.com/idigbio
idigbio.org/rss-feed.xml
webcal://www.idigbio.org/events-calendar/export.ics
iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of
Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings,
and conclusions or recommendations expressed in this material are those of the author(s) and
do not necessarily reflect the views of the National Science Foundation.
Download