Data Discovery and Doer Happiness: Uses for Optical Character Recognition (OCR) Output. Presenter: Deborah Paul Florida State University Integrated Digitized Biocollections (iDigBio) at Biodiversity Information Standards (TDWG) 2014 Conference Elmia Congress Centre, Rydberg Hall, Jönköping, Sweden Oct 27th, 2014 Authors: Deborah Paul, Andrea Matsunaga, Miao Chen, Jason Best, Sylvia Orli, William Ulate, Reed Beaman iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Minimal Data Capture • • • • “filed as” name higher geography barcode image • all sheets in folder get the same initial data • only the barcode differs 2 Biological collection data capture: a rapid approach using curatorial data Raw OCR output, warts and all, can be used to: enter records faster use the database entry ditto feature find duplicates quickly find the labels find the labels with lots of handwriting create your own record sets to transcribe by: – collector – country or county – your Great Aunt Penelope – taxon – language • create cogent sets to speed up validation and database updates • make transcribers / validators jobs easier and more fun! • • • • • • 3 4 Label No. ....2L31. National Herbarium of Canada FLORA OF’T TERRITORIES . Hab. and Loc., Arctic Coast west of Mackenzie River delta: Between King Pt. and Kay Pt., 69° 12’ N., and 138° to 138° 30’ W. .. Collector, A. E. Porsild July 23-25, 1934 Next imagine output from 1000s of labels or notebooks or text files! 5 Seeing the dark data… 6 Robyn E Drinkwater, Robert Cubey, Elspeth Haston at TDWG 2013. • It’s surprising what can be used to help filter specimens – the black art of search terms! 7 Some work from the iDigBio CITSCribe Hackathon Overall Word Cloud Workflow Images OCR OCR OCR Engine Engine Engine Crowd sourcing (BVP) OCR OCR OCR Output Output Output DwC Parsed Output Web Service (Jason Davies) Index (Solr) (Google Charts, Facet Explorer) OCR confidence (n-gram) Cluster (carrot2) Histogram Word Cloud Google Charts: http://developers.google.com/chart/interactive/docs/gallery N-gram: http://github.com/idigbio-citsci-hackathon/OCR-Error-Estimation Facet explorer: http://github.com/idigbio-citsci-hackathon/facet-explorer Jason Davies WC: http://www.jasondavies.com/wordcloud/ Apache Solr: http://lucene.apache.org/solr/ carrot2: http://project.carrot2.org/ 8 Word Clouds with… N-gram Scoring, Faceting, Solr + Carrot2 9 Imagine Integration with current software Use for initial sort or validation 10 11 Managing your crowdsourcing data behind the scenes – OCR too! 12 Work on Automated Parsing Algorithms aOCR group finishing up a study comparing parsing algorithm strategies against a known standard to better define what’s possible at the moment for automated parsing of OCR output to standard Darwin Core terms. 13 http://tinyurl.com/LichenRecords 14 Inside the 1899 Harriman Expedition 15 Inside the 1899 Harriman Expedition 16 Workflow Modules and Sample Digitization Workflows with OCR integrated • The iDigBio DROID and aOCR groups produced a step-by-step series of tasks for implementing OCR in a digitization workflow. • Project specific workflows are available from RBGE, NYBG, SALIX2, ASU Herbarium, ScioTR, TTD-TCN, … • Yours? 17 OCR use, Voice Recognition, User Interface Optimization, Image Analysis,… aOCR WG and Synthesys3 user-interface interest group exemplar ML and NLP workflows combining OCR with Voice recognition software in Symbiota (Macroalgal TCN) • Automated image analysis • combining touch-screen technology into the digitization workflow (ScioTR)($6.99) • • • • 18 Tack så mycket Tack så mycket Tack så mycket! MaCC TCN Andrea Matsunaga, Researcher, iDigBio Miao Chen, Indiana University, Data to Insight Center Jason Best, Botanical Research Institute of Texas Sylvia Orli, IT Head, Smithsonian Botany Department William Ulate, Technical Director, BHL Reed Beaman, Informatics Specialist, iDigBio Elspeth Haston, et al Royal Botanic Garden Edinburgh (RBGE) Stephen Gottschalk, New York Botanic Gardens (NYBG) iDigBio Augmenting Optical Character Recognition WG SALIX2 Smithsonian 19 Find out more at iDigBio facebook.com/iDigBio twitter.com/iDigBio www.idigbio.org vimeo.com/idigbio idigbio.org/rss-feed.xml webcal://www.idigbio.org/events-calendar/export.ics iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.