The use of OCR in the digitisation of herbarium specimens

advertisement
The use of OCR in the
digitisation of
herbarium specimens
Robyn E Drinkwater, Robert Cubey & Elspeth Haston
What is happening in digitisation?
• … and these minimal data records are going to
need data added to them.
What are the options when using
optical character recognition (OCR)?
• Parse OCR text directly into the database fields
• Use OCR data to prepare the specimens for
manual / semi automated data entry
• We have had a digitisation project running to
digitise all the specimens from SW Asia and the
Middle East at RBGE.
• Minimal data had been captured originally*
– Filing name
– Geographical filing region
– Barcode
• We have been routinely processing all our
specimen images through ABBYY OCR software.
* E Haston, R Cubey, DJ Harris (2011). Data concepts and their relevance for data capture in large scale digitisation of
biological collections. International Journal of Humanities and Arts Computing 6 (1-2), 111-119.
Exploring the data…
Step One
• We used the OCR output text to pull out over
7,000 specimen images and associated data
records
• These were then prepared into batches:
– some random
– some sorted by collector and / or country
Step Two
• A team of six digitisers at RBGE completed a
series of trials
• They used two different protocols for data entry
– complete records
– partial records (including collector and geographical
information but not habitat and description)
• In total 7,200 specimens were processed
Results…
• Compared to unsorted,
random specimens, those
which were sorted based on
data from the OCR output
were quicker to digitise
• Of the methods tested here,
the most efficient used a
protocol based on partial
data entry, working with
specimens which had been
filtered by Collector and
Country
The human factor…
Thinking about the ease of entering the data for each test, rate
them on their relative ease of use
100%
90%
80%
70%
60%
5- Hardest
50%
4
3
40%
2
1- Easiest
30%
20%
10%
0%
Random 1
Collector
Country
Collector
& Country
Collector
& Country
Random 2
The human factor…
• Digitisation staff preferred working with
sorted specimens
• They also preferred working with
physical specimens rather than images
Some more thoughts…
• This work is more easily applied than parsing
data from the OCR output
• It can be used in conjunction with other tools
later in the digitisation process since these
other processes will almost certainly be more
efficient with sorted batches of specimens
• Other tasks can also be built on top of this: eg
condition assessment, QC, etc
• It’s surprising what can
be used to help filter
specimens – the black
art of search terms!
Acknowledgments
• The digitisation team at RBGE: Nicky Sharp, David
Braidwood, Muhammad Ghazali, Lorna Glancy,
Dorota Jaworska, Esther Nieto.
• The Andrew W Mellon Foundation
• Dr Antje Ahrends (RBGE) & Dr Chris Glaseby
(BIOSS) for statistical advice
Download