Calbug: A case study of digitization challenges for entomology

advertisement
Calbug: a case study of
digitization challenges for
Entomology collections
Joan Ball, Joyce Gross, Traci Gryzmala,
Gordon Nishida, Peter Oboyski,
Rosemary Gillespie, George Roderick,
Kipling Will
Photo by: Marek Jakubowski
Background
Workflow & Challenges
Progress
Future Direction
Photo by: Marek Jakubowski
What is CalBug?
Essig Museum of Entomology
California Academy of Sciences
California State Collection of Arthropods
Bohart Museum, UC Davis
Entomology Research Museum, UC Riverside
San Diego Natural History Museum
LA County Museum
Santa Barbara Museum of Natural History
Goals
1.) Digitize and geo-reference 1.2 Million specimens from eight
California institutions spanning 110 years of specimen collecting
2.) Analyze spatial and temporal changes in distributions due to land use
change, invasive species, habitat fragmentation, and climate change
Photo by: Marek Jakubowski
Stratified data capture: All specimens of selected taxa
Stratified data capture: All specimens of species found in field stations
• Images and Field Notes
• Species Checklists
• Historical Climate Records
Digital Data:
• Climate Sensor Networks
UC Natural
Reserve System
Background
Workflow & Challenges
Progress
Future Direction
Photo by: Marek Jakubowski
Workflow
1. Select taxa for
databasing
5a. Manually enter data
into MySQL database
with some error checking
2. Sort specimens
by location & date
3. Arrange labels to view
all text, add catalog #
label
4. Take, name, and save
digital image of labels
6. Error Checking
7. Georeference
locality
5b. Online crowd-sourcing
of manual data entry
8. Upload data to
cache
5c. Optical Character
Recognition & data parsing
9. Temporospatial
analyses
Imaging
Workflow
1. Select taxa for
databasing
Challenges:
Labels are small and stacked beneath specimen
2. Sort specimens
by
by location
location &
& date
date
3. Arrange labels to view
all text, add catalog #
label
4. Take, name, and save
digital image of labels
Specimen handling is inefficient, process
extremely time consuming
Current Imaging Rate: 60 specimens per hour
per person
Data Entry
Workflow
Crowd Sourcing:
5a. Manually enter data
into MySQL database
with some error checking
- Interactive website
-Volunteers enter data 3X
Evaluate multiple entries for consistency
5b. Online crowd-sourcing
of manual data entry
Museum staff – focus on imaging, QAQC,
public relations
5c. Optical Character
Recognition & data parsing
Develop dictionaries of common
abbreviations and California localities- pick
lists and controlled fields to reduce error…
OCR
“Smart” parsing program– assign data elements to database fields
based on context and dictionary terms
Workflow
6. Error Checking
7. Georeference
locality
8. Upload data to
cache
Data quality, access & analysis
Georeferencing
& Mapping:
Error
Checking:
Example:
Analyzing
data
Biogeomancer
by locality
and date throughout
to identify typographic
-Sort
Dragonfly
specimens
CA over 100errors,
years
by record
number
to find carry-over
-and
Combine
with:
observation
data, 1914errors.
survey, current
Estimate
coordinates and error radius based on
field studies
standardized
protocols
of records with
label
images.
-Compare
Changes10in%biodiversity,
species
composition,
and
distribution
Data
Cache
- Metrics of climate and land use change
9. Temporospatial
analyses
Source: Cal-Adapt and the Public Interest Energy
Research program, California Energy Commission
Publicly available data layers
Temperature (max, min, mean)
Species Distributions
Past, Present, Projected Future
Past, Present, Projected Future
Precipitation
Land Use
Past, Present, Projected Future
Private, Public, and Protected
Land Cover
Soils
Topography
Hydrology
Ongoing Research Projects
• In support of taxonomy and undergraduate
research
~23,000 georeferenced specimens in the EMEC database
from the Californian Floristic province.
#specimens
Years
~23,000 georeferenced specimens in the EMEC database
from the Californian Floristic province.
J. Powell
Years
reconfiguration
#specimens
WW2
Background
Our Database
Workflow & Challenges
Progress &
Future Direction
Photo by: Marek Jakubowski
Progress Made – Essig museum
Data Entered:
EMEC total 122,000
-42,000 since 1, Sept 2010
-55,289 CA specimens
-65,000 georeferenced
Images Taken:
44,200 images
Progress Made – Collaborators
CDFA: 14,000 Sphecidae, pests
Bohart: 25,000 Sphecidae
SBMNH: 140,000 Coleoptera
(museum records and literature)
Photo from CA Beetle Project site
Riverside: 26,500 bees
CAS: 15,000 Neuroptera
Photo by: Texas A&M University
Photo by: Robin Coville
Timeline
Analysis of data: Arthropod response to global change
Start of
Calbug
Year 1: 240,000
Specimens
Digitized
Year 2: Image and
Digitize 320,000
Specimens; QAQC
Year 4: Image and
Digitize 320,000
Specimens; QAQC
Year 3: Image and
Digitize 320,000
Specimens; QAQC
Imaging Goal – Next 3 years:
320,000 images per year
6,500 images per week (48 weeks)
Finish
Year 5:
Georeferencing
Future Directions – Simplify and
disperse the workflow
1. Select taxa for
databasing
5b. Online crowd-sourcing
of manual data entry
3. Arrange labels to view
all text, add catalog #
label
4 (Modified). Run sheets
of specimens through
imaging station
5c. Optical Character
Recognition & data parsing
6. Error Checking
7. Georeference
locality
8. Upload data to
cache
9. Temporospatial
analyses
Remove sorting step (2), and museum staff data entry (5a)
Speed up image capture through assembly line process (4)
- Set up stations for specific handling tasks
- Automate file naming and saving
Develop dictionaries of localities, and common abbreviations to reduce error and
speed data entry
Looking ahead
• Data from many millions of additional
specimens will remain to be captured
• “brute force” entry needs to be coupled with
any technological advances that we can
harness
• Intermediate products are necessary
Acknowledgements
All participating organizations
National Science Foundation
John Weiczorek, Michelle
Koo, Carol Spencer
Berkeley Natural History
Museums Consortium
Biodiversity Sciences
Technology (BSCIT)
Citizen Science Alliance
>20 Undergraduates
Download