Calbug: a case study of digitization challenges for Entomology collections Joan Ball, Joyce Gross, Traci Gryzmala, Gordon Nishida, Peter Oboyski, Rosemary Gillespie, George Roderick, Kipling Will Photo by: Marek Jakubowski Background Workflow & Challenges Progress Future Direction Photo by: Marek Jakubowski What is CalBug? Essig Museum of Entomology California Academy of Sciences California State Collection of Arthropods Bohart Museum, UC Davis Entomology Research Museum, UC Riverside San Diego Natural History Museum LA County Museum Santa Barbara Museum of Natural History Goals 1.) Digitize and geo-reference 1.2 Million specimens from eight California institutions spanning 110 years of specimen collecting 2.) Analyze spatial and temporal changes in distributions due to land use change, invasive species, habitat fragmentation, and climate change Photo by: Marek Jakubowski Stratified data capture: All specimens of selected taxa Stratified data capture: All specimens of species found in field stations • Images and Field Notes • Species Checklists • Historical Climate Records Digital Data: • Climate Sensor Networks UC Natural Reserve System Background Workflow & Challenges Progress Future Direction Photo by: Marek Jakubowski Workflow 1. Select taxa for databasing 5a. Manually enter data into MySQL database with some error checking 2. Sort specimens by location & date 3. Arrange labels to view all text, add catalog # label 4. Take, name, and save digital image of labels 6. Error Checking 7. Georeference locality 5b. Online crowd-sourcing of manual data entry 8. Upload data to cache 5c. Optical Character Recognition & data parsing 9. Temporospatial analyses Imaging Workflow 1. Select taxa for databasing Challenges: Labels are small and stacked beneath specimen 2. Sort specimens by by location location & & date date 3. Arrange labels to view all text, add catalog # label 4. Take, name, and save digital image of labels Specimen handling is inefficient, process extremely time consuming Current Imaging Rate: 60 specimens per hour per person Data Entry Workflow Crowd Sourcing: 5a. Manually enter data into MySQL database with some error checking - Interactive website -Volunteers enter data 3X Evaluate multiple entries for consistency 5b. Online crowd-sourcing of manual data entry Museum staff – focus on imaging, QAQC, public relations 5c. Optical Character Recognition & data parsing Develop dictionaries of common abbreviations and California localities- pick lists and controlled fields to reduce error… OCR “Smart” parsing program– assign data elements to database fields based on context and dictionary terms Workflow 6. Error Checking 7. Georeference locality 8. Upload data to cache Data quality, access & analysis Georeferencing & Mapping: Error Checking: Example: Analyzing data Biogeomancer by locality and date throughout to identify typographic -Sort Dragonfly specimens CA over 100errors, years by record number to find carry-over -and Combine with: observation data, 1914errors. survey, current Estimate coordinates and error radius based on field studies standardized protocols of records with label images. -Compare Changes10in%biodiversity, species composition, and distribution Data Cache - Metrics of climate and land use change 9. Temporospatial analyses Source: Cal-Adapt and the Public Interest Energy Research program, California Energy Commission Publicly available data layers Temperature (max, min, mean) Species Distributions Past, Present, Projected Future Past, Present, Projected Future Precipitation Land Use Past, Present, Projected Future Private, Public, and Protected Land Cover Soils Topography Hydrology Ongoing Research Projects • In support of taxonomy and undergraduate research ~23,000 georeferenced specimens in the EMEC database from the Californian Floristic province. #specimens Years ~23,000 georeferenced specimens in the EMEC database from the Californian Floristic province. J. Powell Years reconfiguration #specimens WW2 Background Our Database Workflow & Challenges Progress & Future Direction Photo by: Marek Jakubowski Progress Made – Essig museum Data Entered: EMEC total 122,000 -42,000 since 1, Sept 2010 -55,289 CA specimens -65,000 georeferenced Images Taken: 44,200 images Progress Made – Collaborators CDFA: 14,000 Sphecidae, pests Bohart: 25,000 Sphecidae SBMNH: 140,000 Coleoptera (museum records and literature) Photo from CA Beetle Project site Riverside: 26,500 bees CAS: 15,000 Neuroptera Photo by: Texas A&M University Photo by: Robin Coville Timeline Analysis of data: Arthropod response to global change Start of Calbug Year 1: 240,000 Specimens Digitized Year 2: Image and Digitize 320,000 Specimens; QAQC Year 4: Image and Digitize 320,000 Specimens; QAQC Year 3: Image and Digitize 320,000 Specimens; QAQC Imaging Goal – Next 3 years: 320,000 images per year 6,500 images per week (48 weeks) Finish Year 5: Georeferencing Future Directions – Simplify and disperse the workflow 1. Select taxa for databasing 5b. Online crowd-sourcing of manual data entry 3. Arrange labels to view all text, add catalog # label 4 (Modified). Run sheets of specimens through imaging station 5c. Optical Character Recognition & data parsing 6. Error Checking 7. Georeference locality 8. Upload data to cache 9. Temporospatial analyses Remove sorting step (2), and museum staff data entry (5a) Speed up image capture through assembly line process (4) - Set up stations for specific handling tasks - Automate file naming and saving Develop dictionaries of localities, and common abbreviations to reduce error and speed data entry Looking ahead • Data from many millions of additional specimens will remain to be captured • “brute force” entry needs to be coupled with any technological advances that we can harness • Intermediate products are necessary Acknowledgements All participating organizations National Science Foundation John Weiczorek, Michelle Koo, Carol Spencer Berkeley Natural History Museums Consortium Biodiversity Sciences Technology (BSCIT) Citizen Science Alliance >20 Undergraduates