Improving access, usability and enriching data on 385 million natural history specimens
Laurence LIVERMORE 1 , John TWEDDLE 1
& Rob CUBEY 2
1 Natural History Museum, London; 2 Royal Botanic Garden Edinburgh
NBN Crowdsourcing Data Capture Summit
25 September 2015
• Intro
– SYNTHESYS Project
– Crowdsourcing research & key findings
– Why build a new platform?
• Platform functionality – What will it do?
• Strategy & relevance to other organisations
• Future & concluding remarks
Overall aim:
“to create an integrated European infrastructure for
researchers in the natural sciences”
• EU FP7 framework project
• 18 Partners
• 3 core strands of work:
Transnational Access improves accessibility of natural history collections through funded physical access to collections / expertise and facilities.
Joint Research Activities improve access to data stored digitally within NH collections by extracting and enhancing data from digitised collections
Network Activities deliver collection management policies, best practice models, unified standards and protocols for new and emerging collections.
• Automated data collection from digital images
• New methods for 3D digitisation of NH collections
• Access and management of an integrated
European digital collection (with NA2)
DNA sequencing viability
• Crowdsourcing metadata enrichment of digital images
Quantitative colour analysis
• Led by: RBGE (lead), NHM, MfN
• Applied human intelligence is still required for label transcription
• Some of the issues that are very challenging to solve computationally are:
– Diversity and irregularity of labels e.g. shape, size, contents
– Recognising and mapping of label data to atomised fields is complex
– Label data can be duplicated
– Label data can be irrelevant or contradictory
– Mixture of handwritten and printed text
• Crowdsourcing landscape changed since planning (2011-2012)
• Many platforms (recently) launched!
• SYNTHESYS partners developing/using platforms
• Growing understanding of best practices
(Ellwood et al, 2015)
Ellwood et al 2015. doi: 10.1093/biosci/biv005
• General research report (sent to all survey participants)
– Platform comparisons
– Case studies
– Motivation, participation
– Organisational investment
• Functional requirement survey & platform assessments
Feature
Data Entry
Review
Open source
Mobile
PM + Admin
Georef tool
Projects
Community
Contributions
Plat. Age
ALA single
Y
Y
Partial
Y
Y
232
835
128,135
4 years h@h single
Y
N
N
N
N
18**
419
145,574
7 years
LH multi
N
N
N
?
N
30
200+
1,365,200
3 years
Statistics gathered on or around 01/08/2014
Platform age is rounded up
NfN multi
N
Y
N
N
N
4
6,721
1,025,033
2 years
SDV: TC single
Y
?
N
Y
N
139
340+
?
2 years
• Led by Tim Conyers and Robert
Prys-Jones
• Bird register project – initial test project for NfN
• 2,950 pages
• 315,785 transcriptions
• 75% of transcriptions by 1 volunteer!
• Project page: http://www.notesfromnature.org/#/archives/ornithological
• Contributor stats: http://data.nhm.ac.uk/dataset/notes-fromnature/resource/7f8fc5f5-90ae-4959-b286-9cb7951f2875?view_id=ce329dfd-99cb-
4223-b615-ce95d6c707c7
• Led by Sarah Phillips
• British herbarium sheet transcription
• 13,000 transcriptions (2012-
2014)
• Established community generated high quality data – even from handwriting interpretation
• Led by John Tweddle & Mark
Spencer (+ AMC Team)
• Combing contemporary recording with historical datasets
• 1,000 participants, 30,000 classifications, 1,800 field records
• 200 new orchid locations (incl. for threatened spp.)
• New recorders, new activity for existing enthusiasts
• Preliminary analysis already found flowering data are 10 days earlier for 2 orchid species
• Report by Santos et al comparing
NfN vs internal transcription
• “Super” volunteer – more accurate and effective
• Registered users transcribed more than anonymous volunteers
• Anonymous/unregistered volunteers have higher error rates
Records Errors Error %
In situ temp. staff
In situ students
NfN registered
NfN anonymous
10,677 26
3,700 22
0.24
0.59
80,019 2,184 2.73
13,673 1,768 12.93
• Strongly recommend review-based transcription & multi-stage
QC
• Need to offer better training to volunteers (but when?)
• Mechanisms to review incomplete submissions (either human or technical error)
• Highlighted benefits of analysing data – some errors and platform issues could have been fixed earlier…
CS isn’t free and participation isn’t a given!
• Understanding why volunteers participate in crowdsourcing endeavours and how to support, maintain and reward their involvement is central to success
• Narrative, tasks, supporting resources & feedback all affect participation
• Social aspects of crowdsourcing are critical and should not be ignored
• Motivations of participants vary and can be hard to determine
• Increasing number of studies, but biased coverage
• Enthusiasm and interest in project topic
• Desire to record, find and discover
• Learning and development of new skills
• Contribution to the greater good (society/science)
• Sense of purpose and belonging to a community (social)
• On-going, rapid feedback and thanks
• Evidence that the data are being used
• Social interaction and community
• Personal learning and progression
• Recognition and reputational gain (incl. super-contributors)
• Awards, games, badges, leaderboard (work for some people, not others)
• Projects need to be personally and socially relevant to succeed
• Motivations of participants often quite different to those of project designer
• One size rarely fits all - danger of making assumptions
• Key to success is working with and understanding target participants – and adapting
• Clear project rationale with both cultural and scientific benefits
• Projects should be actively promoted and monitored
• Scientists should be visible and engaged with volunteers
• Develop best practice for motivating and retaining volunteers
(self-establishing community structure and forum, good science, tasks of interest, different rewards etc)
• Platform should use existing data standards – reduce bottle neck for collections management ingestion
• Resulting data should be freely available – projects do not end when all tasks are complete!
• Communication, outreach and support (e.g. dedicated staff time to develop and provide feedback to an external community, internal project manager and scientists)
• Strategic project selection (e.g. strong narrative, potential scientific outputs, public appeal, well-structured tasks of known complexity)
• Preparation of underlying data (e.g. data for autocomplete fields such as collector names or localities)
• Post-processing of data and subsequent import into institutional collections management system
• (?) Technical infrastructure (e.g. software, hardware and developers)
• Surveyed 14 EU partners
• Captured functional requirements
• Prioritised using MoSCoW method
• Requirements written up as user stories after identifying key user roles
MoSCoW Method
Must Have
Should Have
Could Have
Won’t Have
“As a Community Manager I want to be able to queue projects so when one project gets completed a new one goes live so Volunteers always have content”
• Platform as a service
• Strong management functionality
• Organisational control
• API (micro services) to allow embedding in mobile and institutional websites
• Key functionality (for example)
– Review-based transcription
– Full task archiving
– Multilingual support
– Georeferencing & mapping support
• Smithsonian Institution’s Transcription Centre
– Strong collaboration potential/expertise
– Met many functional requirements
– Open source & Drupal-based
– Highly customisable (in-house and externally)
– Significant NHM developer experience
• Still encourage partners to use other systems
– ALA, Les Herbonautes, Panoptes
– Differing functionality & specialisms
– NHM still intends to work with Zooniverse
• Technical analysis of major platforms
• Functional requirements document
• Finalise technical specification
• Hire developer(s)…
• Joint development and design work (NHM, Smithsonian,
Simbiotica)
• User acceptance testing
• Launch in August 2016!
Q3 2015
Core Platform development deliverables/milestones refinement
Developer recruitment
List of potential launch projects
Q4 2015 Q1 2016 Q2 2016 Q3 2016
Internal UAT - volunteers/staff
Consortium testing
Confirm launch projects
Seek additional funding
Draft designs implemented
Workflow refinement
Public UAT/soft launch
Finalise launch functionality
Prepare launch projects
Future project reserve list
Post-launch functionality
Final designs implemented
Hard launch [31 Aug 2016]
Promotion
Report on usage and statistics
• Developer recruitment
• Challenging financial climate
• Multiple partners/stakeholders
• CMS integration – currently a massive bottleneck for all our digital projects
• A stronger online presence/brand
• Increased rate of collections digitisation (100k+/day?), hence access to data
• Higher scientific output
• An effective way of engaging (dispersed) members of the public
• Deeper and more meaningful engagement with our collections
• Platform model would work for institutes of all sizes
• Established scalable platform model
• Reduces technical overheads
• Modular structure allows customisation
• Open international collaboration (e.g. iDigBio/Smithsonian)
• Resulting data will be available for research (Data Portal)
• Directly doing research through crowdsourcing
• Deeper engagement with volunteers (visiteering)
• Tracking our data, benefits, impact and repatriation
• Dual approach for transcription – combine with OCR and intelligent sorting
• Beyond transcription…
• We need more data to do better crowdsourcing:
– Raw (unreviewed) transcription data
– Volunteer demographics
– Motivation for initial and sustained user engagement
– Experimental data on optimal UI configurations
• Produce more education and outreach materials to complement public engagement
• Recruiting & keeping developers is a challenge!
• Collaboration & partnerships are good but often result in compromises! (open source + modular helps but is £££)
• “Free” platforms still require community management to get best results
Anecdotal information, raw or processed transcription data welcome
• SYNTHESYS : JRA Objective 3 & NA3 Groups
• Smithsonian Institution : Meghan Ferriter & Michael
Schall
• Other Contributors: Simon Chagnoux, Libby Ellwood,
Paul Flemons, Tom Humphrey and Deborah Paul
• NHM : Celena Bretton, Tim Conyers, Lucy Robinson ,
Ben Scott, Vince Smith, Ali Thomas
• Ellwood, E.R., B. Dunckel, P. Flemons, R. Guralnick, G. Nelson, G. Newman, S.
Newman, D. Paul, G. Riccardi, N. Rios, K. C. Seltmann and A. R. Mast.
(2015). Accelerating digitization of biodiversity research specimens through online public participation. BioScience. doi: 10.1093/biosci/biv005
• It’s a support tool but also a service
• Some of the existing platforms with undoubtedly work for you
• We will have a developer in post sometime in October
• High level ideas for the NHM
• Other museums are not just about Natural History- we have other needs sogood to get feedback
• Project jhas a look of stackholders in SYNTHESYS but we are also aiming at other institutions in the UK and Europe
• Want to develop our role as a virtual hub for Citizen Science
• Want to use this sesison partial for requirements gathering
Very preliminary analysis!
Median flowering dates for Early-purple and
Green-winged orchids are
10 days earlier cf. museum data (1830-
1970)
Functional Item
User guides and help
Review-based transcription
Support for relevant data standards
Project descriptions
Mechanism to report issues with projects or tasks
Summarise active projects and their progress
Different privileges within site
Support for exporting data for clean-up or analysis using external services
Templates for project creation
Linked project-level documentation
All projects and tasks should be archived on the site
Zoomify-style interface
Ability to import lists for controlled validation
Ability to map and export data in different formats
Hover-over help
Tools for analysing and assessing the quality of user contributions
Interactive examples/tutorials
Ability to edit "live" projects
Ability to create custom data export templates
Standard field types and basic validation
Permission-based administration
Project progress bars
Top users
Support for maps to display georeferenced data
Must Have
Must Have
Must Have
Must Have
Must Have
Must Have
Must Have
Must Have
Must Have
Must Have
Must Have
Must Have
Must Have
Must Have
Must Have
Must Have
Must Have
Must Have
Must Have
Must Have
Must Have
Must Have
Must Have
Must Have
Custom fields for data entry
Ability for users to filter projects and tasks within projects based on their areas of interest
A georeferencing tool that allows users to generate coordinates from locality information
An annotation tool that includes determinations to capture data from more expert users
New/featured project section
Ability for users to ask general questions about projects
User notifications
Ability for users to request help from a dedicated community member or project experts
Dynamic lists
Ability to host and run multiple crowdsourcing projects at one time
Links to content and project outputs
Control hub for users
Multikeying/multi-pass transcription
News feed to display updates
Reporting tools
Should Have
Should Have
Should Have
Should Have
Should Have
Should Have
Should Have
Should Have
Should Have
Should Have
Should Have
Should Have
Should Have
Should Have
Should Have
Localisation support
Ability to contact all project volunteers
Ability for users to submit or query records for discussion
Simple content management
Links to information to help with tasks (e.g. BHL, taxonomic catalogues, community created content)
Potential to develop mobile/tablet based apps using API
Flexible theming
A modular structure to support different task types
Should Have
Should Have
Should Have
Should Have
Should Have
Should Have
Should Have
Should Have
Support for organisations/institutes to use single sign on technology for internal users Should Have
Built-in read/write API that is used by platform as primary means for delivering and creating content (e.g. dogfooding paradigm) Should Have
Support for Google Analytics
Support for public/community responses to tasks and discussions
Simple site-wide user statistics
Should Have
Should Have
Should Have
Embedded videos
Simple (non-HTML) interface for editing project information
Ability to serve OCR text to users for correction
Support for external users to use social media logins
Ability to embed and display content from the platform on other websites
Potential to integrate handwriting recognition in the platform
Project blog
Ability to queue projects
Ability for users to share links to transcriptions/tasks to social media networks
Links to information for discovery/educational purposes (e.g. EOL, Wikipedia, National Portals)
Support for users to create their own resources to support a project
Support for anonymous (unregistered user) contributions
Support for markup (formating in data entry fields)
Could Have
Could Have
Could Have
Could Have
Could Have
Could Have
Could Have
Could Have
Could Have
Could Have
Could Have
Could Have
Could Have