Developing taxonomic names services to enhance findability
David Remsen
ECAT Programme Officer
October 22, 2007
WWW.GBIF.ORG
Overview of GBIF and data portal
Informatics challenges relating to taxon data
What we are doing about it
Wider implications of our efforts
…to make the world’s biodiversity data freely and universally available via the Internet
What is biodiversity?
GBIF follows the broadly outlined CBD recognition of levels of biological diversity:
• Molecules / genes
• Species
• Ecosystems / ecology
http://data.gbif.org
/
Core data types on
GBIF network
Taxon names
Taxon occurrence information
specimen records from natural history collections
observational records
Fields used in indexing records
Mandatory
Scientific name
Institutional code
Collection code
Catalogue number
Highly desirable
Geospatial location
Collection date
Higher taxon info
Date last modified
Users
List species recorded in Costa Rica
Find all occurrences of Papilio machaon
Find type specimen for Coffea odorata
Find occurrences of Primates from
Madagascar
Find occurrences from
Antananarivo Province
Portal
Mirror
Registry
DiGIR
DiGIR
DiGIR
TAPIR
DiGIR
TAPIR
DiGIR
X
X
X
X
X
X
Index
Mirror
Databases
http://data.gbif.org
/
Species: Achillea millefolium Kingdom: Animalia
Country: Madagascar Dataset: Continuous Plankton
Recorder Database
http://data.gbif.org
/
taxon data http://data.gbif.org/ws/rest/taxon occurrence record data http://data.gbif.org/ws/rest/occurrence occurrence density data http://data.gbif.org/ws/rest/density
GBIF
Data Portal
Web Services http://data.gbif.org/ws/rest/resource dataset metadata http://data.gbif.org/ws/rest/provider data provider metadata http://data.gbif.org/ws/rest/network data network metadata
http://data.gbif.org/ws/rest/occurrence/list/?taxonConceptKey=14724348&format=darwin
iSpecies
Hundreds of Institutional providers
Thousands of Resources
Millions of Records
Collective & Integrated Access
Wide Taxonomic, Temporal and Geographic Scope
Free and Open Access to all
Go forth and integrate!
Parallel: GenBank
Everything I Just Said
Meets
The names problem in biology
All accumulated information of a species is tied to a scientific name, a name that serves as a link between what has been learned in the past and what we today add to the body of knowledge.
-
Grimaldi & Engel, 2005, Evolution of the Insects
Access to data is via limited points of entry
Biology has a “names problem.”
This names problem impacts these data access entry points.
Exacerbated by:
Wide taxonomic, temporal scope
Federated origins of data
Synonymy
A single concept may reference multiple names
Equivalent
Inclusive
Homography (Homonymy)
A single name may refer to multiple concepts
Definition
A single name may refer to multiple KINDS of concepts
A lexical “concept”
A set of character strings
A nomenclatural concept
A Code-regulated fact
A taxonomic concept
A Hypothesis or opinion
Synonyms
Homonyms
All of these are important to distinguish
Lexical/Orthographic
Informed by: Nomenclators, Taxonomies, Algorithm
Nomenclatural
Informed by: Nomenclators, Monographs (with interpretation)
Taxonomic
Informed by: Monographs, Floras, Faunas, derived checklists
Different classes of equivalence are addressed by different resources
Lexical synonym: A single concept may reference multiple names
ILPIN
IPNI
MOBOT
Gerardia paupercula (Gray) Britt. var borealis (Pennell) Deam
Gerardia paupercula var borealis (Pennell) Deam
Gerardia paupercula Britt. var borealis Deam
Informed by: Nomenclators, Taxonomies, Algorithm
Identifies the preferred lexigraphy of the name
Automates the grouping of lexical variation
Orthographic synonym: A single concept may reference multiple names
Loligo pealeii Loligo pealii
Loligo pealei Loligo plei
Informed by: Nomenclators, Taxonomies, Algorithm
Vernacular synonym: A single concept may reference multiple names
Nomenclatural synonym: A single concept may reference multiple names
Nomenclatural synonym: A single concept may reference multiple names
ILPIN Gerardia paupercula (Gray) Britt. var borealis (Pennell) Deam
IPNI
MOBOT
Gerardia paupercula var borealis (Pennell) Deam
Gerardia paupercula Britt. var borealis Deam
MOBOT
IPNI
Agalinis paupercula (Gray) Britton var.
borealis Pennell (Zenkert 1934)
OHIO DNR Agalinis paupercula (Gray) Britt. var.
borealis Pennell
Agalinis paupercula Britton var. borealis Pennell
ITIS Agalinis paupercula var.
borealis Pennell
Informed by: Nomenclators and generally NOT by taxonomy
Taxonomic synonym: A single concept may reference multiple names (or it may not)
Informed by: Taxonomic Sources
Synthesized synonymy: A bit of everything
Informed by: Algorithm, Nomenclators,Taxonomic Sources
Another example
Aedes calopus | Stegomyia Aegypti | Culex aegypti
Synonymy: Inclusive
Classifications
Catalogue of Life Integrated Classification
Annotated Checklist of the Neuroptera - Mansell 2006
NCBI Taxonomy
Cladograms, Phylograms
Phylogenetic representations
Regional lists
Cetacea of the Hebrides
Flora of China
Thematic Lists
2006 IUCN RedList of Threatened and Endangered Species
WoRMS/OBIS Marine Taxa
100 of the World’s Worst Invasive Alien Species (in GISIN)
Implications for data retrieval
Frost 2005 AMNH
• Notopthalmus viridescens
• Triturus viridescens
• Notopthalmus viridescens
• Notophthalmus viridescens
• Notophthalma viridescens
• Diemyctylus viridescens
• Triton viridescens
• Molge viridescens
• Diemyctylus minatus viridescens
• Triturus viridescens dorsalis
• Diemyctylus viridescens dorsalis
• Notophthalmus viridescens dorsalis
• … 24 others
Dolbe 2004
• Notopthalmus viridescens viridescens
• Triturus viridescens
• Notopthalmus viridescens
• Notophthalmus viridescens
• Notophthalma viridescens
• Diemyctylus viridescens
• Triton viridescens
• Molge viridescens
• Notophthalmus viridescens dorsalis
• Triturus viridescens dorsalis
• Diemyctylus viridescens dorsalis
• Notophthalmus viridescens louisianensis
Homography (Homonymy)
A single name may refer to multiple concepts
Homographs
Virginia (the state) & Virginia Baird & Girard 1853 (the genus)
Tumor (cancer) & Tumor Huang in Huang Dawei 1990
Informed by: Algorithms/Lexicons (word sense disambiguation)
Homonym
Agathis montana (the conifer) & Agathis montana (the wasp)
Wagneria Meladze 1967 & Wagneria Heilprin 1887 & 12 other Wagneria
Informed by: Nomenclators and Taxonomists
Nomenclators establish the factual basis of homonyms and partial disambiguation method
Taxonomy provides a disambiguation method
Taxon Concept (Polysemes)
Gorilla gorilla Wilson and Reeder 1992 vs Gorilla gorilla Groves 2003
Informed by: Taxonomic opinion via monographs, floras, faunas, derived lists
The names problem is inherent to all taxon data
We need a Global Taxonomic Resource
Needs to treat all names
Support multiple taxonomic opinion
Depends on many different source data
The Informatics sum is more than the content parts
Can only work in a federated enviroments
Requires communal exchange data standards communications protocols
Current GBIF Taxonomic Infrastructure
(ECAT)
Catalogue of Life
International Plant Names Index (IPNI)
Index Fungorum
Is not enough
EXPAND to Global Taxonomic Infrastructure
Mobilize wide array of “checklist resources”
Promote the use of nomenclatoral GUIDS in all taxonomic checklists
Enable synthesis of resources
Enable informatics web services
Address Synonymy
Wider access to, and explicit classing of synonyms
Access to multiple lexical grouping algorithms
Access to, and support of, development of nomenclators
Promote the use of nomenclatoral GUIDS in all taxonomic checklists
More Taxonomic, Regional, Thematic checklists
Comprehensive Vernacular Names catalogue
Address Homography, Polysemy
Rapid cataloguing of homography
Access to multiple lexical grouping algorithms
Catalogue and classify all genera
Development of multiple disambiguation methods
Standardized representation of taxon concepts
Development of taxon concept comparators
Explicit assertions of concept relations
As a consumer of taxonomic data resources
As a consumer of name services
As a provider of taxonomic metadata
Increased interoperability
Web site: www.gbif.org
Data portal: www.gbif.net
GBIF Secretariat
Universitetsparken 15
2100 Copenhagen
Denmark
E-mail: dremsen@gbif.org
Phone: +45 3532 1470
Fax: +45 3532 1480
GBIF Secretariat building, supported by a grant from the Aage V. Jensens Fonde