Döring DarwinCoreArchives CLB

advertisement
Darwin Core Archives
Checklist Archives
Checklist Extensions
Archive Tools
Checklist Bank
Markus Döring & David Remsen, GBIF 2010
Checklist Scope
Darwin Core

Ratified in 2009



Set of terms



Significant additions/refinements
Ongoing process
http://rs.tdwg.org/dwc/terms/index.htm
Not tied to technology
Use Text Guidelines for DwC-A

http://rs.tdwg.org/dwc/terms/guides/text/index.htm
Darwin Core Archives for interoperability

Simplicity




Flexible




Complete datasets, compressed
Allow for rich dataset metadata
Single CSV /w header minimal requirement
1:many extensions
Schema descriptor meta.xml
Property mapping to
column or global valu
GNA exchange format



Standard extensions
Taxonomic core conventions
Controlled vocabularies
Best Practices

Include dataset metadata file or URL




inside <archive metadata=“...”>
GBIF recognises eml file
For simplicity a Dublin Core xml file does it
Data file format




UTF8
tab or csv files
header row
NULL as empty string
not “\N” or “NULL”
Dwc:Taxon – Identifier

Relational data, Record ID




Asserting that taxa have a shared concept
ScientificNameID


= TaxonID for checklist archives
= OccurrenceID for occurrence archives
TaxonConceptID


the primary key that other id terms relate to
Link out to some optional name identifier, GUID really
Identifier are plain strings, can be any format
Literal terms, e.g. parentNameUsage



All Dwc ID terms have such a literal friend
Redundant if id terms are used
to be avoided for relations, e.g. homonyms
Dwc:Taxon - Classification


Classification only for accepted taxa, not synonyms
parentNameUsageID




Denormalised (prefer the use of parentNameUsageID)



Kingdom,Phylum,Class,Order,Family,Genus,Subgenus
No explicit records required for higher taxa
TaxonRank


Allows for arbitrary ranks and levels
Beware infinite loops
Root with parentID=NULL or parentID=recordID
String, but recommended vocabulary
http://rs.gbif.org/vocabulary/gbif/rank.xml
Examples http://code.google.com/p/gbif-ecat/wiki/publishingClassifications
Dwc:Taxon - Synonyms

Synonym are records in core file


acceptedNameUsageID




Synonyms point to the accepted/valid name usage
Accepted names have NULL or point to themselves
pro parte synonyms concatenate with | symbol all accepted IDs
taxonomicStatus



But classification should be ignored
Accepted, (hetero-/homotypic) synonym, misapplied
See http://rs.gbif.org/vocabulary/gbif/taxonomic_status.xml
nameAccordingTo

sec. / sensu part of taxon concepts
Dwc:Taxon – Nomenclature

scientificName


full name with authorship
genus, subgenus, specificEpithet, verbatimTaxonRank,
infraspecificEpithet, scientificNameAuthorship

namePublishedIn
nomenclaturalStatus

nomenclaturalCode



http://rs.gbif.org/vocabulary/gbif/nomenclatural_code.xml
originalNameUsageID

Basionym, Pointer to usage that first established the name
Darwin Core Extensions
Dwc Extensions - Basics

One to many relation, schema descriptor meta.xml

id column required to join extensions
rowType specifies the class of records / extension

Property mapping to column or global value


List of allowed properties with




Definition, examples, further link
Mandate Vocabulary
Basic data types: string, integer, decimal, boolean, date, dateTime
Centrally hosted at http://rs.gbif.org


Staging environment
Production is manually moderated, but open to community
Dwc:Taxon Extensions

Frozen soon for GNA “Simple Exchange Format”
http://rs.gbif.org/extension/gbif/1.0/





Vernaculars
Distribution
Bibliography
Alternative ids & links. Webpage, LSID, DOI, JSON, etc
Candidates for further extensions





species info
images
nomenclatural acts & name relations
concept relations
type specimen
Darwin Core Tools
Publishing support
DwC-A Reader Java library


Provides iterators across star schema
Dwc terms and GNA extension terms as enumerations
Validator
Status: Under Evaluation
http://tools.gbif.org/dwca-validator/
Integrated Publishing Toolkit

Compose EML Metadata

Connect to database

Upload Data

Transform to DWCA

Publish via GBIF
Status: Stable release – end 2010
http://ipt.gbif.org
Guidelines and Best Practices
•
•
•
•
DB Admin skills
Database export
No tools required
Successful pilots
• Ireland
• NBN UK
• Norway
• Avian Knowledge
network
• IPNI
• IRMNG
Status: Drafts for November campaign (see roadmap)
Authoring Descriptor XML
Metafile
Status: Ready for Review
http://tools.gbif.org/dwca-assistant/
Excel Spreadsheet Templates
Status: Ready for Review/Testing
Spreadsheet Processor
Status: Ready for Review
http://tools.gbif.org/spreadsheet-processor/
Checklist Bank
Indexing checklists
GBIF Checklist Bank

Rich index to checklists and their content



All of Dwc Taxon and GNA Simple Format extensions:
Vernacular names, Identifier & Links, Distribution, References
~35 million name usages, 90 datasets
+ 8500 derived from occurrence index
Checklists

DwC-A created by





Publisher
Adapters (CoL, ITIS, NCBI, USDA, GRIN, TreeOfLife)
manual Transformation, static
No versioning
4 main types: taxonomic, nomenclatural, occurrences, thematic
Name Usages

Checklists are made up of name usages
a plain name string with optionally:






Classification
Taxonomic status, e.g. synonym, misapllied name
Original name, i.e. basionym
According to, i.e. taxon concept
Nomenclatural status
Original publication
Lexical Grouping

Name strings are parsed and grouped




Correct & incorrect spellings
Homonyms in several groups
Semiautomatic process
largely based on canonical,
year and higher classification
Allows for


Fuzzy matching
Checklist crosswalk
Nomenclatural Grouping

Grouping homotypic names



Original name relation
Homotypic synonyms
Not yet available
Checklist Bank Portal


Preliminary until new
GBIF portal complete
Browse & Search

Statistics
Links to source pages

Flickr Images

Checklist Bank Webservices


Common API to all resources
RESTful JSON services

search names, usages, checklists
navigate classification

http://ecat-dev.gbif.org/api/clb

Importing Darwin Core


Highly relational data
Challenges faced

Syntactically damaged sources


Data Quality




Broken referential integrity
Non names, e.g. “Unallocated Family”
No standard vocabularies for ranks, status, etc
Name strings have several publishing options


Wrong mappings, charsets, non escaped line breaks or field delimiters
ScientificName, Authorship, Genus + epithets + rank
Classification has several publishing options

Normalised (parentUsage / parentUsageID) or flat via Linnean Ranks
GBIF Nub

Synthetic “union taxonomy”, checklist #1
Lexical group = nub name usage

Classification based on prioritized checklists



Align to 8 CoL kingdoms
Fixed accepted ranks:


Linnean + subfamily, subgenus, section, subspecies, variety, form
Other ranks become “Intermediate rank” synonyms

Homotypic synonyms only

Work in progress!
Personal Name Lists

User accounts with personal name lists



Add classifications, status, distribution, vernaculars, etc
from one or more indexed checklists
Also on the fly via webservices


Name string + kingdom/nom code
but only for already indexed name strings
In development …
Download