Darwin Core Archives Checklist Archives Checklist Extensions Archive Tools Checklist Bank Markus Döring & David Remsen, GBIF 2010 Checklist Scope Darwin Core Ratified in 2009 Set of terms Significant additions/refinements Ongoing process http://rs.tdwg.org/dwc/terms/index.htm Not tied to technology Use Text Guidelines for DwC-A http://rs.tdwg.org/dwc/terms/guides/text/index.htm Darwin Core Archives for interoperability Simplicity Flexible Complete datasets, compressed Allow for rich dataset metadata Single CSV /w header minimal requirement 1:many extensions Schema descriptor meta.xml Property mapping to column or global valu GNA exchange format Standard extensions Taxonomic core conventions Controlled vocabularies Best Practices Include dataset metadata file or URL inside <archive metadata=“...”> GBIF recognises eml file For simplicity a Dublin Core xml file does it Data file format UTF8 tab or csv files header row NULL as empty string not “\N” or “NULL” Dwc:Taxon – Identifier Relational data, Record ID Asserting that taxa have a shared concept ScientificNameID = TaxonID for checklist archives = OccurrenceID for occurrence archives TaxonConceptID the primary key that other id terms relate to Link out to some optional name identifier, GUID really Identifier are plain strings, can be any format Literal terms, e.g. parentNameUsage All Dwc ID terms have such a literal friend Redundant if id terms are used to be avoided for relations, e.g. homonyms Dwc:Taxon - Classification Classification only for accepted taxa, not synonyms parentNameUsageID Denormalised (prefer the use of parentNameUsageID) Kingdom,Phylum,Class,Order,Family,Genus,Subgenus No explicit records required for higher taxa TaxonRank Allows for arbitrary ranks and levels Beware infinite loops Root with parentID=NULL or parentID=recordID String, but recommended vocabulary http://rs.gbif.org/vocabulary/gbif/rank.xml Examples http://code.google.com/p/gbif-ecat/wiki/publishingClassifications Dwc:Taxon - Synonyms Synonym are records in core file acceptedNameUsageID Synonyms point to the accepted/valid name usage Accepted names have NULL or point to themselves pro parte synonyms concatenate with | symbol all accepted IDs taxonomicStatus But classification should be ignored Accepted, (hetero-/homotypic) synonym, misapplied See http://rs.gbif.org/vocabulary/gbif/taxonomic_status.xml nameAccordingTo sec. / sensu part of taxon concepts Dwc:Taxon – Nomenclature scientificName full name with authorship genus, subgenus, specificEpithet, verbatimTaxonRank, infraspecificEpithet, scientificNameAuthorship namePublishedIn nomenclaturalStatus nomenclaturalCode http://rs.gbif.org/vocabulary/gbif/nomenclatural_code.xml originalNameUsageID Basionym, Pointer to usage that first established the name Darwin Core Extensions Dwc Extensions - Basics One to many relation, schema descriptor meta.xml id column required to join extensions rowType specifies the class of records / extension Property mapping to column or global value List of allowed properties with Definition, examples, further link Mandate Vocabulary Basic data types: string, integer, decimal, boolean, date, dateTime Centrally hosted at http://rs.gbif.org Staging environment Production is manually moderated, but open to community Dwc:Taxon Extensions Frozen soon for GNA “Simple Exchange Format” http://rs.gbif.org/extension/gbif/1.0/ Vernaculars Distribution Bibliography Alternative ids & links. Webpage, LSID, DOI, JSON, etc Candidates for further extensions species info images nomenclatural acts & name relations concept relations type specimen Darwin Core Tools Publishing support DwC-A Reader Java library Provides iterators across star schema Dwc terms and GNA extension terms as enumerations Validator Status: Under Evaluation http://tools.gbif.org/dwca-validator/ Integrated Publishing Toolkit Compose EML Metadata Connect to database Upload Data Transform to DWCA Publish via GBIF Status: Stable release – end 2010 http://ipt.gbif.org Guidelines and Best Practices • • • • DB Admin skills Database export No tools required Successful pilots • Ireland • NBN UK • Norway • Avian Knowledge network • IPNI • IRMNG Status: Drafts for November campaign (see roadmap) Authoring Descriptor XML Metafile Status: Ready for Review http://tools.gbif.org/dwca-assistant/ Excel Spreadsheet Templates Status: Ready for Review/Testing Spreadsheet Processor Status: Ready for Review http://tools.gbif.org/spreadsheet-processor/ Checklist Bank Indexing checklists GBIF Checklist Bank Rich index to checklists and their content All of Dwc Taxon and GNA Simple Format extensions: Vernacular names, Identifier & Links, Distribution, References ~35 million name usages, 90 datasets + 8500 derived from occurrence index Checklists DwC-A created by Publisher Adapters (CoL, ITIS, NCBI, USDA, GRIN, TreeOfLife) manual Transformation, static No versioning 4 main types: taxonomic, nomenclatural, occurrences, thematic Name Usages Checklists are made up of name usages a plain name string with optionally: Classification Taxonomic status, e.g. synonym, misapllied name Original name, i.e. basionym According to, i.e. taxon concept Nomenclatural status Original publication Lexical Grouping Name strings are parsed and grouped Correct & incorrect spellings Homonyms in several groups Semiautomatic process largely based on canonical, year and higher classification Allows for Fuzzy matching Checklist crosswalk Nomenclatural Grouping Grouping homotypic names Original name relation Homotypic synonyms Not yet available Checklist Bank Portal Preliminary until new GBIF portal complete Browse & Search Statistics Links to source pages Flickr Images Checklist Bank Webservices Common API to all resources RESTful JSON services search names, usages, checklists navigate classification http://ecat-dev.gbif.org/api/clb Importing Darwin Core Highly relational data Challenges faced Syntactically damaged sources Data Quality Broken referential integrity Non names, e.g. “Unallocated Family” No standard vocabularies for ranks, status, etc Name strings have several publishing options Wrong mappings, charsets, non escaped line breaks or field delimiters ScientificName, Authorship, Genus + epithets + rank Classification has several publishing options Normalised (parentUsage / parentUsageID) or flat via Linnean Ranks GBIF Nub Synthetic “union taxonomy”, checklist #1 Lexical group = nub name usage Classification based on prioritized checklists Align to 8 CoL kingdoms Fixed accepted ranks: Linnean + subfamily, subgenus, section, subspecies, variety, form Other ranks become “Intermediate rank” synonyms Homotypic synonyms only Work in progress! Personal Name Lists User accounts with personal name lists Add classifications, status, distribution, vernaculars, etc from one or more indexed checklists Also on the fly via webservices Name string + kingdom/nom code but only for already indexed name strings In development …