LSIDs and RDF in TDWG Roger Hyam, TDWG, RBGE Donald Hobern, GBIF June 7-9, 2006 - Edinburgh, UK Paradigm • Starting assumption is that standards are about sharing data. • Sharing data also implies sharing data through time. Archive What is Shared? • Sharing raw literals isn’t much use. • They need to be gathered together into ‘semantic’ units or objects. perennis 1234 Bellis TaxonName:1234 Bellis perennis Semantics of Objects • Objects need to be based on some shared semantics. • There needs to be somewhere to look up what they mean – an ontology. Ontology TaxonName: Bellis perennis Identity of Objects • • • • How do I refer to this object? Who should I credit? Who should I send corrections to? Is it the same record as I already have or is it a new one? • What is the official version of this data has some one altered it before I received it? TDWG TAG-1 Meeting • There was consensus on– Architecture is concerned with shared data – Biodiversity data will be modeled as a graph of identifiable objects – The semantics of these objects will be encoded in a series of shared ontologies – Ontologies will be related to each other on the basis of a shared Base and Core ontologies as a minimum • Discussion continues on how this is done Implications • We need a ontology to define and relate the objects we exchange. • Ontology governance/management is paramount. • We need a system of GUIDs to identify the objects. • We need a roadmap for the protocols to exchange these objects. Structure of the Ontology Base Ontology BaseThing BaseActor CoreTaxonName CoreInstitution TaxonName Herbarium Core Ontology Domain Ontology NomencalturalType NomeclaturalNote Application Ontologies ABCD DarwinCore ??? Ontology Governance • Allow people to create Domain subontologies easily – prevent alienation. • Each ontology construct (concept) has a status. • Status is increased by passing through explicit gates defined by actual usage. Experimental Shared Recommend What about RDF? • The need to share identifiable objects has been established without reference to a technology. • We are interested in objects not triples. • Typical use case involves a client consuming semantically heterogeneous data from multiple sources. • Semantic Web technologies would be ideal – but aren’t part of the TDWG culture and there are ‘unbelievers’. Current ‘Standards’ • DarwinCore & DiGIR – Based on Z39.50 – HTTP based XML message / response – Simple ‘flat’ application schemas (RDF-like) • ABCD & BioCASe – Based on DarwinCore & DiGIR – Complex document structure. • TAPIR – Unification of BioCASe and DiGIR • No RDF, Objects or GUIDs here yet! Combing Data • GBIF data portal is the only ‘application’ that does data integration between these formats. • No standard way to include XML fragments from other XSD other than xs:any. • There is overlap between the different schemas and no easy way to merge them. What about LSIDs • GUID-1 meeting considered several GUID technologies including (LSID, DOI & Handle). • Life Science Identifiers are being assessed. – I3C & OMG URNs – urn:lsid:ncbi.nlm.nih.gov:pubmed:12571434 – getData() – getMetadata() LSID Permanence • LSIDs should not be recycled – i.e. Used for more that one object. • LSIDs should always resolve but it is OK for them to resolve to a 404 (Gone) error. • No central authority to control these things. • Even DOIs go away if there isn’t institutional backing! LSIDs for Everything? • Are there some things for which LSIDs are inappropriate? – <logo rdf:resource=“urn:lsid:example.com:branding:logo.gif” /> – xsi:schemaLocation=“urn:lsid:example.com:xsd:taxon.xsd” – xmlns:tn=“urn:lsid:example.com:ontology:taxon/” • Definitely places where we will use something else. • Other people will use their own identifiers e.g. DOI, Handle etc. So what’s cooking? XSD Based Conceptual Schemas Recognised Need For GUIDS Different GUID Technologies A TDWG Ontology XML Based Exchange Protocols Emergent Semantic Web OGC Standards (GML) Other! 200+ Data Providers 50+ Million Anonymous ‘Records’ BioMOBY Clients? Possible Roadmap • Build the ontology as a focus for semantics. • Resolution and Harvest protocols should be relatively easy to plug into or wrap round existing service providers so approach these first. • Search/Query – More problematic BioCASe, DiGIR, TAPIR, SPARQL, other? Thank You • Gordon and Betty Moore Foundation • Global Biodiversity Information Facility • NESC • TDWG Members