An LSID resolver for specimens and a digression into issues raised by the use of GUIDs Steve Perry (smperry@ku.edu) © 2006 University of Kansas LSID Resolver for Specimens GUID-2 Part 1 Building an LSID resolver for specimens ©2006 KU BRC 16-Apr-20 LSID Resolver for Specimens GUID-2 How it Works getMetadata() response getMetadata() request LSID Authority config file DiGIR2LSIDMetadataService SPARQL describe query RDF response SPARQL Service DiGIR2 ©2006 KU BRC 16-Apr-20 LSID Resolver for Specimens GUID-2 Details of Prototype Implementation • Classes of Data – Specimens • Metadata Representation – RDF in DarwinCore inspired RDF-Schema • Data Representation – N/A • Experience with Stack – IBM Java toolkit – Great documentation (developerWorks article and Javadoc) – Very easy to implement and test (4 hours) • Concerns – Integration of LSID client into existing software – SOAP not friendly to non-professional programmers ©2006 KU BRC 16-Apr-20 LSID Resolver for Specimens GUID-2 Conclusion : Resolution Is Easy Other issues to resolve: – – – – – – – – – – – Developing ontologies Mapping databases into RDF Finding data to link to Repatriating links into existing databases Versioning Duplicate detection Long term archival storage and access Data aggregation and caching Querying across data from multiple providers Annotating someone else’s data without causing contradictions Trust ©2006 KU BRC 16-Apr-20 LSID Resolver for Specimens GUID-2 Part 2 A digression into issues raised by the use of GUIDs ©2006 KU BRC 16-Apr-20 LSID Resolver for Specimens GUID-2 DiGIR2 :: A Semantic Web Publishing System DiGIR2 Server Harvest Service SPARQL Service Triple Store Synchronizer Data Source ©2006 KU BRC LSID Authority Public Web Services • Not a protocol, a generalpurpose RDF data provider • Synchronizer converts source data into RDF which is stored in a triple store • Multiple services including SPARQL and OAI-PMH allow access to RDF data 16-Apr-20 LSID Resolver for Specimens GUID-2 Synchronizer • Synchronizes the triple store with the database • Builds RDF using: – a data source – a data model (RDF-Schema, OWL ontology) – a mapping program • Can perform transformations while mapping • Can perform resource description tracking and versioning • Standardizes mapping for better support of thematic networks ©2006 KU BRC 16-Apr-20 LSID Resolver for Specimens GUID-2 Synchronizer :: Mapping and Transformation primaryResourceMapFunction urn:lsid:nhm.ku.edu:Herps: concatenate getVar catalogNumber decimalLatitude getVar catalogNumber latitude_deg 32 getVar latitude_deg 30 latitude_min 9 latitude_sec 42 prep E add latitude_min divide 60 getVar latitude_sec divide 3600 preparation getVar prep equalsIgnoreCase E if ETOH Skeleton ©2006 KU BRC 16-Apr-20 LSID Resolver for Specimens GUID-2 Synchronizer :: Versioning and Tracking dc:DarwinCoreSpecimen yp e rdf:t urn:lsid:nhm.ku.edu:Herps:32 dc:decimalLatitude dc:pr e p a ra “30.145”^^xsd:double tion “ETOH”^^xsd:String Synchronizer output ©2006 KU BRC 16-Apr-20 LSID Resolver for Specimens GUID-2 Synchronizer :: Versioning and Tracking dc:DarwinCoreSpecimen yp e rdf:t urn:lsid:nhm.ku.edu:Herps:32 dc:decimalLatitude dc:pr e p a ra “30.145”^^xsd:double tion “ETOH”^^xsd:String Synchronizer output Contents of Triple Store dc:DarwinCoreSpecimen yp e rdf:t urn:lsid:nhm.ku.edu:Herps:32 dc:decimalLatitude dc:pr e p a ra “270.234”^^xsd:double tion “ETOH”^^xsd:String ©2006 KU BRC 16-Apr-20 LSID Resolver for Specimens GUID-2 Synchronizer :: Versioning and Tracking dc:DarwinCoreSpecimen yp e rdf:t urn:lsid:nhm.ku.edu:Herps:32 dc:decimalLatitude dc:pr e p a ra “30.145”^^xsd:double tion “ETOH”^^xsd:String Synchronizer output A new version? Contents of Triple Store dc:DarwinCoreSpecimen yp e rdf:t urn:lsid:nhm.ku.edu:Herps:32 dc:decimalLatitude dc:pr e p a ra “270.234”^^xsd:double tion “ETOH”^^xsd:String ©2006 KU BRC 16-Apr-20 LSID Resolver for Specimens GUID-2 Synchronizer :: Versioning and Tracking What to do with new versions of resource descriptions? • First, track them. Record outside of the RDF subsystem that a resource has been CRUD’d at a particular date and time • After that, there are several ways to handle versioning – No versioning – Non-persistent versioning – Persistent versioning • Each of these affects how clients do searches and how descriptions should be cached and stored remotely. ©2006 KU BRC 16-Apr-20 LSID Resolver for Specimens GUID-2 Versioning Schemes :: No versioning • New version replaces old • No new GUID assigned Contents of Triple Store dc:DarwinCoreSpecimen type rdf: urn:lsid:nhm.ku.edu:Herps:32 dc:decimalLatitude dc:p re p a ratio “30.145”^^xsd:double n “ETOH”^^xsd:String • Simplest scheme • Lose ability to retrieve old versions • Must have application-level rules to find and remove effectiveduplicates ©2006 KU BRC dc:DarwinCoreSpecimen type rdf: urn:lsid:nhm.ku.edu:Herps:32 dc:decimalLatitude dc:p re p a ratio “270.234”^^xsd:double n “ETOH”^^xsd:String 16-Apr-20 LSID Resolver for Specimens GUID-2 Versioning Schemes :: Non-persistent versioning • New GUID assigned • Contents of old description removed • New and old descriptions related to each other by predicates • Do not have problems of old versions matching in cache search • Given old, can find new (inefficient) • Cannot retrieve old data ©2006 KU BRC Contents of Triple Store ty rdf: dc:DarwinCoreSpecimen pe dc:decimalLatitude urn:lsid:nhm.ku.edu:Herps:76 dc:p pu b:r ep re p a lac ratio “30.145”^^xsd:double n “ETOH”^^xsd:String es urn:lsid:nhm.ku.edu:Herps:32 urn:lsid:nhm.ku.edu:Herps:32 pub:replacedBy urn:lsid:nhm.ku.edu:Herps:76 16-Apr-20 LSID Resolver for Specimens GUID-2 Versioning Schemes :: Persistent versioning • New GUID assigned • Old description maintained • New and old descriptions related to each other by predicates • Old versions can end up in triple store together • Given old, can find new (inefficient) • Can retrieve old • Lots of triples! ©2006 KU BRC Contents of Triple Store dc:DarwinCoreSpecimen type rdf: dc:decimalLatitude urn:lsid:nhm.ku.edu:Herps:76 dc:p pu b:r ep re p a lac ratio “30.145”^^xsd:double n “ETOH”^^xsd:String es urn:lsid:nhm.ku.edu:Herps:32 dc:DarwinCoreSpecimen type rdf: dc:decimalLatitude urn:lsid:nhm.ku.edu:Herps:32 dc:p pu b:r ep re p a lac ratio “270.234”^^xsd:double n “ETOH”^^xsd:String es urn:lsid:nhm.ku.edu:Herps:76 16-Apr-20 LSID Resolver for Specimens GUID-2 Versioning :: Mixed versioning • At GUID1, it was stated that different types of information require different versioning policies. • If implemented, this results in a mix of versioning schemes in the global graph • Mixed versioning shifts the burden from providers that don’t version to clients (caches, portals, etc.) which have to figure out whether they are getting only current versions or a mix of new and old (effective duplicates) ©2006 KU BRC 16-Apr-20 LSID Resolver for Specimens GUID-2 Versioning :: Some thoughts on identity • Do GUIDs name things or identify the descriptions of things? • A non-versioned changes to metadata always change the semantic meaning of the description (regardless of whether or not identity is changed) • To paraphrase Heraclitus, “Different waters flow in the same river” • When deciding that a change in a description does not require a change in version, you’re constraining use of your data (you’re interested in the river, I’m interested in the water). ©2006 KU BRC 16-Apr-20 LSID Resolver for Specimens GUID-2 Caching • Lots of use cases for caching – – – – Aggregation for inference Aggregation as solution to distributed query problem Quality of service (response time) Redundancy • Caches should clearly communicate to clients whether the cache holds multiple historical versions of the same description so clients can avoid retrieving effectiveduplicates • To support caching, data providers should support a harvesting mechanism ©2006 KU BRC 16-Apr-20 LSID Resolver for Specimens GUID-2 Incremental Harvesting • Incremental harvesting is more efficient than bulk harvesting because it sends only recent changes • “Give me all metadata changes since X” • To support incremental harvesting we need to track type and date of changes (regardless of the versioning policy) • This adds another set of requirements on to data providers • OAI protocol for metadata harvesting ©2006 KU BRC 16-Apr-20 LSID Resolver for Specimens GUID-2 The Open World Organization A urn:lsid:A:ns:1 ©2006 KU BRC color “red" 16-Apr-20 LSID Resolver for Specimens GUID-2 The Open World Organization A urn:lsid:A:ns:1 color “red" Organization B urn:lsid:A:ns:1 ©2006 KU BRC size “large” 16-Apr-20 LSID Resolver for Specimens GUID-2 The Open World Organization A urn:lsid:A:ns:1 color Merged Graph “red" “red" Organization B co urn:lsid:A:ns:1 ©2006 KU BRC size “large” urn:lsid:A:ns:1 lor size “large” 16-Apr-20 LSID Resolver for Specimens GUID-2 The Open World Organization A urn:lsid:A:ns:1 color Merged Graph “red" “red" Organization B co urn:lsid:A:ns:1 size “large” urn:lsid:A:ns:1 lor size “large” Organization C urn:lsid:A:ns:1 ©2006 KU BRC color “blue” 16-Apr-20 LSID Resolver for Specimens GUID-2 The Open World Organization A urn:lsid:A:ns:1 color Merged Graph “red" “red" Organization B co urn:lsid:A:ns:1 size “large” ©2006 KU BRC size co “large” lor “blue” Organization C urn:lsid:A:ns:1 urn:lsid:A:ns:1 lor color “blue” 16-Apr-20 LSID Resolver for Specimens GUID-2 The Open World Two solutions to this problem • Close the world – Ignore assertions about GUIDs that don’t originate from the authority • Narrow the world – Only allow certain assertions about GUIDs that don’t originate from the authority – Accept/reject foreign authority notifications • Treat everything as an assertion and record who makes it and what they intend by it – Named graphs and semantic web publishing warrants ©2006 KU BRC 16-Apr-20 LSID Resolver for Specimens GUID-2 Provenance, Attribution, and Trust • Assign GUIDs to resources • Assign GUIDs to the graphs that contain concise bounded descriptions, resulting in named “description” graphs • For each description graph, create another named graph that contains information about the assertions made in it • Second named graph is a “warrant” graph • Warrant graph contains meta-meta data – instance of a Warrant class with attributes such as assertedBy • Carroll and Bizer presented “Semantic Web Publishing using Named Graphs” at ISWC2004 Trust Workshop ©2006 KU BRC 16-Apr-20 LSID Resolver for Specimens GUID-2 Issues with LSIDs and RDF – – – – – – – – – – – Developing ontologies Mapping databases into RDF Finding data to link to Repatriating links into existing databases Versioning Duplicate detection Long term archival storage and access Data aggregation and caching Querying across data from multiple providers Annotating someone else’s data without causing contradictions Trust ©2006 KU BRC 16-Apr-20