An LSID resolver for specimens and a digression into issues

advertisement
An LSID resolver for specimens
and a digression into issues
raised by the use of GUIDs
Steve Perry ([email protected])
© 2006 University of Kansas
LSID Resolver for Specimens
GUID-2
Part 1
Building an LSID
resolver for specimens
©2006 KU BRC
16-Apr-20
LSID Resolver for Specimens
GUID-2
How it Works
getMetadata()
response
getMetadata()
request
LSID Authority
config file
DiGIR2LSIDMetadataService
SPARQL
describe query
RDF response
SPARQL Service
DiGIR2
©2006 KU BRC
16-Apr-20
LSID Resolver for Specimens
GUID-2
Details of Prototype Implementation
• Classes of Data
– Specimens
• Metadata Representation
– RDF in DarwinCore inspired RDF-Schema
• Data Representation
– N/A
• Experience with Stack
– IBM Java toolkit
– Great documentation (developerWorks article and Javadoc)
– Very easy to implement and test (4 hours)
• Concerns
– Integration of LSID client into existing software
– SOAP not friendly to non-professional programmers
©2006 KU BRC
16-Apr-20
LSID Resolver for Specimens
GUID-2
Conclusion : Resolution Is Easy
Other issues to resolve:
–
–
–
–
–
–
–
–
–
–
–
Developing ontologies
Mapping databases into RDF
Finding data to link to
Repatriating links into existing databases
Versioning
Duplicate detection
Long term archival storage and access
Data aggregation and caching
Querying across data from multiple providers
Annotating someone else’s data without causing contradictions
Trust
©2006 KU BRC
16-Apr-20
LSID Resolver for Specimens
GUID-2
Part 2
A digression into issues
raised by the use of GUIDs
©2006 KU BRC
16-Apr-20
LSID Resolver for Specimens
GUID-2
DiGIR2 :: A Semantic Web Publishing System
DiGIR2 Server
Harvest
Service
SPARQL
Service
Triple Store
Synchronizer
Data Source
©2006 KU BRC
LSID
Authority
Public Web
Services
• Not a protocol, a generalpurpose RDF data provider
• Synchronizer converts source
data into RDF which is stored
in a triple store
• Multiple services including
SPARQL and OAI-PMH
allow access to RDF data
16-Apr-20
LSID Resolver for Specimens
GUID-2
Synchronizer
• Synchronizes the triple store with the database
• Builds RDF using:
– a data source
– a data model (RDF-Schema, OWL ontology)
– a mapping program
• Can perform transformations while mapping
• Can perform resource description tracking and versioning
• Standardizes mapping for better support of thematic
networks
©2006 KU BRC
16-Apr-20
LSID Resolver for Specimens
GUID-2
Synchronizer :: Mapping and Transformation
primaryResourceMapFunction
urn:lsid:nhm.ku.edu:Herps:
concatenate
getVar
catalogNumber
decimalLatitude
getVar
catalogNumber
latitude_deg
32
getVar
latitude_deg
30
latitude_min
9
latitude_sec
42
prep
E
add
latitude_min
divide
60
getVar
latitude_sec
divide
3600
preparation
getVar
prep
equalsIgnoreCase
E
if
ETOH
Skeleton
©2006 KU BRC
16-Apr-20
LSID Resolver for Specimens
GUID-2
Synchronizer :: Versioning and Tracking
dc:DarwinCoreSpecimen
yp e
rdf:t
urn:lsid:nhm.ku.edu:Herps:32
dc:decimalLatitude
dc:pr
e p a ra
“30.145”^^xsd:double
tion
“ETOH”^^xsd:String
Synchronizer output
©2006 KU BRC
16-Apr-20
LSID Resolver for Specimens
GUID-2
Synchronizer :: Versioning and Tracking
dc:DarwinCoreSpecimen
yp e
rdf:t
urn:lsid:nhm.ku.edu:Herps:32
dc:decimalLatitude
dc:pr
e p a ra
“30.145”^^xsd:double
tion
“ETOH”^^xsd:String
Synchronizer output
Contents of Triple
Store
dc:DarwinCoreSpecimen
yp e
rdf:t
urn:lsid:nhm.ku.edu:Herps:32
dc:decimalLatitude
dc:pr
e p a ra
“270.234”^^xsd:double
tion
“ETOH”^^xsd:String
©2006 KU BRC
16-Apr-20
LSID Resolver for Specimens
GUID-2
Synchronizer :: Versioning and Tracking
dc:DarwinCoreSpecimen
yp e
rdf:t
urn:lsid:nhm.ku.edu:Herps:32
dc:decimalLatitude
dc:pr
e p a ra
“30.145”^^xsd:double
tion
“ETOH”^^xsd:String
Synchronizer output
A new
version?
Contents of Triple
Store
dc:DarwinCoreSpecimen
yp e
rdf:t
urn:lsid:nhm.ku.edu:Herps:32
dc:decimalLatitude
dc:pr
e p a ra
“270.234”^^xsd:double
tion
“ETOH”^^xsd:String
©2006 KU BRC
16-Apr-20
LSID Resolver for Specimens
GUID-2
Synchronizer :: Versioning and Tracking
What to do with new versions of resource descriptions?
• First, track them. Record outside of the RDF subsystem
that a resource has been CRUD’d at a particular date and
time
• After that, there are several ways to handle versioning
– No versioning
– Non-persistent versioning
– Persistent versioning
• Each of these affects how clients do searches and how
descriptions should be cached and stored remotely.
©2006 KU BRC
16-Apr-20
LSID Resolver for Specimens
GUID-2
Versioning Schemes :: No versioning
• New version
replaces old
• No new GUID
assigned
Contents of Triple Store
dc:DarwinCoreSpecimen
type
rdf:
urn:lsid:nhm.ku.edu:Herps:32
dc:decimalLatitude
dc:p
re p a
ratio
“30.145”^^xsd:double
n
“ETOH”^^xsd:String
• Simplest scheme
• Lose ability to
retrieve old versions
• Must have
application-level
rules to find and
remove effectiveduplicates
©2006 KU BRC
dc:DarwinCoreSpecimen
type
rdf:
urn:lsid:nhm.ku.edu:Herps:32
dc:decimalLatitude
dc:p
re p a
ratio
“270.234”^^xsd:double
n
“ETOH”^^xsd:String
16-Apr-20
LSID Resolver for Specimens
GUID-2
Versioning Schemes :: Non-persistent versioning
• New GUID assigned
• Contents of old
description removed
• New and old
descriptions related
to each other by
predicates
• Do not have
problems of old
versions matching in
cache search
• Given old, can find
new (inefficient)
• Cannot retrieve old
data
©2006 KU BRC
Contents of Triple Store
ty
rdf:
dc:DarwinCoreSpecimen
pe
dc:decimalLatitude
urn:lsid:nhm.ku.edu:Herps:76
dc:p
pu
b:r
ep
re p a
lac
ratio
“30.145”^^xsd:double
n
“ETOH”^^xsd:String
es
urn:lsid:nhm.ku.edu:Herps:32
urn:lsid:nhm.ku.edu:Herps:32
pub:replacedBy
urn:lsid:nhm.ku.edu:Herps:76
16-Apr-20
LSID Resolver for Specimens
GUID-2
Versioning Schemes :: Persistent versioning
• New GUID assigned
• Old description
maintained
• New and old
descriptions related
to each other by
predicates
• Old versions can
end up in triple store
together
• Given old, can find
new (inefficient)
• Can retrieve old
• Lots of triples!
©2006 KU BRC
Contents of Triple Store
dc:DarwinCoreSpecimen
type
rdf:
dc:decimalLatitude
urn:lsid:nhm.ku.edu:Herps:76
dc:p
pu
b:r
ep
re p a
lac
ratio
“30.145”^^xsd:double
n
“ETOH”^^xsd:String
es
urn:lsid:nhm.ku.edu:Herps:32
dc:DarwinCoreSpecimen
type
rdf:
dc:decimalLatitude
urn:lsid:nhm.ku.edu:Herps:32
dc:p
pu
b:r
ep
re p a
lac
ratio
“270.234”^^xsd:double
n
“ETOH”^^xsd:String
es
urn:lsid:nhm.ku.edu:Herps:76
16-Apr-20
LSID Resolver for Specimens
GUID-2
Versioning :: Mixed versioning
• At GUID1, it was stated that different types of information
require different versioning policies.
• If implemented, this results in a mix of versioning schemes
in the global graph
• Mixed versioning shifts the burden from providers that don’t
version to clients (caches, portals, etc.) which have to figure
out whether they are getting only current versions or a mix
of new and old (effective duplicates)
©2006 KU BRC
16-Apr-20
LSID Resolver for Specimens
GUID-2
Versioning :: Some thoughts on identity
• Do GUIDs name things or identify the descriptions of
things?
• A non-versioned changes to metadata always change the
semantic meaning of the description (regardless of whether
or not identity is changed)
• To paraphrase Heraclitus, “Different waters flow in the same
river”
• When deciding that a change in a description does not
require a change in version, you’re constraining use of your
data (you’re interested in the river, I’m interested in the
water).
©2006 KU BRC
16-Apr-20
LSID Resolver for Specimens
GUID-2
Caching
• Lots of use cases for caching
–
–
–
–
Aggregation for inference
Aggregation as solution to distributed query problem
Quality of service (response time)
Redundancy
• Caches should clearly communicate to clients whether the
cache holds multiple historical versions of the same
description so clients can avoid retrieving effectiveduplicates
• To support caching, data providers should support a
harvesting mechanism
©2006 KU BRC
16-Apr-20
LSID Resolver for Specimens
GUID-2
Incremental Harvesting
• Incremental harvesting is more efficient than bulk
harvesting because it sends only recent changes
• “Give me all metadata changes since X”
• To support incremental harvesting we need to track type
and date of changes (regardless of the versioning policy)
• This adds another set of requirements on to data providers
• OAI protocol for metadata harvesting
©2006 KU BRC
16-Apr-20
LSID Resolver for Specimens
GUID-2
The Open World
Organization A
urn:lsid:A:ns:1
©2006 KU BRC
color
“red"
16-Apr-20
LSID Resolver for Specimens
GUID-2
The Open World
Organization A
urn:lsid:A:ns:1
color
“red"
Organization B
urn:lsid:A:ns:1
©2006 KU BRC
size
“large”
16-Apr-20
LSID Resolver for Specimens
GUID-2
The Open World
Organization A
urn:lsid:A:ns:1
color
Merged Graph
“red"
“red"
Organization B
co
urn:lsid:A:ns:1
©2006 KU BRC
size
“large”
urn:lsid:A:ns:1
lor
size
“large”
16-Apr-20
LSID Resolver for Specimens
GUID-2
The Open World
Organization A
urn:lsid:A:ns:1
color
Merged Graph
“red"
“red"
Organization B
co
urn:lsid:A:ns:1
size
“large”
urn:lsid:A:ns:1
lor
size
“large”
Organization C
urn:lsid:A:ns:1
©2006 KU BRC
color
“blue”
16-Apr-20
LSID Resolver for Specimens
GUID-2
The Open World
Organization A
urn:lsid:A:ns:1
color
Merged Graph
“red"
“red"
Organization B
co
urn:lsid:A:ns:1
size
“large”
©2006 KU BRC
size
co
“large”
lor
“blue”
Organization C
urn:lsid:A:ns:1
urn:lsid:A:ns:1
lor
color
“blue”
16-Apr-20
LSID Resolver for Specimens
GUID-2
The Open World
Two solutions to this problem
• Close the world
– Ignore assertions about GUIDs that don’t originate from the authority
• Narrow the world
– Only allow certain assertions about GUIDs that don’t originate from
the authority
– Accept/reject foreign authority notifications
• Treat everything as an assertion and record who makes it
and what they intend by it
– Named graphs and semantic web publishing warrants
©2006 KU BRC
16-Apr-20
LSID Resolver for Specimens
GUID-2
Provenance, Attribution, and Trust
• Assign GUIDs to resources
• Assign GUIDs to the graphs that contain concise bounded
descriptions, resulting in named “description” graphs
• For each description graph, create another named graph
that contains information about the assertions made in it
• Second named graph is a “warrant” graph
• Warrant graph contains meta-meta data – instance of a
Warrant class with attributes such as assertedBy
• Carroll and Bizer presented “Semantic Web Publishing
using Named Graphs” at ISWC2004 Trust Workshop
©2006 KU BRC
16-Apr-20
LSID Resolver for Specimens
GUID-2
Issues with LSIDs and RDF
–
–
–
–
–
–
–
–
–
–
–
Developing ontologies
Mapping databases into RDF
Finding data to link to
Repatriating links into existing databases
Versioning
Duplicate detection
Long term archival storage and access
Data aggregation and caching
Querying across data from multiple providers
Annotating someone else’s data without causing contradictions
Trust
©2006 KU BRC
16-Apr-20
Download