Uploaded by coloradofloridaoregon

The Benefits of Using PIDs, EUDAT 1st Conference

advertisement
The Benefits of Using PIDs
October 2012
EUDAT 1ST Conference
Barcelona
Larry Lannom
Corporation for National Research Initiatives
http://www.cnri.reston.va.us/
http://www.handle.net/
Hourglass Model: Internet
email
WWW
phone
SMTP
HTTP
RTP…
TCP
Value Added
Services
UDP…
Internet
Protocol Suite
IP
ethernet
PPP…
CSMA
async
sonet…
copper
fiber
radio
Network
Technology
Corporation for National Research Initiatives
Hourglass Model: Information
Management on Networks
Persistent
Reference
Custom
Clients
Analysis
Apps
Resolution System
Citation
Plug-Ins
Value Added
Services
Typing
Persistent
Identifiers
PID
Digital Objects
Data Sets
RDBMS
Files
Local Storage
Cloud
Computed
Data
Sources
Corporation for National Research Initiatives
PIDs: Advantages
• Persistent Identity via Indirection
– Static references into fluid systems over time
• Data on networks moves
• Ownership/responsibility change
• Formats change
– Embedded Ids
• For data object in hand – current state data
– Updates
– New related entities
– Networks of Persistent Links
• Data / metadata links
• Provenance chains
• Inheritance across a broad set of entities
Corporation for National Research Initiatives
PIDs: Disadvantages
• Extra level of effort / cost on creation
– Analysis – what to identify / granularity
– Coordination across organizations
– Maintain resolution system
• Persistence requires sustained effort
– Organizational discipline
– Technology necessary but not sufficient
• Analyze cost/benefit ratio
– Don’t start unless its worthwhile
– Is your data worth it?
Corporation for National Research Initiatives
Requirements: Identifier String
• Not based on any changeable attributes of the entity
– Location
– Ownership
– Any other attribute that may change w/o changing identity
• Opaque, preferably a ‘dumb number’
– A well known pattern invites assumptions that may be
misleading
– Meaningful semantics invite IP wars, language problems
• Unique
– Avoid collisions, referential uncertainty
• Nice to have
– Human-readable
– Cut-able, paste-able
– Fits common systems, e.g., URI specification
• All of the above contributes to persistence
Corporation for National Research Initiatives
Requirements: Identifier Resolution System
• Reliable
– Redundant, no single points of failure
– Fast enough to not appear broken
• Scalable
– Higher loads managed with more computers
• Flexible
– Adapt to changing computing environments
– Useful to new applications
• Secure
– Resolution must be trusted
– Administration must be trusted
• Transparent
– Part of the plumbing – users don’t need to understand it
• Persistence, again
Corporation for National Research Initiatives
Handle System
• Provides basic identifier resolution system for Internet
•
•
Go from object name to current state data
Name can persist over changes in location and other attributes
• Logically a single system, but physically and organizationally
distributed and highly scalable
• Enables association of one or more typed values, e.g., IP address,
public key, URL, with each id
• Optimized for speed and reliability
• Secure resolution with its own PKI as an option
• Open, well-defined protocol and data model
• Provides infrastructure for application domains
• OPTIMIZED TO DO ONE THING ONLY: resolves Handles to
type/value pairs – all other functions reside in the applications
Corporation for National Research Initiatives
Handle System Usage Examples
• Library of Congress
• IDF (International DOI Foundation)
–
–
–
–
–
–
–
–
–
•
•
•
•
•
CrossRef (scholarly journal consortium, representing >2K publishers & societies)
DataCite (consortium of 20 members from 12 countries, started by TIB)
EIDR (Entertainment Identifier Registry)
mEDRA (Multilingual European DOI Registration Agency)
R.R. Bowker (bibliographic data – US ISBN)
Office of Publications of the European Community (OPOCE)
Institute of Scientific and Technical Information of China (ISTIC)
Airiti, Inc. (Taiwan)
Japan Link Center
DSpace (>1000 institutions)
OECD (tables and graphs)
Australian National Data Service (ANDS)
EPIC (European Persistent Identifier Consortium)
EUDAT (Collaborative Data Infrastructure project in Europe)
Corporation for National Research Initiatives
Handle System Usage (September 2012)
• Assigned Prefixes
– DOI – 213, 403
– Other – 1,900
• Handles
– DOI – > 62 M
– Other - Additional millions (total per prefix known only to prefix
manager)
• Handle Services
– Global
• Six service sites (three CNRI, one CrossRef, one CNNIC, one
GWDG)
– Locals
• >1500 registered LHS’s
• Traffic
– Global: 75 –100 million per month
– CNRI-run proxy servers: 50 – 100 million per month
Corporation for National Research Initiatives
Multiple Resolution
Structured alternatives, e.g., multiple locations, in a single handle value
Include selection criteria in that same value
Handle client application, e.g., proxy server, performs evaluation
Type = 10320/loc; value =
<locations chooseby=“locatt, country, weight”>
– <location id=0 href=“http://abc…. Country=“gb” weight=0>
– <location id=1 href=“http://def… weight=1>
– <location id=2 href=“http://xyz… weight=1>
<locations/>
• If the user is in the UK they are redirected to http://abc…, if not then
either http://def... or http://xyz... at random, 50/50
• Currently deployed in CNRI-run proxies and also available in the open
source proxy code
• Approach extensible for future selection methods, e.g., chooseby
language or other value known to the proxy
•
•
•
•
Corporation for National Research Initiatives
Multiple Resolution "Chooseby"
10.1525/bio.2009.59.5.9
URL
http://caliber.ucpress.net/doi/abs/10.1525/bio.2009.59.5.9
HS_ADMIN
handle=0.na/10.1525; index=200;
[delete hdl,add val,read val,modify val,del admin,add admin,list]
Corporation for National Research Initiatives
Multiple Resolution "Chooseby"
10.1525/bio.2009.59.5.9
URL
http://caliber.ucpress.net/doi/abs/10.1525/bio.2009.59.5.9
HS_ADMIN
handle=0.na/10.1525; index=200;
[delete hdl,add val,read val,modify val,del admin,add admin,list]
10320/loc
<locations chooseby="locatt, country, weighted">
<location id="1" cr_type="MR-LIST" href="http://mr.crossref.org/
iPage?doi=10.1525%2Fbio.2009.59.5.9" weight="1" />
<location id="2" cr_src="unca" label="SECONDARY_BIOONE"
cr_type="MR-LIST" href="http://www.bioone.org/doi/full/10.1525/
bio.2009.59.5.9" weight="0" />
</locations>
Corporation for National Research Initiatives
Multiple Resolution "Chooseby"
10.1525/bio.2009.59.5.9
URL
http://caliber.ucpress.net/doi/abs/10.1525/bio.2009.59.5.9
HS_ADMIN
handle=0.na/10.1525; index=200;
[delete hdl,add val,read val,modify val,del admin,add admin,list]
10320/loc
<locations chooseby="locatt, country, weighted">
<location id="1" cr_type="MR-LIST" href="http://mr.crossref.org/
iPage?doi=10.1525%2Fbio.2009.59.5.9" weight="1" />
<location id="2" cr_src="unca" label="SECONDARY_BIOONE"
cr_type="MR-LIST" href="http://www.bioone.org/doi/full/10.1525/
bio.2009.59.5.9" weight="0" />
</locations>
The evaluation falls through the first two criteria and the proxy uses 'weighted' as the selection criteria.
The first location (http://mr.crossref.org) wins with a weight of 1.
Corporation for National Research Initiatives
Multiple Resolution "Chooseby"
10.1525/bio.2009.59.5.9
URL
http://caliber.ucpress.net/doi/abs/10.1525/bio.2009.59.5.9
HS_ADMIN
handle=0.na/10.1525; index=200;
[delete hdl,add val,read val,modify val,del admin,add admin,list]
10320/loc
<locations chooseby="locatt, country, weighted">
<location id="1" cr_type="MR-LIST" href="http://mr.crossref.org/
iPage?doi=10.1525%2Fbio.2009.59.5.9" weight="1" />
<location id="2" cr_src="unca" label="SECONDARY_BIOONE"
cr_type="MR-LIST" href="http://www.bioone.org/doi/full/10.1525/
bio.2009.59.5.9" weight="0" />
</locations>
The evaluation falls through the first two criteria and the proxy uses 'weighted' as the selection criteria.
The first location (http://mr.crossref.org) wins with a weight of 1.
That location goes to a script on the CrossRef site that builds the page a user sees when resolving the DOI
name as http://dx.doi.org/10.1525/bio.2009.59.5.9. The page is built to include the original URL value
plus the 10320/loc data plus some additional information held by CrossRef.
Corporation for National Research Initiatives
Multiple Resolution "Chooseby"
The page displayed includes both the original URL and the added BioOne link:
TYPE = URL
VALUE = http://caliber.ucpress.net/doi/abs/10.1525/bio.2009.59.5.9
TYPE = 10320/loc
VALUE = http://www.bioone.org/doi/full/10.1525/bio.2009.59.5.9
Corporation for National Research Initiatives
Template Handles
• An unlimited number of handles are computed on the fly from a
single registered template
• Re-write rules and delimiter can be defined at the prefix level, e.g.,
use ‘-’ as delimiter and re-write any URL values, e.g., for any handle
under the prefix 123
• Any handle under that prefix can be divided into base and
extension, e.g., 123/456-abc has a base of 123/456 and an
extension of abc. The base is registered as a normal handle.
• The data at 123/456 will then be combined with the extension
string (abc) using the re-write rule
• Resolve “123/456-abc” and get back
http://repository.com/getobject?id=123/456&part=abc
• Resolve “123/456-def” and get back
http://repository.com/getobject?id=123/456&part=def
Corporation for National Research Initiatives
Template Handles
• Directly results from modularity of the current implementation
• Backend handle storage is pluggable
• A new storage module allows handles to be computed
• The rest of the handle resolution mechanisms are unchanged,
only the storage module was enhanced
• Any exception handles can be individually registered to over-ride
the template
• Re-write rules at the base level will over-ride the prefix level rules
• Re-write rules use Java regular expression language
• Templates allow handle strings to remain static in reference form
while millions of resolution values can be changed at a single
stroke
• Downside - All handles of the correct form become resolvable,
even if they don’t refer to real data
Corporation for National Research Initiatives
Offline Signatures
• Handle values can be signed with "offline"
private keys that need not exist on any
Internet-connected machine.
• This additional layer of verification has been
applied to all entries in the Global Handle
Registry.
• Any party that has the authority to create
handle records can use this capability to sign
their handle records.
• There is a simple (but flexible) API for building
handle value digests and signing those digests.
Corporation for National Research Initiatives
Handle (DOI) Resolution: Redirect to URL
10.1126/science.169.3946.635
Corporation for National Research Initiatives
Handle (DOI) Resolution: Redirect to URL
10.1126/science.169.3946.635
http://dx.doi.org/10.1126/science.169.3946.635
Corporation for National Research Initiatives
Handle (DOI) Resolution: Redirect to URL
10.1126/science.169.3946.635
http://dx.doi.org/10.1126/science.169.3946.635
Corporation for National Research Initiatives
Handle (DOI) Resolution: Open Linked Data
$ curl -LH "Accept: application/rdf+xml;q=0.5, application/vnd.
citationstyles.csl+json;q=1.0"http://dx.doi.org/10.1126/science.1
69.3946.635
Corporation for National Research Initiatives
Handle (DOI) Resolution: Open Linked Data
$ curl -LH "Accept: application/rdf+xml;q=0.5, application/vnd.
citationstyles.csl+json;q=1.0"http://dx.doi.org/10.1126/science.1
69.3946.635
{ "volume" : "169", "issue" : "3946", "DOI" : "10.1126/science.
169.3946.635", "URL" : "http://dx.doi.org/10.1126/science.
169.3946.635", "title" : "The Structure of Ordinary Water: New
data and interpretations are yielding new insights into this
fascinating substance", "container-title" : "Science", "publisher" :
"American Association for the Advancement of Science AAAS
(Science)", "issued" : { "date-parts" : [ [ 1970,8,14 ] ] }, "author" :
[ { "family" : "Frank", "given" : "H. S."} ], "editor" : [], "page" :
"635-641", "type" : "article-journal" }
Corporation for National Research Initiatives
Download