The Benefits of Using PIDs October 2012 EUDAT 1ST Conference Barcelona Larry Lannom Corporation for National Research Initiatives http://www.cnri.reston.va.us/ http://www.handle.net/ Hourglass Model: Internet email WWW phone SMTP HTTP RTP… TCP Value Added Services UDP… Internet Protocol Suite IP ethernet PPP… CSMA async sonet… copper fiber radio Network Technology Corporation for National Research Initiatives Hourglass Model: Information Management on Networks Persistent Reference Custom Clients Analysis Apps Resolution System Citation Plug-Ins Value Added Services Typing Persistent Identifiers PID Digital Objects Data Sets RDBMS Files Local Storage Cloud Computed Data Sources Corporation for National Research Initiatives PIDs: Advantages • Persistent Identity via Indirection – Static references into fluid systems over time • Data on networks moves • Ownership/responsibility change • Formats change – Embedded Ids • For data object in hand – current state data – Updates – New related entities – Networks of Persistent Links • Data / metadata links • Provenance chains • Inheritance across a broad set of entities Corporation for National Research Initiatives PIDs: Disadvantages • Extra level of effort / cost on creation – Analysis – what to identify / granularity – Coordination across organizations – Maintain resolution system • Persistence requires sustained effort – Organizational discipline – Technology necessary but not sufficient • Analyze cost/benefit ratio – Don’t start unless its worthwhile – Is your data worth it? Corporation for National Research Initiatives Requirements: Identifier String • Not based on any changeable attributes of the entity – Location – Ownership – Any other attribute that may change w/o changing identity • Opaque, preferably a ‘dumb number’ – A well known pattern invites assumptions that may be misleading – Meaningful semantics invite IP wars, language problems • Unique – Avoid collisions, referential uncertainty • Nice to have – Human-readable – Cut-able, paste-able – Fits common systems, e.g., URI specification • All of the above contributes to persistence Corporation for National Research Initiatives Requirements: Identifier Resolution System • Reliable – Redundant, no single points of failure – Fast enough to not appear broken • Scalable – Higher loads managed with more computers • Flexible – Adapt to changing computing environments – Useful to new applications • Secure – Resolution must be trusted – Administration must be trusted • Transparent – Part of the plumbing – users don’t need to understand it • Persistence, again Corporation for National Research Initiatives Handle System • Provides basic identifier resolution system for Internet • • Go from object name to current state data Name can persist over changes in location and other attributes • Logically a single system, but physically and organizationally distributed and highly scalable • Enables association of one or more typed values, e.g., IP address, public key, URL, with each id • Optimized for speed and reliability • Secure resolution with its own PKI as an option • Open, well-defined protocol and data model • Provides infrastructure for application domains • OPTIMIZED TO DO ONE THING ONLY: resolves Handles to type/value pairs – all other functions reside in the applications Corporation for National Research Initiatives Handle System Usage Examples • Library of Congress • IDF (International DOI Foundation) – – – – – – – – – • • • • • CrossRef (scholarly journal consortium, representing >2K publishers & societies) DataCite (consortium of 20 members from 12 countries, started by TIB) EIDR (Entertainment Identifier Registry) mEDRA (Multilingual European DOI Registration Agency) R.R. Bowker (bibliographic data – US ISBN) Office of Publications of the European Community (OPOCE) Institute of Scientific and Technical Information of China (ISTIC) Airiti, Inc. (Taiwan) Japan Link Center DSpace (>1000 institutions) OECD (tables and graphs) Australian National Data Service (ANDS) EPIC (European Persistent Identifier Consortium) EUDAT (Collaborative Data Infrastructure project in Europe) Corporation for National Research Initiatives Handle System Usage (September 2012) • Assigned Prefixes – DOI – 213, 403 – Other – 1,900 • Handles – DOI – > 62 M – Other - Additional millions (total per prefix known only to prefix manager) • Handle Services – Global • Six service sites (three CNRI, one CrossRef, one CNNIC, one GWDG) – Locals • >1500 registered LHS’s • Traffic – Global: 75 –100 million per month – CNRI-run proxy servers: 50 – 100 million per month Corporation for National Research Initiatives Multiple Resolution Structured alternatives, e.g., multiple locations, in a single handle value Include selection criteria in that same value Handle client application, e.g., proxy server, performs evaluation Type = 10320/loc; value = <locations chooseby=“locatt, country, weight”> – <location id=0 href=“http://abc…. Country=“gb” weight=0> – <location id=1 href=“http://def… weight=1> – <location id=2 href=“http://xyz… weight=1> <locations/> • If the user is in the UK they are redirected to http://abc…, if not then either http://def... or http://xyz... at random, 50/50 • Currently deployed in CNRI-run proxies and also available in the open source proxy code • Approach extensible for future selection methods, e.g., chooseby language or other value known to the proxy • • • • Corporation for National Research Initiatives Multiple Resolution "Chooseby" 10.1525/bio.2009.59.5.9 URL http://caliber.ucpress.net/doi/abs/10.1525/bio.2009.59.5.9 HS_ADMIN handle=0.na/10.1525; index=200; [delete hdl,add val,read val,modify val,del admin,add admin,list] Corporation for National Research Initiatives Multiple Resolution "Chooseby" 10.1525/bio.2009.59.5.9 URL http://caliber.ucpress.net/doi/abs/10.1525/bio.2009.59.5.9 HS_ADMIN handle=0.na/10.1525; index=200; [delete hdl,add val,read val,modify val,del admin,add admin,list] 10320/loc <locations chooseby="locatt, country, weighted"> <location id="1" cr_type="MR-LIST" href="http://mr.crossref.org/ iPage?doi=10.1525%2Fbio.2009.59.5.9" weight="1" /> <location id="2" cr_src="unca" label="SECONDARY_BIOONE" cr_type="MR-LIST" href="http://www.bioone.org/doi/full/10.1525/ bio.2009.59.5.9" weight="0" /> </locations> Corporation for National Research Initiatives Multiple Resolution "Chooseby" 10.1525/bio.2009.59.5.9 URL http://caliber.ucpress.net/doi/abs/10.1525/bio.2009.59.5.9 HS_ADMIN handle=0.na/10.1525; index=200; [delete hdl,add val,read val,modify val,del admin,add admin,list] 10320/loc <locations chooseby="locatt, country, weighted"> <location id="1" cr_type="MR-LIST" href="http://mr.crossref.org/ iPage?doi=10.1525%2Fbio.2009.59.5.9" weight="1" /> <location id="2" cr_src="unca" label="SECONDARY_BIOONE" cr_type="MR-LIST" href="http://www.bioone.org/doi/full/10.1525/ bio.2009.59.5.9" weight="0" /> </locations> The evaluation falls through the first two criteria and the proxy uses 'weighted' as the selection criteria. The first location (http://mr.crossref.org) wins with a weight of 1. Corporation for National Research Initiatives Multiple Resolution "Chooseby" 10.1525/bio.2009.59.5.9 URL http://caliber.ucpress.net/doi/abs/10.1525/bio.2009.59.5.9 HS_ADMIN handle=0.na/10.1525; index=200; [delete hdl,add val,read val,modify val,del admin,add admin,list] 10320/loc <locations chooseby="locatt, country, weighted"> <location id="1" cr_type="MR-LIST" href="http://mr.crossref.org/ iPage?doi=10.1525%2Fbio.2009.59.5.9" weight="1" /> <location id="2" cr_src="unca" label="SECONDARY_BIOONE" cr_type="MR-LIST" href="http://www.bioone.org/doi/full/10.1525/ bio.2009.59.5.9" weight="0" /> </locations> The evaluation falls through the first two criteria and the proxy uses 'weighted' as the selection criteria. The first location (http://mr.crossref.org) wins with a weight of 1. That location goes to a script on the CrossRef site that builds the page a user sees when resolving the DOI name as http://dx.doi.org/10.1525/bio.2009.59.5.9. The page is built to include the original URL value plus the 10320/loc data plus some additional information held by CrossRef. Corporation for National Research Initiatives Multiple Resolution "Chooseby" The page displayed includes both the original URL and the added BioOne link: TYPE = URL VALUE = http://caliber.ucpress.net/doi/abs/10.1525/bio.2009.59.5.9 TYPE = 10320/loc VALUE = http://www.bioone.org/doi/full/10.1525/bio.2009.59.5.9 Corporation for National Research Initiatives Template Handles • An unlimited number of handles are computed on the fly from a single registered template • Re-write rules and delimiter can be defined at the prefix level, e.g., use ‘-’ as delimiter and re-write any URL values, e.g., for any handle under the prefix 123 • Any handle under that prefix can be divided into base and extension, e.g., 123/456-abc has a base of 123/456 and an extension of abc. The base is registered as a normal handle. • The data at 123/456 will then be combined with the extension string (abc) using the re-write rule • Resolve “123/456-abc” and get back http://repository.com/getobject?id=123/456&part=abc • Resolve “123/456-def” and get back http://repository.com/getobject?id=123/456&part=def Corporation for National Research Initiatives Template Handles • Directly results from modularity of the current implementation • Backend handle storage is pluggable • A new storage module allows handles to be computed • The rest of the handle resolution mechanisms are unchanged, only the storage module was enhanced • Any exception handles can be individually registered to over-ride the template • Re-write rules at the base level will over-ride the prefix level rules • Re-write rules use Java regular expression language • Templates allow handle strings to remain static in reference form while millions of resolution values can be changed at a single stroke • Downside - All handles of the correct form become resolvable, even if they don’t refer to real data Corporation for National Research Initiatives Offline Signatures • Handle values can be signed with "offline" private keys that need not exist on any Internet-connected machine. • This additional layer of verification has been applied to all entries in the Global Handle Registry. • Any party that has the authority to create handle records can use this capability to sign their handle records. • There is a simple (but flexible) API for building handle value digests and signing those digests. Corporation for National Research Initiatives Handle (DOI) Resolution: Redirect to URL 10.1126/science.169.3946.635 Corporation for National Research Initiatives Handle (DOI) Resolution: Redirect to URL 10.1126/science.169.3946.635 http://dx.doi.org/10.1126/science.169.3946.635 Corporation for National Research Initiatives Handle (DOI) Resolution: Redirect to URL 10.1126/science.169.3946.635 http://dx.doi.org/10.1126/science.169.3946.635 Corporation for National Research Initiatives Handle (DOI) Resolution: Open Linked Data $ curl -LH "Accept: application/rdf+xml;q=0.5, application/vnd. citationstyles.csl+json;q=1.0"http://dx.doi.org/10.1126/science.1 69.3946.635 Corporation for National Research Initiatives Handle (DOI) Resolution: Open Linked Data $ curl -LH "Accept: application/rdf+xml;q=0.5, application/vnd. citationstyles.csl+json;q=1.0"http://dx.doi.org/10.1126/science.1 69.3946.635 { "volume" : "169", "issue" : "3946", "DOI" : "10.1126/science. 169.3946.635", "URL" : "http://dx.doi.org/10.1126/science. 169.3946.635", "title" : "The Structure of Ordinary Water: New data and interpretations are yielding new insights into this fascinating substance", "container-title" : "Science", "publisher" : "American Association for the Advancement of Science AAAS (Science)", "issued" : { "date-parts" : [ [ 1970,8,14 ] ] }, "author" : [ { "family" : "Frank", "given" : "H. S."} ], "editor" : [], "page" : "635-641", "type" : "article-journal" } Corporation for National Research Initiatives