New Digital Library Possibilities Using the Open Archives InitiativeProtocol for Metadata Harvesting (OAI-PMH) Michael L. Nelson Old Dominion University Norfolk Virginia, USA mln@cs.odu.edu http://www.cs.odu.edu/~mln/icsep/ International Conference on Scientific Electronic Publishing in Developing Countries Valparaiso, Chile October 2, 2002 Several Slides Also from Van de Sompel & Warner Random Thoughts 1. Thanks to the Organizing Committee for inviting me 2. Me deseo habla prestado la atencion a mis clases del Espanol de la escuela secundaria… 3. Publishers & Editors: if you want increased coverage, exposure and readership, you must “do” OAI… Outline • OAI-PMH history and technical highlights – a full technical review is out of the scope of this presentation • • • • Example data provider user Example service provider uses Implicatations for authors and editors Looking to the future Open Archives Initiative The protocol is openly documented, and metadata is “exposed” to at least some peer group (note: rights management can still apply!) Archive defined as a “collection of stuff” -not the archivist’s definition of “archive”. “Repository” used in most OAI documents. OAI is happening at break-neck speed... The Rise and Fall of Distributed Searching • wholesale distributed searching, popular at the time, is attractive in theory but troublesome in practice – Davis & Lagoze, JASIS 51(3), pp. 273-80 – Powell & French, Proc 5th ACM DL, pp. 264-265 • distributed searching of N nodes still viable, but only for small values of N • NCSTRL: N > 100; bad • NTRS/NIX: N<=20; ok (but could be better) The Rise and Fall of Distributed Searching • Other problems of distributed searching (from STARTS) – source-metadata problem • how do you know which nodes to search? – query-language problem • syntax varies and drifts over time between the various nodes – rank-merging problem • how do you meaningfully merge multiple result sets? • Temptations: – centralize all functions • “everything will be done at X” – standardize on a single product • “everyone will use system Y” Santa Fe Convention [02/2000] • goal: optimize discovery of e-prints http://www.dlib.org/dlib/february00/vandesompel-oai/02vandesompel-oai.html • input: • the UPS prototype http://www.dlib.org/dlib/february00/vandesompel-ups/02vandesompel-ups.html • RePEc /SODA “data provider / service provider model” • Dienst protocol • deliberations at Santa Fe meeting [10/99] Data and Service Providers • Data Providers – publishing into an archive – providing methods for metadata “harvesting” • provide non-technical context for sharing information also • Service Providers – harvest metadata from providers – implement user interface to data • Self-describing archives – Much of the learning about the constituent UPS archives occurred out of band… Even if these are done by the same DL, these are distinct roles Metadata Harvesting • Move away from distributed searching • Extract metadata from various sources • Build services on local copies of metadata – data remains at remote repositories all searching, browsing, etc. performed on the metadata here user individual nodes can still support direct user interaction metadata harvested offline search for “cfd applications” local copy of metadata metadata harvested offline metadata harvested offline metadata harvested offline ... each node independently maintained OAI-PMH v.1.0 [01/2001] • low-barrier interoperability specification • metadata harvesting model: data provider / service provider • focus on document-like objects • autonomous protocol • HTTP based • XML responses • unqualified Dublin Core • experimental: 12-18 months Santa Fe convention OAI-PMH v.1.0/1.1 OAI-PMH v.2.0 nature experimental experimental stable verbs Dienst OAI-PMH OAI-PMH requests HTTP GET/POST HTTP GET/POST HTTP GET/POST responses XML XML XML transport HTTP HTTP HTTP metadata OAMS unqualified Dublin Core about eprints unqualified Dublin Core document like objects model metadata harvesting metadata harvesting metadata harvesting resources OAI-PMH 2.0 • Good news: OAI-PMH is still Six Verbs + Dublin Core • Incremental improvements – – – – single XML schema ambiguities removed more expressive options cleaner separation of roles & responsibilities • Bad news: not backwards compatible with 1.1 Dublin Core • Dublin Core Metadata Initiative – http://www.dublincore.org/ – from 1994-1995, recognizing the need for simple, interoperable metadata for resource discovery – good overview of metadata & DC: http://www.dlib.org/dlib/january01/lagoze/01lagoze.html – 15 elements (qualifiers possible) Title Creator Subject Description Publisher Contributor Date Typ e Format Identifier Source Language Relation Coverage Rights Request is encoded in http Response is encoded in XML XML Schemas for the responses are defined in the OAI-PMH document OAI Mechanics Overview of OAI-PMH Verbs Verb metadata about the repository harvesting verbs Function Identify description of archive ListMetadataFormats metadata formats supported by archive ListSets sets defined by archive ListIdentifiers OAI unique ids contained in archive ListRecords listing of N records GetRecord listing of a single record most verbs take arguments: dates, sets, ids, metadata formats and resumption token (for flow control) protocol vs periphery • clear distinction between protocol and periphery • fixed protocol document • extensible implementation guidelines: • e.g. sample metadata formats, description containers, about containers • allows for OAI guidelines and community guidelines OAI-PMH vs HTTP • clear separation of OAI-PMH and HTTP • OAI-PMH error handling • all OK at HTTP level? => 200 OK • something wrong at OAI-PMH level? => OAI-PMH error (e.g. badVerb) • http codes 302, 503, etc. still available to implementers, but no longer represent OAI-PMH events resource – item - record set-membership is item-level property item = identifier Dublin Core metadata resource all available metadata about David MARC metadata SPECTRUM metadata item records record = identifier + metadata format + datestamp other general changes • better definitions of harvester, repository, item, unique identifier, record, set, selective harvesting • oai_dc schema builds on DCMI XML Schema for unqualified Dublin Core • usage of must, must not etc. as in RFC2119 • wording on response compression other general changes • all protocol responses can be validated with a single XML Schema • easier for data providers • no redundancy in type definitions • SOAP-ready • clean for error handling response no errors <?xml version="1.0" encoding="UTF-8"?> <OAI-PMH> <responseDate>2002-0208T08:55:46Z</responseDate> <request verb=“GetRecord”… …>http://arXiv.org/oai2</request> <GetRecord> <record> <header> <identifier>oai:arXiv:cs/0112017</identifier> <datestamp>2001-12-14</datestamp> <setSpec>cs</setSpec> <setSpec>math</setSpec> note no http encoding </header> of the OAI-PMH request <metadata> ….. </metadata> </record> </GetRecord> </OAI-PMH> response with error <?xml version="1.0" encoding="UTF-8"?> <OAI-PMH> <responseDate>2002-0208T08:55:46Z</responseDate> <request>http://arXiv.org/oai2</request> <error code=“badVerb”>ShowMe is not a valid OAI-PMH verb</error> </OAI-PMH> with errors, only the correct attributes are echoed in <request> resumptionToken scenario: harvesting 2770 records in 3 separate 1000 record “chunks” ListRecords harvester Records 1-1000, resumptionToken=AXad31 ListRecords, resumptionToken=AXad31 Records 1001-2000, resumptionToken=pQ22-x ListRecords, resumptionToken=pQ22-x Records 2001-2770 RDBMS resumptionToken • idempotency of resumptionToken: return same incomplete list when rT is reissued • while no changes occur in the repo: strict • while changes occur in the repo: all items with unchanged datestamp •new, optional attributes for the resumptionToken: •expirationDate •completeListSize •cursor harvesting granularity • harvesting granularity • mandatory support of YYYY-MM-DD • optional support of YYYY-MM-DDThh:mm:ssZ • other granularities considered, but ultimately rejected • granularity of from and until must be the same Identify • Identify more expressive <Identify> <repositoryName>Library of Congress 1</repositoryName> <baseURL>http://memory.loc.gov/cgi-bin/oai</baseURL> <protocolVersion>2.0</protocolVersion> <adminEmail>r.e.gillian@larc.nasa.gov</adminEmail> <adminEmail>rgillian@visi.net</adminEmail> <deletedRecord>transient</deletedRecord> <earliestDatestamp>1990-02-01T00:00:00Z</earliestDatestamp> <granularity>YYYY-MM-DDThh:mm:ssZ</granularity> <compression>deflate</compression> header • header contains set membership of item <record> <header> <identifier>oai:arXiv:cs/0112017</identifier> <datestamp>2001-12-14</datestamp> <setSpec>cs</setSpec> <setSpec>math</setSpec> </header> <metadata> ….. </metadata> </record> eliminates the need for the “double harvest” 1.x required to get all records and all set information ListIdentifiers • ListIdentifiers returns headers <?xml version="1.0" encoding="UTF-8"?> <OAI-PMH> <responseDate>2002-0208T08:55:46Z</responseDate> <request verb=“…” …>http://arXiv.org/oai2</request> <ListIdentifiers> <header> <identifier>oai:arXiv:hep-th/9801001</identifier> <datestamp>1999-02-23</datestamp> <setSpec>physic:hep</setSpec> </header> <header> <identifier>oai:arXiv:hep-th/9801002</identifier> <datestamp>1999-03-20</datestamp> <setSpec>physic:hep</setSpec> <setSpec>physic:exp</setSpec> </header> …… provenance • introduction of provenance container to facilitate tracing of harvesting history <about> <provenance> <originDescription> <baseURL>http://an.oa.org</baseURL> <identifier>oai:r1:plog/9801001</identifier> <datestamp>2001-08-13T13:00:02Z</datestamp> <metadataPrefix>oai_dc</metadataPrefix> <harvestDate>2001-08-15T12:01:30Z</harvestDate> <originDescription> … … … </originDescription> </originDescription> </provenance> </about> friends • introduction of friends container to facilitate discovery of repositories <description> <friends> <baseURL>http://cav2001.library.caltech.edu/perl/oai</baseURL> <baseURL>http://formations2.ulst.ac.uk/perl/oai</baseURL> <baseURL>http://cogprints.soton.ac.uk/perl/oai</baseURL> <baseURL>http://wave.ldc.upenn.edu/OLAC/dp/aps.php4</baseURL> </friends> </description> NASA <friends> example (1) • A light weight, DP-centric method to communicate the existence of “others” http://techreports.larc.nasa.gov/ltrs/oai2.0/?verb=Identify .. <description> <friends ..namespace stuff..> <baseURL>http://naca.larc.nasa.gov/oai2.0</baseURL> <baseURL>http://ntrs.nasa.gov/oai2.0</baseURL> <baseURL>http://horus.riacs.edu/perl/oai/</baseURL> <baseURL>http://ston.jsc.nasa.gov/collections/TRS/oai/</baseURL> </friends> </description> .. NASA <friends> example (2) harvester Identify <friends>…</friends/ http://techreports.larc.nasa.gov/ltrs/oai2.0/ http://naca.larc.nasa.gov/oai2.0/ http://ston.jsc.nasa.gov/collections/TRS/oai/ http://ntrs.nasa.gov/oai2.0/ http://horus.riacs.edu/perl/oai/ branding • introduction of branding container for DPs to suggest rendering & association hints <branding xmlns="http://www.openarchives.org/OAI/2.0/branding/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/branding/ http://www.openarchives.org/OAI/2.0/branding.xsd"> <collectionIcon> <url>http://my.site/icon.png</url> <link>http://my.site/homepage.html</link> <title>MySite(tm)</title> <width>88</width> <height>31</height> </collectionIcon> <metadataRendering metadataNamespace="http://www.openarchives.org/OAI/2.0/oai_dc/" mimeType="text/xsl">http://some.where/DCrender.xsl</metadataRendering> <metadataRendering metadataNamespace="http://another.place/MARC" mimeType="text/css">http://another.place/MARCrender.css</metadataRendering> </branding> oai-identifier • revision of oai-identifier <description> <oai-identifier xmlns="http://www.openarchives.org/OAI/2.0/oaiidentifier" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oaiidentifier http://www.openarchives.org/OAI/2.0/oai-identifier.xsd"> <scheme>oai</scheme> <repositoryIdentifier>oai-stuff.foo.org</repositoryIdentifier> <delimiter>:</delimiter> <sampleIdentifier>oai:oai-stuff.foo.org:5324</sampleIdentifier> </oai-identifier> </description> domain based repository names did not make it into OAI-PMH v.2.0 • • • • • • SOAP implementation Result set filtering Multiple / “best” metadata GetRecord -> GetRecords Machine readable rights management XML format for “mini-archives” So What Does OAI-PMH Mean for Your Digital Library? • Resources on DL projects are typically spent in 2 areas: – creating & maintaining the collection • data provider – developing access services for the collection (searching, browsing, etc.) • service provider • OAI-PMH allows for specialization based on resources / interest NACA Report 1345 as seen through its native DL http://naca.larc.nasa.gov/ NACA Report 1345 as seen through MAGiC http://www.magic.ac.uk/ NACA Report 1345 as seen through its Scirus (Elsevier) http://www.scirus.com/ NACA Report 1345 as seen through my.OAI (FS Consulting) http://www.myoai.com/ Scientific Communication • With only some exceptions, which interface is used for discovery is not as important as the fact that discovery occurred in the first place… – “control” of the discovered objects is not “lost” by data providers • however, higher level mirroring services can be built on top of OAI (cf. NACA & ARC mirroring between NASA LaRC and MAGiC) • The real power of OAI-PMH derives as much from what it does not do as what it actually does What Does OAI-PMH Mean for Authors? • On the surface, absolutely nothing! – the ideal OAI deployment should be absolutely invisible to normal DL operations – uninterested users should not even notice or care • Indirectly, they should enjoy the benefits of the critical mass of current and developing DL tools & systems – personal, institutional data providers – proliferation of targetted, value-added service providers What Does OAI-PMH Mean For Editors? • Absolutely everything… • The decoupling of SPs and DPs will have significant and profound implications on scientific and technical information exchange – OAI-PMH is actually just one component in a larger engineering effort for scholarly communication (e.g. OpenURL) • Service and resource integration will be the focus of journals, professional societies, universities, etc. – OAI-PMH will be a basic, core technology for scientific publishing as http & XML Field of Dreams • It should be easy to be a data provider, even if it makes more work for the service provider. – if enough data providers exist, the service providers will come (DPs >> SPs) • Open-source / freely available tools – “drop-in” data providers: • industrial strength: http://www.eprints.org/ • personal size: http://kepler.cs.odu.edu/ – tools to make your existing DL a data provider: • http://www.openarchives.org/tools/tools.htm • also: OAI-implementers mailing list / mail archive! – service providers: • Arc: http://sourceforge.net/projects/oaiarc/ OAI Observation: Front-End Only • No input/registry mechanism – OAI harvesting protocol is always a front-end for something else • filesystem, Dienst, RDBMS, LDAP, etc. – convenient for pre-existing DLs, but does not address “new” DLs • e.g., “we want to do OAI” • Bounds the scope of OAI – responsibilities and domain of OAI are still be discussed – tension between functionality and simplicity OAI Observation: No T&C • Possible to use multiple OAI servers in a DMZ-like configuration… OAI requests from arbitrary hosts Public OAI Server OAI requests from trusted hosts Private OAI Server Source database could even use a separate copy of the database… OAI Observation: No T&C • Possible to use OAI harvesting protocol in closed, restricted systems OAI 1 OAI 2 OAI 4 OAI 3 all OAI requests originate from these 4 DLs Metadata – Q: “Which format should I use?” • A: any/all of them… – lowest common denominator: unqualified Dublin Core – Again, little known about actual behavior • will DC be actually be useful? or too lossy? • will communities create/adopt specific formats? • will native (presumably richer) formats be harvested? “The Return of MARC” ?! we very much want this to happen... The Future: Community Building • Ultimately, protocols and metadata formats are not what makes a difference • Rather, the critical mass afforded by a common set of utilities (cf. http, Dublin Core, XML) • The best current example: The Open Language Archives Community – http://www.language-archives.org • OAI-PMH provides the basis for communication between strangers, but allows even richer communication between friends http://www.openarchives.org openarchives@openarchives.org Backup Slides Detailed Review of the OAI-PMH 2.0 Verbs 1.1 • Arguments – none • Errors – none Identify 2.0 • Arguments – none • Errors – badArgument ListMetadataFormats 1.1 • Arguments – identifier (OPTIONAL) • Errors – id does not exist 2.0 • Arguments – identifier (OPTIONAL) • Errors – badArgument – noMetadataFormats – idDoesNotExist 1.1 ListSets • Arguments – resumptionToken (EXCLUSIVE) • Errors – no set hierarchy 2.0 • Arguments – resumptionToken (EXCLUSIVE) • Errors – badArgument – badResumptionToken – noSetHierarchy 1.1 ListIdentifiers • Arguments – – – – from (OPTIONAL) until (OPTIONAL) set (OPTIONAL) resumptionToken (EXCLUSIVE) • Errors – no records match 2.0 • Arguments • Errors – – – – from (OPTIONAL) until (OPTIONAL) set (OPTIONAL) resumptionToken (EXCLUSIVE) – metadataPrefix (REQUIRED) – – – – – badArgument cannotDisseminateFormat badResumptionToken noSetHierarchy noRecordsMatch 1.1 ListRecords 2.0 • Arguments • Arguments • Errors • Errors – – – – from (OPTIONAL) until (OPTIONAL) set (OPTIONAL) resumptionToken (EXCLUSIVE) – metadataPrefix (REQUIRED) – no records match – metadata format cannot be disseminated – – – – from (OPTIONAL) until (OPTIONAL) set (OPTIONAL) resumptionToken (EXCLUSIVE) – metadataPrefix (REQUIRED) – – – – – noRecordsMatch cannotDisseminateFormat badResumptionToken noSetHierarchy badArgument 1.1 GetRecord • Arguments – identifier (REQUIRED) – metadataPrefix (REQUIRED) • Errors – id does not exist – metadata format cannot be disseminated 2.0 • Arguments – identifier (REQUIRED) – metadataPrefix (REQUIRED) • Errors – badArgument – cannotDisseminateFormat – idDoesNotExist Argument Summary metadataPrefix from until set resumptionToken identifier Identify ListMetadata Formats optional ListSets exclusive ListIdentifiers optional optional optional exclusive ListRecords optional optional optional exclusive GetRecord Error Summary Identify BA ListMetadata Formats BA ListSets BA BRT ListIdentifiers BA BRT CDF NRM NSH ListRecords BA BRT CDF NRM NSH GetRecord BA NMF IDDNE NSH CDF Generate badVerb on any input not matching the 6 defined verbs this is an inversion of the table in section 3.6 of the OAI-PMH specification IDDNE