Implementation of Digital Libraries Michael L. Nelson Old Dominion University mln@cs.odu.edu http://www.cs.odu.edu/~mln/ Congreso Internacional de Información en Salud Lima, Peru May 28, 2004 Acknowledgements • • • • • • ODU: K. Maly, M. Zubair, J. Bollen LANL: R. Luce, X. Liu NASA: G. Roncaglia, J. Rocker, C. Mackey Cornell: C. Lagoze, S. Warner MAGiC (UK): Paul Needham and, of course, Herbert Van de Sompel (LANL) – the OpenURL slides are nicked from his presentations Outline • A bit of history • Core technologies & Issues – OAI-PMH • deep web – OpenURL – Handles / DOIs – Object Models covered only briefly • Example implementations • Download and go… OAI-PMH Background • I met Herbert Van de Sompel in April 1999... – we spoke of a demonstration project he had in mind and had received sponsorship from Paul Ginsparg and Rick Luce – We wanted to demonstrate a multi-disciplinary DL that leveraged the large number of high quality, yet often isolated, tech report servers, e-print servers, etc. • most digital libraries (DLs) had grown up along single disciplines or institutions – little to no interoperability; isolated DL “gardens” – Universal Preprint Service • Demonstrated at Santa Fe NM, October 21-22, 1999 – http://web.archive.org/web/*/http://ups.cs.odu.edu/ • D-Lib Magazine, 6(2) 2000 (2 articles) – http://www.dlib.org/dlib/february00/02contents.html – UPS was soon renamed the Open Archives Initiative (OAI) http://www.openarchives.org/ Result… OAI • The OAI was the result of the demonstration and discussion during the Santa Fe meeting – OAI = a bunch of people, a religion, a cult, etc. – OAI Protocol For Metadata Harvesting (OAI-PMH) = the protocol created and maintained by the OAI • Initial focus was on federating collections of scholarly e-print materials… • …however, interest grew and the scope and application of OAIPMH expanded to become a generic bulk metadata transport protocol • Note: – OAI-PMH is only about metadata -- not full text! • but what is metadata vs. full-text? – OAI is neutral with respect to the nature of the metadata or the resources the metadata describes • read: commercial publishers have an interest in OAI-PMH too... Request is encoded in http OAI-PMH Mechanics Response is encoded in XML XML Schema for the responses are defined in the OAI-PMH document Overview of OAI-PMH Verbs Verb archival metadata harvesting verbs Function Identify description of archive ListMetadataFormats metadata formats supported by archive ListSets sets defined by archive ListIdentifiers OAI unique ids contained in archive ListRecords listing of N records GetRecord listing of a single record most verbs take arguments: dates, sets, ids, metadata formats and resumption token (for flow control) OAI-PMH Data Model set-membership is item-level property item = identifier Dublin Core metadata resource all available metadata about David MARC metadata SPECTRUM metadata item records record = identifier + metadata format + datestamp Data Providers / Service Providers data providers (repositories) service providers (harvesters) Aggregators aggregators allow for: • scalability for OAI-PMH • load balancing • community building • discovery data providers (repositories) aggregator service providers (harvesters) Aggregators • Frequently interchangeable terms: – aggregators: likely to be community / institutionally focused – caches: stores a copy, less likely to be community-oriented – proxies: less likely to store a copy, may gateway between OAIPMH and other protocols • Dienst / OAI Gateway; Harrison, Nelson, Zubair, JCDL 03 • To learn more about aggregators, caches & proxies: – – http://www.openarchives.org/OAI/2.0/guidelines-aggregator.htm http://www.cs.odu.edu/~mln/jcdl03/ Example Aggregators • Arc - http://arc.cs.odu.edu/ – first described “hierarchical harvesting” in DLib Magazine, 7(4) 2001 • http://www.dlib.org/dlib/april01/liu/04liu.html • Celestial - http://celestial.eprints.org/ – among other services, it provides a history of harvests (successful vs. errors) • http://celestial.eprints.org/cgi-bin/status OAI-PMH 2.0 Registration unregistered because: ??? unregistered repositories 150+ repositories registered • • • • • testing / development not for public harvesting public, but “low-profile” never got around to it… ??? DP:SP ~= 5:1 Data Providers: http://www.openarchives.org/Register/BrowseSites.pl Service Providers: http://www.openarchives.org/service/listproviders.html Registration is Nice… …But Not Required • OAI-PMH is (becoming) the “http” for digital libraries – there is no central registry of http servers • remember the NCSA “What’s New” page? (ca. 1994) • There will never be “registration support” in OAI-PMH – registries are a type of service provider, built on top of OAIPMH – registration will be an integral part of community building – friends… NASA <friends> example harvester Identify <friends>…</friends> http://techreports.larc.nasa.gov/ltrs/oai2.0/ http://naca.larc.nasa.gov/oai2.0/ http://ston.jsc.nasa.gov/collections/TRS/oai/ http://ntrs.nasa.gov/oai2.0/ http://horus.riacs.edu/perl/oai/ NACA Technical Report Server • publicly available – began in 1996 – details in NASA TM-1999209127 • scanned reports from 1917-1958 – NACA = predecessor to NASA • contents mirrored with the MaGIC project http://naca.larc.nasa.gov/ http://naca.larc.nasa.gov/oai2.0/ – a UK-based greyliterature preservation project – OAI-PMH used to mirror contents NACA Report 1345 as seen through its native DL http://naca.larc.nasa.gov/ NACA Report 1345 as seen through MAGiC http://www.magic.ac.uk/ NACA Report 1345 as seen through its Scirus (Elsevier) http://www.scirus.com/ NACA Report 1345 as seen through my.OAI (FS Consulting) http://www.myoai.com/ NASA Technical Report Server • replacement for the previous distributed searching version of NTRS – – – – MySQL Va Tech harvester modified “bucket” details in Nelson, Rocker, Harrison, Library Hi-Tech, 21(2) (March 2003) • a service provider & aggregator http://ntrs.nasa.gov/ – same OAI baseURL as used for interactive searching NASA Technical Report Server • advanced, fielded search • explicit query routing – 12 NASA repositories – 4 non-NASA repositories • turned “off” by default • >600k abstracts; >300k full-text Service Providers • It is clear that SPs are proliferating, despite (because of?) the inherent bias toward DPs in the protocol – easy to be a DP -> many DPs -> SPs eventually emerge – hard to be a DP -> SPs starve – currently 5x DPs more than SPs • SPs are beginning to offer increasingly sophisticated services – competitive market originally envisioned for SPs is emerging Community Building www.ndltd.org Universidad Nacional Mayor de San Marcos Colegio America Pontificia Universidad Catolica del Peru Universidad Nacional Federico Villarreal Universidad Nacional de Trujillo Universidad de Lima Universidad Nacional Jorge Basadre Grohmann Colegio Universitario Andino Universidad del Pacifico Universidad Peruana de Ciencias Apicadads OAI-PMH & The Deep Web Exposing Repository Contents • DP9: Webcrawler access to OAI-PMH repositories • http://dlib.cs.odu.edu/dp9/ • JCDL 02 http://www.cs.odu.edu/~liu_x/dp9/dp9.pdf • An Apache module for OAI-PMH – http://www.modoai.org/ • Extensible Repository Resource Locators (ERRoLs) for OAI Identifiers – http://www.oclc.org/research/projects/oaireso lver/default.htm Race for This New Market… • Yahoo! & University of Michigan – http://www.umich.edu/news/index.html? Releases/2004/Mar04/r031004 • Google & CrossRef – http://www.nature.com/nature/focus/ac cessdebate/17.html OpenURL slides from Herbert Van de Sompel, LANL Origins & Motivation The Context: Library Automation Environment anno 1998 • distributed information environment • local & remote A&I databases • rapidly growing e-journal collection • need to interlink the available information The Problem: • links are delivered by info providers • links are not sensitive to user’s context • appropriate copy problem • links dependent on business agreements between information vendors • links don’t cover the complete collection Origins & Motivation The Context: Library Automation Environment anno 1998 • distributed information environment • local & remote A&I databases • rapidly growing e-journal collection • need to interlink the available information The REAL Problem: • libraries have no say in linking • libraries are losing core part of the “organizing information” task • expensive collection is not used optimally • users are not well served Origins & Motivation The Solution: In information services: • DO NOT provide a link which is an actual service related to a referenced item (e.g. a link from a record in an A&I database to the corresponding full-text) • BUT rather provide • a link that transports metadata about the OpenURL referenced item to • others that are better placed to provide service links Linking server operated by library non-OpenURL linking resource resource link destination link source reference . link resolution of metadata into link link to referenced work OpenURL linking transportation of metadata & identifiers user-specific link source reference . OpenURL OpenURL provision of OpenURL linking server link link link link resolution of metadata & identifiers into services link destination link destination link destination link destination default links: • restricted in nature • action-radius restricted by business agreements • not context-sensitive resource2 resource3 default links resource1 herbert van de sompel metadata plane extended services plane service component1 service component2 resource2 resource3 default links resource1 herbert van de sompel metadata plane NISO OpenURL Standardization Charge • Use existing “OpenURL Framework” as starting point • notion of context-sensitive services • notion of transporting “contextual” metadata packages to obtain context-sensitive services • Define syntax and transport-method for “contextual” metadata packages • Ensure extensibility: • must support future applications • must support other information communities => Generalize and Standardize NISO OpenURL Standardization Charge Therefore, to be addressed were: • OpenURL Framework beyond scholarly resources • “contextual” metadata packages • Syntax for “contextual” metadata packages • Transport of “contextual” metadata packages OpenURL Status • (Nearly) a NISO standard – check for details: • http://library.caltech.edu/openurl/ Naming: Handles & DOIs Naming • Fundamental to other technologies (OAIPMH, OpenURL, etc.) • Options – URNs – Persistent URLs (PURLs) • http://purl.org/ – Handles • http://www.handle.net/ – Digital Object Identifiers • http://www.doi.org/ – ARK • http://www.cdlib.org/inside/diglib/ark/ “Inverted Archives” • Unit of discourse is no longer an archive or service, but a DOI which has services linked from it – cf.: • UPS demonstration prototype • “Smart Objects, Dumb Archives” (SODA) model Example http://dx.doi.org/10.1145/374308.374342 Object Models Popular Object Models • METS – used in DSpace, Fedora – http://www.loc.gov/standards/mets/ • MPEG-21 DIDL – http://xml.coverpages.org/mpeg21-didl.html – used in LANL DLs • http://www.dlib.org/dlib/november03/bekaert/11bekaert.html • http://www.dlib.org/dlib/february04/bekaert/02bekaert.html • http://lib-www.lanl.gov/~herbertv/papers/jcdl2004-submitteddraft.pdf Object Models & OAI-PMH resource Quic kTime™ and a TIFF (Unc ompres sed) dec ompres sor are needed to see this pic ture. item oai:foo.edu:1234 records Quic kTime™ and a TIFF (Unc ompres sed) dec ompres sor are needed to see this pic ture. Move from simple metadata files “pointing” to resources… METS …to records as “modeled representations” of resources Download and Go! Where Do You Want to Build? user CDSware service provider data provider data provider data provider EPrints.org data provider CDSware ... data provider local contextsensitive services Fedora • joint project between Cornell & UVa – funded by the Mellon Foundation • a repository management system – focuses on complex digital objects and their behaviors • more info: – http://www.fedora.info/ – D-Lib Magazine, 9(4) • http://www.dlib.org/dlib/april03/staples/04staples.h tml • MIT + HP Labs • constructed to capture all the output of MIT’s faculty • now generalized to the DSpace Federation – 8 top universities in the US & Canada • More info: – http://www.dspace.org/ – http://sourceforge.net/projects/dspace/ – D-Lib Magazine 9(1) • http://www.dlib.org/dlib/january03/smith/01smith.ht ml EPrints.org • developed at Southampton University – part of larger suite of institutional/author selfarchiving tools and services • e.g.: citebase; paracite • widely adopted -- 100+ sites – http://software.eprints.org/#ep2 • more info – http://www.eprints.org/ – http://www.arl.org/sparc/core/index.asp?page= g20#6 CDSware • developed at CERN • data provider & service provider • large-scale use @ CERN (> 600k records) – in use at a few non-CERN sites • free & paid support models • more info – http://cdsware.cern.ch/ • P2P publishing for academia – community servers for coordination, management – archivelets for individual laptops, PCs • more info: – http://kepler.cs.odu.edu/ – D-Lib Magazine 7(4) • http://www.dlib.org/dlib/april01/maly/04maly.html • developed by UKOLN – open source • OpenURL 0.1 format resolver – NISO 1.0 format??? • more info: – Ariadne, 28 • http://www.ariadne.ac.uk/issue28/resolver/ • ftp://ftp.ukoln.ac.uk/metadata/tools/openresolver/ • http://www.ukoln.ac.uk/distributed-systems/openurl/ Conclusions Why The OAI-PMH is NOT Important • Users don’t care • OAI-PMH is middleware – if done right, the uninterested user should never have to know • Using OAI-PMH does not insure a good SP • OAI-PMH is (or is becoming) HTTP for DLs – few people get excited about http now • http & OAI-PMH are core technologies whose presence is now assumed Digital Library Technologies • • • • http XML OAI-PMH OpenURL ? Other Uses For the OAI-PMH • Assumptions: – Traditional DLs / SPs will continue on their present path of increasing sophistication • citation indexing, search results viz, personalization, recommendations, subject-based filtering, etc. – growth rates remain the same (5x DPs as SPs) • Premise: OAI-PMH is applicable to any scenario that needs to update / synchronize distributed state – Future opportunities are possible by creatively interpreting the OAI-PMH data model • See Van de Sompel, Young & Hickey, D-Lib Magazine July 2003, http://www.dlib.org/dlib/july03/young/07young.html • Nelson, 2nd OAI Workshop, http://agenda.cern.ch/askArchive.php?base=agenda&categ=a0 2333&id=a02333s5t8/transparencies OpenURL Framework evolution A spec based on HTTP GET to transport metadata about • a scholarly referent & • the context in which the referent is referenced Draft Van de Sompel, Beit-Arie, Hochstenbach 05/2001 A framework Standard that enables different Communities to: • describe a referent • describe the context in which the referent is referenced • transport these descriptions NISO Draft Standard 04/2003 The Future: Community Building • Ultimately, protocols and metadata formats are not what makes a difference • Rather, the critical mass afforded by a common set of utilities (cf. http, Dublin Core, XML) • The best current example: The Open Language Archives Community – http://www.language-archives.org/ • OAI-PMH provides the basis for communication between strangers, but allows even richer communication between friends Further Reading • Gerry McKiernan, Library Hi-Tech News – http://www.public.iastate.edu/~gerrymck/OAI-SP-I.pdf – http://www.public.iastate.edu/~gerrymck/OAI-SP-II.pdf – http://www.public.iastate.edu/~gerrymck/OAI-SP-III.pdf • Open Archives Forum OAI-PMH Tutorial – http://www.oaforum.org/tutorial/ • “A Survey of Digital Library Aggregation Services” – http://www.diglib.org/pubs/brogan/ • Open Access News – http://www.earlham.edu/~peters/fos/fosblog.html • Guide To Institutional Repository Software – http://www.soros.org/openaccess/software/ Great Stuff I Did Not Cover… • OAI-PMH – Static Repositories • http://www.openarchives.org/OAI/2.0/guidelines-staticrepository.htm – OAI-Rights • http://www.openarchives.org/documents/OAIRightsWhite Paper.html • http://www.openarchives.org/news/oairightspress030929. html • Digital Preservation – http://www.digitalpreservation.gov/