U.S. Government Use of the OAI-PMH Michael L. Nelson Old Dominion University Norfolk Virginia, USA mln@cs.odu.edu http://www.cs.odu.edu/~mln/ Indo-US Workshop on Open Digital Libraries and Interoperability Arlington, VA - June 23-25, 2003 Acknowledgements • • • • ODU: K. Maly, M. Zubair, J. Bollen, X. Liu LANL: R. Luce, X. Liu NASA: G. Roncaglia, J. Rocker MAGiC (UK): P. Needham Outline • Review: – OAI-PMH – data provider / service provider model • including “aggregators” • • • • Role of registration for repositories NASA projects OSTI demo project Technical Report Interchange (TRI) – NASA, DOE, DOD Disclaimer: Scientific and Technical Information (STI) • This talk will cover US Government focused / sponsored STI only • This talk will not cover American Memory – a cultural history project from the Library of Congress (LoC) • http://memory.loc.gov/ – the LoC played a significant role in the definition and early adoption of the OAI-PMH Acronym Review LaRC = Langley Research Center NASA LANL = Los Alamos National Laboratory Sandia = Sandia National Laboratory Department of Energy AFRL = Air Force Research Laboratory Department of Defense CASI OSTI DTIC (Center for AeroSpace Information) http://www.sti.nasa.gov/ (Office of Scientific and Technical Information) http://www.osti.gov/ (Defense Technical Information Center) http://www.dtic.mil/ The Rise and Fall of Distributed Searching • wholesale distributed searching, popular at the time, is attractive in theory but troublesome in practice – Davis & Lagoze, JASIS 51(3), pp. 273-80 – Powell & French, Proc 5th ACM DL, pp. 264-265 • distributed searching of N nodes still viable, but only for small values of N • NCSTRL: N > 100; bad • NTRS/NIX: N<=20; ok (but could be better) resource – item - record set-membership is item-level property item = identifier Dublin Core metadata resource all available metadata about David MARC metadata SPECTRUM metadata item records record = identifier + metadata format + datestamp Overview of OAI-PMH Verbs Verb metadata about the repository harvesting verbs Function Identify description of repository ListMetadataFormats metadata formats supported by repository ListSets sets defined by repository ListIdentifiers OAI unique ids contained in repository ListRecords listing of N records GetRecord listing of a single record most verbs take arguments: dates, sets, ids, metadata formats and resumption token (for flow control) Data Providers / Service Providers data providers (repositories) service providers (harvesters) Aggregators aggregators allow for: • scalability for OAI-PMH • load balancing • community building • discovery data providers (repositories) aggregator service providers (harvesters) Aggregators • Frequently interchangeable terms: – aggregators: likely to be community / institutionally focused – caches: stores a copy, less likely to be communityoriented – proxies: less likely to store a copy, may gateway between OAI-PMH and other protocols • Dienst / OAI Gateway; Harrison, Nelson, Zubair, JCDL 03 • To learn more about aggregators, caches & proxies: – http://www.openarchives.org/OAI/2.0/guidelines-aggregator.htm – http://www.cs.odu.edu/~mln/jcdl03/ Example Aggregators • Arc - http://arc.cs.odu.edu/ – first described “hierarchical harvesting” in DLib Magazine, 7(4) 2001 • http://www.dlib.org/dlib/april01/liu/04liu.html • Celestial - http://celestial.eprints.org/ – among other services, it provides a history of harvests (successful vs. errors) • http://celestial.eprints.org/cgi-bin/status OAI-PMH 2.0 Registration unregistered because: 75 repositories registered ??? unregistered repositories • testing / development • not for public harvesting • public, but “low-profile” • never got around to it… • ??? Data Providers: http://www.openarchives.org/Register/BrowseSites.pl Service Providers: http://www.openarchives.org/service/listproviders.html DP:SP ~= 5:1 Registration is Nice… …But Not Required • OAI-PMH is (becoming) the “http” for digital libraries – there is no central registry of http servers • remember the NCSA “What’s New” page? (ca. 1994) • There will never be “registration support” in OAIPMH – registries are a type of service provider, built on top of OAI-PMH – registration will be an integral part of community building – friends… <friends> • A light weight, optional, DP-centric method to communicate the existence of “others” http://techreports.larc.nasa.gov/ltrs/oai2.0/?verb=Identify .. <description> <friends ..namespace stuff..> <baseURL>http://naca.larc.nasa.gov/oai2.0</baseURL> <baseURL>http://ntrs.nasa.gov/oai2.0</baseURL> <baseURL>http://horus.riacs.edu/perl/oai/</baseURL> <baseURL>http://ston.jsc.nasa.gov/collections/TRS/oai/</baseURL> </friends> </description> .. NASA <friends> example harvester Identify <friends>…</friends> http://techreports.larc.nasa.gov/ltrs/oai2.0/ http://naca.larc.nasa.gov/oai2.0/ http://ston.jsc.nasa.gov/collections/TRS/oai/ http://ntrs.nasa.gov/oai2.0/ http://horus.riacs.edu/perl/oai/ Use of <friends> Slide from S. Warner, Cornell University Langley Technical Report Server • publicly available – began as an anonymous ftp server in 1992; http access in 1993 – model for other technical report servers at other NASA centers • details in NASA TM109162 • mostly LaTeX, MS Word, other systems – some scanned reports http://techreports.larc.nasa.gov/ltrs/ http://techreports.larc.nasa.gov/ltrs/oai2.0/ NACA Technical Report Server • publicly available – began in 1996 – details in NASA TM-1999209127 • scanned reports from 1917-1958 – NACA = predecessor to NASA • contents mirrored with the MaGIC project http://naca.larc.nasa.gov/ http://naca.larc.nasa.gov/oai2.0/ – a UK-based grey-literature preservation project – OAI-PMH used to mirror contents NACA Report 1345 as seen through its native DL http://naca.larc.nasa.gov/ NACA Report 1345 as seen through MAGiC http://www.magic.ac.uk/ NACA Report 1345 as seen through its Scirus (Elsevier) http://www.scirus.com/ NACA Report 1345 as seen through OAIster http://oaister.umdl.umich.edu/ NACA Report 1345 as seen through my.OAI (FS Consulting) http://www.myoai.com/ NTRS OAI Architecture all searching, browsing, etc. performed on the metadata here user individual nodes can still support direct user interaction search for “cfd applications” NTRS local copy of metadata metadata harvested offline, through OAI interface LTRS ATRS GTRS ... CASITRS content (reports) remain archived at the local sites each node independently maintained NASA Technical Report Server • publicly available • replacement for the former distributed searching version of NTRS – – – – http://ntrs.nasa.gov/ MySQL Va Tech harvester modified “bucket” details in Nelson, Rocker, Harrison, Library Hi-Tech, 21(2) (July 2003) • a service provider & aggregator – same OAI-PMH baseURL as used for interactive searching NASA Technical Report Server • advanced, fielded search • explicit query routing – 12 NASA repositories – 4 non-NASA repositories • turned “off” by default non-NASA repositories > 0.5M records NASA DLs in the Larger STI Realm Publishers Universities International DOD ... DOE this could be a fully connected graph NTRS could also be a data provider from the point of view of other DLs; allowing the harvesting of NASA report metadata. NTRS could also harvest metadata from other DLs, and provide access to non-NASA content. NTRS LTRS ATRS … CASITRS We hope to influence the direction of the science.gov effort to use OAI-PMH OSTI Energy Citations Database • OAI-PMH support just recently added (Feb 2003) – not yet officially announced or registered – 20k records, 8k fulltext • other OSTI collections planned http://www.osti.gov/energycitations/ Technical Report Interchange • Goal: share technical reports between 4 US government labs without creating new digital libraries for users to learn! – – – – NASA Langley Research Center Air Force Research Laboratory Los Alamos National Laboratory (DOE) Sandia National Laboratory (DOE) • Solution: use cooperating OAI-PMH caches at each site to – export local contents – ingest remote contents TRI Production System - Status LaRC TRI System LANL TRI System Records coming in from other TRI systems In Production Slide from M. Zubair, ODU Proposed Sandia TRI System Records going out to other TRI systems AFRL TRI System ODU TRI System (Listener) Mappings in TRI Laboratory Native Metadata Format Native Source Commercial DL System Native Destination Commercial DL System LaRC LANL AFRL Sand ia MARC MARC + local fields COSATI MARC BASIS+ Geac ADVANCE Sirsi ST ILAS Horizon (TBD) Science Server Sirsi ST ILAS Verit y Details in Liu, et al. ECDL 2002; the above table also taken from the same paper A Single TRI Module Connect to remote DL by OAI protocol Local DB Read new data from remote DL Write new data published in local DL OAI Harvester Control Scheduler Common Modules in all three DLs Remote Data in DC format Local Data in DC format Local DL Manager Write Remote data to local format Input Directory Slide from M. Zubair, ODU Read local data and convert to DC format output Directory Specific module for each DL The Future: Community Building • Ultimately, protocols and metadata formats are not what makes a difference • Rather, the critical mass afforded by a common set of utilities (cf. http, Dublin Core, XML) • The best current example: The Open Language Archives Community – http://www.language-archives.org/ • OAI-PMH provides the basis for communication between strangers, but allows even richer communication between friends STI Communities • Government produced/sponsored STI • http://ntrs.nasa.gov/ • http://www.osti.gov/energycitations/ • http://dlib.cs.odu.edu/tri/ • Academia – self-archiving vs. institutional archives • http://www.soros.org/openaccess/ • http://www.ecs.soton.ac.uk/~harnad/Tp/resolution.htm • Commercial publishers – e.g. BioMed Central • http://www.biomedcentral.com/