U.S. Government Use of the OAI-PMH Indo-US Workshop on

advertisement
U.S. Government Use
of the OAI-PMH
Michael L. Nelson
Old Dominion University
Norfolk Virginia, USA
mln@cs.odu.edu
http://www.cs.odu.edu/~mln/
Indo-US Workshop on
Open Digital Libraries and Interoperability
Arlington, VA - June 23-25, 2003
Acknowledgements
•
•
•
•
ODU: K. Maly, M. Zubair, J. Bollen, X. Liu
LANL: R. Luce, X. Liu
NASA: G. Roncaglia, J. Rocker
MAGiC (UK): P. Needham
Outline
• Review:
– OAI-PMH
– data provider / service provider model
• including “aggregators”
•
•
•
•
Role of registration for repositories
NASA projects
OSTI demo project
Technical Report Interchange (TRI)
– NASA, DOE, DOD
Disclaimer:
Scientific and Technical Information (STI)
• This talk will cover US Government
focused / sponsored STI only
• This talk will not cover American Memory
– a cultural history project from the Library of
Congress (LoC)
• http://memory.loc.gov/
– the LoC played a significant role in the
definition and early adoption of the OAI-PMH
Acronym Review
LaRC = Langley Research Center
NASA
LANL = Los Alamos National Laboratory
Sandia = Sandia National Laboratory
Department of Energy
AFRL = Air Force Research Laboratory
Department of Defense
CASI
OSTI
DTIC
(Center for AeroSpace
Information)
http://www.sti.nasa.gov/
(Office of Scientific and
Technical Information)
http://www.osti.gov/
(Defense Technical
Information Center)
http://www.dtic.mil/
The Rise and Fall of
Distributed Searching
• wholesale distributed searching, popular at
the time, is attractive in theory but
troublesome in practice
– Davis & Lagoze, JASIS 51(3), pp. 273-80
– Powell & French, Proc 5th ACM DL, pp. 264-265
• distributed searching of N nodes still
viable, but only for small values of N
• NCSTRL: N > 100; bad
• NTRS/NIX: N<=20; ok (but could be better)
resource – item - record
set-membership is
item-level property
item = identifier
Dublin Core
metadata
resource
all available metadata
about David
MARC
metadata
SPECTRUM
metadata
item
records
record = identifier + metadata format + datestamp
Overview of OAI-PMH Verbs
Verb
metadata
about the
repository
harvesting
verbs
Function
Identify
description of repository
ListMetadataFormats
metadata formats supported by
repository
ListSets
sets defined by repository
ListIdentifiers
OAI unique ids contained in
repository
ListRecords
listing of N records
GetRecord
listing of a single record
most verbs take arguments: dates, sets, ids, metadata formats
and resumption token (for flow control)
Data Providers / Service Providers
data providers
(repositories)
service providers
(harvesters)
Aggregators
aggregators allow for:
• scalability for OAI-PMH
• load balancing
• community building
• discovery
data providers
(repositories)
aggregator
service providers
(harvesters)
Aggregators
• Frequently interchangeable terms:
– aggregators: likely to be community / institutionally
focused
– caches: stores a copy, less likely to be communityoriented
– proxies: less likely to store a copy, may gateway
between OAI-PMH and other protocols
• Dienst / OAI Gateway; Harrison, Nelson, Zubair, JCDL 03
• To learn more about aggregators, caches &
proxies:
– http://www.openarchives.org/OAI/2.0/guidelines-aggregator.htm
– http://www.cs.odu.edu/~mln/jcdl03/
Example Aggregators
• Arc - http://arc.cs.odu.edu/
– first described “hierarchical harvesting” in DLib Magazine, 7(4) 2001
• http://www.dlib.org/dlib/april01/liu/04liu.html
• Celestial - http://celestial.eprints.org/
– among other services, it provides a history of
harvests (successful vs. errors)
• http://celestial.eprints.org/cgi-bin/status
OAI-PMH 2.0 Registration
unregistered because:
75 repositories
registered
??? unregistered
repositories
• testing / development
• not for public harvesting
• public, but “low-profile”
• never got around to it…
• ???
Data Providers: http://www.openarchives.org/Register/BrowseSites.pl
Service Providers: http://www.openarchives.org/service/listproviders.html
DP:SP ~= 5:1
Registration is Nice…
…But Not Required
• OAI-PMH is (becoming) the “http” for digital
libraries
– there is no central registry of http servers
• remember the NCSA “What’s New” page? (ca. 1994)
• There will never be “registration support” in OAIPMH
– registries are a type of service provider, built on top of
OAI-PMH
– registration will be an integral part of community
building
– friends…
<friends>
• A light weight, optional, DP-centric
method to communicate the existence of
“others”
http://techreports.larc.nasa.gov/ltrs/oai2.0/?verb=Identify
..
<description>
<friends ..namespace stuff..>
<baseURL>http://naca.larc.nasa.gov/oai2.0</baseURL>
<baseURL>http://ntrs.nasa.gov/oai2.0</baseURL>
<baseURL>http://horus.riacs.edu/perl/oai/</baseURL>
<baseURL>http://ston.jsc.nasa.gov/collections/TRS/oai/</baseURL>
</friends>
</description>
..
NASA <friends> example
harvester
Identify
<friends>…</friends>
http://techreports.larc.nasa.gov/ltrs/oai2.0/
http://naca.larc.nasa.gov/oai2.0/
http://ston.jsc.nasa.gov/collections/TRS/oai/
http://ntrs.nasa.gov/oai2.0/
http://horus.riacs.edu/perl/oai/
Use of <friends>
Slide from S. Warner, Cornell University
Langley Technical Report Server
• publicly available
– began as an anonymous ftp
server in 1992; http access
in 1993
– model for other technical
report servers at other
NASA centers
• details in NASA TM109162
• mostly LaTeX, MS Word,
other systems
– some scanned reports
http://techreports.larc.nasa.gov/ltrs/
http://techreports.larc.nasa.gov/ltrs/oai2.0/
NACA Technical Report Server
• publicly available
– began in 1996
– details in NASA TM-1999209127
• scanned reports from
1917-1958
– NACA = predecessor to
NASA
• contents mirrored with the
MaGIC project
http://naca.larc.nasa.gov/
http://naca.larc.nasa.gov/oai2.0/
– a UK-based grey-literature
preservation project
– OAI-PMH used to mirror
contents
NACA Report 1345
as seen through its native DL
http://naca.larc.nasa.gov/
NACA Report 1345
as seen through MAGiC
http://www.magic.ac.uk/
NACA Report 1345
as seen through its Scirus
(Elsevier)
http://www.scirus.com/
NACA Report 1345
as seen through OAIster
http://oaister.umdl.umich.edu/
NACA Report 1345
as seen through my.OAI
(FS Consulting)
http://www.myoai.com/
NTRS OAI Architecture
all searching, browsing,
etc. performed on
the metadata here
user
individual nodes can
still support direct user
interaction
search for “cfd
applications”
NTRS
local copy of
metadata
metadata harvested
offline, through
OAI interface
LTRS
ATRS
GTRS
...
CASITRS
content (reports) remain archived at the local sites
each node
independently
maintained
NASA Technical Report Server
• publicly available
• replacement for the former
distributed searching
version of NTRS
–
–
–
–
http://ntrs.nasa.gov/
MySQL
Va Tech harvester
modified “bucket”
details in Nelson, Rocker,
Harrison, Library Hi-Tech,
21(2) (July 2003)
• a service provider &
aggregator
– same OAI-PMH baseURL
as used for interactive
searching
NASA Technical Report Server
• advanced, fielded
search
• explicit query routing
– 12 NASA repositories
– 4 non-NASA
repositories
• turned “off” by default
non-NASA
repositories
> 0.5M records
NASA DLs in the Larger STI Realm
Publishers
Universities
International
DOD
...
DOE
this could be a fully
connected graph
NTRS could also be a
data provider from the
point of view of other
DLs; allowing the
harvesting of NASA
report metadata.
NTRS could also harvest
metadata from other DLs,
and provide access to
non-NASA content.
NTRS
LTRS
ATRS
…
CASITRS
We hope to influence
the direction of the
science.gov effort to use
OAI-PMH
OSTI Energy Citations Database
• OAI-PMH support just
recently added (Feb
2003)
– not yet officially
announced or
registered
– 20k records, 8k fulltext
• other OSTI collections
planned
http://www.osti.gov/energycitations/
Technical Report Interchange
• Goal: share technical reports between 4 US
government labs without creating new digital
libraries for users to learn!
–
–
–
–
NASA Langley Research Center
Air Force Research Laboratory
Los Alamos National Laboratory (DOE)
Sandia National Laboratory (DOE)
• Solution: use cooperating OAI-PMH caches at
each site to
– export local contents
– ingest remote contents
TRI Production System - Status
LaRC
TRI System
LANL
TRI System
Records coming in from
other TRI systems
In
Production
Slide from M. Zubair, ODU
Proposed
Sandia
TRI System
Records going out to
other TRI systems
AFRL
TRI System
ODU
TRI System
(Listener)
Mappings in TRI
Laboratory
Native
Metadata
Format
Native Source
Commercial DL
System
Native
Destination
Commercial DL
System
LaRC
LANL
AFRL
Sand ia
MARC
MARC + local fields
COSATI
MARC
BASIS+
Geac ADVANCE
Sirsi ST ILAS
Horizon
(TBD)
Science Server
Sirsi ST ILAS
Verit y
Details in Liu, et al. ECDL 2002; the above table also taken from the same paper
A Single TRI Module
Connect to remote DL by
OAI protocol
Local DB
Read new data from
remote DL
Write new data published
in local DL
OAI Harvester Control
Scheduler
Common Modules in all three DLs
Remote Data in
DC format
Local Data in DC format
Local DL Manager
Write Remote data to local
format
Input Directory
Slide from M. Zubair, ODU
Read local data and
convert to DC format
output Directory
Specific module for each DL
The Future: Community Building
• Ultimately, protocols and metadata formats are not what
makes a difference
• Rather, the critical mass afforded by a common set of
utilities (cf. http, Dublin Core, XML)
• The best current example: The Open Language Archives
Community
– http://www.language-archives.org/
• OAI-PMH provides the basis for communication between
strangers, but allows even richer communication between
friends
STI Communities
• Government produced/sponsored STI
• http://ntrs.nasa.gov/
• http://www.osti.gov/energycitations/
• http://dlib.cs.odu.edu/tri/
• Academia
– self-archiving vs. institutional archives
• http://www.soros.org/openaccess/
• http://www.ecs.soton.ac.uk/~harnad/Tp/resolution.htm
• Commercial publishers
– e.g. BioMed Central
• http://www.biomedcentral.com/
Download