Implementation of Digital Libraries

advertisement
Implementation of Digital Libraries
Michael L. Nelson
Old Dominion University
mln@cs.odu.edu
http://www.cs.odu.edu/~mln/
Congreso Internacional de Información en Salud
Lima, Peru
May 28, 2004
Acknowledgements
•
•
•
•
•
•
ODU: K. Maly, M. Zubair, J. Bollen
LANL: R. Luce, X. Liu
NASA: G. Roncaglia, J. Rocker, C. Mackey
Cornell: C. Lagoze, S. Warner
MAGiC (UK): Paul Needham
and, of course, Herbert Van de Sompel
(LANL)
– the OpenURL slides are nicked from his
presentations
Outline
• A bit of history
• Core technologies & Issues
– OAI-PMH
• deep web
– OpenURL
– Handles / DOIs
– Object Models
covered only briefly
• Example implementations
• Download and go…
OAI-PMH
Background
• I met Herbert Van de Sompel in April 1999...
– we spoke of a demonstration project he had in mind and had
received sponsorship from Paul Ginsparg and Rick Luce
– We wanted to demonstrate a multi-disciplinary DL that leveraged
the large number of high quality, yet often isolated, tech report
servers, e-print servers, etc.
• most digital libraries (DLs) had grown up along single disciplines or
institutions
– little to no interoperability; isolated DL “gardens”
– Universal Preprint Service
• Demonstrated at Santa Fe NM, October 21-22, 1999
– http://web.archive.org/web/*/http://ups.cs.odu.edu/
• D-Lib Magazine, 6(2) 2000 (2 articles)
– http://www.dlib.org/dlib/february00/02contents.html
– UPS was soon renamed the Open Archives Initiative (OAI)
http://www.openarchives.org/
Result… OAI
• The OAI was the result of the demonstration and discussion
during the Santa Fe meeting
– OAI = a bunch of people, a religion, a cult, etc.
– OAI Protocol For Metadata Harvesting (OAI-PMH) = the protocol
created and maintained by the OAI
• Initial focus was on federating collections of scholarly e-print
materials…
• …however, interest grew and the scope and application of OAIPMH expanded to become a generic bulk metadata transport
protocol
• Note:
– OAI-PMH is only about metadata -- not full text!
• but what is metadata vs. full-text?
– OAI is neutral with respect to the nature of the metadata or the
resources the metadata describes
• read: commercial publishers have an interest in OAI-PMH too...
Request is encoded
in http
OAI-PMH Mechanics
Response is encoded
in XML
XML Schema for the
responses are defined
in the OAI-PMH
document
Overview of OAI-PMH Verbs
Verb
archival
metadata
harvesting
verbs
Function
Identify
description of archive
ListMetadataFormats
metadata formats supported by archive
ListSets
sets defined by archive
ListIdentifiers
OAI unique ids contained in archive
ListRecords
listing of N records
GetRecord
listing of a single record
most verbs take arguments: dates, sets, ids, metadata formats
and resumption token (for flow control)
OAI-PMH Data Model
set-membership is
item-level property
item = identifier
Dublin Core
metadata
resource
all available metadata
about David
MARC
metadata
SPECTRUM
metadata
item
records
record = identifier + metadata format + datestamp
Data Providers / Service Providers
data providers
(repositories)
service providers
(harvesters)
Aggregators
aggregators allow for:
• scalability for OAI-PMH
• load balancing
• community building
• discovery
data providers
(repositories)
aggregator
service providers
(harvesters)
Aggregators
• Frequently interchangeable terms:
– aggregators: likely to be community / institutionally focused
– caches: stores a copy, less likely to be community-oriented
– proxies: less likely to store a copy, may gateway between OAIPMH and other protocols
• Dienst / OAI Gateway; Harrison, Nelson, Zubair, JCDL 03
• To learn more about aggregators, caches & proxies:
–
–
http://www.openarchives.org/OAI/2.0/guidelines-aggregator.htm
http://www.cs.odu.edu/~mln/jcdl03/
Example Aggregators
• Arc - http://arc.cs.odu.edu/
– first described “hierarchical harvesting” in DLib Magazine, 7(4) 2001
• http://www.dlib.org/dlib/april01/liu/04liu.html
• Celestial - http://celestial.eprints.org/
– among other services, it provides a history of
harvests (successful vs. errors)
• http://celestial.eprints.org/cgi-bin/status
OAI-PMH 2.0 Registration
unregistered because:
??? unregistered
repositories
150+ repositories
registered
•
•
•
•
•
testing / development
not for public harvesting
public, but “low-profile”
never got around to it…
???
DP:SP ~= 5:1
Data Providers: http://www.openarchives.org/Register/BrowseSites.pl
Service Providers: http://www.openarchives.org/service/listproviders.html
Registration is Nice…
…But Not Required
• OAI-PMH is (becoming) the “http” for digital libraries
– there is no central registry of http servers
• remember the NCSA “What’s New” page? (ca. 1994)
• There will never be “registration support” in OAI-PMH
– registries are a type of service provider, built on top of OAIPMH
– registration will be an integral part of community building
– friends…
NASA <friends> example
harvester
Identify
<friends>…</friends>
http://techreports.larc.nasa.gov/ltrs/oai2.0/
http://naca.larc.nasa.gov/oai2.0/
http://ston.jsc.nasa.gov/collections/TRS/oai/
http://ntrs.nasa.gov/oai2.0/
http://horus.riacs.edu/perl/oai/
NACA Technical Report
Server
• publicly available
– began in 1996
– details in NASA TM-1999209127
• scanned reports from
1917-1958
– NACA = predecessor to
NASA
• contents mirrored with the
MaGIC project
http://naca.larc.nasa.gov/
http://naca.larc.nasa.gov/oai2.0/
– a UK-based greyliterature preservation
project
– OAI-PMH used to mirror
contents
NACA Report 1345
as seen through its native DL
http://naca.larc.nasa.gov/
NACA Report 1345
as seen through MAGiC
http://www.magic.ac.uk/
NACA Report 1345
as seen through its Scirus
(Elsevier)
http://www.scirus.com/
NACA Report 1345
as seen through my.OAI
(FS Consulting)
http://www.myoai.com/
NASA Technical Report
Server
• replacement for the
previous distributed
searching version of NTRS
–
–
–
–
MySQL
Va Tech harvester
modified “bucket”
details in Nelson, Rocker,
Harrison, Library Hi-Tech,
21(2) (March 2003)
• a service provider &
aggregator
http://ntrs.nasa.gov/
– same OAI baseURL as
used for interactive
searching
NASA Technical Report
Server
• advanced, fielded
search
• explicit query routing
– 12 NASA repositories
– 4 non-NASA
repositories
• turned “off” by
default
• >600k abstracts;
>300k full-text
Service Providers
• It is clear that SPs are proliferating, despite
(because of?) the inherent bias toward DPs in the
protocol
– easy to be a DP -> many DPs -> SPs eventually emerge
– hard to be a DP -> SPs starve
– currently 5x DPs more than SPs
• SPs are beginning to offer increasingly
sophisticated services
– competitive market originally envisioned for SPs is
emerging
Community Building
www.ndltd.org
Universidad Nacional Mayor de San Marcos
Colegio America
Pontificia Universidad Catolica del Peru
Universidad Nacional Federico Villarreal
Universidad Nacional de Trujillo
Universidad de Lima
Universidad Nacional Jorge Basadre Grohmann
Colegio Universitario Andino
Universidad del Pacifico
Universidad Peruana de Ciencias Apicadads
OAI-PMH & The Deep Web
Exposing Repository Contents
• DP9: Webcrawler access to OAI-PMH
repositories
• http://dlib.cs.odu.edu/dp9/
• JCDL 02 http://www.cs.odu.edu/~liu_x/dp9/dp9.pdf
• An Apache module for OAI-PMH
– http://www.modoai.org/
• Extensible Repository Resource Locators
(ERRoLs) for OAI Identifiers
– http://www.oclc.org/research/projects/oaireso
lver/default.htm
Race for This New Market…
• Yahoo! & University of Michigan
– http://www.umich.edu/news/index.html?
Releases/2004/Mar04/r031004
• Google & CrossRef
– http://www.nature.com/nature/focus/ac
cessdebate/17.html
OpenURL
slides from Herbert Van de Sompel, LANL
Origins & Motivation
The Context: Library Automation Environment anno 1998
• distributed information environment
• local & remote A&I databases
• rapidly growing e-journal collection
• need to interlink the available information
The Problem:
• links are delivered by info providers
• links are not sensitive to user’s context
• appropriate copy problem
• links dependent on business agreements between
information vendors
• links don’t cover the complete collection
Origins & Motivation
The Context: Library Automation Environment anno 1998
• distributed information environment
• local & remote A&I databases
• rapidly growing e-journal collection
• need to interlink the available information
The REAL Problem:
• libraries have no say in linking
• libraries are losing core part of the “organizing
information” task
• expensive collection is not used optimally
• users are not well served
Origins & Motivation
The Solution:
In information services:
• DO NOT provide a link which is an actual service
related to a referenced item (e.g. a link from a record
in an A&I database to the corresponding full-text)
• BUT rather provide
• a link that transports metadata about the
OpenURL
referenced item
to
• others that are better placed to provide service
links
Linking server operated by library
non-OpenURL linking
resource
resource
link destination
link source
reference
.
link
resolution of
metadata into link
link to referenced work
OpenURL linking
transportation of
metadata & identifiers
user-specific
link source
reference
.
OpenURL
OpenURL
provision of OpenURL
linking
server
link
link
link
link
resolution of
metadata & identifiers into services
link
destination
link
destination
link
destination
link
destination
default links:
• restricted in nature
• action-radius restricted by business agreements
• not context-sensitive
resource2
resource3
default links
resource1
herbert van de sompel
metadata plane
extended services plane
service
component1
service
component2
resource2
resource3
default links
resource1
herbert van de sompel
metadata plane
NISO OpenURL Standardization Charge
• Use existing “OpenURL Framework” as starting point
• notion of context-sensitive services
• notion of transporting “contextual” metadata packages
to obtain context-sensitive services
• Define syntax and transport-method for “contextual”
metadata packages
• Ensure extensibility:
• must support future applications
• must support other information communities
=> Generalize and Standardize
NISO OpenURL Standardization Charge
Therefore, to be addressed were:
• OpenURL Framework beyond scholarly resources
• “contextual” metadata packages
• Syntax for “contextual” metadata packages
• Transport of “contextual” metadata packages
OpenURL Status
• (Nearly) a NISO standard
– check for details:
• http://library.caltech.edu/openurl/
Naming: Handles & DOIs
Naming
• Fundamental to other technologies (OAIPMH, OpenURL, etc.)
• Options
– URNs
– Persistent URLs (PURLs)
• http://purl.org/
– Handles
• http://www.handle.net/
– Digital Object Identifiers
• http://www.doi.org/
– ARK
• http://www.cdlib.org/inside/diglib/ark/
“Inverted Archives”
• Unit of discourse is no longer an
archive or service, but a DOI which
has services linked from it
– cf.:
• UPS demonstration prototype
• “Smart Objects, Dumb Archives” (SODA)
model
Example
http://dx.doi.org/10.1145/374308.374342
Object Models
Popular Object Models
• METS
– used in DSpace, Fedora
– http://www.loc.gov/standards/mets/
• MPEG-21 DIDL
– http://xml.coverpages.org/mpeg21-didl.html
– used in LANL DLs
• http://www.dlib.org/dlib/november03/bekaert/11bekaert.html
• http://www.dlib.org/dlib/february04/bekaert/02bekaert.html
• http://lib-www.lanl.gov/~herbertv/papers/jcdl2004-submitteddraft.pdf
Object Models & OAI-PMH
resource
Quic kTime™ and a
TIFF (Unc ompres sed) dec ompres sor
are needed to see this pic ture.
item
oai:foo.edu:1234
records
Quic kTime™ and a
TIFF (Unc ompres sed) dec ompres sor
are needed to see this pic ture.
Move from simple metadata files
“pointing” to resources…
METS
…to records as “modeled
representations” of resources
Download and Go!
Where Do You Want to
Build?
user
CDSware
service
provider
data
provider
data
provider
data
provider
EPrints.org
data
provider
CDSware
...
data
provider
local contextsensitive services
Fedora
• joint project between Cornell & UVa
– funded by the Mellon Foundation
• a repository management system
– focuses on complex digital objects and their
behaviors
• more info:
– http://www.fedora.info/
– D-Lib Magazine, 9(4)
• http://www.dlib.org/dlib/april03/staples/04staples.h
tml
• MIT + HP Labs
• constructed to capture all the output of
MIT’s faculty
• now generalized to the DSpace Federation
– 8 top universities in the US & Canada
• More info:
– http://www.dspace.org/
– http://sourceforge.net/projects/dspace/
– D-Lib Magazine 9(1)
• http://www.dlib.org/dlib/january03/smith/01smith.ht
ml
EPrints.org
• developed at Southampton University
– part of larger suite of institutional/author selfarchiving tools and services
• e.g.: citebase; paracite
• widely adopted -- 100+ sites
– http://software.eprints.org/#ep2
• more info
– http://www.eprints.org/
– http://www.arl.org/sparc/core/index.asp?page=
g20#6
CDSware
• developed at CERN
• data provider & service provider
• large-scale use @ CERN (> 600k records)
– in use at a few non-CERN sites
• free & paid support models
• more info
– http://cdsware.cern.ch/
• P2P publishing for academia
– community servers for coordination,
management
– archivelets for individual laptops, PCs
• more info:
– http://kepler.cs.odu.edu/
– D-Lib Magazine 7(4)
• http://www.dlib.org/dlib/april01/maly/04maly.html
• developed by UKOLN
– open source
• OpenURL 0.1 format resolver
– NISO 1.0 format???
• more info:
– Ariadne, 28
• http://www.ariadne.ac.uk/issue28/resolver/
• ftp://ftp.ukoln.ac.uk/metadata/tools/openresolver/
• http://www.ukoln.ac.uk/distributed-systems/openurl/
Conclusions
Why The OAI-PMH
is NOT Important
• Users don’t care
• OAI-PMH is middleware
– if done right, the uninterested user should never have to
know
• Using OAI-PMH does not insure a good SP
• OAI-PMH is (or is becoming) HTTP for DLs
– few people get excited about http now
• http & OAI-PMH are core technologies whose
presence is now assumed
Digital Library Technologies
•
•
•
•
http
XML
OAI-PMH
OpenURL ?
Other Uses For the OAI-PMH
• Assumptions:
– Traditional DLs / SPs will continue on their present path of
increasing sophistication
• citation indexing, search results viz, personalization, recommendations,
subject-based filtering, etc.
– growth rates remain the same (5x DPs as SPs)
• Premise: OAI-PMH is applicable to any scenario that needs to
update / synchronize distributed state
– Future opportunities are possible by creatively interpreting the
OAI-PMH data model
• See Van de Sompel, Young & Hickey, D-Lib Magazine July 2003,
http://www.dlib.org/dlib/july03/young/07young.html
• Nelson, 2nd OAI Workshop,
http://agenda.cern.ch/askArchive.php?base=agenda&categ=a0
2333&id=a02333s5t8/transparencies
OpenURL Framework evolution
A spec based on HTTP GET to transport metadata about
• a scholarly referent &
• the context in which the referent is referenced
Draft Van de Sompel, Beit-Arie, Hochstenbach 05/2001
A framework Standard that enables different Communities
to:
• describe a referent
• describe the context in which the referent is referenced
• transport these descriptions
NISO Draft Standard 04/2003
The Future: Community Building
• Ultimately, protocols and metadata formats are
not what makes a difference
• Rather, the critical mass afforded by a common
set of utilities (cf. http, Dublin Core, XML)
• The best current example: The Open Language
Archives Community
– http://www.language-archives.org/
• OAI-PMH provides the basis for communication
between strangers, but allows even richer
communication between friends
Further Reading
• Gerry McKiernan, Library Hi-Tech News
– http://www.public.iastate.edu/~gerrymck/OAI-SP-I.pdf
– http://www.public.iastate.edu/~gerrymck/OAI-SP-II.pdf
– http://www.public.iastate.edu/~gerrymck/OAI-SP-III.pdf
• Open Archives Forum OAI-PMH Tutorial
– http://www.oaforum.org/tutorial/
• “A Survey of Digital Library Aggregation
Services”
– http://www.diglib.org/pubs/brogan/
• Open Access News
– http://www.earlham.edu/~peters/fos/fosblog.html
• Guide To Institutional Repository Software
– http://www.soros.org/openaccess/software/
Great Stuff I Did Not Cover…
• OAI-PMH
– Static Repositories
• http://www.openarchives.org/OAI/2.0/guidelines-staticrepository.htm
– OAI-Rights
• http://www.openarchives.org/documents/OAIRightsWhite
Paper.html
• http://www.openarchives.org/news/oairightspress030929.
html
• Digital Preservation
– http://www.digitalpreservation.gov/
Download