OAI and Metadata Harvesting Mukesh Pund Scientist,

advertisement
OAI and Metadata
Harvesting
Mukesh Pund
Scientist,
NISCAIR
New Delhi
Acknowledgements


While preparing this presentation, I have used
material from several sources on OAI-PMH by
other authors
I gratefully acknowledge these sources
Digital Repositories:
Current Situation





Mushrooming number and variety of distributed
digital repositories (archives, digital libraries)
Use of variety of hardware, software, database
solutions
Use of different search and retrieval interfaces
Most of the content are not indexed by web
search engines
Content resides in backend databases – not
picked up by web search engines
Problems faced by Users



How to identify and retrieve relevant information
from different repositories?
Visiting and searching individual repositories is
very expensive
Key Requirement: How do we support cross
searching?
Current Solutions


Federated/ distributed searching
– Z39.50 Information Retrieval protocol
Metadata harvesting
– OAI-PMH protocol
Federated/ distributed searching




Protocol: "Information Retrieval (Z39.50): Application
Service Definition and Protocol Specification", (ISO/ ANSI
standard) (v1-1991, v2-1992, v3-1995)
Client-Server model (TCP/IP Service)
Process:
– Client (‘Origin’) sends queries, formatted according to
Z39.50, to repository Server (“Target”).
– Server translates this to local query format, searches the
database, sends the results to the client, formatted
according to Z39.50
– Client translates the results and presents it to the user
Client can send queries to as many related z39.50
compliant servers as possible
Z39.50 protocol …



Example implementation: Distributed searching
of library catalogues/ bibliographic databases
Problem - performance
– Implementation not easy
– Does not scale well (if nodes > 100)
– Network bandwidth
– Z39.50 implementation at client (“Origin’) end
Z30.50 resources:
http://lcweb.loc.gov/z3950/agency/ (Z39.50
International Maintenance Agency, Library of
Congress)
OAI-PMH Vs. Z39.50


OAI-PMH: Indexed Search much similar to
general search Engines. Requires Service
Providers and data providers
Z39.50: Concurrent Search, No service providers
only data providers
8
OAI-PMH


Open Archive Initiative-Protocol for
Metadata Harvesting
Protocol Version 2.0 of 2002-06-14
http://www.openarchives.org
Open Archives Initiative (OAI)
The protocol is openly
documented, and metadata
is “exposed” to at least some
peer group (note: rights
management can still apply!)
Archive defined as a
dynamic “collection of
stuff” -- not the archivist’s
definition of “archive”.
“Repository” used in most
OAI documents.
OAI is happening
at break-neck speed...
Metadata Harvesting



Move away from distributed searching (e.g.,
Z39.50)
Extract metadata from various sources
Build services on local copies of metadata
– Resources remain at remote repositories
user
individual nodes can
still support direct user
interaction
metadata
harvested
offline
Search
all searching, browsing,
etc. performed on
the metadata here
local copy of
metadata
metadata
harvested
offline
metadata
harvested
offline
metadata
harvested
offline
...
each node
independently
maintained
Data and Service Providers



Data Provider
– Creators and keepers of the metadata as well as
repositories of resources
– Give free access of metadata (not necessarily:
free access to full texts / resources)
Service Provider
– Harvest and store metadata (no live requests!)
– May select certain subsets from Data Providers
(set hierarchy, date stamp) for selective
harvesting
– May enrich metadata
– Offer (value-added) service on the basis of the
metadata
One ‘service’ can play both roles (Aggregators)
Multiple Data and Service
Providers
Data providers
Harvesting
based on
OAI-PMH
Service providers
Aggregators
Data providers
Aggregator
Service providers
OAI-PMH v.2.0 [06/2002]









Low-barrier interoperability specification
Metadata harvesting model: data provider /
service provider
Metadata about resources
Autonomous protocol
Not a search protocol!
HTTP based
XML responses
Unqualified Dublin Core
Stable: backward compatible
OAI Data Model:
Resources / Items / Records
resource
item = identifier
Dublin Core
metadata
all available metadata
about Mona Lisa
MARC
metadata
SPECTRUM
metadata
item
records
record = identifier + metadata format + datestamp
Harvesting: How it works
Six OAI “Verbs”
Service Provider
Identify
ListMetadataFormats
ListSets
ListIdentifiers
ListRecords
GetRecord
Metadata Provider
R
H
E
HTTP Request
A
P
(OAI
Verb)
R
O
V
S
E OAI
OAI I
S
T
T
O
HTTP Response
E
R
(Valid
XML)
R
Y
Harvester


A harvester is a client application that issues OAIPMH requests.
A harvester is operated by a service provider as a
means of collecting metadata from repositories
.
18
Repository


A repository is a network accessible server that can
process the OAI-PMH requests.
A repository is managed by a data provider to
expose metadata to harvesters
19
Resource

A resource is the object or "stuff" that metadata is
"about". The nature of a resource, whether it is
physical or digital, or whether it is stored in the
repository or is a constituent of another database,
is outside the scope of the OAI-PMH
20
Item


An item is a constituent of a repository from which
metadata about a resource can be disseminated.
That metadata may be disseminated on-the-fly
from the associated resource, cross-walked from
some canonical form, actually stored in the
repository, etc.
21
Record


A record is metadata in a specific metadata format.
A record is returned as an XML-encoded byte
stream in response to a protocol request to
disseminate a specific metadata format from a
constituent item.
22
Unique Identifier


A unique identifier unambiguously identifies an item
within a repository
The unique identifier is used in OAI-PMH requests
for extracting metadata from the item.
cont…
23
Unique Identifier


The format of the unique identifier must correspond
to that of the URI (Uniform Resource Identifier)
syntax
Repositories may implement the oai-identifier
24
Role of Identifier



Unique identifiers play two roles in the protocol:
Response: Identifiers are returned by both the
ListIdentifiers and ListRecords requests.
Request: An identifier, in combination with a
metadataPrefix , is used in the GetRecord request
as a means of requesting a record in a specific
metadata format from an item
25
OAI-PMH Verbs






Identify
ListSets
ListMetadataFormats
ListIdenfiers
GetRecord
ListRecords
Identify

Returns general information about the:
 Archive and its policies
 Datestamp
 Granularity
Ex:
http://192.168.0.12/dspace-oai/request?verb=Identify

27
28
ListSets



Provide a listing of sets in which records may be
organized (may be hierarchical, overlapping, or flat)
Example:
http://192.168.0.12/dspaceoai/reqeust?verb=ListSets
29
30
ListMetadataFormats



Lists metadata formats supported by the archive as
well as their schema locations and namespaces
Example:
http://192.168.0.12/dspaceoai/request?verb=ListMetadataFormats
31
32
ListIdentifiers


List headers for all items corresponding to the
specified parameters
http://192.168.0.12/dspaceoai/request?verb=ListIdentifiers&metadataPrefix=o
ai_dc
33
34
GetRecord



Returns the metadata for a single item in the form of an
OAI record
Example:
http://192.168.0.12/oai/request?verb=
GetRecord&identifier=oai:192.168.0.12:123456789/3&m
etadataPrefix=oai_dc
35
08/24/07
ListRecords


Retrieves metadata records for multiple items
http://192.168.0.12/dspaceoai/request?verb=ListRecords&metadataPrefix=oai_
dc
37
38
ListIdentifiers


To get a list of identifiers
http://192.168.0.12/oai/request?verb=ListIdentifier
s&metadataPrefix=oai_dc&from=2002-12-01
39
40
Selective Harvesting





By date
&from=2002-12-01 OR
&from=2002-12-01&until=2003-12-01
By set (collection in Dspace)
&set=hdl_1849_2
41
Useful Sites



OAI-PMH Official Site:
– http://www.openarchives.org/
Testing your OAI-PMH compatibility
– http://oai.dlib.vt.edu/cgi-bin/Explorer/2.01.45/testoai
Registering your Digital Repository
– http://www.openarchives.org/data/registeraspro
vider.html
42
OAI Service Provider Software
(Harvesters)



PKP Harvester:
– University of British Columbia, Canada
– http://www.pkp.ubc.ca/pkp-harvester/
DLESE
– Digital Library for Earth System Education
– http://sourceforge.net/projects/dlese-oai/
ARC
– Old Dominion University, Virginia
– http://arc.cs.odu.edu/
43
OAI Data Provider Software


OAICat
– OCLC
– http://www.oclc.org/research/software/oai/cat
.htm
DLESE
– Digital Library for Earth System Education
– http://sourceforge.net/projects/dlese-oai dfs
44
How do baseURLs look like

DSpace repositories
– NSDL : 202.54.99.9/dspace
– http://202.54.99.9/dspace-oai/request
45
OAI Tools

http://www.openarchives.org/tools/tools.html
46
Thank
You
47
Download