mod_oai: Metadata Harvesting for Everyone

advertisement
mod_oai:
Metadata Harvesting
for Everyone
Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Aravind Elango
{mln,aelango}@cs.odu.edu
{herbertv,liu_x}@lanl.gov
DLF 2004 Fall Forum
Baltimore MD
October 25-27, 2004
mod_oai is sponsored by the Andrew Mellon Foundation
Outline
• mod_oai
–
–
–
–
–
crawling vs. harvesting
complex objects & OAI-PMH
how mod_oai works
scenarios
demos
• More information
– http://www.modoai.org/
– http://www.openarchives.org/
Inefficient Web Crawlers
what documents have been
modified since 2003-11-15?
www.getty.edu
doc1; last mod
2003-03-12
doc2; last mod
2002-07-19
doc3; last mod
2003-11-29
doc4; last mod
2002-10-03
…
doc100; last mod
2003-09-113
robot image from: http://www.q-design.com/toy/ToyArt/robots/55.JPEG
A More Efficient Way…
what documents have been
modified since 2003-11-15?
www.getty.edu
with OAI-PMH
doc1; last mod
2003-03-12
doc2; last mod
2002-07-19
doc3; last mod
2003-11-29
doc4; last mod
2002-10-03
…
doc100; last mod
2003-09-113
mod_oai
• Goal: integrate OAI-PMH functionality into
the web server itself…
• mod_oai: an Apache 2.0 module to
automatically answer OAI-PMH requests
for an http server
– written in C
– respects values in .htaccess, httpd.conf
• Result: web harvesting with OAI-PMH
semantics (e.g., from, until, sets)
• www.foo.edu/modoai?ListIdentifiers&metdataPrefix=
oai_dc&from=2004-09-15&set=video:mpeg
OAI-PMH data model
resource
OAI-PMH identifier
= entry point to all records pertaining to the resource
metadata pertaining
to the resource
Dublin Core
metadata
MPEG-21
DIDL
METS
MARCXML
metadata
modeled representation
of the resource
simple
model
complex
model
complex
model
more expressive
model
item
records
OAI-PMH and complex models
• OAI-PMH record == modeled representation of the resource
• Can be selectively harvested via OAI-PMH ~ datestamp, set
• Resource can be:
– simple object (1 file)
– compound object (multiple files)
• OAI-PMH records can contain:
– Typical metadata
– Actual resource(s)
• By-Value – base64 encoded
• By-Reference – http address of resource
• both
– Identifiers of metadata and resource(s), unambiguously mapped to the
identified data
– A variety of secondary information
Complex Objects & OAI-PMH
• LANL Repository
– OAI-PMH as a Repository Access Protocol to
access metadata and content represented as
DIDLs
• APS/LANL/LoC Mirroring
– OAI-PMH transfer of APS content represented in
application neutral format (DIDLs)
• LANL DSpace Plug-in
– Exposes MPEG-21 DIDL documents through builtin DSpace OAI-PMH infrastructure
How mod_oai works
• Install on an Apache 2.0 server
– compile & edit httpd.conf
http://www.foo.edu/
now has an OAI-PMH baseURL of:
http://www.foo.edu/modoai
OAI-PMH characteristics:
Typical Repository
OAI-PMH Entity
Resource
value
URL
description
PDF, PS, XML, HTML or other file
Item
identifier
OAI
Identifier
DNS-based name of metadata about
resource
set membership
LCSH
Library of Congress Subject Heading
metadataPrefix
oai_dc
bibliographic metadata in Dublin Core
Record
datestamp
2004-10-18
modification date of DC record
oai_marc
bibliographic metadata in MARC
Record
metadataPrefix
datestamp 2004-07-31 modification date of MARC record
OAI-PMH Data Model in mod_oai
resource
OAI Identifier ==
URL of Resource
http://techreports.larc.nasa.gov/ltrs/PDF/2004/aiaa/NASA-aiaa-2004-0015.pdf
DC, HTTP, DIDL
Modeled Representations
Set membership ==
MIME type
Dublin Core
metadata
HTTP
headers
DIDL: base64 or
urls + HTTP headers
item
records
OAI-PMH characteristics: mod_oai
OAI-PMH Entity
Resource
value
description
URL
HTML, GIF, PDF or other web file
URL
same URL as the resource
Item
identifier
set membership
MIME type
MIME type of the resource
Record
metadataPrefix http_header the http headers that would have
been returned via HTTP GET/HEAD
datestamp
2004-07-31
modification date of resource
oai_dc
a subset of http_header in DC
2004-07-31
modification date of resource
Record
metadataPrefix
datestamp
Record
metadataPrefix
datestamp
oai_didl
2004-07-31
MPEG-21 DIDL: base64 encoded
resource + http_header metadata
modification date of resource
OAI-PMH Concepts
concept
mod_oai interpretation
OAI Identifier
URL of resource
set
MIME type of resource
datestamp
change time of resource
deleted records
“no” deleted records
http_header
Use Cases
• Regular Web Crawling
– use ListIdentifiers to discover URLs
– add new URLs to the list of URLs to be
crawled
• Harvesting Resources w/ OAI-PMH
– use ListRecords to extract the entire
resource as an MPEG-21 DIDL AIP
Regular Crawling: ListIdentifiers
harvester issues a
ListIdentifiers,
finds the updates,
and does HTTP
GETs on just the
updates
Resource Harvesting: ListRecords
harvester issues
a ListRecords,
and gets the
updates in
DIDLs (http
headers + byvalue or by-ref
resources)
Demo
• Repository Explorer
– http://oai.dlib.vt.edu/cgi-bin/Explorer/oai2.0/testoai
– http://oai.dlib.vt.edu/cgibin/Explorer/oai2.0/testoai?archive=http://whiskey.cs.odu.edu/modoai
• Direct URLs
– http://whiskey.cs.odu.edu/modoai?verb=Identify
– http://whiskey.cs.odu.edu/modoai?verb=ListMetadataFormats
– http://whiskey.cs.odu.edu/modoai?verb=ListIdentifiers&metad
ataPrefix=oai_dc
– http://whiskey.cs.odu.edu/modoai?verb=ListRecords&metadata
Prefix=http_header
– http://whiskey.cs.odu.edu/modoai?verb=ListRecords&metadata
Prefix=oai_didl
Datestamps and Etags
L. Clausen, “Concerning Etags and Datetsamps”,
4th International Web Archiving Workshop, ECDL 2004
http://www.netarchive.dk/website/publications/Etags-2004.pdf
• Procedure
– 16 harvests over 1 month
of 465,374 .dk domains
– 5,543,470 possible
downloads
– 5,182,034 successful
downloads
– 599,143 changes
Datestamp and Etag Example
Errors in Datestamps and Etags
Indicating Change
Etags
Datestamps
missed change
0.087%
0.30%
redundant crawl
32%
10.7%
40.1 % of pages without Etags
0.07% of pages without Datestamps
L. Clausen, “Concerning Etags and Datetsamps”,
4th International Web Archiving Workshop, ECDL 2004
http://www.netarchive.dk/website/publications/Etags-2004.pdf
mod_oai…
• is:
– a simple way to more
efficiently harvest web pages
– a possible impact on
robots.txt
– fully OAI-PMH compliant
• works with existing
harvesters
• is not:
– yet suitable for dynamic
files
– a replacement for
•
•
•
•
DSpace
Fedora
eprints.org
other digital libraries /
repositories / cms
Download