A New Model for Web Resource Harvesting Michael L. Nelson

advertisement
A New Model for Web Resource Harvesting
Michael L. Nelson
Old Dominion University
joint work with:
Her
Herbert Van de Sompel
Xiaoming Liu
Carl Lagoze
Simeon Warner
Terry Harrison
This work supported in part by the Andrew Mellon Foundation & Library of Congress
My Research Interests
•
Digital Objects and Repositories
o
o
•
Digital Preservation
o
o
•
Interaction between them: roles, responsibilities, architecture, scalability
Bumper sticker: Free the Object from the Tyranny of the Repository
Shared infrastructure models, automation, large-scale best effort
strategies
Bumper sticker: We Need Fewer Heroes
User / System Co-Evolution
o
o
Discerning intent from access large-scale patterns
Bumper sticker: Freedom From Choice
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
Outline
(0) The Problem
(1) OAI-PMH Mechanics
(2) OAI-PMH for Resource Harvesting
(3) mod_oai
(4) Future Research
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
WWW and DL: Separated at Birth
WWW
The Good: XML, BitTorrent, Web Services
The Bad: RSS
The Ugly: Semantic Web
WWW
DL
DL
The Good: OAIS, DOI, OAI-PMH
The Bad: Dublin Core
The Ugly: SRU/W
Today
1994
The problem is not that the WWW doesn’t work; it clearly does.
The problem is that our expectations have been lowered.
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
more on DL/WWW, from the NSF Post-DL Workshop:
http://www.sis.pitt.edu/~dlwkshop/paper_sompel.html
http://www.sis.pitt.edu/~dlwkshop/paper_lagoze.html
Web Robots
what documents have been
modified since 2003-11-15 ?
what is this file?
what are its relationships to other files?
how often does it change?
www.getty.edu
doc1; last mod
2003-03-12
doc2; last mod
2002-07-19
…
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
robot image from: http://www.q-design.com/toy/ToyArt/robots/55.JPEG
doc100; last mod
2003-09-11
A More Efficient Way
<co>
<metadata/>
<link/>
<link/>
<change/>
…
</co>
what documents have been
modified since 2003-11-15 ?
www.getty.edu
with mod_oai
doc1; last mod
2003-03-12
doc2; last mod
2002-07-19
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
…
doc100; last mod
2003-09-11
Outline
(0) The Problem
(1) OAI-PMH Mechanics
(2) OAI-PMH for Resource Harvesting
(3) mod_oai
(4) Future Research
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
A Very Brief OAI-PMH History
•
Universal Preprint Service
o
o
A cross-archive DL that that provides services on a collection of metadata
harvested from multiple archives
- not distributed searching
Demonstrated at Santa Fe NM, October 21-22, 1999
- D-Lib Magazine, 6(2) 2000 (2 articles)
– http://www.dlib.org/dlib/february00/02contents.html
o
•
UPS was soon renamed the Open Archives Initiative (OAI)
http://www.openarchives.org/
The OAI has authored, among other things, the Open Archives Initiative
Protocol for Metadata Harvesting (OAI-PMH)
o
in use around the world, 600+ known instances
- registration not required; many unknown instances
– http://gita.grainger.uiuc.edu/registry/
– http://celestial.eprints.org/cgi-bin/status
o
used by Google ca. late 2004
- http://www.nla.gov.au/digicoll/oai/
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
Data Providers /
Repositories
“A repository is a network accessible server that
can process the 6 OAI-PMH requests …
A repository is managed by a data provider to
expose metadata to harvesters.”
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
Service Providers /
Harvesters
“A harvester is a client application that
issues OAI-PMH requests. A harvester is
operated by a service provider as a means
of collecting metadata from repositories.”
Aggregators
aggregators allow for:
• scalability for OAI-PMH
• load balancing
• community building
• discovery
data providers
(repositories)
aggregator
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
service providers
(harvesters)
OAI-PMH data model
resource
OAI-PMH sets
OAI-PMH identifier
OAI-PMH identifier
metadataPrefix
datestamp
entry point to all records pertaining to the resource
Dublin Core
metadata
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
MARCXML
metadata
item
records
metadata pertaining
to the resource
Overview of OAI-PMH Verbs
Verb
metadata
about the
repository
harvesting
verbs
Function
Identify
description of repository
ListMetadataFormats
metadata formats supported by repository
ListSets
sets defined by repository
ListIdentifiers
OAI unique ids contained in repository
ListRecords
listing of N records
GetRecord
listing of a single record
most verbs take arguments: dates, sets, ids, metadata formats
and resumption token (for flow control)
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
Outline
(0) The Problem
(1) OAI-PMH Mechanics
(2) OAI-PMH for Resource Harvesting
(3) mod_oai
(4) Future Research
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
Resource Harvesting: Use cases
• Discovery: use content itself in the creation of services
o
o
o
search engines that make full-text searchable
citation indexing systems that extract references from the full-text content
browsing interfaces that include thumbnail versions of high-quality images
from cultural heritage collections
• Preservation:
o
o
periodically transfer digital content from a data repository to one or more
trusted digital repositories
trusted digital repositories need a mechanism to automatically synchronize
with the originating data repository
Ideas first presented in Van de Sompel, Nelson, Lagoze & Warner, http://www.dlib.org/dlib/december04/vandesompel/12vandesompel.html
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
Existing OAI-PMH based approaches
Typical scenario:
1. An OAI-PMH harvester harvests Dublin Core records from the OAI-PMH
repository.
2. The harvester analyzes each Dublin Core record, extracting dc.identifier
information in order to determine the network location of the described
resource.
3. A separate process, out-of-band from the OAI-PMH, collects the
described resource from its network location.
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
Existing OAI-PMH based approaches : Issue 1
 Locating the resource based on information provided in dc.identifier


dc.identifier used to convey a variety of identifier: (simultaneously) URL
DOI, bibliographic citation, … Not expressive enough to distinguish
between identifier, locator.
 Several dereferencing attempts required
URI provided in dc.identifier is commonly that of a bibliographic “splash
page”
 How to know it is a bibliographic “splash page”, not the resource?
 If it is a bibliographic “splash page”, where is the resource?
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
Existing OAI-PMH based approaches : Issue 2
 Using the OAI-PMH datestamp of the Dublin Core record to trigger
incremental harvesting:

Datestamp of DC record does not necessarily change when resource
changes
no DC datestamp change
no resource update
resource update
DC datestamp change
OK
unnecessary
resource download
missed
resource update
OK
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
Existing OAI-PMH based approaches : Conventions
 Cannot really address issue 2 (datestamps) with metadata
conventions
 Issue 1 (identifier & locator of the resource) is currently addressed
with a range of conventions

First dc.identifier is locator of the resource
 what if the resource is not digital?

Use of dc.format and/or dc.relation to convey locator
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
Existing OAI-PMH based approaches : Conventions
<oai_dc:dc>
<dc:title>A Simple Parallel-Plate Resonator Technique for Microwave.
Characterization of Thin Resistive Films</dc:title>
<dc:creator>Vorobiev, A.</dc:creator>
<dc:subject>ING-INF/01 Elettronica</dc:subject>
<dc:description>A parallel-plate resonator method is proposed for
non-destructive characterisation of resistive films used in
microwave integrated circuits. A slot made in one ... </dc:description>
<dc:publisher>Microwave engineering Europe</dc:publisher>
<dc:date>2002</dc:date>
<dc:type>Documento relativo ad una Conferenza o altro Evento</dc:type>
<dc:type>PeerReviewed</dc:type>
<dc:identifier>http://amsacta.cib.unibo.it/archive/00000014/</dc:identifier>
<dc:format>pdf
http://amsacta.cib.unibo.it/archive/00000014/01/GaAs_1_Vorobiev.pdf
</dc:format>
</oai_dc:dc>
splash page
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
locator of resource
Existing OAI-PMH based approaches : Conventions
…
<dc:identifier>http://amsacta.cib.unibo.it/archive/00000014/</dc:identifier>
<dc:relation>
http://amsacta.cib.unibo.it/archive/00000014/01/GaAs_1_Vorobiev.pdf
</dc:relation>
…
splash page
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
locator of resource
Existing OAI-PMH based approaches : Conventions
…
<dc:identifier> http://amsacta.cib.unibo.it/archive/00000014/</dc:identifier>
<dc:relation>
http://resolver.unibo.it/00000014/
</dc:relation>
<dc:relation>
http://amsacta.cib.unibo.it/archive/00000014/01/GaAs_1_Vorobiev.pdf
</dc:relation>
…
splash page
locator of resource
splash page
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
Existing OAI-PMH based approaches : Other attempts
 dc.identifier leads to splash page & splash page contains special
purpose XHTML link to resource(s)


What if there is no splash page?
How does a harvester recognize this situation?
 OA-X: protocol extension



OK in local context
Strategic problem to generalize
How to consolidate with OAI-PMH data model
 Qualified Dublin Core


Could bring expressiveness to distinguish between locator & identifier
But what about the datestamp issue?
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
Proposed OAI-PMH based approach
 Use metadata formats that were specifically created for
representation of digital objects:


Complex Object Formats as OAI-PMH metadata formats
MPEG-21 DIDL, METS, ..
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
OAI-PMH data model
resource
OAI-PMH identifier
= entry point to all records pertaining to the resource
metadata pertaining
to the resource
Dublin Core
metadata
MARCXML
metadata
MPEG-21
DIDL
METS
simple
more
expressive
highly
expressive
highly
expressive
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
item
records
Complex Object Formats : characteristics
• Representation of a digital object by means of a wrapper XML
document.
• Represented resource can be:
o
o
simple digital object (consisting of a single datastream)
compound digital object (consisting of multiple datastreams)
• Unambiguous approach to convey identifiers of the digital object
and its constituent datastreams.
• Include datastream:
o
o
o
By-Value: embedding of base64-encoded datastream
By-Reference: embedding network location of the datastream
not mutually exclusive; equivalent
• Include a variety of secondary information
o
o
o
By-Value
By-Reference
Descriptive metadata, rights information, technical metadata, …
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
<didl:DIDL>
<didl:Item>
<didl:Descriptor><didl:Statement mimeType="text/xml; charset=UTF-8">
<dii:Identifier>
http://amsacta.cib.unibo.it/archive/00000014/
</dii:Identifier>
</didl:Statement></didl:Descriptor>
<didl:Descriptor><didl:Statement mimeType="text/xml; charset=UTF-8">
<oai_dc:dc>
<dc:title>A Simple Parallel-Plate Resonator Technique for
Microwave. Characterization of Thin Resistive Films
</dc:title>
<dc:creator>Vorobiev, A.</dc:creator>
<dc:identifier>
http://amsacta.cib.unibo.it/archive/00000014/</dc:identifier>
<dc:format>application/pdf</dc:format>
…
</oai_dc:dc>
</didl:Statement></didl:Descriptor>
<didl:Component>
<didl:Resource mimeType="application/pdf"
ref="http://amsacta.cib.unibo.it/archive/00000014/01/GaAs_1_Vorobiev.pdf"/>
</didl:Component>
</didl:Item>
</didl:DIDL>
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
Complex Object Formats & OAI-PMH
• Resource represented via XML wrapper => OAI-PMH
<metadata>
• Uniform solution for simple & compound objects
• Unambiguous expression of locator of datastream
• Disambiguation between locators & identifiers
• OAI-PMH datestamp changes whenever the resource
(datastreams & secondary information) changes
• OAI-PMH semantics apply: “about” containers, set membership
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
OAI-PMH based approach using Complex Object Format
Typical scenario:
1. An OAI-PMH harvester checks for support of a locally understood
complex object format using the ListMetadataFormats verb
2. The harvester harvests the complex object metadata. Semantics of the
OAI-PMH datestamp guarantee that new and modified resources are
detected.
3. A parser at the end of the harvesting application analyzes each harvested
complex object record:
- The parser extracts the bitstreams that were delivered By-Value.
- The parser extracts the unambiguous references to the network
location of bitstreams delivered By-Reference.
4. A separate process, out-of-band from the OAI-PMH, collects the
bitstreams delivered By-Reference from the extracted network locations.
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
Complex Object Formats & OAI-PMH : issues
•
•
•
•
•
Which Complex Object Format(s)
How to Profile Complex Object Format(s) for OAI-PMH Harvesting
Large records
Making resources re-harvestable
Because the resource is represented as <metadata>, can rights
pertaining to the resource be expressed according to the “rights for
metadata” OAI-rights guideline?
• Tools:
o
o
Software library to write compliant complex objects
Integration of this library with repository systems (Fedora, DSpace,
eprints.org, ….)
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
Outline
(0) The Problem
(1) OAI-PMH Mechanics
(2) OAI-PMH for Resource Harvesting
(3) mod_oai
(4) Future Research
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
mod_oai approach
•
•
Goal: integrate OAI-PMH functionality into the web server itself…
mod_oai: an Apache 2.0 module to automatically answer OAI-PMH
requests for an http server
o
o
•
•
written in C
respects values in .htaccess, httpd.conf
compile mod_oai on http://www.foo.edu/
baseURL is now http://www.foo.edu/modoai
o
Result: web harvesting with OAI-PMH semantics (e.g., from, until, sets)
- http://www.foo.edu/modoai?
verb=ListIdentifiers &
metdataPrefix=oai_dc &
from=2004-09-15 &
set=mime:video:mpeg
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
OAI-PMH data model in mod_oai
resource
OAI-PMH sets
MIME type
metadata pertaining
to the resource
http://techreports.larc.nasa.gov/ltrs/PDF/2004/aiaa/NASA-aiaa-2004-0015.pdf
OAI-PMH identifier
= entry point to all records pertaining to the resource
Dublin Core
metadata
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
HTTP header
metadata
MPEG-21
DIDL
item
records
OAI-PMH concepts : typical repository
OAI-PMH Entity
Resource
value
URL
description
PDF, PS, XML, HTML or other file
Item
identifier
OAI Identifier
DNS-based name of metadata about resource
set membership
LCSH
Library of Congress Subject Heading
metadataPrefix
oai_dc
bibliographic metadata in Dublin Core
Record
datestamp
2004-10-18
modification date of DC record
Record
metadataPrefix
datestamp
oai_marc
2004-07-31
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
bibliographic metadata in MARC
modification date of MARC record
OAI-PMH concepts : mod_oai
OAI-PMH Entity
Resource
value
description
URL
HTML, GIF, PDF or other web file
URL
same URL as the resource
set membership
MIME type
MIME type of the resource
metadataPrefix
http_header
the http headers that would have been
returned via HTTP GET/HEAD
datestamp
2004-07-31
modification date of resource
oai_dc
a subset of http_header in DC
2004-07-31
modification date of resource
Item
identifier
Record
Record
metadataPrefix
datestamp
Record
metadataPrefix
datestamp
oai_didl
2004-07-31
MPEG-21 DIDL: base64 encoded resource +
http_header metadata
modification date of resource
Resource Discovery: ListIdentifiers
harvester
• issues a ListIdentifiers,
• finds URLs of updated
resources
• does HTTP GETs updates
only
• can get URLs of
resources with specified
MIME types
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
Preservation: ListRecords
harvester
• issues a ListRecords,
• Gets updates as MPEG21 DIDL documents
(HTTP headers, resource
By Value or By
Reference)
• can get resources with
specified MIME types
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
performance of mod_oai and wget
on www.cs.odu.edu
# of f il es in baseli ne
# of f il es in upda te
(25%)
index .html
as seed
709
114
wge t
"find . -type f"
as seed
5739
1318
mod_oai
files
5268
1335
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
for more detail: “mod_oai: An Apache Module for Metadata Harvesting “
http://arxiv.org/abs/cs.DL/0503069
Outline
(0) The Problem
(1) OAI-PMH Mechanics
(2) OAI-PMH for Resource Harvesting
(3) mod_oai
(4) Future Research
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
Issues and Future Work
•
For a given server, there are a set of URLs, U, and a set of files F
o
o
•
Neither function is 1-1 nor onto
o
•
Apache maps U  F
mod_oai maps F  U
We can easily check if a single u maps to F, but given F we cannot (easily)
generate U
Short-term issues:
o
dynamic files
- exporting unprocessed server-side files would be a security hole
o
IndexIgnore
- httpd will “hide” valid URLs
o
File permissions
- httpd will advertise files it cannot read
•
Long-term issues
o
Alias, Location
- files can be covered up by the httpd
o
UserDir
- interactions between the httpd and the filesystem
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
IndexIgnore & File Permissions
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
Alias: Covering Up Files
httpd.conf:
Alias /A /usr/local/web/htdocs/B
Alias /B /usr/local/web/htdocs/A
the files “A” and “B” will be different from the URLs
http://server/A
http://server/B
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
UserDir: “Just in Time” mounting of directories
whiskey.cs.odu.edu:/ftp/WWW/conf% ls /home
liu_x/ mln/
whiskey.cs.odu.edu:/ftp/WWW/conf% ls -d /home/tharriso
/home/tharriso/
whiskey.cs.odu.edu:/ftp/WWW/conf % ls /home
liu_x/ mln/ tharriso/
whiskey.cs.odu.edu:/ftp/WWW/conf %
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
Looking Further Down the Road for mod_oai
•
“Reverse” the method of URL discovery
o
o
cannot look to the files;
listen to incoming requests and build a list of valid URLs
- could be seeded with files at start
- also the method for handling server processed files / URLs
•
Plug-ins for descriptive metadata
o
o
o
•
DC tags in HTML
MS Office formats, PDF
Tags from JPEG, TIFF, MP3, etc.
Additional metadata in the DIDL
o
o
technical metadata from JHOVE
estimated change rate
- cf. Cho & Garcia-Molina, ACM TOIT 28(4)
•
http log access as separate metadata formats
- cf. Van de Sompel, Young & Hickey, D-Lib 9(7/8)
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
Expanding OAI-PMH / Complex Object Access
•
OAI-PMH / CO access for:
o
o
o
blogs
message boards
native file systems
- e.g. Mac OS X “Spotlight”
•
More aggressive use of OAI-PMH / CO for preservation
o
o
recently funded NSF DIGARCH program
use for preservation:
- Usenet
- Email
- Multicasting
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
OAI-PMH + Complex Objects:
A New Model for Web Resource Harvesting
•
Better web harvesting can be achieved through:
o
o
•
Use cases:
o
o
•
Preservation (ListRecords)
Web crawling (ListIdentifiers)
mod_oai: reference implementation
o
o
o
•
OAI-PMH: structured access to updates
Complex object formats: modeled representation of digital objects
Better performance than wget
static files only; dynamic files in the future
not a replacement for DSpace, Fedora, eprints.org, etc.
More info:
o
o
http://www.modoai.org/
http://whiskey.cs.odu.edu/
A New Model for Web Resource Harvesting
Texas A&M University, April 25, 2005
Download