A New Model for Web Resource Harvesting Michael L. Nelson Old Dominion University joint work with: Her Herbert Van de Sompel Xiaoming Liu Carl Lagoze Simeon Warner Terry Harrison This work supported in part by the Andrew Mellon Foundation & Library of Congress My Research Interests • Digital Objects and Repositories o o • Digital Preservation o o • Interaction between them: roles, responsibilities, architecture, scalability Bumper sticker: Free the Object from the Tyranny of the Repository Shared infrastructure models, automation, large-scale best effort strategies Bumper sticker: We Need Fewer Heroes User / System Co-Evolution o o Discerning intent from access large-scale patterns Bumper sticker: Freedom From Choice A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Outline (0) The Problem (1) OAI-PMH Mechanics (2) OAI-PMH for Resource Harvesting (3) mod_oai (4) Future Research A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 WWW and DL: Separated at Birth WWW The Good: XML, BitTorrent, Web Services The Bad: RSS The Ugly: Semantic Web WWW DL DL The Good: OAIS, DOI, OAI-PMH The Bad: Dublin Core The Ugly: SRU/W Today 1994 The problem is not that the WWW doesn’t work; it clearly does. The problem is that our expectations have been lowered. A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 more on DL/WWW, from the NSF Post-DL Workshop: http://www.sis.pitt.edu/~dlwkshop/paper_sompel.html http://www.sis.pitt.edu/~dlwkshop/paper_lagoze.html Web Robots what documents have been modified since 2003-11-15 ? what is this file? what are its relationships to other files? how often does it change? www.getty.edu doc1; last mod 2003-03-12 doc2; last mod 2002-07-19 … A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 robot image from: http://www.q-design.com/toy/ToyArt/robots/55.JPEG doc100; last mod 2003-09-11 A More Efficient Way <co> <metadata/> <link/> <link/> <change/> … </co> what documents have been modified since 2003-11-15 ? www.getty.edu with mod_oai doc1; last mod 2003-03-12 doc2; last mod 2002-07-19 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 … doc100; last mod 2003-09-11 Outline (0) The Problem (1) OAI-PMH Mechanics (2) OAI-PMH for Resource Harvesting (3) mod_oai (4) Future Research A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 A Very Brief OAI-PMH History • Universal Preprint Service o o A cross-archive DL that that provides services on a collection of metadata harvested from multiple archives - not distributed searching Demonstrated at Santa Fe NM, October 21-22, 1999 - D-Lib Magazine, 6(2) 2000 (2 articles) – http://www.dlib.org/dlib/february00/02contents.html o • UPS was soon renamed the Open Archives Initiative (OAI) http://www.openarchives.org/ The OAI has authored, among other things, the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) o in use around the world, 600+ known instances - registration not required; many unknown instances – http://gita.grainger.uiuc.edu/registry/ – http://celestial.eprints.org/cgi-bin/status o used by Google ca. late 2004 - http://www.nla.gov.au/digicoll/oai/ A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Data Providers / Repositories “A repository is a network accessible server that can process the 6 OAI-PMH requests … A repository is managed by a data provider to expose metadata to harvesters.” A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Service Providers / Harvesters “A harvester is a client application that issues OAI-PMH requests. A harvester is operated by a service provider as a means of collecting metadata from repositories.” Aggregators aggregators allow for: • scalability for OAI-PMH • load balancing • community building • discovery data providers (repositories) aggregator A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 service providers (harvesters) OAI-PMH data model resource OAI-PMH sets OAI-PMH identifier OAI-PMH identifier metadataPrefix datestamp entry point to all records pertaining to the resource Dublin Core metadata A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 MARCXML metadata item records metadata pertaining to the resource Overview of OAI-PMH Verbs Verb metadata about the repository harvesting verbs Function Identify description of repository ListMetadataFormats metadata formats supported by repository ListSets sets defined by repository ListIdentifiers OAI unique ids contained in repository ListRecords listing of N records GetRecord listing of a single record most verbs take arguments: dates, sets, ids, metadata formats and resumption token (for flow control) A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Outline (0) The Problem (1) OAI-PMH Mechanics (2) OAI-PMH for Resource Harvesting (3) mod_oai (4) Future Research A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Resource Harvesting: Use cases • Discovery: use content itself in the creation of services o o o search engines that make full-text searchable citation indexing systems that extract references from the full-text content browsing interfaces that include thumbnail versions of high-quality images from cultural heritage collections • Preservation: o o periodically transfer digital content from a data repository to one or more trusted digital repositories trusted digital repositories need a mechanism to automatically synchronize with the originating data repository Ideas first presented in Van de Sompel, Nelson, Lagoze & Warner, http://www.dlib.org/dlib/december04/vandesompel/12vandesompel.html A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Existing OAI-PMH based approaches Typical scenario: 1. An OAI-PMH harvester harvests Dublin Core records from the OAI-PMH repository. 2. The harvester analyzes each Dublin Core record, extracting dc.identifier information in order to determine the network location of the described resource. 3. A separate process, out-of-band from the OAI-PMH, collects the described resource from its network location. A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Existing OAI-PMH based approaches : Issue 1 Locating the resource based on information provided in dc.identifier dc.identifier used to convey a variety of identifier: (simultaneously) URL DOI, bibliographic citation, … Not expressive enough to distinguish between identifier, locator. Several dereferencing attempts required URI provided in dc.identifier is commonly that of a bibliographic “splash page” How to know it is a bibliographic “splash page”, not the resource? If it is a bibliographic “splash page”, where is the resource? A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Existing OAI-PMH based approaches : Issue 2 Using the OAI-PMH datestamp of the Dublin Core record to trigger incremental harvesting: Datestamp of DC record does not necessarily change when resource changes no DC datestamp change no resource update resource update DC datestamp change OK unnecessary resource download missed resource update OK A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Existing OAI-PMH based approaches : Conventions Cannot really address issue 2 (datestamps) with metadata conventions Issue 1 (identifier & locator of the resource) is currently addressed with a range of conventions First dc.identifier is locator of the resource what if the resource is not digital? Use of dc.format and/or dc.relation to convey locator A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Existing OAI-PMH based approaches : Conventions <oai_dc:dc> <dc:title>A Simple Parallel-Plate Resonator Technique for Microwave. Characterization of Thin Resistive Films</dc:title> <dc:creator>Vorobiev, A.</dc:creator> <dc:subject>ING-INF/01 Elettronica</dc:subject> <dc:description>A parallel-plate resonator method is proposed for non-destructive characterisation of resistive films used in microwave integrated circuits. A slot made in one ... </dc:description> <dc:publisher>Microwave engineering Europe</dc:publisher> <dc:date>2002</dc:date> <dc:type>Documento relativo ad una Conferenza o altro Evento</dc:type> <dc:type>PeerReviewed</dc:type> <dc:identifier>http://amsacta.cib.unibo.it/archive/00000014/</dc:identifier> <dc:format>pdf http://amsacta.cib.unibo.it/archive/00000014/01/GaAs_1_Vorobiev.pdf </dc:format> </oai_dc:dc> splash page A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 locator of resource Existing OAI-PMH based approaches : Conventions … <dc:identifier>http://amsacta.cib.unibo.it/archive/00000014/</dc:identifier> <dc:relation> http://amsacta.cib.unibo.it/archive/00000014/01/GaAs_1_Vorobiev.pdf </dc:relation> … splash page A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 locator of resource Existing OAI-PMH based approaches : Conventions … <dc:identifier> http://amsacta.cib.unibo.it/archive/00000014/</dc:identifier> <dc:relation> http://resolver.unibo.it/00000014/ </dc:relation> <dc:relation> http://amsacta.cib.unibo.it/archive/00000014/01/GaAs_1_Vorobiev.pdf </dc:relation> … splash page locator of resource splash page A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Existing OAI-PMH based approaches : Other attempts dc.identifier leads to splash page & splash page contains special purpose XHTML link to resource(s) What if there is no splash page? How does a harvester recognize this situation? OA-X: protocol extension OK in local context Strategic problem to generalize How to consolidate with OAI-PMH data model Qualified Dublin Core Could bring expressiveness to distinguish between locator & identifier But what about the datestamp issue? A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Proposed OAI-PMH based approach Use metadata formats that were specifically created for representation of digital objects: Complex Object Formats as OAI-PMH metadata formats MPEG-21 DIDL, METS, .. A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 OAI-PMH data model resource OAI-PMH identifier = entry point to all records pertaining to the resource metadata pertaining to the resource Dublin Core metadata MARCXML metadata MPEG-21 DIDL METS simple more expressive highly expressive highly expressive A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 item records Complex Object Formats : characteristics • Representation of a digital object by means of a wrapper XML document. • Represented resource can be: o o simple digital object (consisting of a single datastream) compound digital object (consisting of multiple datastreams) • Unambiguous approach to convey identifiers of the digital object and its constituent datastreams. • Include datastream: o o o By-Value: embedding of base64-encoded datastream By-Reference: embedding network location of the datastream not mutually exclusive; equivalent • Include a variety of secondary information o o o By-Value By-Reference Descriptive metadata, rights information, technical metadata, … A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 <didl:DIDL> <didl:Item> <didl:Descriptor><didl:Statement mimeType="text/xml; charset=UTF-8"> <dii:Identifier> http://amsacta.cib.unibo.it/archive/00000014/ </dii:Identifier> </didl:Statement></didl:Descriptor> <didl:Descriptor><didl:Statement mimeType="text/xml; charset=UTF-8"> <oai_dc:dc> <dc:title>A Simple Parallel-Plate Resonator Technique for Microwave. Characterization of Thin Resistive Films </dc:title> <dc:creator>Vorobiev, A.</dc:creator> <dc:identifier> http://amsacta.cib.unibo.it/archive/00000014/</dc:identifier> <dc:format>application/pdf</dc:format> … </oai_dc:dc> </didl:Statement></didl:Descriptor> <didl:Component> <didl:Resource mimeType="application/pdf" ref="http://amsacta.cib.unibo.it/archive/00000014/01/GaAs_1_Vorobiev.pdf"/> </didl:Component> </didl:Item> </didl:DIDL> A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Complex Object Formats & OAI-PMH • Resource represented via XML wrapper => OAI-PMH <metadata> • Uniform solution for simple & compound objects • Unambiguous expression of locator of datastream • Disambiguation between locators & identifiers • OAI-PMH datestamp changes whenever the resource (datastreams & secondary information) changes • OAI-PMH semantics apply: “about” containers, set membership A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 OAI-PMH based approach using Complex Object Format Typical scenario: 1. An OAI-PMH harvester checks for support of a locally understood complex object format using the ListMetadataFormats verb 2. The harvester harvests the complex object metadata. Semantics of the OAI-PMH datestamp guarantee that new and modified resources are detected. 3. A parser at the end of the harvesting application analyzes each harvested complex object record: - The parser extracts the bitstreams that were delivered By-Value. - The parser extracts the unambiguous references to the network location of bitstreams delivered By-Reference. 4. A separate process, out-of-band from the OAI-PMH, collects the bitstreams delivered By-Reference from the extracted network locations. A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Complex Object Formats & OAI-PMH : issues • • • • • Which Complex Object Format(s) How to Profile Complex Object Format(s) for OAI-PMH Harvesting Large records Making resources re-harvestable Because the resource is represented as <metadata>, can rights pertaining to the resource be expressed according to the “rights for metadata” OAI-rights guideline? • Tools: o o Software library to write compliant complex objects Integration of this library with repository systems (Fedora, DSpace, eprints.org, ….) A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Outline (0) The Problem (1) OAI-PMH Mechanics (2) OAI-PMH for Resource Harvesting (3) mod_oai (4) Future Research A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 mod_oai approach • • Goal: integrate OAI-PMH functionality into the web server itself… mod_oai: an Apache 2.0 module to automatically answer OAI-PMH requests for an http server o o • • written in C respects values in .htaccess, httpd.conf compile mod_oai on http://www.foo.edu/ baseURL is now http://www.foo.edu/modoai o Result: web harvesting with OAI-PMH semantics (e.g., from, until, sets) - http://www.foo.edu/modoai? verb=ListIdentifiers & metdataPrefix=oai_dc & from=2004-09-15 & set=mime:video:mpeg A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 OAI-PMH data model in mod_oai resource OAI-PMH sets MIME type metadata pertaining to the resource http://techreports.larc.nasa.gov/ltrs/PDF/2004/aiaa/NASA-aiaa-2004-0015.pdf OAI-PMH identifier = entry point to all records pertaining to the resource Dublin Core metadata A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 HTTP header metadata MPEG-21 DIDL item records OAI-PMH concepts : typical repository OAI-PMH Entity Resource value URL description PDF, PS, XML, HTML or other file Item identifier OAI Identifier DNS-based name of metadata about resource set membership LCSH Library of Congress Subject Heading metadataPrefix oai_dc bibliographic metadata in Dublin Core Record datestamp 2004-10-18 modification date of DC record Record metadataPrefix datestamp oai_marc 2004-07-31 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 bibliographic metadata in MARC modification date of MARC record OAI-PMH concepts : mod_oai OAI-PMH Entity Resource value description URL HTML, GIF, PDF or other web file URL same URL as the resource set membership MIME type MIME type of the resource metadataPrefix http_header the http headers that would have been returned via HTTP GET/HEAD datestamp 2004-07-31 modification date of resource oai_dc a subset of http_header in DC 2004-07-31 modification date of resource Item identifier Record Record metadataPrefix datestamp Record metadataPrefix datestamp oai_didl 2004-07-31 MPEG-21 DIDL: base64 encoded resource + http_header metadata modification date of resource Resource Discovery: ListIdentifiers harvester • issues a ListIdentifiers, • finds URLs of updated resources • does HTTP GETs updates only • can get URLs of resources with specified MIME types A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Preservation: ListRecords harvester • issues a ListRecords, • Gets updates as MPEG21 DIDL documents (HTTP headers, resource By Value or By Reference) • can get resources with specified MIME types A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 performance of mod_oai and wget on www.cs.odu.edu # of f il es in baseli ne # of f il es in upda te (25%) index .html as seed 709 114 wge t "find . -type f" as seed 5739 1318 mod_oai files 5268 1335 A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 for more detail: “mod_oai: An Apache Module for Metadata Harvesting “ http://arxiv.org/abs/cs.DL/0503069 Outline (0) The Problem (1) OAI-PMH Mechanics (2) OAI-PMH for Resource Harvesting (3) mod_oai (4) Future Research A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Issues and Future Work • For a given server, there are a set of URLs, U, and a set of files F o o • Neither function is 1-1 nor onto o • Apache maps U F mod_oai maps F U We can easily check if a single u maps to F, but given F we cannot (easily) generate U Short-term issues: o dynamic files - exporting unprocessed server-side files would be a security hole o IndexIgnore - httpd will “hide” valid URLs o File permissions - httpd will advertise files it cannot read • Long-term issues o Alias, Location - files can be covered up by the httpd o UserDir - interactions between the httpd and the filesystem A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 IndexIgnore & File Permissions A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Alias: Covering Up Files httpd.conf: Alias /A /usr/local/web/htdocs/B Alias /B /usr/local/web/htdocs/A the files “A” and “B” will be different from the URLs http://server/A http://server/B A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 UserDir: “Just in Time” mounting of directories whiskey.cs.odu.edu:/ftp/WWW/conf% ls /home liu_x/ mln/ whiskey.cs.odu.edu:/ftp/WWW/conf% ls -d /home/tharriso /home/tharriso/ whiskey.cs.odu.edu:/ftp/WWW/conf % ls /home liu_x/ mln/ tharriso/ whiskey.cs.odu.edu:/ftp/WWW/conf % A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Looking Further Down the Road for mod_oai • “Reverse” the method of URL discovery o o cannot look to the files; listen to incoming requests and build a list of valid URLs - could be seeded with files at start - also the method for handling server processed files / URLs • Plug-ins for descriptive metadata o o o • DC tags in HTML MS Office formats, PDF Tags from JPEG, TIFF, MP3, etc. Additional metadata in the DIDL o o technical metadata from JHOVE estimated change rate - cf. Cho & Garcia-Molina, ACM TOIT 28(4) • http log access as separate metadata formats - cf. Van de Sompel, Young & Hickey, D-Lib 9(7/8) A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 Expanding OAI-PMH / Complex Object Access • OAI-PMH / CO access for: o o o blogs message boards native file systems - e.g. Mac OS X “Spotlight” • More aggressive use of OAI-PMH / CO for preservation o o recently funded NSF DIGARCH program use for preservation: - Usenet - Email - Multicasting A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005 OAI-PMH + Complex Objects: A New Model for Web Resource Harvesting • Better web harvesting can be achieved through: o o • Use cases: o o • Preservation (ListRecords) Web crawling (ListIdentifiers) mod_oai: reference implementation o o o • OAI-PMH: structured access to updates Complex object formats: modeled representation of digital objects Better performance than wget static files only; dynamic files in the future not a replacement for DSpace, Fedora, eprints.org, etc. More info: o o http://www.modoai.org/ http://whiskey.cs.odu.edu/ A New Model for Web Resource Harvesting Texas A&M University, April 25, 2005