mod_oai: Metadata Harvesting for Everyone Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Aravind Elango {mln,aelango}@cs.odu.edu {herbertv,liu_x}@lanl.gov DLF 2004 Fall Forum Baltimore MD October 25-27, 2004 mod_oai is sponsored by the Andrew Mellon Foundation Outline • mod_oai – – – – – crawling vs. harvesting complex objects & OAI-PMH how mod_oai works scenarios demos • More information – http://www.modoai.org/ – http://www.openarchives.org/ Inefficient Web Crawlers what documents have been modified since 2003-11-15? www.getty.edu doc1; last mod 2003-03-12 doc2; last mod 2002-07-19 doc3; last mod 2003-11-29 doc4; last mod 2002-10-03 … doc100; last mod 2003-09-113 robot image from: http://www.q-design.com/toy/ToyArt/robots/55.JPEG A More Efficient Way… what documents have been modified since 2003-11-15? www.getty.edu with OAI-PMH doc1; last mod 2003-03-12 doc2; last mod 2002-07-19 doc3; last mod 2003-11-29 doc4; last mod 2002-10-03 … doc100; last mod 2003-09-113 mod_oai • Goal: integrate OAI-PMH functionality into the web server itself… • mod_oai: an Apache 2.0 module to automatically answer OAI-PMH requests for an http server – written in C – respects values in .htaccess, httpd.conf • Result: web harvesting with OAI-PMH semantics (e.g., from, until, sets) • www.foo.edu/modoai?ListIdentifiers&metdataPrefix= oai_dc&from=2004-09-15&set=video:mpeg OAI-PMH data model resource OAI-PMH identifier = entry point to all records pertaining to the resource metadata pertaining to the resource Dublin Core metadata MPEG-21 DIDL METS MARCXML metadata modeled representation of the resource simple model complex model complex model more expressive model item records OAI-PMH and complex models • OAI-PMH record == modeled representation of the resource • Can be selectively harvested via OAI-PMH ~ datestamp, set • Resource can be: – simple object (1 file) – compound object (multiple files) • OAI-PMH records can contain: – Typical metadata – Actual resource(s) • By-Value – base64 encoded • By-Reference – http address of resource • both – Identifiers of metadata and resource(s), unambiguously mapped to the identified data – A variety of secondary information Complex Objects & OAI-PMH • LANL Repository – OAI-PMH as a Repository Access Protocol to access metadata and content represented as DIDLs • APS/LANL/LoC Mirroring – OAI-PMH transfer of APS content represented in application neutral format (DIDLs) • LANL DSpace Plug-in – Exposes MPEG-21 DIDL documents through builtin DSpace OAI-PMH infrastructure How mod_oai works • Install on an Apache 2.0 server – compile & edit httpd.conf http://www.foo.edu/ now has an OAI-PMH baseURL of: http://www.foo.edu/modoai OAI-PMH characteristics: Typical Repository OAI-PMH Entity Resource value URL description PDF, PS, XML, HTML or other file Item identifier OAI Identifier DNS-based name of metadata about resource set membership LCSH Library of Congress Subject Heading metadataPrefix oai_dc bibliographic metadata in Dublin Core Record datestamp 2004-10-18 modification date of DC record oai_marc bibliographic metadata in MARC Record metadataPrefix datestamp 2004-07-31 modification date of MARC record OAI-PMH Data Model in mod_oai resource OAI Identifier == URL of Resource http://techreports.larc.nasa.gov/ltrs/PDF/2004/aiaa/NASA-aiaa-2004-0015.pdf DC, HTTP, DIDL Modeled Representations Set membership == MIME type Dublin Core metadata HTTP headers DIDL: base64 or urls + HTTP headers item records OAI-PMH characteristics: mod_oai OAI-PMH Entity Resource value description URL HTML, GIF, PDF or other web file URL same URL as the resource Item identifier set membership MIME type MIME type of the resource Record metadataPrefix http_header the http headers that would have been returned via HTTP GET/HEAD datestamp 2004-07-31 modification date of resource oai_dc a subset of http_header in DC 2004-07-31 modification date of resource Record metadataPrefix datestamp Record metadataPrefix datestamp oai_didl 2004-07-31 MPEG-21 DIDL: base64 encoded resource + http_header metadata modification date of resource OAI-PMH Concepts concept mod_oai interpretation OAI Identifier URL of resource set MIME type of resource datestamp change time of resource deleted records “no” deleted records http_header Use Cases • Regular Web Crawling – use ListIdentifiers to discover URLs – add new URLs to the list of URLs to be crawled • Harvesting Resources w/ OAI-PMH – use ListRecords to extract the entire resource as an MPEG-21 DIDL AIP Regular Crawling: ListIdentifiers harvester issues a ListIdentifiers, finds the updates, and does HTTP GETs on just the updates Resource Harvesting: ListRecords harvester issues a ListRecords, and gets the updates in DIDLs (http headers + byvalue or by-ref resources) Demo • Repository Explorer – http://oai.dlib.vt.edu/cgi-bin/Explorer/oai2.0/testoai – http://oai.dlib.vt.edu/cgibin/Explorer/oai2.0/testoai?archive=http://whiskey.cs.odu.edu/modoai • Direct URLs – http://whiskey.cs.odu.edu/modoai?verb=Identify – http://whiskey.cs.odu.edu/modoai?verb=ListMetadataFormats – http://whiskey.cs.odu.edu/modoai?verb=ListIdentifiers&metad ataPrefix=oai_dc – http://whiskey.cs.odu.edu/modoai?verb=ListRecords&metadata Prefix=http_header – http://whiskey.cs.odu.edu/modoai?verb=ListRecords&metadata Prefix=oai_didl Datestamps and Etags L. Clausen, “Concerning Etags and Datetsamps”, 4th International Web Archiving Workshop, ECDL 2004 http://www.netarchive.dk/website/publications/Etags-2004.pdf • Procedure – 16 harvests over 1 month of 465,374 .dk domains – 5,543,470 possible downloads – 5,182,034 successful downloads – 599,143 changes Datestamp and Etag Example Errors in Datestamps and Etags Indicating Change Etags Datestamps missed change 0.087% 0.30% redundant crawl 32% 10.7% 40.1 % of pages without Etags 0.07% of pages without Datestamps L. Clausen, “Concerning Etags and Datetsamps”, 4th International Web Archiving Workshop, ECDL 2004 http://www.netarchive.dk/website/publications/Etags-2004.pdf mod_oai… • is: – a simple way to more efficiently harvest web pages – a possible impact on robots.txt – fully OAI-PMH compliant • works with existing harvesters • is not: – yet suitable for dynamic files – a replacement for • • • • DSpace Fedora eprints.org other digital libraries / repositories / cms