Repository Synchronization Using NNTP and SMTP Michael L. Nelson, Joan A. Smith, Martin Klein Old Dominion University Norfolk VA www.cs.odu.edu/~{mln,jsmit,mklein} DLF Spring 2006 Austin TX April 10-12, 2006 Preservation: Fortress Model Five Easy Steps for Preservation: 1. 2. 3. 4. 5. Get a lot of $ Buy a lot of disks, machines, tapes, etc. Hire an army of staff Load a small amount of data “Look upon my archive ye Mighty, and despair!” image from: http://www.itunisie.com/tourisme/excursion/tabarka/images/fort.jpg Alternate Models of Preservation • Lazy Preservation – Let Google, IA et al. preserve your website • Just-In-Time Preservation – Find a “good enough” replacement web page • Web Server Based Preservation – Use Apache modules to create archival-ready resources • Shared Infrastructure Preservation – Push your content to sites that might preserve it image from: http://www.proex.ufes.br/arsm/knots_interlaced.htm Shared, Existing Infrastructure • Can we (re)use existing installed network infrastructure for preservation purposes? Who has the Bigger Fortress? Experiment & Simulation • Inject the contents of an OAI-PMH repository directly into: – Email (SMTP) – Usenet News (NNTP) • Instrument existing email, news servers • Use mod_oai (www.modoai.org) to do resource harvesting – complex object formats (e.g. MPEG-21 DIDL) used to encode the resources as “lumps of XML” – results are generalizable to any repository system • Analyze testbed, simulate very large collections Test Repository • Website with 72 files – HTML, PDF, PNG, JPEG, GIF – 1KB - 1.5 MB • Used a script to harvest the MPEG-21 DIDLs, and then: – attach to outbound email mesgs – post to a moderated newsgroup (repository.odu.test1) Email Adding Email Attachments / Headers outgoing mail incoming mail Email Headers OAI-PMH & HTTP headers original email mesg base64 encoded DIDL SMTP Overhead ~ 1 sec penalty per mesg diminishing returns for skipping mesgs Email Traffic @ mail.cs.odu.edu • 30 days of traffic – 505,987 mesgs – 4081 unique hosts – daily • mean: 16,866 • std dev: 5147 P(x) = a(x-b) we measured b≈1.6 News News Posting OAI-PMH & HTTP headers base64 encoded DIDL News Overhead News Policies Simulation Parameters • Repository – – – – 100,000 items 1MB/item 100 daily additions 400 daily updates • Time – 2000 days (5.5 years) • Email – granularity=1 – follows ODU power law example • News – servers hold contents for 30 days NNTP Results Email Results (Without Memory) Email Results (With Memory) Discussion • We’ve examined the worst case scenario – large, active repository – sending contents by-value • Optimizations / Alternatives – smaller, less dynamic repositories – sending contents by-reference – use for repository discovery, not for content interchange • instead of sending “GetRecord” results, send “Identify” results and let interested parties return to your site with proper harvesters Summary • Shared, existing infrastructure can be used to push content to unknown preservation partners – exploiting not just hardware infrastructure, but human communication patterns for resource discovery as well • While not possessing ideal DL/Archival capabilities, these methods are congruent with standard web practices – Gmail, Google Groups, etc. will always have more disks than you…