Repository Synchronization Using NNTP and SMTP

advertisement
Repository Synchronization
Using NNTP and SMTP
Michael L. Nelson, Joan A. Smith, Martin Klein
Old Dominion University
Norfolk VA
www.cs.odu.edu/~{mln,jsmit,mklein}
DLF Spring 2006
Austin TX
April 10-12, 2006
Preservation: Fortress Model
Five Easy Steps for Preservation:
1.
2.
3.
4.
5.
Get a lot of $
Buy a lot of disks, machines, tapes, etc.
Hire an army of staff
Load a small amount of data
“Look upon my archive ye Mighty, and
despair!”
image from: http://www.itunisie.com/tourisme/excursion/tabarka/images/fort.jpg
Alternate Models
of Preservation
• Lazy Preservation
– Let Google, IA et al. preserve your website
• Just-In-Time Preservation
– Find a “good enough” replacement web page
• Web Server Based Preservation
– Use Apache modules to create archival-ready resources
• Shared Infrastructure Preservation
– Push your content to sites that might preserve it
image from: http://www.proex.ufes.br/arsm/knots_interlaced.htm
Shared, Existing Infrastructure
• Can we (re)use existing installed network
infrastructure for preservation purposes?
Who has the Bigger Fortress?
Experiment & Simulation
• Inject the contents of an OAI-PMH repository
directly into:
– Email (SMTP)
– Usenet News (NNTP)
• Instrument existing email, news servers
• Use mod_oai (www.modoai.org) to do resource
harvesting
– complex object formats (e.g. MPEG-21 DIDL) used to
encode the resources as “lumps of XML”
– results are generalizable to any repository system
• Analyze testbed, simulate very large collections
Test Repository
• Website with 72 files
– HTML, PDF, PNG, JPEG, GIF
– 1KB - 1.5 MB
• Used a script to harvest the MPEG-21
DIDLs, and then:
– attach to outbound email mesgs
– post to a moderated newsgroup
(repository.odu.test1)
Email
Adding Email
Attachments / Headers
outgoing mail
incoming mail
Email
Headers
OAI-PMH & HTTP
headers
original email mesg
base64 encoded DIDL
SMTP Overhead
~ 1 sec penalty per mesg
diminishing returns
for skipping mesgs
Email Traffic @ mail.cs.odu.edu
• 30 days of traffic
– 505,987 mesgs
– 4081 unique hosts
– daily
• mean: 16,866
• std dev: 5147
P(x) = a(x-b)
we measured b≈1.6
News
News
Posting
OAI-PMH & HTTP
headers
base64 encoded DIDL
News Overhead
News Policies
Simulation Parameters
• Repository
–
–
–
–
100,000 items
1MB/item
100 daily additions
400 daily updates
• Time
– 2000 days (5.5 years)
• Email
– granularity=1
– follows ODU power
law example
• News
– servers hold contents
for 30 days
NNTP Results
Email Results
(Without Memory)
Email Results
(With Memory)
Discussion
• We’ve examined the worst case scenario
– large, active repository
– sending contents by-value
• Optimizations / Alternatives
– smaller, less dynamic repositories
– sending contents by-reference
– use for repository discovery, not for content interchange
• instead of sending “GetRecord” results, send “Identify” results
and let interested parties return to your site with proper
harvesters
Summary
• Shared, existing infrastructure can be used to push
content to unknown preservation partners
– exploiting not just hardware infrastructure, but human
communication patterns for resource discovery as well
• While not possessing ideal DL/Archival
capabilities, these methods are congruent with
standard web practices
– Gmail, Google Groups, etc. will always have more
disks than you…
Download