Self-Preserving Digital Objects Michael L. Nelson & Terry L. Harrison Old Dominion University {mln, tharriso}@cs.odu.edu http://www.cs.odu.edu/~mln/ Alliance for Innovation in Science and Technology Information Fourth Annual AISTI Mini-Conference "Phase Shifting for Digital Libraries" Santa Fe, New Mexico Sept 15-16, 2003 Outline • • • • • History Preservation Archives vs. Objects Smart Objects & Dumb Archives Self-Preserving Objects My DL History • 1992 - work first begun on first generation Langley Technical Report Server (LTRS) • 1993 - WWW version of LTRS • http://techreports.larc.nasa.gov/ltrs/ • work w/ ODU on WATERS • 1994 - NASA Technical Report Server (NTRS) • distributed searching of many “LTRS-like” servers (20 separate nodes, all NASA centers) • http://techreports.larc.nasa.gov/cgi-bin/NTRS • 1996 - NACA Technical Report Server (NACATRS) • http://naca.larc.nasa.gov/ • • • • • 1996 - Joint research in DLs with ODU begins 1997 - NCSTRL+ (clustering, buckets) 1999 - OAI-PMH development begins 2001 - Arc, DP9, Archon, Kepler, etc. 2002 - OAI-PMH version of the NTRS • http://ntrs.nasa.gov/ History • ca. 1994 - 1995: a LaRC researcher, upon seeing LTRS remarked: “all of these reports are nice, but what we really want is the data...” • ca. 1995 - present: many reports in LTRS start to include data files, appendices, software and other information types • NACATRS: the scanned nature of the reports imply that 1 report = N files N >= (pages * 3) + 2 NASA STI • Formal publications cover a decreasing percentage of NASA’s STI output – most DLs focus only on formal publications • Informal STI is maintained by only by a network of collegial distribution – aging and shrinking workforce weakens this network • Customers want much more than formal publication – rather than stretch the meaning of “report” or “document”, define a new object for DL transactions STI Observations • Media formats are instantiations of a more general class of information • Most DLs are uni-format, following the obsolete media boundaries of their non-digital predecessors • “Separate but equal” DLs considered harmful – customer should not have to re-integrate what should never have been de-integrated... – institutional knowledge being lost because we don’t have a publishing vector established Pyramid of Scientific and Technical Information (STI) Information is created in a variety of formats. Formal publications, the focus of most DL projects, are supported by a pyramid of informal information. Journal Articles Conference Papers time Technical Reports software raw data notes video / images Information Lost Over Time manuscript library software ftp site Project User raw data thrown away images filing cabinent Figure 7: STI Lost in Project / Archival / Reuse Process New Project Content is King The information content is more important than the systems used for its storage, management and retrieval Objects should not be “locked” in specific DLs or archives Prelude to OAI… • I met Herbert Van de Sompel in April 1999... – we spoke of a demonstration project he had in mind and had received sponsorship from Paul Ginsparg and Rick Luce – We wanted to demonstrate a multi-disciplinary DL that leveraged the large number of high quality, yet often isolated, tech report servers, e-print servers, etc. • most digital libraries (DLs) had grown up along single disciplines or institutions – little to no interoperability; isolated DL “gardens” Universal Preprint Service • A cross-archive DL that that provides services on a collection of metadata harvested from multiple archives – Nelson: NCSTRL+; a modified version of Dienst • support for “clustering” • support for “buckets” – Krichel: ReDIF metadata format – Van de Sompel: SFX Linking • Demonstrated at Santa Fe NM, October 21-22, 1999 – http://web.archive.org/web/*/http://ups.cs.odu.edu/ – D-Lib Magazine, 6(2) 2000 (2 articles) • http://www.dlib.org/dlib/february00/02contents.html – UPS was soon renamed the Open Archives Initiative (OAI) http://www.openarchives.org/ UPS Participants Archive / DL Records in DL Buckets in UPS Buckets Linked to Full Content arXiv 128943 85204 85204 743 742 659 3036 3036 3036 29680 25184 9084 1590 1590 951 71359 71359 13582 235361 187115 112516 www .arxiv.o rg CogPrints cogp rints.soton. ac.uk NACA nac a.larc.nasa .gov NCSTRL www .ncs trl.org NDLTD www .ndlt d.org RePEc netec.mcc.ac.uk Totals: totals ca. July 1999 Buckets: Information Surrogates in UPS • Limitations on intellectual property, file size, transmission time, system load, etc. caused us to focus on metadata only • Metadata was collected into “buckets”, with pointers back to the data files (still at the original sites) Value Added Services Attached to the Buckets SFX Reference Linking Service, developed at Univ of Ghent, Belgium. - provides a layer of indirection between reference services available at a local site and the object itself SFX “buttons” are attached to the buckets themselves - communication occurs between SFX server and the bucket Adding other services to the buckets is easy... Data and Service Providers • Data Providers – publishing into an archive – providing methods for metadata “harvesting” • provide non-technical context for sharing information also • Service Providers – harvest metadata from providers – implement user interface to data • Self-describing archives – Much of the learning about the constituent UPS archives occurred out of band… Even if these are done by the same DL, these are distinct roles Metadata Harvesting • Move away from distributed searching • Extract metadata from various sources • Build services on local copies of metadata – data remains at remote repositories all searching, browsing, etc. performed on the metadata here user individual nodes can still support direct user interaction metadata harvested offline search for “cfd applications” local copy of metadata metadata harvested offline metadata harvested offline metadata harvested offline ... each node independently maintained Result… OAI • The OAI was the result of the demonstration and discussion during the Santa Fe meeting – OAI = a bunch of people, a religion, a cult, etc. – OAI Protocol For Metadata Harvesting (OAI-PMH) = the protocol created and maintained by the OAI • • • Initial focus was on federating collections of scholarly e-print materials… …however, interest grew and the scope and application of OAI-PMH expanded to become a generic bulk metadata transport protocol Note: – OAI-PMH is only about metadata -- not full text! • but what is metadata vs. full-text? – OAI is neutral with respect to the nature of the metadata or the resources the metadata describes • read: commercial publishers have an interest in OAI-PMH too... A Look Back at UPS • Primary outcome of the meeting was the OAI & OAI-PMH • Krichel: ReDIF metadata: – still in use & being developed • Van de Sompel: SFX – OpenURL (NISO Standard) – SFX is a commercial OpenURL resolver marketed by Ex Libris • Nelson: – NCSTRL+ begat Arc (arc.cs.odu.edu) and others – Buckets? Componentized Digital Libraries SRW RSS ... !? Preservation • RLG Report: Preserving Digital Information: Final Report and Recommendations – http://www.rlg.org/ArchTF/ – refreshing - moving to new media • considered (comparitively) easy – migrating - transitioning to new systems, formats, idioms • considered hard Really Long Term Preservation • Migration is very hard, to be sure – but given sufficient demand, this can be accomplished – cf. early 1980s game emulation: • http://www.intellivisionlives.com/ • http://stella.atari.org/ • Refreshing may actually be harder… – or at least intrinsically bound to the migration problem • http://web.archive.org/web/20011127114113/http://www.aisti.o rg/ • http://web.archive.org/web/19971210220634/http://libwww.lanl.gov/ Preservation Metrics So Far • Nelson & Allen – 3% decay of objects in DLs • http://www.dlib.org/dlib/january02/nelson/01nelson.html • Lawrence, et al. – 3% decay of URLs included in technical papers • http://www.neci.nec.com/~lawrence/papers/persistencecomputer01/bib.html • Koheler – ~ 33% of URLs “unstable” or “partially unstable” • http://InformationR.net/ir/4-4/paper60.html • Kahle – average URL lasts 44 days • http://www.hackvan.com/pub/stig/articles/trustedsystems/0397kahle.html Case Study: ICASE • Institute for Computer Applications in Science and Engineering – independent research institute affiliated with NASA Langley Research Center • www.icase.edu – years of operation: 1972-2002 – combined with other LaRC institutes, rolled into the National Institute for Aerospace (NIA) • ICASE Report Series – pre-prints/e-prints of all ICASE affiliated authors • also issued as NASA Contractor Reports – Dienst was used for report management & workflow • Harrison, Zubair & Nelson, JCDL 03, Dienst <-> OAI-PMH gateway NIA Transition • At first, all files at www.icase.edu were lost • then, the site was brought back online • but how well do DLs survive bulk-transfer? Whither the ICASE Digital Library? it appears to be reinstated… but not completely… How Long is Forever? • Average human life span (from: http://www.che.uc.edu/acs/archives/cintacs/vol39no5/vol39no5.html) – female: 78 – male: 77 • Average Fortune 500 company lifespan: http://www.businessweek.com/chapter/degeus.htm) – 40 - 50 years • Universities? • U.S. Government agency or institution? – what about individual labs? • NASA Zero Base Review • U.S. Military BRAC (from: Self-Preservation • Objects should be prepared to outlive the people & institutions that are charged with their well-being • Many areas of risk: – – – – – company, agency, university, etc. ceases to exist funding cut person dies disaster (hurricane, earthquake, etc.) malicious attack P2P Model • Applicable for scientific and technical information? – Napster, Gnutella, etc. rely on the repetitive nature of popular culture media (songs, movies, etc.) to insure the availability of items – a “bubble” of recent and popular interest • this assumption is probably not valid in STI DLs – cf. popularity(HBO) >> popularity(AMC) Smart Objects, Dumb Archives Buckets DA Fedora? METS? Guildford Protocol ??? OAI-PMH “Key Concepts in the Architecture of the Digital Library” • next 9 slides taken from Bill Arm’s seminal article in the inaugural issue of D-Lib Magazine: – http://www.dlib.org/dlib/July95/07arms.html The technical framework exists within a legal and social framework • DLs no longer represent systems specific to academics or information specialists – content influences how the DL is used • architecture must allow the implementation of various policies Understanding of digital library concepts is hampered by terminology • “common English” != “professional English” – multiple professional jargons too • What do these words mean to you? – – – – – copy publish content document work The underlying architecture should be separate from the content stored in the library • general purpose functions and contentspecific functions should be separated • TL analogy: – the more specific the bookshelf is to holding actual books, the harder it is to repurpose the bookshelf in the future Names and identifiers are the basic building block for the digital library • names != addresses • in any DL architecture diagram, (almost) anything that can be drawn can be named • consider the impact that handles/DOIs have had on the publishing/DL community Digital library objects are more than collections of bits • objects = metadata + data – “but what is metadata?” • don’t ask hard questions figure 2 in http://www.dlib.org/dlib/July95/07arms.html The digital library object that is used is different from the stored object • what you store is not necessarily what you get – storage and dissemination are separate events, and can represent separate formats • also, potentially separate from the applicationspecific format Users want intellectual works, not digital objects • The DL architect’s needs should not inconvenience the users’ needs • recombination of objects – what is an object in your world view? figure 4 in http://www.dlib.org/dlib/July95/07arms.html Repositories must look after the information they hold • “Repository Access Protocol” – Kahn Wilensky Framework • http://www.cnri.reston.va.us/home/cstr/arch/k-w.html figure 3 in http://www.dlib.org/dlib/July95/07arms.html Objects vs. Archives • This is the tenet that I question… • Most DL objects still bound to the applications that generate or render the objects Design Goals • Aggregation – DLs should be shielded from the transient nature of file formats – Prevent information hemorrhaging by archiving all data types • Intelligence – Aggregation (above) implies code, why stop at passive objects? Make objects smart... – Bucket-bucket & bucket-tool intelligence Design Goals • Self-Sufficiency – Maximum autonomy & survivability: fully selfsufficient buckets – Option to internally store all needed materials • Mobility – Why should an information object be stuck in one place? – Mobility for replication, workflow, data collection Design Goals • Heterogeneity – One size does not fit all... – Different buckets for different applications, sites, disciplines, etc. • Archive Independence – Focus is on information, not yet another DL “system” • does not require an archive to function – “Work with everything; break nothing” Smart Objects • aggregate: – metadata – data – methods to operate on the metadata/data • http://www.cs.odu.edu/~mln/teaching/cs595METS in the future? f03/?method=getMetadata&type=all • http://www.cs.odu.edu/~mln/teaching/cs595f03/?method=listMethods • http://www.cs.odu.edu/~mln/teaching/cs595f03/?method=listPreference • (cheat) http://www.cs.odu.edu/~mln/teaching/cs595f03/bucket/bucket.xml • assumptions – Perl – http server Internal Structure jaga.cs.odu.edu:/home/mln/public_html/teaching/cs695-f03 % ls bucket/ CVS/ index.cgi* jaga.cs.odu.edu:/home/mln/public_html/teaching/cs695-f03 % ls bucket/ bucket.xml* content/ CVS/ lib/ logs/ methods/ jaga.cs.odu.edu:/home/mln/public_html/teaching/cs695-f03 % ls bucket/content/ ~syllabus.txt ~week1~readings.html ~week5~readings.html ~week10~readings.html ~week1~week-01.ppt ~week6~readings.html ~week11~readings.html ~week2~readings.html ~week7~readings.html ~week12~readings.html ~week2~week-02.ppt ~week8~readings.html ~week13~readings.html ~week3~assignment1.ppt ~week9~readings.html ~week14~readings.html ~week3~readings.html ~week15~readings.html ~week3~week-03.ppt jaga.cs.odu.edu:/home/mln/public_html/teaching/cs695-f03 % ls bucket/lib CVS/ EZXML.pm mime.e style.css jaga.cs.odu.edu:/home/mln/public_html/teaching/cs695-f03 % ls bucket/logs/ access.log CVS/ mylog.log jaga.cs.odu.edu:/home/mln/public_html/teaching/cs695-f03 % ls bucket/methods/ addElement.pl* getElement.pl* listMethods.pl* setPreference.pl* CVS/ get_log.pl* listPreference.pl* deleteElement.pl* getlog.pl* log.pl* display.pl* getMetadata.pl* setMetadata.pl* jaga.cs.odu.edu:/home/mln/public_html/teaching/cs695-f03 % Examples • 1.6.X bucket – http://ntrs.nasa.gov/ – http://www.cs.odu.edu/~mln/phd/ • 2.0 buckets – http://www.cs.odu.edu/~mln/teaching/cs595-f03/ – http://dochter.seven.research.odu.edu:3257/~aravind/tes t2/b36/ Self-Preservation • Objectives: – knowledge of the system state not required • i.e. -- you don’t need to keep track of where everything is… – the knowledge required for each object should be minimal • actually, the required number of “friends” should be finite, even in very large systems Friends and Family • Friends – connections to “other” buckets • Family – connections to replications of you Scenario: 3buckets/2pals each A Pals: b,c B Pals: a,c C Pals: b,a We want to add new_guy (D) A Pals: b,c B Pals: a,c C Pals: b,a D Pals:(none) Tool calls: C.insert(D,”start”) A Pals: b,c B Pals: a,c C D Pals: Pals: b,a D is added to C’s pal list A Pals: b,c B Pals: a,c C D Pals: Pals: b,a,(d) C pal_list is overstuffed Return handshake: D.insert(C,”finish”) A Pals: b,c B Pals: a,c C D Pals: c Pals: b,a,(d) C pal_list is overstuffed C “refits” pal list … A Pals: b,c B Pals: a,c C D Pals: c Pals: b,a,(d) C pal_list is overstuffed Refit step 1: : C.pop_1st_pal not known by (D) A Pals: b,c B Pals: a,c C D Pals: c Pals: b,a,d Now C pal_list is overstuffed Refit step 2 B.pop_pal( C ) A Pals: b,c B C Pals: a,c D Pals: c Pals: a,d Refit step 2 B.insert( D, “start” ) A Pals: b,c Pals: a,d B C D Pals: c Pals: a,d Refit step 3 D.insert( B, “finish” ) A Pals: b,c Pals: a,d B C D Pals: c,b Pals: a,d Refit step 3 D.insert( B, “finish” ) A Pals: b,c Pals: a,d B C D Pals: c,b Pals: a,d A Pals: b,c Pals: a,d B C Pals: a,d D Pals: c,b 10 Buckets, 4 Friends: Step 2 10 Buckets, 4 Friends: Step 3 10 Buckets, 4 Friends: Step 4 10 Buckets, 4 Friends: Step 5 10 Buckets, 4 Friends: Step 6 10 Buckets, 4 Friends: Step 7 10 Buckets, 4 Friends: Step 8 10 Buckets, 4 Friends: Step 9 10 Buckets, 4 Friends: Step 10 20 Buckets, 4 Friends 100 Buckets, 10 Friends Building the Network Bucket: this_node_name; max_friend size; list_of_pals; insert ( new_guy, string handshake) // Adds new_guy to this bucket's pal list // handshake = "start" or "finish" { if (I know(new_guy) { return; } else { put new_guy at end of my pal list; if ( handshake = "start" ) {new_node.insert(this_node_name, "finish"); } if ( my pal list if now overstuffed) { refit(); } } return list_of_pals; } refit () // To keep pal_list from being overstuffed { read in new_guy's pal list; pop_1st_pal_list(); // I remove 1st pal "Y" from my list that's // not present in "new_guy's" pal list Y.pop_from_list(Me) // Have "Y" pop "Me" Y.insert(new_guy , "start"); // Y adds new_guy to his list // this will call new_guy to add "Y" as well } Communications Cost: Building the Network • Total communications cost to build the network b2 - f - (b-f)2 • b = # of buckets • f = # of friends Communications Cost: Building the Network Communications Cost: Traversing the Network • Flood algorithm: b(f-1) - f + 2 • Spanning Tree: b-1 • Upper bound on the diameter of the network: (b-f) /2 +1 – (typically much less) Network Resiliency • The network can survive at least f-1 node (bucket) or edge (communications) failures and still remain fully connected Cf. Other P2P Projects • Gnutella – also O(N2) to build the network • currently don’t know the exact message cost • Chord, Tapestry, etc. – content addressable networking • hash function to map keys to locations – orthogonal to buckets Chatting • the stored objects are inactive until invoked – if no one communicates with the object, it never wakes up, can never perform self-tests, etc. • solution: – circulate a number of tokens through the network to insure that everyone is woken up – buckets can perform a number of administrative tasks at these times • Core to solving the migration issue Communications Tokens Flocking… • Craig Reynolds, “Flocks, Herds, and Schools:A Distributed Behavioral Model”, SIGGRAPH 87 • Observations: – flocks, schools, herds, etc. exhibit many desirable properties: • scale-free – neighbors matter, not total size of flock • no upper bound – flocks are never “full” – flocks, etc. can be modeled with simple rules: • Collision Avoidance: avoid collisions with nearby flockmates • Velocity Matching: attempt to match velocity with nearby flockmates • Flock Centering: attempt to stay close to nearby flockmates Flocking for DLs Rules Flocking Boids Flocking Buckets Collision Avoidance avoid collisions with nearby flockmates not overwriting one's own copies nor the copies of other buckets (i.e., namespace collision avoidance) Velocity Matching attempt to match velocity with nearby flockmates deleting copies of oneself to provide “space” for late arrivals in a storage location Flock Centering attempt to stay close to nearby flockmates following others to available storage locations Flocking (9,4) “new repository available” “new repository available” Flocking (10,4) Future Work • Friends – optimizing the connections while sending the communication token • convert to small world graph over time – repair faults in the network • Family – types • active • passive – provenance / authenticity Other Applications for Smart Objects • communication pulses will share the location of new services – format conversion (migration) – new repository locations (refreshing) – submit logs, alerts, other messages to people, services, etc. • self-arranging displays Self-Arranging Displays For Buckets • premise: to have the links in the object reflect the community’s preferences – real-time computation; no log file processing – Bollen & Nelson, “Adaptive Networks of Smart Objects” – http://www.cs.odu.edu/~mln/pubs/bollenj_adaptive.pdf Hebbian Learning http://b1?method=display&referer=b1&redirect=http://b2?method\ =display\%26referer=http://b1 http://b2?method=display&referer=b2&redirect=http://b1?method\ =display\%26redirect=http://b3?method=display\%26referer=http://b2 Initial Experiment • Elango, Bollen & Nelson, "Dynamic Linking of Smart Digital Objects Based on User Navigation Patterns" – http://www.cs.odu.edu/~aelango/html/adaptive.pdf – Take top 50 all-time pop music bands • from Spin Magazine’s top 50 bands of all time – From each band, take 2 “related” bands • according to allmusic.com – Create network of 150 buckets with band info (metadata from allmusic.com) – Randomize the network • each band points to 3 other randomly selected bands – Get people to traverse the network… Sample Screenshot Sample Results From the Initial Node: Public Enemy Project Comments Kahn-Wilensky Framework DigitalObjects (including Warwick Framework & FEDORA) KWF DOs never implemented; unsure of WF. FEDORA is CORBA-based. FEDORA is actively being develop ed, but w e’re unaware of a “demo” at thi s time. Multivalent Documents “Semantic laye rs” – overlays lenses on the document (ala translations, annotations, geospatial data). OpenDoc, OLE, DCOM, etc. Extending docum ent functionality through application emb edding. Metaphoria Aggregating information sources; separating content from presentation. VERS Encapsulated Objects self-documenting XML encoded data files for long term storage of Australian governmen t documents Aurora Encapsulation of content, metadata and services. CORBA-based implementation (?). E-commerce (cryptolopes & DigiBox) “Super-distribution” focused… also encryption and anonym ity Filesystems & Formats (ELFS, HDF, netCDF, etc.) Experimental filesystems and self-describing data formats; focus on high performance computing appl ications see www.fedora.info for latest update see http://techreports.larc.nasa.gov/ltrs/PDF/2001/tm/NASA-2001-tm211426.pdf Related Work Risks • Why have these projects met with limited success or are only used in niche applications? – it is one thing to add a layer to your DL, but changing the structure of your first-class objects incurs a level of short-term risk – however, even the most well-thought out componentized DL is subject to long-term risks • cf. ICASE DL Conclusions • Smart objects are an idea whose time has come – natural progression of DL R&D • Smart objects will play an fundamental role in digital preservation