Self-Preserving Digital Objects Michael L. Nelson & Terry L. Harrison {mln,

advertisement
Self-Preserving Digital Objects
Michael L. Nelson & Terry L. Harrison
Old Dominion University
{mln, tharriso}@cs.odu.edu
http://www.cs.odu.edu/~mln/
Alliance for Innovation in Science and Technology Information
Fourth Annual AISTI Mini-Conference "Phase Shifting for Digital Libraries"
Santa Fe, New Mexico
Sept 15-16, 2003
Outline
•
•
•
•
•
History
Preservation
Archives vs. Objects
Smart Objects & Dumb Archives
Self-Preserving Objects
My DL History
• 1992 - work first begun on first generation Langley Technical Report Server
(LTRS)
• 1993 - WWW version of LTRS
• http://techreports.larc.nasa.gov/ltrs/
• work w/ ODU on WATERS
• 1994 - NASA Technical Report Server (NTRS)
• distributed searching of many “LTRS-like” servers (20 separate nodes, all
NASA centers)
• http://techreports.larc.nasa.gov/cgi-bin/NTRS
• 1996 - NACA Technical Report Server (NACATRS)
• http://naca.larc.nasa.gov/
•
•
•
•
•
1996 - Joint research in DLs with ODU begins
1997 - NCSTRL+ (clustering, buckets)
1999 - OAI-PMH development begins
2001 - Arc, DP9, Archon, Kepler, etc.
2002 - OAI-PMH version of the NTRS
• http://ntrs.nasa.gov/
History
• ca. 1994 - 1995: a LaRC researcher, upon seeing
LTRS remarked:
“all of these reports are nice, but what we really want is
the data...”
• ca. 1995 - present: many reports in LTRS start to
include data files, appendices, software and other
information types
• NACATRS: the scanned nature of the reports
imply that 1 report = N files
N >= (pages * 3) + 2
NASA STI
• Formal publications cover a decreasing
percentage of NASA’s STI output
– most DLs focus only on formal publications
• Informal STI is maintained by only by a
network of collegial distribution
– aging and shrinking workforce weakens this network
• Customers want much more than formal
publication
– rather than stretch the meaning of “report” or
“document”, define a new object for DL transactions
STI Observations
• Media formats are instantiations of a more general
class of information
• Most DLs are uni-format, following the obsolete
media boundaries of their non-digital predecessors
• “Separate but equal” DLs considered harmful
– customer should not have to re-integrate what should never
have been de-integrated...
– institutional knowledge being lost because we don’t have a
publishing vector established
Pyramid of Scientific and
Technical Information (STI)
Information is created in a variety of formats. Formal publications, the focus of
most DL projects, are supported by a pyramid of informal information.
Journal Articles
Conference Papers
time
Technical Reports
software raw data
notes
video /
images
Information Lost Over Time
manuscript
library
software
ftp site
Project
User
raw data
thrown away
images
filing cabinent
Figure 7: STI Lost in Project / Archival / Reuse Process
New
Project
Content is King
The information content is
more important than the
systems used for its storage,
management and retrieval
Objects should not be “locked”
in specific DLs or archives
Prelude to OAI…
• I met Herbert Van de Sompel in April 1999...
– we spoke of a demonstration project he had in mind and
had received sponsorship from Paul Ginsparg and Rick
Luce
– We wanted to demonstrate a multi-disciplinary DL that
leveraged the large number of high quality, yet often
isolated, tech report servers, e-print servers, etc.
• most digital libraries (DLs) had grown up along single disciplines
or institutions
– little to no interoperability; isolated DL “gardens”
Universal Preprint Service
• A cross-archive DL that that provides services on a collection of metadata
harvested from multiple archives
– Nelson: NCSTRL+; a modified version of Dienst
• support for “clustering”
• support for “buckets”
– Krichel: ReDIF metadata format
– Van de Sompel: SFX Linking
• Demonstrated at Santa Fe NM, October 21-22, 1999
– http://web.archive.org/web/*/http://ups.cs.odu.edu/
– D-Lib Magazine, 6(2) 2000 (2 articles)
• http://www.dlib.org/dlib/february00/02contents.html
– UPS was soon renamed the Open Archives Initiative (OAI)
http://www.openarchives.org/
UPS Participants
Archive / DL
Records in DL
Buckets in UPS
Buckets Linked to
Full Content
arXiv
128943
85204
85204
743
742
659
3036
3036
3036
29680
25184
9084
1590
1590
951
71359
71359
13582
235361
187115
112516
www .arxiv.o rg
CogPrints
cogp rints.soton. ac.uk
NACA
nac a.larc.nasa .gov
NCSTRL
www .ncs trl.org
NDLTD
www .ndlt d.org
RePEc
netec.mcc.ac.uk
Totals:
totals ca. July 1999
Buckets: Information
Surrogates in UPS
• Limitations on intellectual property,
file size, transmission time, system
load, etc. caused us to focus on
metadata only
• Metadata was collected into
“buckets”, with pointers back to the
data files (still at the original sites)
Value Added
Services Attached
to the Buckets
SFX Reference Linking
Service, developed at
Univ of Ghent, Belgium.
- provides a layer of
indirection between
reference services
available at a local site
and the object itself
SFX “buttons” are attached
to the buckets themselves
- communication occurs
between SFX server and
the bucket
Adding other services to the
buckets is easy...
Data and Service Providers
• Data Providers
– publishing into an archive
– providing methods for metadata “harvesting”
• provide non-technical context for sharing information
also
• Service Providers
– harvest metadata from providers
– implement user interface to data
• Self-describing archives
– Much of the learning about the constituent UPS
archives occurred out of band…
Even if these
are done by
the same DL,
these are
distinct roles
Metadata Harvesting
• Move away from distributed searching
• Extract metadata from various sources
• Build services on local copies of metadata
– data remains at remote repositories
all searching, browsing,
etc. performed on
the metadata here
user
individual nodes can
still support direct user
interaction
metadata
harvested
offline
search for “cfd
applications”
local copy of
metadata
metadata
harvested
offline
metadata
harvested
offline
metadata
harvested
offline
...
each node
independently
maintained
Result… OAI
•
The OAI was the result of the demonstration and discussion during the Santa Fe
meeting
– OAI = a bunch of people, a religion, a cult, etc.
– OAI Protocol For Metadata Harvesting (OAI-PMH) = the protocol created and maintained
by the OAI
•
•
•
Initial focus was on federating collections of scholarly e-print materials…
…however, interest grew and the scope and application of OAI-PMH expanded to
become a generic bulk metadata transport protocol
Note:
– OAI-PMH is only about metadata -- not full text!
• but what is metadata vs. full-text?
– OAI is neutral with respect to the nature of the metadata or the resources the metadata
describes
• read: commercial publishers have an interest in OAI-PMH too...
A Look Back at UPS
• Primary outcome of the meeting was the OAI &
OAI-PMH
• Krichel: ReDIF metadata:
– still in use & being developed
• Van de Sompel: SFX
– OpenURL (NISO Standard)
– SFX is a commercial OpenURL resolver marketed by
Ex Libris
• Nelson:
– NCSTRL+ begat Arc (arc.cs.odu.edu) and others
– Buckets?
Componentized Digital Libraries
SRW
RSS
...
!?
Preservation
• RLG Report: Preserving Digital Information:
Final Report and Recommendations
– http://www.rlg.org/ArchTF/
– refreshing - moving to new media
• considered (comparitively) easy
– migrating - transitioning to new systems, formats,
idioms
• considered hard
Really Long Term Preservation
• Migration is very hard, to be sure
– but given sufficient demand, this can be accomplished
– cf. early 1980s game emulation:
• http://www.intellivisionlives.com/
• http://stella.atari.org/
• Refreshing may actually be harder…
– or at least intrinsically bound to the migration problem
• http://web.archive.org/web/20011127114113/http://www.aisti.o
rg/
• http://web.archive.org/web/19971210220634/http://libwww.lanl.gov/
Preservation Metrics So Far
• Nelson & Allen
– 3% decay of objects in DLs
• http://www.dlib.org/dlib/january02/nelson/01nelson.html
• Lawrence, et al.
– 3% decay of URLs included in technical papers
• http://www.neci.nec.com/~lawrence/papers/persistencecomputer01/bib.html
• Koheler
– ~ 33% of URLs “unstable” or “partially unstable”
• http://InformationR.net/ir/4-4/paper60.html
• Kahle
– average URL lasts 44 days
• http://www.hackvan.com/pub/stig/articles/trustedsystems/0397kahle.html
Case Study: ICASE
• Institute for Computer Applications in Science and
Engineering
– independent research institute affiliated with NASA
Langley Research Center
• www.icase.edu
– years of operation: 1972-2002
– combined with other LaRC institutes, rolled into the
National Institute for Aerospace (NIA)
• ICASE Report Series
– pre-prints/e-prints of all ICASE affiliated authors
• also issued as NASA Contractor Reports
– Dienst was used for report management & workflow
• Harrison, Zubair & Nelson, JCDL 03, Dienst <-> OAI-PMH
gateway
NIA Transition
• At first, all files at www.icase.edu were lost
• then, the site was brought back online
• but how well do DLs survive bulk-transfer?
Whither the ICASE Digital Library?
it appears to be reinstated…
but not completely…
How Long is Forever?
• Average human life span
(from: http://www.che.uc.edu/acs/archives/cintacs/vol39no5/vol39no5.html)
– female: 78
– male: 77
• Average Fortune 500 company lifespan:
http://www.businessweek.com/chapter/degeus.htm)
– 40 - 50 years
• Universities?
• U.S. Government agency or institution?
– what about individual labs?
• NASA Zero Base Review
• U.S. Military BRAC
(from:
Self-Preservation
• Objects should be prepared to outlive the people &
institutions that are charged with their well-being
• Many areas of risk:
–
–
–
–
–
company, agency, university, etc. ceases to exist
funding cut
person dies
disaster (hurricane, earthquake, etc.)
malicious attack
P2P Model
• Applicable for scientific and technical
information?
– Napster, Gnutella, etc. rely on the repetitive nature of
popular culture media (songs, movies, etc.) to insure the
availability of items
– a “bubble” of recent and popular interest
• this assumption is probably not valid in STI DLs
– cf. popularity(HBO) >> popularity(AMC)
Smart Objects, Dumb Archives
Buckets
DA
Fedora? METS?
Guildford Protocol
???
OAI-PMH
“Key Concepts in the Architecture
of the Digital Library”
• next 9 slides taken from Bill Arm’s seminal
article in the inaugural issue of D-Lib
Magazine:
– http://www.dlib.org/dlib/July95/07arms.html
The technical framework exists
within a legal and social framework
• DLs no longer represent systems specific to
academics or information specialists
– content influences how the DL is used
• architecture must allow the implementation
of various policies
Understanding of digital library
concepts is hampered by terminology
• “common English” != “professional English”
– multiple professional jargons too
• What do these words mean to you?
–
–
–
–
–
copy
publish
content
document
work
The underlying architecture
should be separate from the
content stored in the library
• general purpose functions and contentspecific functions should be separated
• TL analogy:
– the more specific the bookshelf is to holding
actual books, the harder it is to repurpose the
bookshelf in the future
Names and identifiers are the basic
building block for the digital library
• names != addresses
• in any DL architecture diagram, (almost)
anything that can be drawn can be named
• consider the impact that handles/DOIs have
had on the publishing/DL community
Digital library objects are more
than collections of bits
• objects = metadata + data
– “but what is metadata?”
• don’t ask hard questions
figure 2 in http://www.dlib.org/dlib/July95/07arms.html
The digital library object that is used
is different from the stored object
• what you store is not necessarily what you
get
– storage and dissemination are separate events,
and can represent separate formats
• also, potentially separate from the applicationspecific format
Users want intellectual works,
not digital objects
• The DL architect’s needs should not
inconvenience the users’ needs
• recombination of objects
– what is an object in your world view?
figure 4 in http://www.dlib.org/dlib/July95/07arms.html
Repositories must look after the
information they hold
• “Repository Access Protocol”
– Kahn Wilensky Framework
• http://www.cnri.reston.va.us/home/cstr/arch/k-w.html
figure 3 in http://www.dlib.org/dlib/July95/07arms.html
Objects vs. Archives
• This is the tenet that I question…
• Most DL objects still bound to the
applications that generate or render the
objects
Design Goals
• Aggregation
– DLs should be shielded from the transient
nature of file formats
– Prevent information hemorrhaging by archiving
all data types
• Intelligence
– Aggregation (above) implies code, why stop at
passive objects? Make objects smart...
– Bucket-bucket & bucket-tool intelligence
Design Goals
• Self-Sufficiency
– Maximum autonomy & survivability: fully selfsufficient buckets
– Option to internally store all needed materials
• Mobility
– Why should an information object be stuck in
one place?
– Mobility for replication, workflow, data
collection
Design Goals
• Heterogeneity
– One size does not fit all...
– Different buckets for different applications, sites,
disciplines, etc.
• Archive Independence
– Focus is on information, not yet another DL “system”
• does not require an archive to function
– “Work with everything; break nothing”
Smart Objects
• aggregate:
– metadata
– data
– methods to operate on the metadata/data
• http://www.cs.odu.edu/~mln/teaching/cs595METS in the future?
f03/?method=getMetadata&type=all
• http://www.cs.odu.edu/~mln/teaching/cs595f03/?method=listMethods
• http://www.cs.odu.edu/~mln/teaching/cs595f03/?method=listPreference
• (cheat) http://www.cs.odu.edu/~mln/teaching/cs595f03/bucket/bucket.xml
• assumptions
– Perl
– http server
Internal Structure
jaga.cs.odu.edu:/home/mln/public_html/teaching/cs695-f03 % ls
bucket/ CVS/ index.cgi*
jaga.cs.odu.edu:/home/mln/public_html/teaching/cs695-f03 % ls bucket/
bucket.xml* content/ CVS/ lib/ logs/ methods/
jaga.cs.odu.edu:/home/mln/public_html/teaching/cs695-f03 % ls bucket/content/
~syllabus.txt
~week1~readings.html
~week5~readings.html
~week10~readings.html ~week1~week-01.ppt
~week6~readings.html
~week11~readings.html ~week2~readings.html
~week7~readings.html
~week12~readings.html ~week2~week-02.ppt
~week8~readings.html
~week13~readings.html ~week3~assignment1.ppt ~week9~readings.html
~week14~readings.html ~week3~readings.html
~week15~readings.html ~week3~week-03.ppt
jaga.cs.odu.edu:/home/mln/public_html/teaching/cs695-f03 % ls bucket/lib
CVS/ EZXML.pm mime.e style.css
jaga.cs.odu.edu:/home/mln/public_html/teaching/cs695-f03 % ls bucket/logs/
access.log CVS/ mylog.log
jaga.cs.odu.edu:/home/mln/public_html/teaching/cs695-f03 % ls bucket/methods/
addElement.pl*
getElement.pl*
listMethods.pl*
setPreference.pl*
CVS/
get_log.pl*
listPreference.pl*
deleteElement.pl* getlog.pl*
log.pl*
display.pl*
getMetadata.pl* setMetadata.pl*
jaga.cs.odu.edu:/home/mln/public_html/teaching/cs695-f03 %
Examples
• 1.6.X bucket
– http://ntrs.nasa.gov/
– http://www.cs.odu.edu/~mln/phd/
• 2.0 buckets
– http://www.cs.odu.edu/~mln/teaching/cs595-f03/
– http://dochter.seven.research.odu.edu:3257/~aravind/tes
t2/b36/
Self-Preservation
• Objectives:
– knowledge of the system state not required
• i.e. -- you don’t need to keep track of where
everything is…
– the knowledge required for each object should
be minimal
• actually, the required number of “friends” should be
finite, even in very large systems
Friends and Family
• Friends
– connections to “other” buckets
• Family
– connections to replications of you
Scenario: 3buckets/2pals each
A
Pals: b,c
B
Pals: a,c
C
Pals: b,a
We want to add new_guy (D)
A
Pals: b,c
B
Pals: a,c
C
Pals: b,a
D
Pals:(none)
Tool calls: C.insert(D,”start”)
A
Pals: b,c
B
Pals: a,c
C
D
Pals:
Pals: b,a
D is added to C’s pal list
A
Pals: b,c
B
Pals: a,c
C
D
Pals:
Pals: b,a,(d)
C pal_list is
overstuffed
Return handshake:
D.insert(C,”finish”)
A
Pals: b,c
B
Pals: a,c
C
D
Pals: c
Pals: b,a,(d)
C pal_list is
overstuffed
C “refits” pal list …
A
Pals: b,c
B
Pals: a,c
C
D
Pals: c
Pals: b,a,(d)
C pal_list is
overstuffed
Refit step 1: :
C.pop_1st_pal not known by (D)
A
Pals: b,c
B
Pals: a,c
C
D
Pals: c
Pals: b,a,d
Now C pal_list is
overstuffed
Refit step 2
B.pop_pal( C )
A
Pals: b,c
B
C
Pals: a,c
D
Pals: c
Pals: a,d
Refit step 2
B.insert( D, “start” )
A
Pals: b,c
Pals: a,d B
C
D
Pals: c
Pals: a,d
Refit step 3
D.insert( B, “finish” )
A
Pals: b,c
Pals: a,d
B
C
D
Pals: c,b
Pals: a,d
Refit step 3
D.insert( B, “finish” )
A
Pals: b,c
Pals: a,d
B
C
D
Pals: c,b
Pals: a,d
A Pals: b,c
Pals: a,d B
C Pals: a,d
D
Pals: c,b
10 Buckets, 4 Friends: Step 2
10 Buckets, 4 Friends: Step 3
10 Buckets, 4 Friends: Step 4
10 Buckets, 4 Friends: Step 5
10 Buckets, 4 Friends: Step 6
10 Buckets, 4 Friends: Step 7
10 Buckets, 4 Friends: Step 8
10 Buckets, 4 Friends: Step 9
10 Buckets, 4 Friends: Step 10
20 Buckets, 4 Friends
100 Buckets, 10 Friends
Building the Network
Bucket:
this_node_name;
max_friend size;
list_of_pals;
insert ( new_guy, string handshake)
// Adds new_guy to this bucket's pal list
// handshake = "start" or "finish"
{
if (I know(new_guy) { return; }
else {
put new_guy at end of my pal list;
if ( handshake = "start" )
{new_node.insert(this_node_name, "finish"); }
if ( my pal list if now overstuffed)
{ refit(); }
}
return list_of_pals;
}
refit ()
// To keep pal_list from being overstuffed
{
read in new_guy's pal list;
pop_1st_pal_list();
// I remove 1st pal "Y" from my list that's
// not present in "new_guy's" pal list
Y.pop_from_list(Me)
// Have "Y" pop "Me"
Y.insert(new_guy , "start");
// Y adds new_guy to his list
// this will call new_guy to add "Y" as well
}
Communications Cost:
Building the Network
• Total communications cost to build the
network
b2 - f - (b-f)2
• b = # of buckets
• f = # of friends
Communications Cost:
Building the Network
Communications Cost:
Traversing the Network
• Flood algorithm:
b(f-1) - f + 2
• Spanning Tree:
b-1
• Upper bound on the diameter of the network:
(b-f) /2 +1
– (typically much less)
Network Resiliency
• The network can
survive at least f-1
node (bucket) or edge
(communications)
failures and still
remain fully connected
Cf. Other P2P Projects
• Gnutella
– also O(N2) to build the network
• currently don’t know the exact message cost
• Chord, Tapestry, etc.
– content addressable networking
• hash function to map keys to locations
– orthogonal to buckets
Chatting
• the stored objects are inactive until invoked
– if no one communicates with the object, it never wakes
up, can never perform self-tests, etc.
• solution:
– circulate a number of tokens through the network to
insure that everyone is woken up
– buckets can perform a number of administrative tasks at
these times
• Core to solving the migration issue
Communications Tokens
Flocking…
• Craig Reynolds, “Flocks, Herds, and Schools:A
Distributed Behavioral Model”, SIGGRAPH 87
• Observations:
– flocks, schools, herds, etc. exhibit many desirable properties:
• scale-free
– neighbors matter, not total size of flock
• no upper bound
– flocks are never “full”
– flocks, etc. can be modeled with simple rules:
• Collision Avoidance: avoid collisions with nearby flockmates
• Velocity Matching: attempt to match velocity with nearby flockmates
• Flock Centering: attempt to stay close to nearby flockmates
Flocking for DLs
Rules
Flocking Boids
Flocking Buckets
Collision
Avoidance
avoid collisions with
nearby flockmates
not overwriting one's own copies nor
the copies of other buckets (i.e.,
namespace collision avoidance)
Velocity
Matching
attempt to match
velocity with nearby
flockmates
deleting copies of oneself to provide
“space” for late arrivals in a storage
location
Flock
Centering
attempt to stay close
to nearby flockmates
following others to available storage
locations
Flocking (9,4)
“new repository available”
“new repository available”
Flocking (10,4)
Future Work
• Friends
– optimizing the connections while sending the
communication token
• convert to small world graph over time
– repair faults in the network
• Family
– types
• active
• passive
– provenance / authenticity
Other Applications for
Smart Objects
• communication pulses will share the
location of new services
– format conversion (migration)
– new repository locations (refreshing)
– submit logs, alerts, other messages to people,
services, etc.
• self-arranging displays
Self-Arranging Displays
For Buckets
• premise: to have the links in the object
reflect the community’s preferences
– real-time computation; no log file processing
– Bollen & Nelson, “Adaptive Networks of Smart
Objects”
– http://www.cs.odu.edu/~mln/pubs/bollenj_adaptive.pdf
Hebbian Learning
http://b1?method=display&referer=b1&redirect=http://b2?method\
=display\%26referer=http://b1
http://b2?method=display&referer=b2&redirect=http://b1?method\
=display\%26redirect=http://b3?method=display\%26referer=http://b2
Initial Experiment
• Elango, Bollen & Nelson, "Dynamic Linking of
Smart Digital Objects Based on User
Navigation Patterns"
– http://www.cs.odu.edu/~aelango/html/adaptive.pdf
– Take top 50 all-time pop music bands
• from Spin Magazine’s top 50 bands of all time
– From each band, take 2 “related” bands
• according to allmusic.com
– Create network of 150 buckets with band info
(metadata from allmusic.com)
– Randomize the network
• each band points to 3 other randomly selected bands
– Get people to traverse the network…
Sample Screenshot
Sample Results
From the Initial Node:
Public Enemy
Project
Comments
Kahn-Wilensky Framework
DigitalObjects (including Warwick
Framework & FEDORA)
KWF DOs never implemented; unsure of WF.
FEDORA is CORBA-based. FEDORA is
actively being develop ed, but w e’re unaware of
a “demo” at thi s time.
Multivalent Documents
“Semantic laye rs” – overlays lenses on the
document (ala translations, annotations,
geospatial data).
OpenDoc, OLE, DCOM, etc.
Extending docum ent functionality through
application emb edding.
Metaphoria
Aggregating information sources; separating
content from presentation.
VERS Encapsulated Objects
self-documenting XML encoded data files for
long term storage of Australian governmen t
documents
Aurora
Encapsulation of content, metadata and
services. CORBA-based implementation (?).
E-commerce (cryptolopes &
DigiBox)
“Super-distribution” focused… also
encryption and anonym ity
Filesystems & Formats (ELFS,
HDF, netCDF, etc.)
Experimental filesystems and self-describing
data formats; focus on high performance
computing appl ications
see www.fedora.info for latest update
see http://techreports.larc.nasa.gov/ltrs/PDF/2001/tm/NASA-2001-tm211426.pdf
Related
Work
Risks
• Why have these projects met with limited
success or are only used in niche
applications?
– it is one thing to add a layer to your DL, but
changing the structure of your first-class
objects incurs a level of short-term risk
– however, even the most well-thought out
componentized DL is subject to long-term risks
• cf. ICASE DL
Conclusions
• Smart objects are an idea whose time has come
– natural progression of DL R&D
• Smart objects will play an fundamental role in
digital preservation
Download