Keeping Digital Documents Usable: Managing Data Formats John Mark Ockerbloom

advertisement
Keeping Digital Documents
Usable:
Managing Data Formats
John Mark Ockerbloom
Carnegie Mellon University
April 27, 1999
What I’ll be talking about
• A data model and architecture that supports
definition, use, and conversion of an
arbitrarily large number of data formats
• How this model helps digital libraries and
archives keep electronic documents
accessible and usable over long term
The problem of electronic
document preservation
• We can digitize lots of information, but it can
become inaccessible very quickly
– (150 year old book vs. 5 year old 5 1/4” floppy)
• Electronic preservation problems differ
sharply from print preservation problems
• Preserving the bits is easy: just replicate
them
– (Remove the hardware dependency first if you can)
– Internet allows very wide replication, avoiding singlearchive failures
• The problem is understanding the bits, so
that you can continue to use them...
Data format mismatch
• In a large, diverse, digital archive, information
comes in a variety of formats
• Most clients only understand a few formats
• They therefore cannot effectively use many
materials
– data may be in incomprehensible form
– data may be in form not easily worked with
• Particularly problematic:
– formats that have complex (but useful) structure
– legacy data and programs (obsolete format assumptions)
• In a long-lived library, most information IS
“legacy”
Standards are a partial solution
• Standards allow common understandings…
– Data: SGML/XML, Word processor formats, HTML, PDF,
Quark, specialized scientific formats, page image
formats….
– Metadata: USMARC, Dublin Core, RDF...
• …But no one standard fits all
– different uses may require different data choices
– “lowest common denonimator” often not good enough
• And standards change over time
– needs and applications change (sometimes quickly)
– standardization process lags
– even established standards become obsolete
» Who supports EBCDIC now?
» Who will support 1999 standards in 2049?
Techologies for usefully
preserving unfamiliar formats
• Emulation
– don’t change the data; maintain programs to deal with it
– Essentially data abstraction, since the “emulation” just
needs to provide same functionality, and may be
implemented very differently from original
– But: May be costly to maintain infrastructure; may
unnecessarily lock user into old interaction styles
• Migration
– Periodically convert data to more “up-to-date” formats;
then use your everyday programs on it
– But: How do you control information loss?
An archived document
Its raw electronic form
From: Sherry T Haddock <shaddock@csr.uta.edu>
To: caeti@nosc.mil
Subject: CAETI Community Meeting Info
Date: Thu, 15 Feb 1996 17:12:52 -0600 (CST)
Mime-Version: 1.0
Content-Type: MULTIPART/MIXED; BOUNDARY="608184028-521714262-824425972=:20798"
Cc: Sherry T Haddock <shaddock@csr.uta.edu>
This message is in MIME format. The first part should be readable text,
while the remaining parts are likely unreadable without MIME-aware tools.
Send mail to mime@docserver.cac.washington.edu for more info.
--608184028-521714262-824425972=:20798
Content-Type: TEXT/PLAIN; charset=US-ASCII
Here are maps detailing the March CAETI Community Meeting Location. ...
Thanks again,
Sherry <shaddock@csr.uta.edu>
--608184028-521714262-824425972=:20798
Content-Type: TEXT/PLAIN; charset=US-ASCII; name="CaetiMap.hqx"
Content-Transfer-Encoding: BASE64
Content-ID: <Pine.SUN.3.90.960215171252.20798B@csr.uta.edu>
Content-Description:
KFRoaXMgZmlsZSBtdXN0IGJlIGNvbnZlcnRlZCB3aXRoIEJpbkhleCA0LjAp
DQoNCjojODBLQ0E0VCklZUtGISI2NiUzYzgmIjgtYCMzIiQpIU4hLSlEMyVt
ZC1tNGkrJ2EnWiUhTiIhbCEhLSFyW20NCg0KKiEhQiFOIVgiISohJCEzIzMj
IiEhISFtIU4hLSIhKiEkcltxMyFgIzMjMnEzcnJxM1hJaHJOIS1BISohJCFg
Iw0KDQozIWAzIU4hLSYhKiEkIkojMyFgRiFOIS0pISohJCMzIzMhYFMhTiEt
LCEqISQkISMzIWBkIU4hLTEhKiEkcltxDQoNCjMhcmxyTiEtNCEqISQlSiMz
IWEtIU4hLTghKiEkJjMjMyFhQiFOITJxcmohJHJbcTNycnEzVCEiNSEqIXE
(Emailed,
MIME-attached,
base64,
binhexed,
Powerpoint 3)
TOM Basic Data Concepts
• An object is a (non-mutable) typed value
– objects can come from anywhere, be passed about (cf.
MIME)
• A type specifies abstractly what can be done
with an object
– includes attributes, methods, with slots for specs...
• An encoding is a mapping from one type to
another type that represents the original type
– usually maps to a simpler type
– ``lowest-level’’ type: byte sequence
• A format is a representation of a type as a
byte sequence.
– As a type plus a sequence of encodings
– Conversions can be defined between formats
Example: A simple mail
message as a TOM object
• Value is the content of the message
• Type is simple ``mail message’’ type
– attributes: sender, recipients, header, body...
– methods: get_attachment (num)...
• This ``mail message’’ type has encodings:
– ``standard’’ encoding is RFC822 encoding (as ASCII byte
sequence)
• Format is just a type with encodings:
– ``mail message’’ type in ``standard’’ encoding
– could also have further encodings (e.g. mail header)
• TOM ships objects around with format tags
Part of the TOM type hierarchy
Object
Reference
URL
Powerpoint
Package
Binhex
Communication
Mail message
Subtypes are substitutable for supertypes (cf. Liskov & Wing)
Substitutable types?
• What are they?
– In a ``substitutable’’ subtyping model, objects of type T
behave exactly like objects of T’s supertypes when used
through supertype interfaces
– Conceptually, there’s an ``abstraction mapping’’ where
each object in a subtype has a corresponding object in
the supertype that can “substitute” for it
• Why are they important?
– They allow unfamiliar types to be used through familiar
supertype interfaces, with information and behavior
guaranteed to be consistent with the supertype.
– (Most other OO systems don’t require this)
– The guarantee of the information preserved by a type is
also useful in characterizing conversions
What can you do with an object
in an unfamiliar format?
• If you know its type, you can use it through
its interface (attributes, methods)
– e.g. getting “sender”, “recipients” attributes of mail
• If you know one of its encodings, you can use
it through its encoded type
– e.g. displaying the text encoding of the mail
• If you don’t know its type, but know one of its
supertypes, use its supertype’s interface
– e.g. using “element”, “describe” methods of package
• There may be a conversion to a known format
– e.g. Powerpoint slide format to GIF image format
The architecture supporting TOM
(simplified)
Clients
get info on
formats,
request
Client
operations
(e.g. conversions)
Brokers
maintain info on
formats, invoke
servers for operations
Client
Servers implement
operations
Server
Type
Broker
Server
Clients can also register
new formats, operations,
server information...
Client
Brokers
can trade info,
consult other
brokers
Type
Broker
Server
TOM in action
• Brokers, apps deployed at Carnegie Mellon
– TOM Conversion Service, TOM Frame Service
– Uses off-the-shelf technology (converters from the Net,
Web browsers)
– Try them yourself! Demos at http://tom.cs.cmu.edu/
• It should work even better when scaled up
– Broker software to be released later in 1999
– Anyone can add new types, methods, formats, services
– The more sites get involved, the more services will be
available, and the more expertise brokers will have
Brokers enable smart
conversions
If formats are nodes, and conversions are directed edges...
This doesn’t scale
This may lose too much
information
This gives greatest
flexibility
But how do we find best conversion path?
•Breadth-first search is a good starting point
•We can label edges with conversion characteristics
•Example: Conversions can be said to respect a type
•(preserve the information given in that type)
Respectful conversions?
• A conversion c respects type T if for all inputs
i in c’s domain, i and c(i) are indistinguishable
when viewed through the interface of type T
– strong form: abstraction mapping to T of i and c(i) map to
the same “substitute” object in T
– weak form: there exists no method m of type T for which
m(i) can return a value that m(c(i)) cannot return, other
parameters and context being equal. (And vice versa.
With a similar rule for attributes of T.)
• Why is this important?
– It says that a conversion preserves all the information
defined in type T
– It lets one specify which information must be preserved
Choices in converting our mail
message
Object
Reference
URL
Powerpoint
Package
Binhex
Communication
Mail message
Respecting Communication: Sender, recipient, etc., guaranteed preserved
Respecting Package: Attachment structure guaranteed preserved
Want different guarantees? You can always declare a new supertype
What’s good about TOM’s
design?
• It’s simple (and therefore flexible):
– Minimal, basic, well understood standards
• It’s accommodating:
– Describes past, present and future data formats with
good breadth and depth of expressiveness
– It can be composed with a wide variety of programs and
databases (including the Web, off-the-shelf programs)
– Benefits start with very low investment, then increase
• It’s scalable (largely by taking advantage of
distributed, interactive nature of Net):
– Anyone can define new formats and services
– Brokers coordinate contributions from Net community,
allowing efficient sharing of work
Sharing the work:
Key for successful digital
libraries, archives
• Internet gives new opportunities for collaboration,
resource sharing, e.g.
– Dealing with diverse data formats (TOM broker architecture)
– Coordinated cataloging, acquisition (some now, can be done
more openly and broadly)
– Sharing crucial metadata (Catalog of Copyright Entries)
• Many long-term payoffs for openly sharing collections
– Not just the benefits of public access for citizenry, but also...
– Replication minimizes risks of information loss
– Value-added services (e.g. indexing, xrefs, search engines)
– Allows whole community to adapt, migrate, augment
information for new needs and situations
• “Giving it away” lets you amortize cost of improvement
& maintenance over wide constituency
Download