Keeping Digital Documents Usable: Managing Data Formats John Mark Ockerbloom Carnegie Mellon University April 27, 1999 What I’ll be talking about • A data model and architecture that supports definition, use, and conversion of an arbitrarily large number of data formats • How this model helps digital libraries and archives keep electronic documents accessible and usable over long term The problem of electronic document preservation • We can digitize lots of information, but it can become inaccessible very quickly – (150 year old book vs. 5 year old 5 1/4” floppy) • Electronic preservation problems differ sharply from print preservation problems • Preserving the bits is easy: just replicate them – (Remove the hardware dependency first if you can) – Internet allows very wide replication, avoiding singlearchive failures • The problem is understanding the bits, so that you can continue to use them... Data format mismatch • In a large, diverse, digital archive, information comes in a variety of formats • Most clients only understand a few formats • They therefore cannot effectively use many materials – data may be in incomprehensible form – data may be in form not easily worked with • Particularly problematic: – formats that have complex (but useful) structure – legacy data and programs (obsolete format assumptions) • In a long-lived library, most information IS “legacy” Standards are a partial solution • Standards allow common understandings… – Data: SGML/XML, Word processor formats, HTML, PDF, Quark, specialized scientific formats, page image formats…. – Metadata: USMARC, Dublin Core, RDF... • …But no one standard fits all – different uses may require different data choices – “lowest common denonimator” often not good enough • And standards change over time – needs and applications change (sometimes quickly) – standardization process lags – even established standards become obsolete » Who supports EBCDIC now? » Who will support 1999 standards in 2049? Techologies for usefully preserving unfamiliar formats • Emulation – don’t change the data; maintain programs to deal with it – Essentially data abstraction, since the “emulation” just needs to provide same functionality, and may be implemented very differently from original – But: May be costly to maintain infrastructure; may unnecessarily lock user into old interaction styles • Migration – Periodically convert data to more “up-to-date” formats; then use your everyday programs on it – But: How do you control information loss? An archived document Its raw electronic form From: Sherry T Haddock <shaddock@csr.uta.edu> To: caeti@nosc.mil Subject: CAETI Community Meeting Info Date: Thu, 15 Feb 1996 17:12:52 -0600 (CST) Mime-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="608184028-521714262-824425972=:20798" Cc: Sherry T Haddock <shaddock@csr.uta.edu> This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. Send mail to mime@docserver.cac.washington.edu for more info. --608184028-521714262-824425972=:20798 Content-Type: TEXT/PLAIN; charset=US-ASCII Here are maps detailing the March CAETI Community Meeting Location. ... Thanks again, Sherry <shaddock@csr.uta.edu> --608184028-521714262-824425972=:20798 Content-Type: TEXT/PLAIN; charset=US-ASCII; name="CaetiMap.hqx" Content-Transfer-Encoding: BASE64 Content-ID: <Pine.SUN.3.90.960215171252.20798B@csr.uta.edu> Content-Description: KFRoaXMgZmlsZSBtdXN0IGJlIGNvbnZlcnRlZCB3aXRoIEJpbkhleCA0LjAp DQoNCjojODBLQ0E0VCklZUtGISI2NiUzYzgmIjgtYCMzIiQpIU4hLSlEMyVt ZC1tNGkrJ2EnWiUhTiIhbCEhLSFyW20NCg0KKiEhQiFOIVgiISohJCEzIzMj IiEhISFtIU4hLSIhKiEkcltxMyFgIzMjMnEzcnJxM1hJaHJOIS1BISohJCFg Iw0KDQozIWAzIU4hLSYhKiEkIkojMyFgRiFOIS0pISohJCMzIzMhYFMhTiEt LCEqISQkISMzIWBkIU4hLTEhKiEkcltxDQoNCjMhcmxyTiEtNCEqISQlSiMz IWEtIU4hLTghKiEkJjMjMyFhQiFOITJxcmohJHJbcTNycnEzVCEiNSEqIXE (Emailed, MIME-attached, base64, binhexed, Powerpoint 3) TOM Basic Data Concepts • An object is a (non-mutable) typed value – objects can come from anywhere, be passed about (cf. MIME) • A type specifies abstractly what can be done with an object – includes attributes, methods, with slots for specs... • An encoding is a mapping from one type to another type that represents the original type – usually maps to a simpler type – ``lowest-level’’ type: byte sequence • A format is a representation of a type as a byte sequence. – As a type plus a sequence of encodings – Conversions can be defined between formats Example: A simple mail message as a TOM object • Value is the content of the message • Type is simple ``mail message’’ type – attributes: sender, recipients, header, body... – methods: get_attachment (num)... • This ``mail message’’ type has encodings: – ``standard’’ encoding is RFC822 encoding (as ASCII byte sequence) • Format is just a type with encodings: – ``mail message’’ type in ``standard’’ encoding – could also have further encodings (e.g. mail header) • TOM ships objects around with format tags Part of the TOM type hierarchy Object Reference URL Powerpoint Package Binhex Communication Mail message Subtypes are substitutable for supertypes (cf. Liskov & Wing) Substitutable types? • What are they? – In a ``substitutable’’ subtyping model, objects of type T behave exactly like objects of T’s supertypes when used through supertype interfaces – Conceptually, there’s an ``abstraction mapping’’ where each object in a subtype has a corresponding object in the supertype that can “substitute” for it • Why are they important? – They allow unfamiliar types to be used through familiar supertype interfaces, with information and behavior guaranteed to be consistent with the supertype. – (Most other OO systems don’t require this) – The guarantee of the information preserved by a type is also useful in characterizing conversions What can you do with an object in an unfamiliar format? • If you know its type, you can use it through its interface (attributes, methods) – e.g. getting “sender”, “recipients” attributes of mail • If you know one of its encodings, you can use it through its encoded type – e.g. displaying the text encoding of the mail • If you don’t know its type, but know one of its supertypes, use its supertype’s interface – e.g. using “element”, “describe” methods of package • There may be a conversion to a known format – e.g. Powerpoint slide format to GIF image format The architecture supporting TOM (simplified) Clients get info on formats, request Client operations (e.g. conversions) Brokers maintain info on formats, invoke servers for operations Client Servers implement operations Server Type Broker Server Clients can also register new formats, operations, server information... Client Brokers can trade info, consult other brokers Type Broker Server TOM in action • Brokers, apps deployed at Carnegie Mellon – TOM Conversion Service, TOM Frame Service – Uses off-the-shelf technology (converters from the Net, Web browsers) – Try them yourself! Demos at http://tom.cs.cmu.edu/ • It should work even better when scaled up – Broker software to be released later in 1999 – Anyone can add new types, methods, formats, services – The more sites get involved, the more services will be available, and the more expertise brokers will have Brokers enable smart conversions If formats are nodes, and conversions are directed edges... This doesn’t scale This may lose too much information This gives greatest flexibility But how do we find best conversion path? •Breadth-first search is a good starting point •We can label edges with conversion characteristics •Example: Conversions can be said to respect a type •(preserve the information given in that type) Respectful conversions? • A conversion c respects type T if for all inputs i in c’s domain, i and c(i) are indistinguishable when viewed through the interface of type T – strong form: abstraction mapping to T of i and c(i) map to the same “substitute” object in T – weak form: there exists no method m of type T for which m(i) can return a value that m(c(i)) cannot return, other parameters and context being equal. (And vice versa. With a similar rule for attributes of T.) • Why is this important? – It says that a conversion preserves all the information defined in type T – It lets one specify which information must be preserved Choices in converting our mail message Object Reference URL Powerpoint Package Binhex Communication Mail message Respecting Communication: Sender, recipient, etc., guaranteed preserved Respecting Package: Attachment structure guaranteed preserved Want different guarantees? You can always declare a new supertype What’s good about TOM’s design? • It’s simple (and therefore flexible): – Minimal, basic, well understood standards • It’s accommodating: – Describes past, present and future data formats with good breadth and depth of expressiveness – It can be composed with a wide variety of programs and databases (including the Web, off-the-shelf programs) – Benefits start with very low investment, then increase • It’s scalable (largely by taking advantage of distributed, interactive nature of Net): – Anyone can define new formats and services – Brokers coordinate contributions from Net community, allowing efficient sharing of work Sharing the work: Key for successful digital libraries, archives • Internet gives new opportunities for collaboration, resource sharing, e.g. – Dealing with diverse data formats (TOM broker architecture) – Coordinated cataloging, acquisition (some now, can be done more openly and broadly) – Sharing crucial metadata (Catalog of Copyright Entries) • Many long-term payoffs for openly sharing collections – Not just the benefits of public access for citizenry, but also... – Replication minimizes risks of information loss – Value-added services (e.g. indexing, xrefs, search engines) – Allows whole community to adapt, migrate, augment information for new needs and situations • “Giving it away” lets you amortize cost of improvement & maintenance over wide constituency