Building the digital library community John Mark Ockerbloom Carnegie Mellon University February 8, 1999 The library community • Centuries of people who have created the works we can read and research in the library • The modern-day community of library maintainers and users • More than just a collection of information and people – information is organized in a usable manner – people are trained professionals, supported by their universities, governments • More than just one library The digital revolution • Population of the Internet (both people and data) is growing exponentially – much of the information library users want is now digital • Particular appeals: – – – – Near-instant, cheap access to information Breadth (and sometimes depth) of content Easy to store, serve large quantities of information Just about anyone can easily publish content, as well as “consume” it • Although… – it’s often hard to find the high-quality information you really need (obscure, or not put on-line) – little coordination of information – distractions abound (noise mixed in with signal) A vision: Revolutions in learning, teaching, creating, collaborating • Very large libraries (tens of millions of “volumes”) available to scholars, students, general public • Distributed: enables distance learning, collaboration • Accessible quickly, at low cost, through a variety of methods • Easily searchable, citable, extendable, preservable • New kinds of resources available (e.g. conceptual networks, intelligent tutors) A nightmare? • Thousands of uncoordinated, isolated systems? • Prohibitive cost structures? • Hard to use systems, lacking the affordances of previous media? • Data and metadata that become unusable within 10-20 years -- or less? – “404 Not found” – Vendor becomes unreachable… content becomes unreadable! An opportunity • Libraries can take the lead in forming communities for digital information – have expertise in managing information and organizing it for users – acquire much of the content • University libraries can play special role – have large collections, variety of expertise – University DLs both serve the university and promote it – they need to be both conservative and innovative • Challenges are both technical and social – library science, computer science, HCI, sociology/anthropology... – organization and support (politics, economics…) Some design challenges in digital libraries • Acquisition • Cataloging/Searching – can be done at much finer grain – new ways of searching for things – new kinds of metadata may be important • Access control • Presentation / Interface • Preservation and Maintenance A key design principle: Sharing the work • Useful even at small scale – Coordinated cataloging: The On-Line Books Page » 8400 listings, 1M hits/month (60% nongraphical) – Sharing crucial metadata: Catalog of Copyright Entries – Coordinated acquisition: Catholic Encyclopedia – Inter-project dialogue: Book People mailing list • Larger libraries, projects can enhance each other’s collections at larger scale A specific problem: Data format mismatch • Much of the information in a digital library is from outside sources, in variety of formats • Most clients only understand a few formats • They therefore cannot effectively use many materials – data may be in incomprehensible form – data may be in form not easily worked with • Particularly problematic: – formats that have complex (but useful) structure – legacy data and programs (obsolete format assumptions) • Most of the information in large libraries is “legacy”; long lifespan essential! Standards are a partial solution • “The wonderful thing about standards is that there are so many to choose from” – Data: SGML/XML, Word processor formats, HTML, PDF, Quark, specialized scientific formats, page image formats…. – Metadata: USMARC, Dublin Core, RDF... • Standards allow common understandings... • …But no one standard fits all – different sources may make different data choices – lowest common denonimator often not good enough – needs, applications, standards change (sometimes quickly) TOM: A data model for mediating among diverse data formats • Allows unfamiliar formats to be – operated on via outside services that understand the format – related to familiar formats – converted into usable formats » for the needs of a particular application or user » for migration from obsolete technology to new technology • Works for data that is: – accessible as a (typed or typable) sequence of bytes, or – accessible through a well-defined, working protocol TOM lets you get this... …from this From: Sherry T Haddock <shaddock@csr.uta.edu> To: caeti@nosc.mil Subject: CAETI Community Meeting Info Date: Thu, 15 Feb 1996 17:12:52 -0600 (CST) Mime-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="608184028-521714262-824425972=:20798" Cc: Sherry T Haddock <shaddock@csr.uta.edu> This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. Send mail to mime@docserver.cac.washington.edu for more info. --608184028-521714262-824425972=:20798 Content-Type: TEXT/PLAIN; charset=US-ASCII Here are maps detailing the March CAETI Community Meeting Location. ... Thanks again, Sherry <shaddock@csr.uta.edu> --608184028-521714262-824425972=:20798 Content-Type: TEXT/PLAIN; charset=US-ASCII; name="CaetiMap.hqx" Content-Transfer-Encoding: BASE64 Content-ID: <Pine.SUN.3.90.960215171252.20798B@csr.uta.edu> Content-Description: KFRoaXMgZmlsZSBtdXN0IGJlIGNvbnZlcnRlZCB3aXRoIEJpbkhleCA0LjAp DQoNCjojODBLQ0E0VCklZUtGISI2NiUzYzgmIjgtYCMzIiQpIU4hLSlEMyVt ZC1tNGkrJ2EnWiUhTiIhbCEhLSFyW20NCg0KKiEhQiFOIVgiISohJCEzIzMj IiEhISFtIU4hLSIhKiEkcltxMyFgIzMjMnEzcnJxM1hJaHJOIS1BISohJCFg Iw0KDQozIWAzIU4hLSYhKiEkIkojMyFgRiFOIS0pISohJCMzIzMhYFMhTiEt LCEqISQkISMzIWBkIU4hLTEhKiEkcltxDQoNCjMhcmxyTiEtNCEqISQlSiMz IWEtIU4hLTghKiEkJjMjMyFhQiFOITJxcmohJHJbcTNycnEzVCEiNSEqIXE (Emailed, MIME-attached, base64, binhexed, Powerpoint 3) TOM: Key ideas 1 Formats can be described by what information they contain how they represent it how they relate to other formats Object-oriented models capture these aspects So: I use an object-oriented metadata schema to describe data formats 2 Much useful format info, services, distributed throughout the Net. Mediators give uniform access to diverse knowledge bases, services So: I use a network of mediators (type brokers) to assist with unfamiliar formats The architecture supporting TOM (simplified) Clients get info on formats, request Client operations (e.g. conversions) Brokers maintain info on formats, invoke servers for operations Client Servers implement operations Server Type Broker Server Clients can also register new formats, operations, server information... Client Brokers can trade info, consult other brokers Type Broker Server What’s good about this design? • It’s simple (and therefore flexible): – Minimal, basic, well understood standards • It’s accommodating: – Describes past, present and future data formats with good breadth and depth of expressiveness – It can be composed with a wide variety of programs and databases (including the Web, off-the-shelf programs) – Benefits start with very low investment, then increase • It’s scalable (largely by taking advantage of distributed, interactive nature of Net): – Anyone can define new formats and services – Brokers coordinate contributions from Net community A cooperative library network (simplified) Clients get info on materials, request Client services (search, convert...) Brokers maintain info on services, invoke servers for operations Client Servers provide materials, services Server Library Broker Server Clients can also register new materials, metadata, services... Client Brokers can trade info, consult other brokers Library Broker Server The importance of open content • Sharing content and/or metadata gives each library boost from others • Enables distributed indexing and crossreferencing – Alta Vista et al made possible by open content • Enables replication, minimizing risk of information loss • More flexibility in adapting and migrating information to new situations • Users can improve, augment resources and feed them back to libraries Summary • The best large-scale digital libraries are built around community – technically: distributed, cooperative infrastructure (e.g. broker architecture of TOM) developed by experts in multiple domains – socially: cooperation between disciplines, organizations; designs that meet the needs of various constituencies • University libraries like Penn’s can take a leading role in creating DL community – Have much of the collections, experts – Can provide testbeds, help their users, gain visibility • Potential for revolutionary benefits, with the right designs To find out more: • TOM home page: – Conversion service, other demos, technical details, thesis document http://tom.cs.cmu.edu/ • The On-Line Books Page: – Catalog, Book People archives, copyright entries, selected resources on on-line texts and libraries http://www.cs.cmu.edu/books.html • Personal home page: http://www.cs.cmu.edu/~spok/