Building the digital library community John Mark Ockerbloom Carnegie Mellon University

advertisement
Building the digital library
community
John Mark Ockerbloom
Carnegie Mellon University
February 8, 1999
The library community
• Centuries of people who have created the
works we can read and research in the library
• The modern-day community of library
maintainers and users
• More than just a collection of information and
people
– information is organized in a usable manner
– people are trained professionals, supported by their
universities, governments
• More than just one library
The digital revolution
• Population of the Internet (both people and
data) is growing exponentially
– much of the information library users want is now digital
• Particular appeals:
–
–
–
–
Near-instant, cheap access to information
Breadth (and sometimes depth) of content
Easy to store, serve large quantities of information
Just about anyone can easily publish content, as well as
“consume” it
• Although…
– it’s often hard to find the high-quality information you
really need (obscure, or not put on-line)
– little coordination of information
– distractions abound (noise mixed in with signal)
A vision:
Revolutions in learning, teaching,
creating, collaborating
• Very large libraries (tens of millions of
“volumes”) available to scholars, students,
general public
• Distributed: enables distance learning,
collaboration
• Accessible quickly, at low cost, through a
variety of methods
• Easily searchable, citable, extendable,
preservable
• New kinds of resources available (e.g.
conceptual networks, intelligent tutors)
A nightmare?
• Thousands of uncoordinated, isolated
systems?
• Prohibitive cost structures?
• Hard to use systems, lacking the affordances
of previous media?
• Data and metadata that become unusable
within 10-20 years -- or less?
– “404 Not found”
– Vendor becomes unreachable… content becomes
unreadable!
An opportunity
• Libraries can take the lead in forming
communities for digital information
– have expertise in managing information and organizing it
for users
– acquire much of the content
• University libraries can play special role
– have large collections, variety of expertise
– University DLs both serve the university and promote it
– they need to be both conservative and innovative
• Challenges are both technical and social
– library science, computer science, HCI,
sociology/anthropology...
– organization and support (politics, economics…)
Some design challenges in
digital libraries
• Acquisition
• Cataloging/Searching
– can be done at much finer grain
– new ways of searching for things
– new kinds of metadata may be important
• Access control
• Presentation / Interface
• Preservation and Maintenance
A key design principle:
Sharing the work
• Useful even at small scale
– Coordinated cataloging: The On-Line Books Page
» 8400 listings, 1M hits/month (60% nongraphical)
– Sharing crucial metadata: Catalog of Copyright Entries
– Coordinated acquisition: Catholic Encyclopedia
– Inter-project dialogue: Book People mailing list
• Larger libraries, projects can enhance each
other’s collections at larger scale
A specific problem:
Data format mismatch
• Much of the information in a digital library is
from outside sources, in variety of formats
• Most clients only understand a few formats
• They therefore cannot effectively use many
materials
– data may be in incomprehensible form
– data may be in form not easily worked with
• Particularly problematic:
– formats that have complex (but useful) structure
– legacy data and programs (obsolete format assumptions)
• Most of the information in large libraries is
“legacy”; long lifespan essential!
Standards are a partial solution
• “The wonderful thing about standards is that
there are so many to choose from”
– Data: SGML/XML, Word processor formats, HTML, PDF,
Quark, specialized scientific formats, page image
formats….
– Metadata: USMARC, Dublin Core, RDF...
• Standards allow common understandings...
• …But no one standard fits all
– different sources may make different data choices
– lowest common denonimator often not good enough
– needs, applications, standards change (sometimes
quickly)
TOM: A data model for
mediating among diverse data
formats
• Allows unfamiliar formats to be
– operated on via outside services that understand the
format
– related to familiar formats
– converted into usable formats
» for the needs of a particular application or user
» for migration from obsolete technology to new
technology
• Works for data that is:
– accessible as a (typed or typable) sequence of bytes, or
– accessible through a well-defined, working protocol
TOM lets you get this...
…from this
From: Sherry T Haddock <shaddock@csr.uta.edu>
To: caeti@nosc.mil
Subject: CAETI Community Meeting Info
Date: Thu, 15 Feb 1996 17:12:52 -0600 (CST)
Mime-Version: 1.0
Content-Type: MULTIPART/MIXED; BOUNDARY="608184028-521714262-824425972=:20798"
Cc: Sherry T Haddock <shaddock@csr.uta.edu>
This message is in MIME format. The first part should be readable text,
while the remaining parts are likely unreadable without MIME-aware tools.
Send mail to mime@docserver.cac.washington.edu for more info.
--608184028-521714262-824425972=:20798
Content-Type: TEXT/PLAIN; charset=US-ASCII
Here are maps detailing the March CAETI Community Meeting Location. ...
Thanks again,
Sherry <shaddock@csr.uta.edu>
--608184028-521714262-824425972=:20798
Content-Type: TEXT/PLAIN; charset=US-ASCII; name="CaetiMap.hqx"
Content-Transfer-Encoding: BASE64
Content-ID: <Pine.SUN.3.90.960215171252.20798B@csr.uta.edu>
Content-Description:
KFRoaXMgZmlsZSBtdXN0IGJlIGNvbnZlcnRlZCB3aXRoIEJpbkhleCA0LjAp
DQoNCjojODBLQ0E0VCklZUtGISI2NiUzYzgmIjgtYCMzIiQpIU4hLSlEMyVt
ZC1tNGkrJ2EnWiUhTiIhbCEhLSFyW20NCg0KKiEhQiFOIVgiISohJCEzIzMj
IiEhISFtIU4hLSIhKiEkcltxMyFgIzMjMnEzcnJxM1hJaHJOIS1BISohJCFg
Iw0KDQozIWAzIU4hLSYhKiEkIkojMyFgRiFOIS0pISohJCMzIzMhYFMhTiEt
LCEqISQkISMzIWBkIU4hLTEhKiEkcltxDQoNCjMhcmxyTiEtNCEqISQlSiMz
IWEtIU4hLTghKiEkJjMjMyFhQiFOITJxcmohJHJbcTNycnEzVCEiNSEqIXE
(Emailed,
MIME-attached,
base64,
binhexed,
Powerpoint 3)
TOM: Key ideas
1
Formats can be described by
what information they contain
how they represent it
how they relate to other formats
Object-oriented models capture these aspects
So: I use an object-oriented metadata schema to
describe data formats
2
Much useful format info, services, distributed
throughout the Net.
Mediators give uniform access to diverse
knowledge bases, services
So: I use a network of mediators (type brokers) to
assist with unfamiliar formats
The architecture supporting TOM
(simplified)
Clients
get info on
formats,
request
Client
operations
(e.g. conversions)
Brokers
maintain info on
formats, invoke
servers for operations
Client
Servers implement
operations
Server
Type
Broker
Server
Clients can also register
new formats, operations,
server information...
Client
Brokers
can trade info,
consult other
brokers
Type
Broker
Server
What’s good about this design?
• It’s simple (and therefore flexible):
– Minimal, basic, well understood standards
• It’s accommodating:
– Describes past, present and future data formats with
good breadth and depth of expressiveness
– It can be composed with a wide variety of programs and
databases (including the Web, off-the-shelf programs)
– Benefits start with very low investment, then increase
• It’s scalable (largely by taking advantage of
distributed, interactive nature of Net):
– Anyone can define new formats and services
– Brokers coordinate contributions from Net community
A cooperative library network
(simplified)
Clients get info on
materials,
request
Client
services
(search, convert...)
Brokers
maintain info on
services, invoke
servers for operations
Client
Servers provide
materials,
services
Server
Library
Broker
Server
Clients can also register
new materials, metadata,
services...
Client
Brokers
can trade info,
consult other
brokers
Library
Broker
Server
The importance of open content
• Sharing content and/or metadata gives each
library boost from others
• Enables distributed indexing and crossreferencing
– Alta Vista et al made possible by open content
• Enables replication, minimizing risk of
information loss
• More flexibility in adapting and migrating
information to new situations
• Users can improve, augment resources and
feed them back to libraries
Summary
• The best large-scale digital libraries are built
around community
– technically: distributed, cooperative infrastructure (e.g.
broker architecture of TOM) developed by experts in
multiple domains
– socially: cooperation between disciplines, organizations;
designs that meet the needs of various constituencies
• University libraries like Penn’s can take a
leading role in creating DL community
– Have much of the collections, experts
– Can provide testbeds, help their users, gain visibility
• Potential for revolutionary benefits, with the
right designs
To find out more:
• TOM home page:
– Conversion service, other demos, technical details,
thesis document
http://tom.cs.cmu.edu/
• The On-Line Books Page:
– Catalog, Book People archives, copyright entries,
selected resources on on-line texts and libraries
http://www.cs.cmu.edu/books.html
• Personal home page:
http://www.cs.cmu.edu/~spok/
Download