Why Databases? Gio Wiederhold September 2001 CS545 intro

advertisement
CS545 intro
Why Databases?
September 2001
Gio Wiederhold
Stanford University
www-db.stanford.edu/people/gio.html
Sep-01
CS545 Intro
1
Abstract
The distinction of storing data in files and databases is that databases are intended to be
used by multiple programs and types of users.
Databases have been available in various forms since 1958.
The major paper defining database functionality in a formal sense is due to Ted Codd, of
IBM, published in 1970.
Information is created by applying knowledge (encoded as programs or rules) to collected
data and message received.
Data and computation resources are provided by a variety of suppliers, public and
private. The number of potential suppliers and their autonomy also creates information
overload
To cope with these issues novel intermediate services are needed, opening up new
opportunities. Many traditional relationships among consumers and vendors will
change.
The autonomy of the suppliers causes heterogeneity and inconsistencies. The semantics
of diverse sources are captured by their ontologies, the collection of terms and their
relationships as used in the domain of discourse for the source. When sources are to
be related we rely on their ontologies to make the linkages. . Creating a sound
algebra encompassing the required operations allows manipulation and composition of
the interoperation process.
Sep-01
CS545 Intro
2
Outline
•
•
•
•
•
•
•
Motivation and Functions needed
Early Inventions
Architecture
Formal basis
Breadth of applicability
Unsolved problems
Research Directions
Sep-01
CS545 Intro
3
Files versus Databases
Files: provide input and output for a program •(transient)
• Devices: Paper tape (ascii), Cards, Magnetic Tapes
• Examples:
1. FORTRAN: tapes 1-5 input, 5 standard in ( 80 column cards)
tapes 6-7 output, 6 print (120 cols), 7 punch ( 80 cols)
still visible in files, IBM VM OS
2. UNIX: standard in > Standard out
3. Data-processing: in > •> out = in > •> out = in > •> out ....
Databases: storage (persistent, reliable, random access)
• Enabled by disk - technology, starting in 1960 (5MB)
• Many users, i.e., many (small) programs ••••
• Example:
1. BOMP – Bill-of-materials (inventory), airline seats, processing
Sep-01
CS545 Intro
4
Files
•
Files: a means for programs to store data for later use
– The initial program •determines
1. what data are being stored (all? – memory dump [LISP] )
2. how it is being stored – structure and format
3. when it is being stored and available
– successor programs must follow these decisions
• often the successor program is another invocation of the
initial program •
•
Problems
– One program requires a different structure than another: BOMP
– Data must be available rapidly, incrementally:
• Class-assignments
• seat reservations
• library checkout
– Programs •must be available continuously, depend on data
Sep-01
CS545 Intro
5
Databases
• Data are intended to be used by many programs
– Often small – transactions
– Various subsets of the all the relevant data
– Structural transformations: Bill-of-Materials Programs:
Input program
Output program
Records parts being
delivered
Records parts being
consumed
Products :> parts
Supplier :> parts
Inventory
Suppliers, Products
:> parts
Sep-01
CS545 Intro
6
BoMPs are common
•
•
•
•
•
•
•
Supplier
Parts
Product-Assemblies
Clinical-labs
Observations
Patient-Records
Employees
Salary & Tasks
Productivity
Accidents
Reports
Failure-Analysis
Flights
Seats
Passengers
Classes
Grades
Student-Performance
...
Two directions / hierarchies needed for data access:
Data sources
Stuff
Data consumption
Solutions?
Sep-01
CS545 Intro
7
Design Problem & Solutions
Conceptual - model
• Supplier program:
– Use a hierarchy: supplier
parts supplied ( 1: n )
• Consumer program:
– Use a hierarchy: consumer
parts used ( 1: m )
Actual solution in memory: Matrix:
if it exceeds memory then either
supplier or consumer
part accesses
become costly
Actual solution beyond memory:
1. redundant transformed data
2. pointer and index structures
Sep-01
CS545 Intro
s1 s2 s3
c1
c2
c3
sn
P
cm
8
Factors influencing design
• Size --- memories are getting bigger, problems too
• Density of matrix:
– suppliers supply only some parts, overlapping
– products consume only some parts, overlapping
• Performance requirements:
– supplier response can be less critical
– airline seats made available versus seats being sold
– laboratory data obtained versus patient records needed
• Usage patterns:
– batches versus single item accesses
– linked according to yet other criteria:
Sep-01
CS545 Intro
9
DBMSs
Database Management Systems
• Collection of the software needed to manage databases
• Components:
–
–
–
–
Storage management – intertwined with the operating systems
Query and update processor – uses the schema
Schema interpreter and compiler
Transaction management and concurrency control/protection –
also jointly with OS
– Logger for backup
– Recovery programs
• Large, complex, not all features always needed
• Many fewer vendors now than 10 yesrs ago
Sep-01
CS545 Intro
10
Inventions – 1 - Data Description
• Schemas [McGee, 1958] program independence
= A symbolic description of each column, to be interpreted by
update and retrieval programs as well as users
– Allows programs to use subsets
– Allows columns to be added without affecting current programs
• Compilation of Schemas [1975]
= avoids interpretation cost
– requires keeping track of last update for auto-recompile
• Views [Chamberlin et al., 1976] Bounded schemas
= Data base adminiistrator defines schema subset for user roles
– Can be compiled for fast execution
– Must be recompiled when base schema or view is changed.
Sep-01
CS545 Intro
11
Inventions – 2 – access trees
• Indexes [Landauer 1963] balanced trees
= Efficient ancillary access path
– Requires updating to stay current
• Multiple Indexes [DavisLin 1965] multi-attribute-based
access
= Multiple ancillary access paths
– Allows access by multiple paths
– Requires much updating to stay current
• B-trees [Bayer, 1972] Index Updateability
= Index blocks are kept only 50%-100% full for mostly fast
update
– Improves performance greatly for indexes
Sep-01
CS545 Intro
12
Inventions – 3 - structures
• Hierarchical Structures [IMS, 1963] Dense data structures
=
–
–
–
Trees mapped to sequential structures for fast access to sparse data
Fast access when many related values are needed
Costly to update, often done periodically
Must be combined with trees for multiple-access paths
• Triple storage [Feldman, 1969] Arbitrary structures
= All data represented by object-attribute-value entries
– High cost when many related values are needed
Note that these two conflict – in today's database
implementations performance has won out over flexibility
Sep-01
CS545 Intro
13
Inventions – 4 – model foodfight
• Relational Model [Codd 1970]
= tabular model, with an algebraic set of operations, normalization
– Formalization enabled understanding, dissemination
– No inter-relation semantics, specified when query is made
– Later constraints were added, implicitly defining keys, connections
• Hierarchical - (also applied to one view of BOMPs)
= describe hierarchical connections among data records, no algebra
– An attempt to describe earlier, simple implementations in model terms
• Network – generalization of BOMP
= describe structure, procedural navigation in near-arbitrarily linked data
Strong inter-record connections, needed for locating data
Sep-01
CS545 Intro
14
Why did the relational model win?
• Relational Model DBMSes
–
–
–
–
Sequel  QUEL, SQL
Formality – allowed essential optimization algorithms
Restrictions – as normalization, provide guidance
Teachability – exposed principles:
• can't teach only from examples
DBMS independence – safety blanket for mission-critical users
• But implementations added features
• Use least common set of features?
– Hard to enforce once a system has been bought
• Few suppliers remain {ORACLE. IBM. MS, mySQL}
• ER model [Chen, 1976]
= Focuses on design, can be mapped to multiple implementations
– Few tools for direct translation
– Poor maintenance of model, ignored when DBs are expanded
Sep-01
CS545 Intro
15
Databases and the Web
• HTML presentation: Hierarchical Markup Language
= Data are transformed for human consumption, external refs
– Often hierarchical – object-oriented view
– If there was a schema, it is now hidden
• XML presentation
= Schema data is embedded
– Much flexibility
– Much more space when entries are small
– Requires an interpretation for viewing as XSLT
• RDF Resource description Formalism
= Triple representation: object-attribute-value
– Great flexibility
– Uncertain implementation
Sep-01
CS545 Intro
16
Information
Data
overload
starvation
• More databases
– public & corporate
• Faster communication
– digital
– packeting: TCP-IP, ATM
• World-wide connectivity
– Internet & Intranets
– world-wide web
• Disintermediation
– ubiquitous publishing
Sep-01
CS545 Intro
17
Change in Supply vs Demand
What information consumes is rather
obvious, it consumes the attention of its
recipients.
Hence a wealth of information creates a
poverty of attention, and a need to
allocate that attention efficiently among
the overabundance of information
sources that might consume it.
Sep-01
CS545 Intro
18
[Herbert Simon]
Making data relevant
• Data reduction
• Data abstraction
–
–
–
–
Level changing
Summarization
Exception search
Level change to integrate with
other data sources
• Follow Customer Model:
hierarchical, divide-and-conquer,
a common paradigm
Sep-01
CS545 Intro
19
Data and Knowledge
Data Loop
Knowledge Loop
Storage
Education
Selection
Abstraction
Integration
Recording
Summarization
Experience
Decision-making
State changes
Action
Sep-01
CS545 Intro
Information is
created at the
confluence of
data -- the state
&
knowledge -the ability to
select and
project the
state into
the future
20
Transforming Data to Information
Application
Layer
Mediation
Layer
Foundation
Layer
Sep-01
users at workstations
value-added services
data and simulation resources
CS545 Intro
21
Functions inside Mediation
articulation
Summarize
Transform
Heterogenous
Selection
Sep-01
resources
CS545 Intro
22
Function of Mediation
Apply Domain-specific Specialist
Knowledge to add value
•
•
•
•
•
•
•
to locate data sources
to convert for consistency
to integrate from diverse sources
to describe data for processing
to abstract for insight / models
to extrapolate to new situations
to summarize for presentation
 INFORMATION
Sep-01
CS545 Intro
23
Environmental Restoration at
INEL Undoing 50 years of messes
….
MSL [Stanford]
OQL [ODMG]
MQL [ISX]
OEM
QEM
OEM
QEM
other
mediators
wrapper
OEM
QEM
QEM
OEM
mediator
QEM
OEM
OEM
QEM
CORBA
OEM
QEM
wrapper
QEM
wrapper
wrapper
Many projects
many sources
ERIS
LOCKHEED MARTIN
June 1998
Sep-01
IEDMS
ISX - Stanford Univ.
Idaho National
Engineering Laboratory
CS545 Intro
24
From Schemas to Ontologies
Ontologies allow communication among partners
in enterprises (rarely in machine-readable form)
Relationships determine meaning - parent, school, company
Databases use ontologies during design
in their E-R diagrams (implicitly) and to
represent the leaf nodes in their schemas.
Variable and Class names in Software
Knowledge-bases use term ontologies (often
explicitely), add class definition (to hold instances),
constraints, and operations among the terms.
Sep-01
CS545 Intro
25
Ontology: components
.
We represent the contents and structure of
a languages by its ontology:
• a set of well-defined terms,
which delimit the domain of discourse
• relationships among those terms,
chosen from a limited set
a formalizable subset of expert knowledge
Sep-01
CS545 Intro
26
Heterogeneity among Domains
If interoperation involves distinct
domains mismatch ensues
• Autonomy conflicts with consistency,
– Local Needs have Priority,
– Outside uses are a Byproduct
Heterogeneity must be addressed
• Platform and Operating Systems 
• Representation and Access Conventions 
• Naming and Ontologies 
Sep-01
CS545 Intro
27
Unsolved problem in Interoperation
Common assumption in assembling and integrating
distributed information resources
• The language used by the resources is the same
• Sublanguages used by the resources are subsets of a
globally consistent language
This assumption is provably false.
Working towards the goal of global consistency is
1. naïve -- the goal cannot be achieved
2. inefficient -- languages are efficient in local contexts
Sep-01
CS545 Intro
28
Large Ontologies: good or bad?
 Have all the Knowledge together
+ simple for customers of KBs
– hard for owners of KBs, must synchronize with many others
– in the limit -- everybody must be globally consistent
 Large KB will cover multiple / all domains
 created by a committee -- slow
 maintained by a committee – costly to impssible
 Differences in level of abstraction -- efficiency
 homeowner: nail
 carpenter: sinker, brad, boxnail, . . .
Sep-01
CS545 Intro
29
Evolution of mediation
applications
A2
A1
A4
A3
A5
A6
integrators
a.
I2
I1
mediators
network
b.
M1
c.
d.
wrappers
D1
W2
W1
D2
D4
W3
D5
M2
e.
D6
D3
datasources
Sep-01
CS545 Intro
30
Definition*
A mediator is a software module that exploits
encoded knowledge about certain sets or
subsets of data to create information for a
higher layer of applications.
It should be small and simple, so that it can be
maintained by one expert or, at most, a small
and coherent group of experts.
* Wiederhold: IEEE Computer March 1992
Sep-01
CS545 Intro
31
Interfaces
Human  Computer
{x-widgets, HTML}
Application  Mediator
{OQL, KQML, ...}
Mediator  Data sources
{SQL, TQL, XML, … }
Data real world
{sensors, clerks, … }
Sep-01
CS545 Intro
32
An Integration Architecture
Client
Application
portfolios for each company
Mediator
stock market prices
Sep-01
business reports
Wrapper
Wrapper
Ticker
Tape
Dialog
CS545 Intro
33
Status of Mediation Technology
Today
• Handcrafted
• Expert consults with
programmer
• Programmer codes the
knowledge needed
• Resource changes
require advise,
program update
Sep-01
Future
• Generated from
models
• Domain Expert
maintains models
• Specification
determines functions
• Resource changes
trigger regeneration
CS545 Intro
34
A mediator is not static software:
Knowledge ages
Application
Interface
Changes of
user needs
Software & People
Models, programs,
rules, caches, . . .
Owner / Creator
Maintainer
Lessor - Seller
Advertisor
Resource
changes
Resource Interfaces
Sep-01
Domain
changes
CS545 Intro
35
Domain Specialization
• Knowledge Acquisition (20% effort) &
• Knowledge Maintenance (80% effort *)
to be performed by
• Domain specialists
• Professional organizations
• Field teams of modest size
automously
maintainable
Empowerment
* based on experience with software
Sep-01
CS545 Intro
36
Roles
Computer Scientists
• Provide tools
–
–
–
–
adapatation
integration
matching
composing
• Assess Standards
• Assure scalability
Sep-01
Domain Experts
• Learn to use the tools
• Select resources
• Assess their value
• Rank their quality
• Resolve semantics
• Get client feedback
• Give provide feedback
CS545 Intro
37
Mediation Research Topics
•
•
•
•
Mediator management and maintenance
Representation of knowledge and customer models
Balancing dynamic and warehouse solutions
Formalization of semantic heterogneities
–
–
–
–
many levels and types
roles for wrappers vs. mediators vs. applications
scalability by partitioning -- make it simple!
Domain Ontologies --- tools, validation, . . .
• Effect of object paradigm and method-based access
• Service and business models
• New types of information systems
Sep-01
CS545 Intro
38
Long Range Science Vision
Databases
access
storage
algebras
Systems
Engineering
analysis
documentation
costing
Artificial
Intelligence
knowledge mgmt
domain expertise
uncertainty
Integration Methods
Integration
Science
Sep-01
CS545 Intro
GIS
39
Fat versus thin mediators
• too thin: insufficient added value
• Too fat: hard to
compose
service
scope
• Too narrow: few costumers
• too broad:
hard to maintain,
needs a committee
domain scope
Sep-01
Just right
CS545 Intro
40
Maintenance is good for you
?
13
12
11
100%
10
9
90
8
80
7
70
6
60
lifetime
5
50
4
40
3
30
2
1
20
10
relative annual
maintenance cost
depreciation = 1 / lifetime
years
0
automobile
Sep-01
hardware
CS545 Intro
software
41
Client-Server Architecture
Client system s
X
Fast build of clients
by resource reuse
data and simulation resources
Changes (x) are difficult,
can affect many clients
Sep-01
CS545 Intro
42
Systems with
Mediators
Gio Wiederhold. 1995
Applications . . . .
Mediators . . . . . .
Data Resources . . .
Sep-01
CS545 Intro
43
Growth through
Reuse
Gio Wiederhold. 1995
New Application
Prior & Revised
Mediators
Extended Data
Resources
Sep-01
CS545 Intro
44
Linear O(n) Cost of Growth-- now
O(n2)
• Data changes only affect some
mediators; only in their domain
• Mediators can
1. supply old information to n-1
prior applications
2. provide better information to the
new application
3. be partially or completely reused
• New applications, using the new
data, can be developed and
inserted dynamically
Sep-01
CS545 Intro
7 2
45
Assigning maintenance responsibility
a. Source data quality –
supplier database, files, or web pages
b. Interface to the source –
Sources
wrapper, supplier or vendor for supplier
c. Source selection –
expert specialist in mediator
d. Source quality assessment –
customer input to mediator
Services
e. Semantic interoperation –
specialist group providing input to the mediator
f. Consistency and metadata information –
mediator service operation or warehouse
g. Informal, pragmatic integration –
client services with customer input
h. User presentation formats –
Customers
client services with customer input
Sep-01
CS545 Intro
46
Download