IDBD: An Interactive Design Tool for CODASYL-DBTG

advertisement
IDBD
AN INTERACTIVE DESIGN TOOL
FOR CODASYL-DBTG-TYPEDATA BASES
Janis
Roland Dahl'
A. Bubenko jr*
The Systems Development
Laboratory
1.
The Interactive
Data Base Designer
(IDBD)
assumes as input a conceptual description
of
data to be stored in the data base (in
terms
of a binary data model) and an expected workload in terms of navigations
in the conceptual
model.
Extensive checking of input is
performed.
The designer has the possibility
to restrict
the solution
space of the design
algorithm
by prescribing
implementation
strategies for parts of the binary model.
2. University
Stockholm,
of Stockholm,
Sweden
Introdmction
Before we can design a data base schema, compatible
with some existing
Data Base Management System, we have to determine
what kind
of data it
should contain and what kind of
updates,
work-load,
in terms of queries,
and deletes it must be able to haninserts
dle.
examination
of
In order
to permit
alternative
solutions
the requirements
must
be stated in as implementation
independent
possible.
By 'implementation
terms
as
independent'
we mean that
there
have been
made no decisions
on how to group data items
in records, which access techniques
to use
and how to navigate in a structure
of records
and sets.
ABSTRACT
1. Chalmers University
of Technology,
41296 Gothenburg, Sweden
(SYSLAB)3
Designing a data base is thus only
tively
small)
part of a systems
process.
It
is preceded
by a
activities
the purpose of which is
corporate information
needs and
the requirements
of an information
be developed.
S-
a (reladevelopment
number of
to analyze
to specify
system to
The Interactive
Data Base Designer
(IDBD)
presented in this paper has been developed to
be compatible with two kinds of system design
(philosophies)
approaches.
S-10691
8. This work has been supported by the
National
Swedish Board for Technical
Development
the analytical
approach,
The first
kind,
proceeds through development phases like
- goal and problem analysis
- activity
analysis
etc.
set
of
and arrives
at a comprehensive
This set also
specifications.
requirements
includes a conceptual
information
model of
relevant
parts of the enterprise
and a set of
information
requirements
[Bub-801.
The conceptual information
rode1 (CIM) describes and
defines relevant
phenomena (entities,
relaassumptions,
inference
rules
events,
tions,
of Discourse
(UoD).
etc.) of the Universe
The CIM models the UoD in an extended time
Proceedings of the Eighth International
on Very Large Data Bases
Conference
108
Mexico City, September,
1982
perspective
in order
and constraints.
to capture
dynamic rules
+ the tool
existing
The next step in this approach
is to 'restrict'
the CIM (from a time perspective
point
of view) and to decide
what information
to
in the data base and how to conceptustore
in
ally navigate in this set of information
to satisfy
stated information
requireorder
ments (see Gus-82 for a comprehensive exposition of this problem).
If the information
to be stored in the data
base is defined
in terms of a binary data
model and the conceptual
navigations
are
specified
assuming such a model then this is
the required input to IDBD.
a
of
is
in
be
data
of the follow-
- description
of the conceptual data model
- its
data item types
(representing
entity
types) and relations
the
the
- a description
model defined
of the workload
of
in terms of run-units
the
- a description
be considered
of certain
parameters
by the design algorithm
to
- a set of design directives
restricting
algorithm
space.
- comprehensive
checking
of
the
consistency
of input data is performed (we
have empirically
found that it is difficult
to supply a tool of this kind with
correct input the first
time; a waste of
and computer
resources
is the
time
result of optimizing
incorrect
input)
to the design
solution
its
A consistency
check is performed on the model
Also an
and the work-load
descriptions.
analysis
is performed to estimate the number
of references
to the data items and the relations when navigating
in the model.
The input,
interaction
will
be illustrated
case, the enterprise
assumed.
- the designer
has a
possibility
to
prescribe
certain
implementation
alternatives for parts of the model (or
the
whole model).
This has the following
advantages
and processing of IDBD
by a small practical
GROSS. The following
iS
GROSSis a local supplier,
i.e.
it
supplies
parts
to customers located in the same city.
Parts are distributed
by cars and one cargo
is called a delivery.
One delivery
can conEvery day
1 to 20.
tain several orders,
there
are several deliveries,
1 to 20. Customers send their orders to GROSS. An order
includes
1 to 25 parttypes.
When an order
arrives,
its day for delivery
is determined. '
the designer can test
his/hers
own
solution
alternatives
which may be
'natural'
or which have other,
preferred,
non-quantifyable
properties
Proceedings of the Eighth International
on Very Large Data Bases
Input
The input to the IDBD consists
ing types of data:
The DBTG-schema design algorithm
of IDBD has
the binary data model and a set of implementation strategies
in common with
design-aids
developed
at the University
of Michigan
[Mit-75,
Ber-77B, Pur-791.
It differs,
however, from them in several important
respects
solutions
an
This paper describes
and explains
IDBD in
Secterms of running a small sample case.
tion 2 describes the input to IDBD - the conceptual binary data model and how to describe
in the
the work-load in terms of navigating
model.
User interaction,
checking of input
and how to supply IDBD with design directives
is presented in section 3. The design algorithm and the results
of performing
a design
run are given in section 4.
2.
unnormal or *non-sensebe avoided
augment
- the description
of the work-load
practically
realistic
as navigation
the conceptual
binary
model can
defined.
It is, of course, also
advantageous
to use
the experimental
approach as a complement to
the purely analytical
one in order
to avoid
guesswork concerning the requirements
and the
work-load.
which gives
to monitor
to
+ the solution
space can be drastically
reduced
thereby
making
IDBD
realistic
tool
also
for
design
large, complex data base schemata
The other approach to data base design is the
experimental
one.
In this
case we assume
that a 'fast prototype'
is developed
by the
use of the CS4 system [Ber-77A].
CS4 employs
a binary data model and is thus compatible
with IDBD. Experimental
use of the prototype
system can provide
us with
statistics
of
navigation
types and frequencies.
- the tool is interactive
designer
a possibility
design process
can be used
DB-schema
can
Conference
109
Mexico
City,
September,
1982
The following
information
requirements
terms of queries
- are defined
in the
ing design stages.
1.
For a particular
day, all
the
customers
which
are to he supplied,
and for each
customer:
name and address.
2.
For a particular
parttype,
which
it
is
included
delivery
for each order.
.7 .
For a particular
parttype
and for
t icular
customer,
the
total
amount for the part in order.
4.
assumed
are
model
entity
The
types
represented
in
this
case
by a set of data
In fact,
as “partinfo”
in
the
item types.
item type can also
data
structure,
a data
represent
a group of data item types.
- in
preced-
For all orders,
their
amounts.
Assume also that the
have been defined:
print
Data item types are
ITEM (in
capital
input
and thereafter,
per each data item:
the orders
in
and the day of
all
parttypes
following
1.
Deletion
2.
Insertion
of a delivery.
3.
Insertion
of a new customer.
4.
Updating
a pardollar
with
of the conceptual
We assume that
the
tional
data structure
basis of an analysis
conceptual
information
tion requirements.
data
item
size of
acters
the data
item
in number
the data
item
cardinality
of
a security
code
In our case
as follows:
attributes.
constitutes
the basis for the conceptual
(binary)
data model and its work-load.
Description
the
of char-
(optional)
If different
data items cannot be placed
in
for
instance
because of
the
same record,
data,
constraints
or distributed
security
types can be specified
then
the
data
item
If the
with different
security
code numbers.
security
code is not specified,
it is put to
zero.
This
2.1
name of
transactions
of a delivery.
of part
described
with
the
word
letters)
on one ‘line of
on separate
lines,
one
ITEM
delivno
delivday
orderno
ord-part
partno
partinfo
amount
custno
custname
custadr
data model
following
binary
relahas been created
on the
of the
enterprise,
the
model and the informa-
5
6
6
0
8
92
8
5
30
30
the data
item
types
are
specified
150
30
300
1500
2000
2000
1500
1000
1000
1000
Rinary
relations
in
the
data
model
are
described
with the word RELATION on one input
line
followed
by one line
per
relation
containing:
dday-dno
cua t-ord
- name of
the
- name of
relation
the data
origins
- name of the data
tion is directed
ord-o-p
- the number
- minimum
item2
part-o-p
Proceedings of the Eighth International
on Very Large Data Bases
relation
Conference
110
item
(iteml)
from
item to which
(item2)
of instances
of
the
which
the
the
rela-
relation
number
of item1
related
to
one
- maximum number
item2
of item1
related
to
one
- minimum
item1
of item2
related
to
one
number
Mexico City, September, 1982
- maximum number of item2 related
item1
The relations
follows:
RELATION
dday-dno
dno-ord
cust-ord
c-cname
c-cadr
ord-o-p
o-p-amnt
part-o-p
p-pinfo
in our case
are
to
specified
one
dclivday
i.--,--
as
idday-dno
..!I.---,
delivday
delivno
custno
custno
custno
orderno
ord-part
partno
partno
delivno
orderno
orderno
custname
custadr
ord-part
amount
ord-part
partinfo
150111
300 1 1
300 1 1
1000 1 1
1000 1 1
1500 1 1
1500 1 1
1500 1 1
2000 1 1
1
0
1
1
1
1
0
1
delivno
'
*._-_ _.-_-.'
20
20
50
1
1
25
1
200
1
jdno-ord
,..A-,
orderno
+-
cust-ord
For the specified
input data for
a relation
and the cardinality
of the participated
data
items, IDBD determines the average number of
iteml(item2)
related
to item2(iteml),
the
type of the relation
(l:l,
l:M, M:l, M:N) and
what type of mapping, total (T) and partial
(P), that the domain and range of the relation
participates
in.
This will be further
discussed in section 3.1.
2.2
Description
I--
(custno
5
of the Pork-load
This access path could
language as follows:
The work-load is generated by the information
requirements
(queries)
and the transactions.
They imply a need to navigate in the binary
relational
structure.
In order
defined,
to show how navigations
two examples are given.
can
;
be
described
in
natural
F’O~ a particular
delivday
get all
delivno
related
to it via the relation
dday-dno
for all
these delivno
get all orderno
which are related
to all
these delivno
via
the relation
dno-ord
for all
these orderno
get the custno
related
to these
orderno
via the
relation
cust-ord
for the custno,
get
custname
related
to it
via the relation
c-cname,
custadr
related
to it
via the relation
c-cadr
be
Example 1 For query 1 in our example the following access path is defined:
The navigation
starts always at
where the entry-point,
i.e.
item, and the operation
for it
In example I the starting
item
the level
1,
the starting
Is described.
is dellvday.
After delivday
has been accessed;alI
delivno
This
related to delivday are to be accessed.
We call
is done at the next higher
level.
the
item delivday
qualifier
of delivno,.
because delivno
related
to
delivday
is
requested.
In the access path, each time when
the latest
accessed Item Is used as a qualifier for next item wanted, a next higher level
is specified.
When the same item is used as qualifier
for
several required
items, that can be specified
at the same level.
Custname and custadr have
the same qualifier
custno and therefore
they
are specified
at the same level.
Proceedings of the Eighth International
on Very Large Data Bases
Conference
111
Mexico City, September,
1982
Example 2 In query 3 we want to get the
dollar
amount for the parts
in order for a particular
parttype
and for a particular
customer.
RUN-UNIT
RU custinf
30
1
FIND UNIQ delivday
RU amount 30
1
FIND UNIQ partno
2
2
GET ALL delivno
GET ALL ord-part
3
3
GET ALL orderno
GET orderno
4
4
GET custno
5
GET custname
GET custadr
GET custno
2
3
SELECT ord-part
3
GET amount
IDBD uses
description
ord-o-p
Run-units
are
UNIT followed
Every run-unit
taining
:
4
- the word
part-o-p
- name of
CL3
partno
for
word
RDNthemselves.
line
con-
RU
the
run-unit
of
the
run-unit
After
the head line
the run-unit
is described
in a hierarchical
way with level
numbers much
like
a COBOL data declaration.
On each
line
there is either
a level
description
or a data
operation
description.
The level
description
lines
are numbered
The first
line after
the
1 and upwards.
line
is always a level
description
line
level
number 1.
We access a unique
partno,
thereafter
all
ord-part
related
to it.
At this
stage we do
not know for which of the ord-part
we want to
get the dollar
amount, because we do not know
to which
customer
ord-part
is
related.
Therefore,
we continue
by accessing
orderno
for each of ord-part
and then access the customer
related
to
each
order.
Now we know
which orders
are related
to the required
customer.
Now we need
to know the ord-part
related
to those
orders.
We have
already
accessed
ord-part
and therefore
we do not
need to access
them again.
By a SELECT
operation
we can select
the ord-part
related
to the required
customer
without
additional
accesses.
Each level
- the
description
level
line
from
head
with
contains:
number
- optionally
a cardinality
Normally
the cardinality
for a level
is
one.
In
that case there is no need to specify
the
cardinality
number.
Otherwise,
a real number
both
< 1 (a probability)
and > 1 (a freThis
cardinality
quency)
can be specified.
multiplies
with
the cardinality
on the next
lower level
to give the
cardinality
on the
actual
level.
If a cardinality
is specified
cardinalon level
1, it multiplies
with the
ity of the run-unit.
Going backwards
in the
access
path
can be
done
in
two different
ways.
If the item to
be accessed
has the
same qualifier
as the
item
at
the
next
higher
level,
then that
higher
level
is specified.
If one wants
to
skip one or more higher
levels
without
making
any access to the database,
it can be done by
using the special
operation
SELECT.
On each level
there can be specified
zero,
of data operation
occurrences
one or more
Every data operation
descripdescriptions.
tion is
defined
on one line
and contains:
- data operation
verb
example shows the way query
1
be expressed
in terms of a run-
unit.
- name of
- optionally
Proceedings of the Eighth International
on Very Large Data Bases
rules
described
with
the
by the run-units
starts
with a head
- cardinality
We can start
the
navigation
either
with
access
to
partno
or
to
custno.
Here we
choose to start
with partno.
The following
and 3 can
the following
syntactical
of the run-units.
Conference
112
the
required
a relation
data
item
name
Mexico City, September,
1982
Usually there is no need to specify
a relation
name.
Only if there are more than one
relation
type between two data item types, it
is necessary
to specify the relation
name.
Otherwise the IDBD finds
the relation
type
itself.
The data item at the previous level
At level
1, where
is called the qualifier.
there
is no lower
level,
the access is
directly
to the data item type and not via
a
relation
as at levels > 1.
The different
data operation
FIND
verbs
are:
INSERT is analogous
to DELETE.
By MODIFY one or all data
the qualifier
are modified.
items
related
to
By LINK one or all data items are linked
to
This means that the relation
the qualifier.
instance(s)
is(are)
stored in the data base.
By UNLINK one or all
data items are unlinked.
At level 1 only the following
verbs can be specified:
TI 11
data
operation
WIQ
FIND
GET
A final
example shows
run-unit
describing
the
tion nr 2 for inserting
the inserted
delivno
delivday
is not already
base with a probability
SELECT
I 1[ 1
LINK
ALL
UNLINK
where { ) means one of the elements inside (
1 and [ lmeans that the element inside [ I
is optional.
FIND defines an entry point for the run-unit.
FIND UNIQ implies some sort of direct
access
to the data item, while FIND SEQ implies
a
sequential
browse through all data items of
the mentioned data item type.
data items which are
By GET one or all
related
to the qualifier
are accessed-
the definition
of a
work-load of transacFor
a new delivery.
it is assumed that its
stored
in the data
of 20%.
RU ins-supp
30
1
INSERT UNIQ delivno
2 0.2
INSERT delivday
2 0.8
LINK delivday
2
INSERT ALL orderno
.l
LINK custno
INSERT ALL ord-part
4
INSERT amount
LINK partno
3.
3.1
User interaction
Checking
and anal~ais
of input
data
By SELECT no access is made, because already
This noraccessed data items are selected.
mally also means a branch from a higher
to
some lower numbered level.
After the input phase and before the optimithere is an interactive
phase
zation
phase,
certain
consistency
checks are
where also
made.
By DELETE one or all data items related
to
DELETE causes
the qualifier
are deleted.
also deletion
(i.e.
UNLINK) of the relation
which has the deleted data item as
instance,
its destination.
First of all the IDBD asks for values
These are
tain constants.
Proceedings of the Eighth International
on Very Large Data Bases
Conference
of cer-
Mexico City, September,
1982
- maximum record
acters
length
in number
- counter
length
in number
of characters
- pointer
length
in number
of
- maximum secondary
storage
characters
that is allowed
in
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
of char-
characters
number
of
- CALC storage
factor,
a factor
greater
or
how much more
equal
to one, which tells
secondary
storage
than
nominal
is
required
for hashed storage
- CALC access factor,
a factor
greater
or
how many more
equal
to one, which tells
accesses
than one that is
required
due
to synonyms in hashed storage.
From the values
of the first
three
of
these
constants
and from the earlier
description
of
data items,
relations
and run-units,
the program decides
which
consistency
constraints
must hold.
IDBD finds
out for each
relation
which
implementation
alternatives
(IAs) are
possible
, which are -wed
and which
IAs
are
impossible
(OFF).
The lists
of IAs for
these three cases can be displayed
for
the
DBA. For the possible
and unwanted cases the
DBA can interactively
change IAs.
There
is
also
a fourth
case
that
the DBA can use,
namely to specify
that one special
IA is
to
be used (ON).
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
Fixed
Fixed
Variable
Variable
Fixed
Fixed
Variable
Variable
Chain,
Chain,
Chain,
Chain,
Chain,
Chain,
Chain,
Chain,
Pointer
Pointer
Pointer
Pointer
Dummy
duplication
duplication
reversed
duplication
duplication
reversed
aggregation
aggregation
reversed
aggregation
aggregation
reversed
next pointer
next pointer
reversed
next + owner pointer
next + owner pointer
next + prior
pointer
next + prior
pointer
next + owner + prior
next + owner + prior
array
array reversed
array + owner pointer
array + owner pointer
record
For illustration,
the implementation
tives
15 and 17 are shown below.
reversed
reversed
pointer
pointer
rev.
reversed
alterna-
:---6-------m
The reason for changing
the lists
of IAs
can
be that an IA is already
fixed
or that the IA
gives an unnormal
solution.
(An example
of
an unnormal
solution
would be the case,
where
“article-number”
is suggested
to
be aggregated
under “number-in-stock”
instead
of the
other way around.)
If
the
DBA reduces
the
solution
space
for the different
relations,
the execution
time
during
the
optimization
phase is also substantially
reduced.
This is
necessary
for a large
data base.
After
the user has finished
the change of the
(see
section
3.2) IDBD takes the best
IAs,
relation,
e.g.
it
every
remaining
IAs for
the
set
of IAs, which are either
ON,
takes
in
that
order.
The
possible
or unwanted,
alternatives
are multinumber
of
possible
plied
and the total
number of combinations
of
displayed
for the user.
DBTG-structures
are
The user has now the possibility
to
further
The reason for
reduce
the
solution
space.
this
is that the computing
time can be very
long,
if there are many DBTG-structure
alterIt is also possible
to
natives
to consider.
limit
the
processing
time
to
some value.
the
optimizing
phase
is
that
time
After
interrupted
and the next phase in the program
continues.
There are 21 different
implementation
alternatives,
numbered 1-21 (see also [Ber-77Bl).
The IAs
are
displayed
with
their
number.
Here is a short description
of each IA:
Proceedings of the Eighth International
on Very Large Data Bases
Conference
114
Mexico City, September,
1982
For the different
valid
DBTG-structures
IDBD
calculates
an estimate
of
the
number
of
accesses
to the data base for every run-unit.
The need for secondary
storage
is also calculated.
The best solutions
are presented
to the user.
“The best solutions”
are those with the least
secondary
number of accesses
for
a certain
storage
size.
The solutions
are presented
in
the order of increasing
number
of accesses.
storage
At
the
secondary
the
same time
requirements
are
decreasing
in
order
to
(The
belong
to
the
set of best solutions.
first
10 solutions
are always
added
to
the
set .)
For different
some messages
DBTG-structures
such as
- an entry
point
- there
no access
is
for
IDBD also
an item
path
- there is
no SYSTEM
access)
to an item
Explanation:
- type
M:l,
gives
user interaction
section.
- min/av/max21
the reverse
to an item
is
of
data
Data items
name
size
delivno
5
delivday
6
orderno
6
ord-part
0
examplified
partno
partinfo
amount
C”6 tno
custname
custadr
a
9.2
a
5
30
30
in
150
30
300
1500
2000
2000
1500
1000
1000
1000
dday-dno
dno-ord
cust-ord
c-cname
c-cadr
ord-o-p
o-p-amnt
part-o-p
p-pinfo
be
illusIDBD has
the
input
particular
seq.ref.no
secur.
Explanation
120
30
1630
20
l:M,
(P)
in
the
but
concerns
reference
frequencies
reference
item2
delivday
delivno
custno
custno
custno
orderno
ord-part
partno
partno
to
relations:
count
ref.no
fwd
number
%
150 1.2
delivno
540 4.2
orderno
orderno
320 2.5
custname
320 2.5
custadr
2700 21.1
ord-part
2790 21.8
amount
98 0.8
ord-part
3000 23.5
partinfo
ref.no bwd
number
X
141 1.1
75 0.6
443 3.5
98
0.8
2100 16.4
:
shows the number of
required
from
item1
to item2 in the
(forwards).
Every
update
(DELETE, INSERT and MODIFY) is
as 2 references.
- rel.no
bwd shows the number of
required
references
from
item2
to item1 in the
relation
(backwards).
These numbers give the designer
some indication
of which relations
are critical
for the
efficiency
of the data
base
system.
Those
relations
can
then
be analyzed
in greater
detail.
means the number of references
needed
which
are
accesses),
to the data item type
sequenof
- seq .ref .no means the number
which are needed to the
tial
accesses,
data item type.
These numbers give the data base
tor
or designer
some indication
data items there will
be a need
access and/or
sequential
access.
Proceedings of the Eighth International
on Very Large Data Bases
is analogous,
relation
- rel.no
fwd
references
relation
operation
calculated
300
Explanation:
- DA-ref.no
(logical
directly
of
Relations
with
rel .name item1
items:
card. DA-ref.no
(l:l,
(sequential
The analysis
of input
data
will
trated
by the following
examples.
given the following
output
from
data
checking
procedures
for this
case :
Analysis
of mapping
- min/av/maxC!
shows the minimum,
average
and maximum number of item1 related
to
i tern2
Analysis
A typical
the next
type
- T/P stands for total
(T) and partial
mappings
between
the data items
domain and range of the relation
is missing
entry
is the
M:N)
Conference
The analysis
each
of
run-units
shows
for
run-unit
the following
type of output
(examplified
for run-units
“custinf”,
“amount”
and
“ins-supp”)
.
administraof to which
for
direct
115
Mexico City, September,
1982
Look at/change
Run-units
ru-name cardinality
level-no
cardinality
operation
item1
Instructions?
item2
relation
,
I
FIND UNIQ delivday
2
GET ALL
delivno
delivday
dday-dno
3
GET ALL
orderno
delivno
dno-ord
4
GET
custno
orderno
cust-ord
rev.
5
GET
custname custno
c-cname
GET
custadr
custno
c-cadr
.
1
LINE
INSERT ALL
4
INSERT
LINK
delivno
dday-dno
delivno
dday-dno
delivno
dno-ord
custno
ord-part
orderno
orderno
amount
partno
ord-part
ord-part
The different
ON
POSSIBLE =
UNWANTEDOFF
-
dday-dno
ON=
POSSIBLE= 1
UNWNTED=
UNWANTED=
rev.
o-p-amnt
part-o-p
or
alt.
2
3
relations
specified
5
7 11 15 19
9 13 17
OFF=
4 6
change?y
change=15 o
more changes?n
dday-dno
ON=
15
POSSIBLE= 1 2
rev.
cust-ord
ord-o-p
consistency
constraints
are:
the relation
shall have this impl.
possible
impl. alternatives
not desirable
impl. alternatives
these impl. alt.
are not permitted
Do you want all
displayed?
(a/s)a
8 10 12 14 16 18 20 21
3
5
71119
9 13 17
OFF=
4 6 8 10 12 14 16 18 20 21
dno-ord
ON=
POSSIBLEJ- 1 2 3 5 7 11 15 19
UNWANTED= 9 13 17
OFF=
4 6 8 10 12 14'16 18 20 21
rev.
rev.
p-pinfo
ON-
The "Run-units"
listing
is merely
just
a
structured
reprint
of the input data with the
relations
expanded and a note "rev."
if
the
relation
is used in backward direction,
i.e.
from item2 to iteml.
Design
(y/n)y
You can change among the ON, POSSIBLE and
UNWANTEDimplementation
alternatives
by writing the implementation
alternative
number and
the letter
0, P or IJ respectively
on one line
Explanation:
3.2
(y/n)Y
You will
see all or specified
relations
with
their relation
name and what consistency
constraints
that hold for the different
implementation alternatives.
amount
30.00
1
FIND UNIQ partno
2
GET ALL
ord-part
partno
part-o-p
3
GET
orderno
ord-part
ord-o-p
rev.
4
GET
custno
orderno
cust-ord
rev.
2
3.00
SELECT
ord-part
3
GET
amount
ord-part
o-p-amnt
30.00
ins-supp
1
INSERT UNIQ delivno
2
0.20
INSERT
delivday
2
0.80
LINK
delivday
2
INSERT ALL orderno
alternatives?
reversed
30.00
custinf
implementation
POSSIBLE= 5
UNWANTED= 6 9 10 11 12
OFF1 2 3 4 7 8 13 14 15 16 17 18 19 20 21
change?n
Note: For all l:M-relations
except
for
dnoord the suggested IA is in this example put
chain with both owner and prior
to 15, e.g.
For all l:l-relations
the only pospointer.
sible IA is left with
number 5, e.g fixed
aggregation.
directfree
Supplying design directives
to the IDBD for
choosing among implementation
alternatives
is
best illustrated
by showing part of a typical
interactive
user-IDBD session.
The following
is an example
interaction
with the IDBD:
of
a
IDBD will
listing:
Conference
with
the
following
There are 8 different
DBTG-structures
to examine.
Do you want to reduce the solution
space further7n
Do you want to limit
the CPU-time when optimizing?n
user
Submit values to the following
constants
maximum record length=512
counter length=2
pointer length=4
max. secondary storage=100000000
CALC storage factorll.2
CALC access factorl.5
Proceedings of the Eighth International
on Very Large Data Bases
then continue
Note: Normally IDBD will
save the best
10
solutions
(with
least number of accesses to
In this case there are only
the data base).
of which 6 of these are
8 possible
solutions,
accepted as correct
DBTG-structures.
116
Mexico City, September, 1982
4.
The Design-aid
4.1
The algorithm
preferable(possible)
and unwanted IAs.
In
this
way the tool
helps
the designer to
choose good data structures.
If the designer
lets
IDBD to decide, IDBD has then a smaller
number of IAs to consider.
The solution
space is thereby considerably
reduced.
4.1.1
Interactivity
The
three
different
design
tools
developed at the University
of
Michigan [Mit-75,
Ber-77B, Pur-79 ] all
work
in a similar
way. The design aids are typically batch programs.
From a specified
input
the tools
produce,
sfter
an optimization
phase, efficient
data structures.
Some problems with this type of tools are that
- they sometimes
tions
produce
- they restrict
themselves
solution
space
unnormal
to
a
- maximum record
length
which
are
are
is not violated
- if the data item types in the relation
have different
security
codes, duplication and aggregation
are ruled out
solu-
- for a
partial
there
related
sible
all B
reduced
- the designer has no possibility
of
ing his/her
own data structures,
may be more natural
or which may
other
(non-quantifiable)
desirable
perties
- they take,
in spite
of
algorithms,
optimization
to run on a computer
for
data bases.
Some of the validity
constraints,
checked before the interaction,
testwhich
have
pro-
relation,
where there
exists
a
mapping from A to B, such that
exist B data items
that
are not
to any A data item, it is imposto aggregate B under A such that
data items are represented.
Suppose there
is a l:M-type
of relation
between A and B data items so that one A data
item is related
to zero, one or more B data
The validity
constraints,
which are
items.
are analochecked in this case, are (there
gous rules for l:lM:l- and M:N-relations)
sophisticated
too long time
normal
sized
- it
To overcome these problems the design aid has
The data base designer
to be interactive.
has then the possibility
to manipulate
with
the solution
space so that these problems do
not have to arise.
is impossible
to aggregate
- a chain or pointer array
in reversed direction
- there
A under B
can not be used
is no need for dummy records
4.1.2
Validity
constraints
Certain validity
in order to
constraints
must be satisfied
arrive at a valid DBTG data structure.
- duplicating
of A under B is done
fixed length, not variable
length
Some of these rules cannot be checked until
a
with
records
and sets is
DBTG structure
The validity
constraints,
which are
created.
tested, are
- if
references
in the run-units
only
in the direction
of the relatraverse
tion, e.g. from A to B, there is no need
for owner pointers
or for duplication
of
A under B
record
its
lengths
are within
permitted
a record type has no repeating
a level > 1
lim-
- if there is only traversing
in the opposite
direction,
duplication
or aggregation of B under A and chain
or pointer
are made
array
without
owner pointer
unwanted
groups on
a data item type is not aggregated
than once
more
that a set has not the same record
both as an owner and as a member.
me
- if there is traversing
in both
tnions,
we need owner pointers,
chain or pointer
array
without
pointer are made unwanted.
can, howof the validity
constraints
be checked before the interaction
with
ever,
This
is
the data base designer takes place.
which IAs are valid or
done by determining
Unlike
the
not valid for
every relation.
Michigan
design
tools,
IDBD does not just
distinguish
between valid and not valid
IAs
into
IAs
valid
categorizes
also
but
Proceedings of the Eighth International
on Very Large Data Bases
Conference
with
direce-g*
owner
4.1.3
Cost caIcul.ation
The storage cost
is
calculated
as the sum of the lengths of all
records, pointers
and wasted areas for hashed
wasted areas are
These
record
types.
"CALC
estimated by the aid of the parameter
storage factor".
117
Mexico City, September,
1982
There are a few assumptions
about the
record
type
implementation
and the DBTG data base
management system.
If a record
type is
only
with
direct
then
it
is
accessed
access,
assumed that the record
type will
be hashed
If
a record
type is
(CALC in
Codasyl).
record
accessed
sequentially,
then the
type
is
made a member
in
a Codasyl SYSTEM set
(Singular
set).
If
there
is
a need
for
direct
access to more than one data item type
in a record
type,
then it is created
a seconThis
index
is
assumed to be
dary
index.
implemented
by a pointer
array.
The access cost is calculated
for
unit.
We differentiate
between
of access costs.
These
are
the
number
each
three
Run-unit
verb
GET FIRST
MODIFY FIRST
INSERT FIRST
TABLE 1.
runtypes
4.2
of
alternative
total
storage
- CALC accesses,
hashed
records.
plied
with the
factor”
total
number
set
which
are
accesses
to
The number is multiparameter
“CALC access
are
the
the
The number of accesses
calculated
is an estithe
number of physical
accesses.
mation
of
Records,
which are stored
in the
same block
or already
stored
in
primary
storage
in
buffers,
are not considered.
There
To get an estimate
of the number of accesses,
IDBD takes care of the hierarchical
structure
in the run-units
with
different
number
of
data
items
required
on each level.
Average
IDBD also considers
values
are used.
- what kind of data entry
to the record
type
- the combination
run-unit
.
there
is
of IAs and verbs
are
defined
in
the
of
pointer
of
the
number
As an example
consider
the verbs GET
accesses
calculated,
FIRST, MODIFY FIRST and INSERT FIRST
for
an
implementation
alternative
with a chain with
prior
pointers,
for instance
IA=lS.
Table
1
shows the figures.
Proceedings of the Eighth international
on Very Large Data Bases
Conference
cer-
contains
informa-
structures
types
access
cost
of accesses
cost
for
and
each run-unit.
This
solution,
information
is,
for
each
displayed
as follows
(examplified
by the
solution
alternative
with the
lowest
access
cost).
accesses
- the cases where the data item types
stored
in the same record
type
for
The results
Each solution
tion about
which
accesses
In the case of INSERT FIRST
both
the
owner
and the previous
first
member have to be read
and written
and the new first
member shall
be
written,
e.g. 5 accesses.
record
accesses,
pointers.
1
2
5
Number of pointer
tain verbs
scan
where
a
- sequential
accesses,
through
the records
for a certain
record
These
accesses
can be
type is made.
types,
if the
cheaper
than
the
other
records
are clustered
into blocks.
- pointer
through
Number of pointer
accesses
118
are
6 record
types
Record type no
0 delivday
Record length:
30 records
1 (delivday)
has
6 char.
CALC access
6 char.
Record type no
0 delivno
Record length :
2 (delivno
150 records
) has
5 char.
CALC access
5 char.
Record type no
0 orderno
Record length:
300 records
3 (orderno
) has
6 char.
seq. access
6 char.
Record type no
0 custno
1 custname
1 custadr
Record length:
4 (custno
) has
1000 records
5 char. CALC access
30 char.
30 char.
65 char.
Record type no
0 ord-part
1 amount
Record length:
5 (ord-part)
0 char.
8 char.
8 char.
Record type no
0 partno
1 partinfo
Record length:
6 (partno
) has
2000 records
8 char. CALC access
92 char.
100 char.
has
1500 records
Mexico City, September,
1982
There are
5 set types
Table 2 shows the storage
and access
costs
for the 6 alternatives
in this sample case.
150 instances
Set type no 1 (dday-dno) has
30 records
Owner = record type no 1
150 records
Member = record type no 2
The set is implemented with DBTG-set
with owner pointer and prior pointer
300 instances
Set type no 2 (dno-ord ) has
150 records
Owner = record type no 2
300 records
Member = record type no 3
The set is implemented with DBTG-set
with owner pointer
Solution
number
Access
cost
Storage
cost
No. of record
types
No. of set
types
1
2
3
4
14045
14165
14315
76512
100025
100025
391836
393036
420636
390336
391356
410436
6
6"
5
5
6
6
6
4"
4
TABLE 2.
delivday
1g
delivno
cost:
Ron-unit
ord-day
has the access
seq. access
0
CALC access
1
pointer
access3
cost:
Run-unit
amount
has the access
0
=
seq. access
CALC access
1
pointer
access2
cost:
Run-unit orders
has the access
seq. access
= 300
CALC access
0
pointer
access- 3000
cost:
1 times
Run-unit del-supp has the accesr
seq. access
=
0
CALC access
=
1
pointer
access77
cost:
30 times
Run-unit
ins-supp has the access cost:
0
seq. access
CALC access
=
1
pointer
access=
106
Run-unit
ins-cust
has the access cost:
se*. access
0
CALC access
1
pointer
access3
30 times
Run-unit upd-part
has the access
0
seq. access
CALC access
1
pointer
access1
The total
number of accesses-
Proceedings of the Eighth International
on Very Large Data Bases
orderno
T
ord-part
amount
e
partno
partinfo
the data base is 391836 char.
Run-unit custinf
has the access
=
0
seq. access
CALC access
=
1
pointer
access=
25
R
-
1500 instances
Set type no 5 (part-o-p)
has
2000 records
Owner = record type no 6
1500 records
Member = record type no 5
The set is implemented with DBTG-set
with
owner pointer and prior pointer
for
custno
custadr
1500 instances
Set type no 4 (ord-o-p
) has
300 records
Owner = record type no 3
1500 records
Member = record type no 5
The set is implemented with DBTG-set
with owner pointer and prior pointer
cost
30 times
R = DBTC-set implemented
with owner pointer
= record
30 times
-
14045
1
with
chain
100 times
El
cost:
costs
The schemata for solution
alternatives
1 and
4 are graphically
illustrated
in figures
1
and 2.
300 instances
has
Set type no 3 (cust-ord)
1000 records
Owner = record type no 4
300 records
Member = record type no 3
The set is implemented with DBTG-set
with
owner
pointer and prior pointer
The storage
Access and storage
- set
Figure
1.
type
type
Solution
= seq. access
+p
= direct
alternative'no.
access
1
LO times
1500 times
I
Conference
119
Mexico City, September,
1982
1
i
;
2.
alternative
delivno
that to
in the
records
access
The too1
IDBD is running on VAX computers
The program
operating
system.
standard UNIX call:
idbd c-i infile.
1-r rfilel
i-s
[-u sifilel
(-a outfild
sofilel
with
call
[-p
UNIX
is a
-s
The name of the input
data file,
which
contains the data items, the
relations
and the run-units.
outfile
The name
lengthy
filename
nal itself.
produces
parmfile
answering
questions
Instead
of
about the values for the different
parameters,
these can be stored
in
The name of that file is
a file.
given with this parameter.
rfile
If there is just one DBTG structure
to analyze,
Put the IAs for the
relations
1, 2 etc. in a file
and
name that file with this parameter
at the program call.
of the file,
where the
listings
are stored.
The
/dev/tty
means the termiThe filename /dev/null
no output file.
Proceedings of the Eighth International
on Very Large Data Bases
Mscur3sioll
IDBD has the flexibility
to optimize
the
design
from a predefined,
desired
scope.
Tests show that it is a valuable
property
in
order
to make a tool useful in a variety
of
practical
design situations.
parmfilel
It is possible to name the different
files
at
Otherwise
the program call with parameters.
the names of the files will be asked for
by
The different
the IDBD during
execution.
files
are
infile
TO
Discussions
with practitioners
in the field
one is not always
has
disclosed
that
interested
in 'optimizing'
the total
DBschema. For many reasons (reliability,
security,
modifyability,
comprehensibility
etc.)
the designers may wish to implement parts of
the data base in a specific
way and leave the
rest
of it (if any) to the 'optimizer'.
The
difficulty
of predicting
future workloads may
- in the strict
sense
also make optimization
- more an intellectual
exercise than a realAn optimal solution
alternaistic
approach.
have a merit.
It
tive may, nevertheless,
against which other,
provides
a 'base-line'
solutions
can be measured in
more 'natural',
terms of access and storage costs.
no. 4
In solution
alternative
no. 4 is
duplicated
under orderno.
This means
get an orderno from a given
delivno
records
with
ordernotdelivno,
those
have to be scanned.
That is why the
cost in this case raises tremendously*
4.3
continue
processing
after
an
interrupt
with
-s aofile
give the
name of the saved
fi'le
with
this
parameter.
5.
:partno
partinfo
L---d
Solution
sifile
amount
r----
Figure
It is possible
to interrupt
the
program after
the input phase in
order to examine the output
from
the
input
checking
procedures.
Give in that case this
parameter.
The data is saved in the file
sofile.
custadr
h---‘-7
Y
sofile
Conference
120
[Ber-77A]
Berild S., Nachmens S.: A Tool for
Data Base Design by Infological
Simulation,
Proc. 3rd Int.
Conf.
on VLDB, Tokyo, 1977.
[Ber-77B?
A Methodology
for
Berelian
E.:
Base Design in a Paging
Data
The University
of
Environment,
Engineering
Systems
Michigan,
Laboratory,
Ann Arbor, USA, Ph.D.
Thesis, 1977.
[Bub-801
Bubenko
jr
J.A.:
Information
Modeling in the Context of Systems
Lavington
Development,
in S.H.
(Ed.) "Information
Processing SO”,
North Holland,
1980, pp. 395-411.
1 Gus-821
Gustafsson
Bubenko jr
Approach to
Modeling,
Conference
Information
Karlsson
T.,
M.R.,
J.A.:
A Declarative
Conceptual Information
IFIP
WG8.1 Working
"Comparative Review of
System Design Metho-
Mexico City, September,
1982
dologies",
[Mit-751
[Pur-79
North Holland,
1982.
Mitoma M-F., Irani K.B.: Automatic
Database Schema Design and Optimization,
Proc. of 1st Int. Conf. on
VLDB, 1975.
]
Purkayastha S.: Design of DBMSprocessable
Logical
Database
Structures,
The
University
of
Michigan,
Ann Arbor,
USA, Ph.D.
Thesis, 1979.
Proceedings of the Eighth International
on Very Large Data Bases
Conference
121
Mexico City, September,
1982
Download