at the 1975 5 P. Sl1oan'-

advertisement
DEZISION RULES FOR THE AUTOMATED GENERATION
OF SrORAGE STRATEGIES IN
MANAGEMENT SYSTEMS
DATA
by
GRANT N SMITH
S.B., MIT
(1974)
SUBMITTED IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE
DEGREE OF MASTER OF
SCIENCE
at the
MASS&AHUSETTS
INSTITUTE OF
TECHNOLOGY
June, 1975
Signature
of
Author
..........
.....
..........
Alfred P. Sl1oan'- chool otjAnageggp,
Certified by..-..
..
-
.
..
...
.
/7
May 9,
.
1
5
, ,........
Thesis Supervisor
.o .0
*.*
. *
. .... *
.....
by . . .s..
.....
.......
Chairman, Departmental Committee on Graduate Students
Accepted
RC-HIVES
JUN 13 1975
16RARIUS
Abstract
PAGE 2
DECISION RULES FOR THE AUTOMATED GENERATION
OF STORAGE STRATEGIES IN DATA MANAGEMENT
SYSTEMS.
by
GRANT N SMITH
Submitted to the Alfred P. Sloan School of Management on May
9, 1975 in partial fulfillment of the requirements for the
degree of Master of Science.
Current methods of determining storage strategies (both
logical and physical) rely usually on (1) expert opinion,
and (2) the experience of the designers. There has been
some work in the area
of automated design, but the
approaches taken to date generally apply only at generation
time, thus leaving the resulting design in effect for the
rest of the life of the system. Should usage of the system
change over time, as experience shows that it will, large
inefficiencies may result owing to the original choice of
storage strategy.
The work presented here attempts to introduce dynamic
decisions regarding storage strategies that will be invoked
(1) on a regular basis, and (2) when system performance
degrades below an unacceptable
level. These decisions
involve both the structure of the data base (such as which
fields are to be in which files), as well as indexing, data
encoding, factoring and virtualizing decisions. Decision
rules are described which achieve this result.
Also described is a procedure whereby any given request will
be most efficiently satisfied, making use of the current
structure of the data base, indexes, etc.
Finally, the set of decision variables required to drive the
above decision subsystems is specified in detail.
Thesis Supervisor: Stuart E. Madnick
Title:
Assistant Professor of Management Science
PAGE 3
Acknowledgements
I
wish
to
invaluable
advice, comments
only the writing
in which I
Professor
thank
for
his
criticism throughout
not
Stuart
and
madnick
E.
of this thesis, but through
all the years
be associated with
have been so fortunate as to
him.
In everyone's life, there is
that
is Professor
one eternal optimist. In mine,
Donovan. He
John
has
been a
powerful
motivating force throighout this work.
Last, but by no means least, thanks are due Chip Ziering for
many hours of fruitfull
-
often heated -
discussion.
rable of Contents
Introduction......
. .
a...**..*
*
....
PAGE 4
.* ...
The Relational Model of Data............
....... Page
6
.......
15
Page
Shared Data Bases and User Flexibility..
......
Decision Variables and Truth Functions..
.....
The Structural Decision Subsystem.......
....... Page
The Request Decision Sub system..
Page 37
*.Page
53
68
Page 108
...............
Scenario for application of
Decision Rules...
conclusion.. ..
. .........
References.
. . . . ..
. ..
..
..
..
....
........
..
Bibliography............
Appendix 1..........
..
............
00
*
.........
Page 126
........
Page 145
........
Page
..........
Page 152
151
.Page
156
Appendix 2..........
.Page
167
Appendix 3......
.Page
177
...
.
*
..
eO..e.
..
*
S
55
555550*05
SL
SS
*
..
*
.
.
S S SS
S S
*
*
5555*B*O
55
*
55
*
0
L "L
9 *L
S S.
S 55
0555
*050
0S05
OS
S 505000
055050055
C~ L
L
5550
*
535
550505555
55
6ZL1
5005505
Z. L
*
055
5.5555555
Li *fI
*05055505
*
8 ti
Se.
SeS
E IE
5.55
55.5
Eti
*
55
*O.S**5
SSSsSS
*05S5S~**.
555
t OE
ti~
550
.3.55.5
91t
.SSSO
55
*55.
a
*
*
&SS
B
&S
*SS
0
0
5
055
0
0.
.
.
.
.
.
.
5555
.
.
35
.
.
5
00
,
L eJ
0a
go Is-
dsajnbT~
S Z!)Yd
CHAPTER I
PAGE 6
Introduction.
For
programming
applications
primarily
stake
at
was
if
applications programs
the
(as
structure
the
structure
that the
idea being
been
the
avoidance
expounded.
from the logical
mapping- function. Then,
system
the
to
machine
data structure),
the physical
(or
if
into some
be known)
of
underlying data
structure available
data
was
What
rewriting
of
whenever the
and
came to
applications program
oriented data
has
Involved was a mapping
base was changed.
data
data-independent
of
concept
the
nos-
years
some
would take
care of
this
were any
change in
the
there
physical structure, the mipping function would be changed so
that
change
structure
logical
the same
still
would
programs,
and, in
programs,
or
a
be
person generating
Note that
to mean either a
not necessary for our purposes
two classes,
since as far as
to
user (be
fact, any
logical data structure).
the term user
presented
as existed
any
it in
requests
this
before
applications
of
the form
that
against
throughout we shall use
program or a person.
It
is
to distinguish between these
the database is
concerned,
all
PAGE 7
ZHAPTER I
requests look alike.
a division of responsibility,
Arising from this approach is
The logical structure of the data is
and thus of expertise.
in
the
responsibility of
of
the domain
the
user, while
mapping function previously also
physical structure and tha
in the domain of the aser, have now been removed. This is in
fact a desirable featire as the user may concentrate efforts
on
technicalities of
in the
bogged down
rather
problems
applications-oriented
than
becoming
data
establishing a
base.
However,
that is
way things turned out. There
not guite the
were several attempts to design
systems which would perform
the mapping function ini handle
the physical structures for
the
user, given
the
was that the
turned out
the
physical
user
had
Furthermore,
no
to be
(and
structure
quite
thus
differentiation
generally delineated
the way
mapping capabilities tended
rather simplistic in Zoncept and
that
But
logical structure.
and so
execution,
of
the user
to be
with the result
knowledgable
the
it
mapping
about
function).
responsibility
(or user
the
was
group) now
took on the responsibility of both the logical- and physical
PAGE 8
CHAPTER I
structures. True, for any one logical structure the user now
had a choice of physical structure coming from a wider range
previously have allowed, but
than personal experieice might
was a blessing
whether that
horror remains
or a disguised
unclear.
Other promises made -
and not kept - about data independence
mapping function. Theoretically, a
again revolve around the
structure should be
given physical
several logical
mapped into
able to be
This facility
structures, and vice-versa.
has not been realized to any notable extent.
Furthermore,
the
namely the isolation
data structure,
Rarely, if
ever, was
of
data
of users from changes
has not
once established. It
one physical
purpose
primary
yet realized
the physical
and no one
in the physical
its full
potential.
data structure
was a Herculean task
structure,
independence,
altered
to implement any
was about to go
in and
mapping
would
tamper once it was working.
Any
one
physical-to-lagical
generally be performel
structure
only once, and decisions
as to what
CHAPTER I
it
should be
decisions were, and still are, made
a fixed
with
the data
base. These
by people.
Much of the
future uses of
to the
time,
point in
were male at one
perception as
PAGE 9
knowledge on which these decisions
were based was knowledge
gained from experience, and so was
more akin to an art than
a
science.
non-trivial
some
However,
of
subset
such
decisions are indeed logical and rational, and so subject to
some measure of automition.
It
would
made to take
of some of
advantage of the structured nature
On the :ontrary,
these decisions.
there have
task, and
to this
efforts addressed
has been
no attempt
to claim that
be inaccurate
been several
these efforts
can be
divided (perhaps unfairly) into two major groups:
simulation-oriented
generation to
are notably
decisions used prior
to system
structuring decisions.
These
static, one-time decisions made
at the
aid in
discretion of some person, and requiring substantial
human intervention. The results of decisions made at
that
point were
life of the
the effort
to be
lata base.
influential throughout
the
credit is
due
However,
to formalize some
much
major aspects
of the
decision.
dynamic
rules used continually throughout
the life
CHAPTER
the
of
to
base
data
I
reguire major
Whether any
gained
is
arose the dilemma of
to draw
intended
the years
to
provile
logical-to-physical
(albeit
working
a
system.
over
purport
results was
on these
with
tamper
inefficiently)
This work
that they
ongoing basis.
the system on an
iction was taken
to
in
However, the
upon.
questionable. 3nce again there
whether
and
monitoring effort
these efforts was in
important aspect of
10
human intervention
their interpretation and acting
attempted to track
usage
system
monitor
results of this
performance. The
would, again,
PAGE
in
dealing
data
structure
invaluable insights
on the
some
and
independence,
to
mapping, and
as
systems
with such
propose
a
methodology for achieving some of the promises made earlier.
It
is
important
to emphasize
is
that this
methodology
a
pretend to be all-encompassing. The
since no one work could
approach here will be to:
1)
describe a
system
independence 3asel on
in which
there
is
true
a physical-to-logical
data
mapping
capability,
2)
enhance
some
of
this system
the
better
with the
ability to
formulated
decision
perform
tasks,
PAGE 11
CHAPTER I
the
including
reconfiguration
dynamic
structures
without
structures.
Attention
initial
of system
monitoring
structuring
of
of
also be
will
decisions made
the
data
physical
the
alteration
and
use
the
logical
paid
to
at
the
definition
time.
3)
further
enhance
the
system
capabilities that are oriented
with
decision
toward the efficient
satisfying of requests against the data base.
The work here revolves around
the relational model of data.
be a dismissal of all other
This should not be construed to
models (such as the network model) as inferior. The author's
familiarity with the relational model and the existence of a
well-defined set of theoretical rules that can be applied in
the model were the motivating
It should also
factors behind this decision.
be pointel out that the
herein used has embellishments
relational model as
and alterations derived from
various personal experiences and sources of the author.
responsibilities for
any errors and inconsistencies
model employed here should not
the well-known names
in the
necessarily be attributed to
behind the relational model;
well be the fault of the author.
The
they may
ZHAPTER I
PAGE 12
Structure of Thesis.
Chapter II will introiuce the relational model as needed for
our
purposes,
and
point
out
the
differences,
where
applicable, between this molel and that found in most of the
literature.
Chapter III will address itself
to the methodology employed
for achieving data independence.
of decision variables maintained
Chapter IV presents a list
by the
system.
Since there is
information about
system usage and performance
consistent set
required to
regarding physical restructuring,
support dynamic decisions
a
of statistical
a long list
of rules
has been
developed for
naming
these decision variables.
Chapter
V
responsible
will
address
for
initial
itself
to
the
specification,
decision
and
physical data
dynamic reconfiguration
of the
the Structural Decisign
Substga (or SDS),
and
rules
subsequent
structures chapter VI
CHAPTER I
concern
will
decisions
those
optimally satisfying requests
PAGE
13
dynamically
about
made against the data
base -
made
the Reguest Decision Subsistem (or RDS).
scenario, and those decision
Chapter VII presents i typical
VI will be applied to the
rules developed in Chapters V and
demonstrate the
scenario to
effectiveness of
the decision
rules.
Chapter VIII
possibilities
further
explored,
as well
as
that
should,
perhaps
can, and
ways to
as to
with some remarks
concludes the thesis
expand
be
rules
the decision
utilizing a similar methololgy to that employed here.
Again,
pointel
must be
it
and
are
intended
clearly
to
that the
V and VI are
developed in Chapters
certainly dependent
out
on the
not
demonstrate
situation specific (and
implementation of
universally
rules
decision
Chapter III)
applicable.
a methodology
and
They
there
is
are
no
intention of developing a comprehensive and universal set of
rules.
ZHAPTER I
Finally,
some
familiirity with BNF (Backus-Normal
assumed throughtout.
PAGE 14
Form) is
Good introductory sources are (1,2).
CHAPTER II
PAGE
15
The Relational |odel af Data.
Probably
the
major
model
relational
stumbling
is
the
block
in
introducing
The
terminology.
the
concepts
underlying this approiCh are familiar to us all.
consider a regular report,
one
time or
another.
convenient format
spell out the
for each
or table that we have all seen at
In
Figure 2.1
for representing such data.
category.
example, in
'dept#',
has
Figure
'description',
difficulty in
rows and
that the
Note
2.1 we
a table;
a
The columns
provide a value
categories of data; the rows
well be interchanged without loss
For
is such
columns might
or alteration of meaning.
see
the columns
etc. And there
interpreting the
are 7
labelled
rows.
No-one
information in
Figure
2.1, and this is essentially the relational model.
By convention in
columns, and
the relational model,
put the
data in
the rows
we
always label the
(ie: horizontally)
just as is the case in Figure 2.1 . Furthermore, the columns
are called domains,
names.
ani the
column headings are thus domain
This arises from the mathematical concept of a domain
CHAPTER II
PAGE
Plant: White Plains, New York.
Period ending: Aug 31.
Summari of -oerations
(in 000's)
Difference
Dppt# Descrittion La b3 r Expense
Actual Budet
(Actual-Budget)
- 639
6464
7103
2990
1
Spray
-1152
12829
13981
5915
Coating
2
+ 400
2590
998
2190
3
Filing
-1336
5243
3907
1637
Sanding
6
+ 525
11275 10750
7
5915
Buffing
- 152
8998
8846
4788
Assemble
10
and Pack
TOTAL
22243
45911
48265
Fi-aur_ 2. 1
-
2354
16
CHAPTER
as being a
II
PAGE 17
collection of objects (or numbers,
information-carrying
item).
When
or any other
we choose
a value
from
that collection we are choosing an item from that domain.
Notice that
each row is created
of the
from each
called an entry.
six domains.
then the departments
for
example,
information is
Now,
of the entries
might just
We
the top of the
we shuffle
table, and
In fact,
the rows; it
the rows in random order
in
a
telephone
as
may be
(as it would
but
directory)
no
lost by a random ordering of the entries.
if we were to interchange domains 1 and 2 of Figure 2.1
(ie: 'Dept#'
But
and 'description')
changed the domain
provided we
well.
table is
in decreasing 'dept#' order.
no information if
inconvenient to have
be,
important.
the 'total' entry at
easily put
we lose
is not
single item
in the
Each row
Notice also that the order
the table
(rows) in
by choosing a
notice that the order
one entry must be
the table
is
there
would be
names (column
of the domains
the same as that in all
to remain meaningful.
no problem
headings) as
within any
other entries if
Thus, the order
of the
CHAPTER II
PAGE 18
domains is important, while that of the rows is not.
PrimaEXr-Ks.
the table far
entry in
only one
can be
that there
may observe
2.1 we
In Figure
and the
value of 'Depti',
any one
same applies for 'description', while there is no reason for
this to be the case in any of the other columns. In fact, in
column
of
is
information alone which department it
it
is
both departments,
given a value
then there
is
there is
no
for 'Dept#',
information relevent
to that row. Eg:
say that
problem.)
'Dept#'
table; ie: for any value of
But,
ambiguity about any
given Dept# =
candidate £Rigary
a
'Dept#' is
(If
meant.
values in the entry.
can uniquely determina all other
we
that is
no
that
from
determine
not
we can
2.1,
Figure
Thus,
the 'labor'
that it is in
value '5915' and told
given the
twice.
'5915' appears
domain the value
the 'labor'
there is
t§! for
2,
we
Thus,
the
only one entry
in the table.
In
the
event that
combination
there
of domains
is
must
property; namely, given a set
of
domains, the
entry
no
be
then
some
exhibit
this
such domain,
found that
of values for that comination
zontaining
those values
in
those
CHAPTER II
domains is
A
PAGE 19
uniquely determinel.
table need
hive
not
a primary
key,
but
it
is
often
advantageous from the standpoint of efficiency to do so. (In
model propsed by
the relational
that are
two entries
may contain
no relation
Codd, et. al.
identical, and
always exists some primiry key, even
so there
if it is a combination
of all domains. This is not the case here, as can be seen in
the definition of the 'Join' operator in Appendix 2.)
Normalization.
Looking again
the
table
at Figure 2.1,
is
some idditional
'Plant', and the
are
two
We see
'Period' covered.
the
unler
such
information,
heading
of
as
the
also that there
'Expense';
viz.
and 'Budget'.
'Actual'
Since
columns
printed above
we notice that
the relational
tables,
world as
the
a set
of
we must find some way to include that information in
the table itself. As it
the informarion in
heading.
model views
now stands, it
is
not really part of
the table; rather it is a
Considering the
fact
that
form of table
the 'Plant'
and
the
CHAPTER II
of the table, we may assume
'Period' are printed at the top
and we further
of some iiportance,
is
that it
PAGE 20
assume that
there are other plants and other periods.
up a distinct table for each
one course of action is to set
plant/period combination, each having an identical format to
result in
that of Figure 2.1 . This would
distinct tables, and
identical, yet
actioa
suggests
itself: set
plant/period combinations,
a
up
a large number of
so a second
single table
and somehow
course of
for
all
distinguish entries
and period. This can be
as belonging to some specific plant
done by simply adding two domains to Figure 2.1: 'Plant' and
'Period'.
Furthermore,
notion
of
'Budget'.
we must
of incorporating
way
find some
domains
'Expense'
into the
two
so,
we might
merely
To do
'Actual expense', and 'Buiget Expense'.
'Actual'
rename the
the
and
domains
The table now is as
appears in Figure 2.2 .
Notice, however,
based on the
that the table
notion 3f 'Expense';
contains two
we have
domains each
just renamed the
V.
-~
Summary of Operations
(In
Plant
Period Dept* DescErig2tion Labor
W Plns
10/31
1
W Plns
10/31
10/31
10/31
10/31
2
10/31
10
W Plns
W Plns
W Plns
V Plns
3
6
7
Spray
Coating
Filing
Sanding
Buffing
Assemble
2990
5915
998
1637
5915
4788
000's)
Budget
Actual
Expense
Expense
7103
6464
13981
12829
2190
2590
5243
3907
10750
11275
8846
8998
Difference
(Actual-Budget)
-
639
-1152
+
400
-1136
+ 525
-
152
and Pack
W Plns
10/31
TOTAL
22243
45911
Figure 2. 2
-2354
CHAPTER II
two domains
appearing
single
as in
Figure 2.2.
in either
column
domain:
the
prefixing 'Actual'
specify the role
In
'Expense'
case, the
fact,
domain.
and 'Budget' to
domain is
table,
must
be qualified by a
to provide
The
reason
the domain name
used more
suzh
values
chosen from
of each of these domains in
if a
failure
this
are, in
general,
it
PAGE 22
was to
in any
role name. If
in
for
the table. In
than once
role names
a
there
one
is
that event,
a
then
ambiguity results.
Use of a role name is
is used
not limited to cases in which a domain
more than on:e
in the
same table, and
any domain
name may be qualified by a role name.
Figure 2.2 is a version of the table which has unique domain
(or rather role) names, and is set
in
contains all information
some of it
in the form of
normalized table.
In
of taking information
the
plant
and perioi
the table itself
table headings.
general,
of
Figure
we take the
as opposed to
This is
called a
normalizing a table consists
that applies to all
integral part of the entries
More specifically,
up in such a way that it
2.1)
entries (such as
and making
it
an
themselves (as in Figure 2.2).
primary key of tables higher
CHAPTER
II
up in the hierarchy, ini make it
the
lower
table. An
PAGE 23
part of the primary key of
this
clarify
help to
will
example
point.
The example
Figure 2.3(a)
appearing in
is
one
domains,
view.
2.3(a).
Notice
in the
Notice that some domains
table) are not really
'employee'
ames of other tables.
but the
Figure 2.3(c)
that might be
form of a set of tables
formed to store this logical
(such as 'children'
logical
exist in a corporate data base.
view of the data that might
Figure 2.3(b)
shows the
is the normalized
set of tables arising from
that Figure 2.3(c)
was derived
from Figure
2.3(b) by the following steps:
for each domain
fact a
name in a table (say A)
table name
table A,
and make
(say B)
it part
that is in
take the primary
of the
key of
primary key
of
table B.
remove the table name (B) from table A.
This is the process of normalization, and, in the relational
model, 111 tables
must be normalized (ie:
must not contain
CHAPTER
II
PAGE 24
EMPLOYEE (E MP. NAME, AGE)
CHILDREN (CHILD. NAME, AGE)
JOBHIST (JOB.DATE,TITLE)
SALARY (SAL. DATE,SALARY)
1)
2)
3)
4)
EMPLOYEE (EMP. NAgAGEJOBHISTCHILDREN)
JOBHIST (JOB.DAT,TITLE,SALARY)
SALARY(SAL.DAfTSALARY)
CHILDREN (CHILD. VAgAGE)
laL
1) EMPLOYEE (EMP. NAMAGE)
2)
3)
JOBHIST (EMP. NAME, OB.DATgTIT LE)
S ALARY (E MP. NA ME, JB.ATE, _AL.DATE, SALAR Y)
4)
CHILDREN(EM_.NAEg,ZHILDNAEi,
AGE)
Fiqure 2.3
(Primary keys underlined)
CHAPTER II
PAGE 25
domain names that are in fact table names).
Why 'relational'
model? What we have been calling tables are
call 'relations' in the relational
an arbitrary name.
entry is
table.
model. This is more than
Remember that we described
formed by selecting a value from each domain in
In mathematical
terminology,
subset of all combinations of values,
of
above how an
the domains.
The
name
used for
these
the
entries are
a
or a cartesian product
such
a
subset is
a
'relation'.
More formally:
The cartesian product of A and B (written A X B) is a set of
ordered pairs, each first element of the pair coming from A,
and each second element from B.
A X B =
Ie:
J(a,b):a(A, b<Bj
('4' means 'is a member of')
We can easily obtain an ordered
n-tuple (where n>2)
by this
method:
D1 X D2 X...X Dn
A
relation will
=
normally
a(dl,d2,...dn): di Di,
be written
as
i=1,...nj
a relation
name
PAGE 26
CHAPTER II
list of domain names.
followed by an ordered, parenthesized
For example:
le: R1(D1,D2,D3).
The
(3) for
referred to
reader is
Employee(name,emp#,dept#).
further discussion
of
relations and normalization.
Second Normal Form ani Functional Dependence.
The process
of normalization
(namely, the
described above
removal of all domain names that are in fact relation names)
is adequate for most situations in which the user is careful
asigned to the various relations
to ensure that the domains
are in
fact assigned to
aided by the
the 'correct# relations.
process of diagramming the data
in Figure 2.3(a).
However,
there are times
This is
base as shown
when what seems
quite logical will, in fact, give rise to problems.
Consider Figure
'Dept#' the
or in
value of 'description' is
other words, 'description' is
on 'Dept#'.
well;
Notice that for
2.1 ,
Clearly,
namely,
any given
uniquely determined;
functionally dependent
in this case, the reverse
'Dept#'
is
value of
functionally
is true as
dependent
on
CHAPTER II
PAGE 27
'description', but this need not be the case.
Now, if for any reason we were no longer interested in Dept#
therefor struck
2 and
lose
fact
the
that
the second row
2
Dept#
from Figure
2.1, we
ie:
the
exist anywhere else.
One
is
'coating';
relationship (2,coating)
loes not
way to avoid this is to
establish a new relation containing
only
the
functionally
domains
dependent
(Dept#,
description). We may now strike either of these domains from
the relation in Figure 2.1 without loss of information.
are said to be
These relations
in second normal
form; Ie:
Domains not functionally lependent on each candidate key are
stored in a separate relation. Figure 2.1 would thus contain
'Dept#'
as a
domain, but
relation now contains 'Dept#'
not
'description', and
another
and 'description'.
Third Normal Form and transitive dependence.
Third normalized
relations are second
normalized relations
in which there exist no transitive dependencies.
If B is
functionally dependent on A, and
C is functionally
CHAPTER II
dependent on B,
PAGE 28
then by the algebraic
is also dependent
on A. But in
a
transitivity laws, C
somwhat different manner,
since it is also dependent on B which is dependent on A.
this case,
This is
we say
that C is
true in any cise
transitively dependent
where
In
on A.
the application of algebraic
transitivity yields an additional
functional dependency, as
it did in the above case.
Relations in third normal form would not contain any domains
that
were dependent
other domain
on any
which is
itself
functionally dependent on some domain in the relation.
For the case above,
where C
is
we would establish a separate
transitively dependent on A,
relation containing domains B
and C, and remove C from the relation containing A.
It thus appears preferable to
retain all relations in third
normal form for the reasons outlined above.
The
reader is
referred
to (4) for
a more
treatment of second- and third normal forms.
Transferability of Role Names.
comprehensive
CHAPTER II
PAGE 29
Consider the existence of two relations:
wife.socsec)
marriage(husband.soc_sec,
does
Since 'marriage'
'marriage'.
twice,
name is
role
a
and
'husband.socsec'
the name
list
request to
with socsec
list
Those
supplied
and age of
the wife of
phrased
might be
are:
consider
Now
and
wife.name
that
'wife.socsecI.
arbitrary retrieval language)
domain
contains
mandatory.
617-03-2911. This
and so
domain 'socsect
contains the
'person'
Notice that
and
age),
person (soc-sec, name,
a
a person
(in
some
as follows:
husband.socsec
for
wife.age
'617-03-2911' ;
Notice
that
referenced (viz:
contains
'person' relation
the
'name' and
'age') but
the
not the
domains
role name
qualifier 'wife'. Intuitively, however, it is clear that the
request is
t3 satisfy the
information needed
present, but
not in any way that the system can utilize.
The way that the system mikes use of implicit information of
the type
name qualifier
entry
is by transferring
in the example above
'wife'
designated
'husband.socsec
to the 'person'
by
617-03-2911'
relation only for that
relationship
the
and
the role
the
between
corresponding
CHAPTER
'wife.soc-sec'.
'wife'
loas
name qualifier in the 'person'
PAGE 30
II
not become
a permanent
role
relation.
Set Theoretic Notatil, Definitions and Examples.
In
chapter I
was mentioned
the fact
collection of theoretical rules exists
operate on relations (regardless of
particular
rules. This
normalizel form).
is perhaps
most from those presented
This
where the
that a
well-defined
which may be used to
whether they are in any
section outlines
model used
elsewhere(3,4).
here differs
Differences will
be pointed out at appropriate points in the discussion.
The following operations will be defined:
these
PAGE 31
CHAPTER II
syabl
2peralion
adi**
Diadic/fj
Union
U
Diadic
Intersection
N
Diadic
Difference
-
Diadic
Cartesian Produzt
X
Diadic
Projection
P
Monadic
Join
*
Diadic
Diadic
Composition
Permutation
M
Monadic
Compaction
C
Monadic
Restriction
R
Both
Division
/
Diadic
These
operations
are
briefly
described
here,
and
are
formally defined and examples given in Appendix 2.
** Diadic operators operate on
two relations (they may both
be the same relation); monadic operators operate on a single
relation.
CHAPTER II
PAGE 32
Notation.
R<i> is
the name of the i th relation
* means 'is a member of'
J....j
implies
a list,
or set
of the
items between
the
'Il's.
c(i) is
the cardinality
n(i) is
the degree (number of domains)
(number of entries) in
d(i,j) is
the j th domain of R<i>,
v(m)(ij)
is
t(i)
is
ie:
0 is
in R<i>
j=1,..n(i)
the m th value of d(ij), m=1,..c(i)
an n(i)-tuple in R<i>
(v (a) (i,1)
t(i)
a
L(jaI)
R<i>
is
, v(a) (ij2),..v(a)
(i . n(i))
1,...c(i
the length of list
a
the null set - ie: R<i>=0 implies c(i)=O
asp means a is
a subset of b (a=b is
legal)
acb means a is a prgaer subset of b (a b)
Va means for all values of a
This notation will be used
throughout the remainder of this
CHAPTER II
PAGE 33
thesis.
Explanation of gperatars.
Union
The union
of two sets consists
of a set that
contains all
entries that appear in eitter of the two sets.
Intersection
The intersection
of two
sets is a
set that
contains only
entries that appear in both of the two sets.
Difference
The
two
difference of
entries
that appear
other. Eg: If the
in
sets is
one
a
of the
set
that contains
sets, but
not in
two sets were A and B, then 'A
all
the
- B' is a
set of all entries that appear in A, but not in B.
Cartesian Product
This is
as defined on Page 25
Projection
The projection of a relation is
a procedure whereby some of
the domains in the relation are removed.
Join
A join of two relations is the process whereby two relations
may be
combined into a
single relation containing
domains of the two being joined.
all the
CHAPTER
II
PAGE 34
Composit ion
This is
the same
as the
join, except
which the relations are joined is
that the
removed.
domain on
This means that a
composition is in fact, a projection of a join.
Permutation
A
permutation appliel
to
re-ordering the domains in
a
relation consists
of
merely
the relation.
compaction.
The compaction operator
entries
from
is used for deleting
a relation.
conjunction with
It
is
all redundent
used most
commonly
the projection operator, which
in
may result
in redundent entries.
Restriction
The
restriction operator
is used
for selective
retrieval
from a relation.
Division
Division is the inverse of the cartesian product.
Introduction to XRN.
This section is intenled to be
a very brief introduction to
the pertinent points about XRM.
XRM is a particular implementation of an n-ary relation data
management
model
designed
Center, San Jose (5).
and
built
by
IBM
Scientific
It operates basically as follows.
CHAPTER II
PAGE 35
XRM can handle two types of information:
, character string data, and
(32 bit) numeric data
. fullword
There
are
correspondingly two
relations; one
that handles
major
subcategories
that handles character strings,
n-tuples of
numeric data.
type (character or numeri. n-tuple)
of
and another
Any one
relation
can only handle data of
that type.
Each entity in the system
(character string, or n-tuple) is
automatically assigned an IRM ID
base.
Given
that
ID,
efficiently retrieved
obtains
its
entity, and
ID by
string
the
entity
by XRM.
And,
applying
a
then performing the
character strings,
are hashed;
when entered into the data
some number
in the
can
given
rapidly
and
the entity,
XRM
to
that
be
hashing function
retrieval. In the
of the
case of
first bytes
case of
of the
numeric n-tuples,
all
primary key domains are hashed.
All IDs in
XRM are fullword integers.
In numeric relations (n-tuples), any domain can be inverted.
CHAPTER II
This is
Once
equivalent to
such
domain,
an inversion
that
domain. If
contain
More
inversions in
the given
said
about
that domain.
a value
for
ID's of n-tuples
value
in
that
in the
the
inverted
a linear search
would be
the
implementation
of
Chapter V.
For our purposes,
will
information,
is
index for
given
no inversion existed,
necessary.
concepts
exists,
will rapidly find all
XRN
relation
building an
PAGE 36
this introduction will suffice.
be
explained
the reader is
as
needed.
referred to (5).
Additional
For
further
CHAPTER III
PAGE 37
Shared Data Bases and User Flexibinlity.
based on the relational
This chapter presents a methodology
between the
logical- and
intentions of data independence is
to allow the
achieving independence
model for
physical data structures.
One of the
(it)
user to view the struCture existing in the data he
in a
way most convenient
for a specific
uses
application. This
provided with the facility to
means that the user should be
define any relation containing any domains in any order, and
be able
it as
to use
such. Notice,
some
however,
of the
issues raised by permitting this flexibility.
The
most
problem
glaring
arises
divergence from the concept of
bases. The benefits
been
adequately
proposing the
covered
result
specific applications.
the
many and have
Now
elsewhere,
every user
we
are
a powerful
easy definition of relations for
What does this
Each
of
shared (or centralized) data
facility for allowing
concept?
a
of shared data banks are
tool that allows rapid and
data base
as
user now
do to the centralized
wants (and
is able
to
CHAPTER
have) different
III
relations for his application(s),
basically gaining effiziency and
of
generality.
maintain his
PAGE 38
Each
user
own data
which is
convenience at the expense
must,
needed to
furthermore,
collect
support his
application,
and
rather than delegate that function to a central authority.
The traditional
maintainence
methal of centralizing data
has
administrator
been
the
appointment
whose responsibility
central shared data base,
and
to that data base. Zhinge to
time
consuming,
convenience was
and
so
collection and
of
it is
a
data
to maintain
base
the
ensure that all users conform
the data base is expensive and
generally
sacrificed in favor
to
be
of a
avoided.
User
centralized data
base.
There is no need for sacrifice on either the user's part, or
the
data
base
concept of a
administrator's part.
1:n mapping of physical
demonstrates its
user a specific
value.
There
is
no
This
is
where
to logical structures
reason for
mapping from the single
denying a
physical data base
into a specific logical relation for some application.
presents no
problem if the
the
logical relation that
This
the user
wishes to define on the physical data base is some subset of
CHAPTER III
PAGE 39
the domains existing in that physical data base. But what if
the logical relation
requires a mapping onto
does not yet exist in
the physical
a domain that
data base, and is
yet to
be created?
one
relation
domain.
existing
the
base
data
the physical
in
to include
relevent
new
the
the principle
one could invoke
Alternatively
of a
now from the logical to the physical relations,
1:n mapping,
the
containing
relation
new
a
create
and
redefine
to
is
possibility
required
information.
We
expressed the
have thus
logical to
physical relations;
n:m mapping
for a
need
ie:
map onto several physical relations,
from
relation can
a logical
and a physical relation
can be mapped into several logical relations.
Before
a
proposing
mappings,
let
efficiency.
In
us
a
methodology
iddress
very
very
large
for
briefly
data
constantly uses the same, small subset
relation pays
a high price
implementing
the
base,
a
issue
user
n:m
of
that
of data in a logical
in performing the
mapping each
CHAPTER III
PAGE 40
time. Some exception should be made in such a case whereby a
physical relation
data,
and
existing
relations. This
different to
still
is established containing that
alongside
should not
the user;
appear
to be
the
however be
the logical
the
same
subset of
original
made to
physical
appear any
relation defined
and contain
mast
fully
updated
of relations
that we
information.
We turn now to a methodology.
Methodoloay.
There are
basically three categories
have expressed a desire for in the above discussion:
physical relitions
in
the physical
(centralized)
data base,
logical
(user defined)
relations,
and
special physical efficiency-oriented relations.
The terminology to be used here
to be consistent
literature):
is
as follows
(and intended
with the current terminology
found in the
CHAPTER III
.
-
relations
real
physically in
user-defined
-
mapped
are
exist
that
relations
the data base
* virtual relations
which
those
PAGE 41
by
the
(logical) relations
the
onto
system
real
relations
. derived
relations
are subsets
base.
data
-
relations that
real (physical)
3f the real relations
They exist
primarily
constituting the
for
efficiency
reasons.
We thus have a basic system as shown in Figure 3.1
.
Notice
that the elements of the system shown in Figure 3.1 interact
in a
Figure 3.1
reformatted to yield
can be easily
The same is true of all
a hierarchy.
precisely, they form
specific way; more
Figure 3.2.
other figures in this chapter: they
can be expressed in an hierarchical relationship.
Notice also that
this system has not
eliminated either the
data base administrator or the need for some person (perhaps
again the
data base administrator)
to specify
the initial
real relations. The features of the system thus far are:
.
the ability
system
to
maintained
define
real
virtual relations
relations,
and
on
the
have
the
CHAPTER
III
Figure 3.1
PAGE 42
CHAPTER III
FiAure 3.2
PAGE 43
CHAPTER III
system
perform
mapping
PAGE 44
functions
virtual relation may in fact
More than
relation).
(note
that
a
be identical to a real
one virtual
relation may
be
defined on any one real relation.
. the
the
ability to decide
data base
derived
(the decision being
alzinistrator)
relation for
reasons
made by
to
create a
of
efficiency in
(real)
a
particular application
.
the
ability
for a
virtual
relation
to
contain
domains from more than one real relation.
As can be seen, the enhancements are concerned only with thesystem mapping
functions.
Thus,
users may
define virtual
relations consisting of domains in any (combination of) real
relations, but
may n2t
define additional
domains.
It
is
also important to point out the following:
. primary keys
in virtual relations exist
in the eye
of the user only; they do not necessarily correspond
to primary keys in real relations.
. the
are
set-theoretiz operators
defined in
all
by
that are
required
creation of virtual relations,
are a function only of
the user
Chapter II
for
the
as virtual relations
existing relations. The data
base adminstritor, however, who needs the capability
CHAPTER III
to
in
perhaps
facilities;
additional
form of
the
a
not available to the
command,
DEFINE... or .REATE...
needs
domains
and/or
relations
real
define
PAGE 45
user.
Assuming a
in some real
is
relation somewhere in the data
will
adding
Thus
exist.
real
relations) involves the user interacting
adminstrator.
basically
Zonsist
(or
domains
real
with the data base
data base administrator in
The actions of the
situation would
this
base and there
the data base this new
a decision required as to where in
domain
as real domains
will have to exist
additional real domaims
domains, these
new real
to define
user wishes
following
of the
steps:
1)
Determine
If
sa,
merely
existing real
the user
simply tell
is
he
for some
iifferent name
utilizing a
domaia.
whether
user
the
from
to define
a
synonym equating the two names.
2) If (1)
is
determine
not the Case,
in
which
apply some set
real
relation
of rules to
the
domain(s)
belong(s).
3) Add the domain to that real relation, thus making it
available
relation.
for
use
(Note
in
that
any
this
user-defined
step
may
virtual
require
CHAPTER III
restructuring some real
PAGE 46
Alternatively,
relation.
a
new real relation could be established consisting of
the
new domain,
and the
primary key
relation that should contain that
reguired
decision
each time
must
the new
be
made
real
domain. A join is
domain
by
of the
the
is used.
This
data
base
administrator.)
Step (1) above appears to require some human intervention on
the part of a person such as the data base adminstrator, who
is
familiar
relations.
with the
But major
indeed be formalized,
Provided there
real
global system
relations
portions
and the
of steps
(2)
existing real
and (3) can
and automated.
are same guidelines
(eg:
they
must
third-normal form - see Chapter II),
for the
all
be
then step
maintaining of
maintained
(2)
in
above can
be performed by the system.
In a
similar way, by supplying
expected use to
some information as
to the
be made of this new domain,
the system can
to include this new
real domain in
determine precisely haw
CHAPTER III
the data base -
decision
ie: parform step
analogous
is directly
PAGE 47
Notice also that this
(3).
that required
to
in
the
creation of a derived relation.
0
Perhaps it
would appear that all
automation of the major share of
steps (2)
applying step (3)
the capability of
accomplished by the
and (3)
above is
administrative overhead. But consider
the reduction of some
data
that is
base adminstrator
does not
dynamically,
have
which the
(except perhaps
at
predetermined, discrete time intervals). This means that the
real
can be
relations
so
structured
current system usage and requirements.
of the n:m
mapping capability of the
as to
reflect
the
Furthermore, because
system, these changes
in the real relations - be it mere addition of a real domain
or relation, or a restructuring of existing real relations in any way to the virtual
are not visible
user.
We
may now modify Figure
there is some
relations of the
3.1 to show the
system funation controlling the
the real relations in
Decision Subsystem
data base; namely,
the
appears in Figure 3.3 . If
the Structural
Figure 3.3 were reformatted into
the SDS, which must be available to
the real relation hanilers,
of the hierarchy.
structure of
The modified version of Figure 3.1
(gQ).
an hierarchical diagram,
fact that
would become the innermost level
CHAPTER III
f ilure 3. 3
PAGE 48
CHAPTER III
We
have discussed
methodology for
around the
far in
thus
PAGE 49
a
a
non-technical manner
functions revolving
automating some of the
maintainence of the
real relations of
the data
a structure for the
real relations of
the data
base.
Now, given
base (as specified by the SDS),
and given also the possible
determined by the SDS)
existence of derived relations (also
it becomes clear that there may well be more than one way to
users are
derivation
mapped onto
All requests from
agiinst the data base.
satisfy a request
against virtual relations (or
of virtual
relations),
more real,
one or
some set-theoretic
from above,
which
or derived
are
relations. Once
again some decisions are required in the mapping function to
determine:
. whether
the request
(and legally,
- ie:
is valid
from an access
can logically
control point of view)
be satisfied given the virtual relations involved,
. how the request can be satisfied,
and
. how best to satisfy the request.
The subsystem that
controls these decisions is
Decision Subsystem
(RD2).
The RDS
the Request
is responsible
also for
CHAPTER III
PAGE 50
determining how well it is doing
in terms of efficiency. if
the
performance
RDS
(perhaps
decides
as a
thit
result
trigger the SDS
in
as
shown
of changing
system
an attempt to restore
acceptable level. We
RDS,
system
in
is
usage) it
3.4
.
corresponding hierarchical diagram is
will
performance to an
thus modify Figure 3.3
Figure
degrading
Its
to include the
position
in
the
self-evident from this
figure.
The
discussion
non-technical in
this
in
chapter
nature in
global functions and
an attampt
the techniques employed
virtual relations are
purposely
in
been
to demonstrate
interactions within the system
subsystems - the
major decision
has
SDS and
the
of the
RDS. Furthermore,
implementing both the
of no consequence to
real- and
the discussion,
and have no impact on the methodology proposed.
that the
Finally, notice
ilentical
in
their
requests against
The
requests
set-theoretic
virtual-,
real- and
conceptual
underpinnings,
either are made
used throughout
operations
or derived
on
virtual relations
in a
will
be
relations,
relations.
(These
are
thus
and
consistent fashion.
in the
be
format
they
of
real-,
operations are
as
CHAPTER
III
PAGE 51
D PLIONS
m
Fi.qure 3. 4
on
a" a -m a*N
CHAPTER
defined in
Chapter II.)
that a user
mapping
III
PAGE 52
This does
not of
necessity imply
will employ set-theoretic requests
from
a
higher-level
request
directly; a
language
to
a
set-theoretic algebra is a well-understood, and conceptually
simple operation.
Figure 3.4
(See
(6))
might involve
Thus our hierarchical
an additional
user and the virtual relations;
view of
layer between
namely,
the
a request language -
to - set theoretic operation mapping facility.
What we have presentel in this
chapter is a methodology for
achieving true data independence and providing the user with
powerful
relations.
facilities
At
centralized data
by two decision
the
for
defining
same time,
base concept.
application-specific
however,
we
The methodology
subsystems which assume some
preserve
is
the
enhanced
of the system
structuring responsibility.
We
proceed now
subsystems.
to a
detailed inspection
of the
decision
CHAPTER IV
PAGE 53
Decision Variables and Truth Functions.
This chapter describes the naming
conventions to be used in
subsequent sections for the naming of decision variables.
Since
there
variables,
for
is
it
a
was
naming
them.
(Backus-Normal
rather
large
set
of
decided to establish a
This
Form)
method
is
these
decision
consistent method
presented here
format, along
with
the
in
BNF
appropriate
explanations.
Note
that
all
'$<number>'.
generate
decision
This signifies
that name.
variable
the BNF
A reference
names
rule
begin
with
number used
section containing
a
to
these
numbered rules appears as Appendix 3.
<relation id> is an XRM-assigned
internal ID; <domain #> is
the position of a domain within a tuple.
Rule# Rule
CHAPTER IV
1
PAGE 54
<qualifier>: := $1<relation
<category> <qualifier
type> <unit> <request>
type>
<join
info>
<options>
These are variables
domains
in
the
containing statictics about the
capicity
of qualifiers
selection criteria that appear in
in
the
use of
list
a request.
<relation type>::= <virtual>I<real>J<derived>
<virtual> ::= v
<real> ::= r
<derived> ::= d
<unit>: :=<doma in> I<relation>
<domain> ::= d(<relation id>,<domain#>)
<relation> ::= r(<relation id>)
<request> ::= <retrieve>l<update>l<insert>l<delete>
<retrieve> ::= r
<update> ::= u
<insert> ::= i
<delete> ::= d
<category>::=<siaple>l<compound> l<non-specific>
<simple> ::= s
<compound> ::= c
<non-specific> : := n
<qualifier
type>::=<equality>g<nonequality>
<unspecified>
<equality> ::= e
of
CHAPTER IV
PAGE 55
<noneguality> ::= n
<unspecified>
u
<join info>::=<join>J<nojoin>
<join> ::= j
<nojoin> ::= n
<options>: :=<null>I<index> I<no index>
<null> ::=
<index>
i(trss)
::=
(trss='total
resloved set size')
<no index> ::= n
ExamPge:
$lrd(i,j)rsen
j-th
times the
number of
is the
domain of real relation i is used as a simple (ie: the only)
qualifier
equality
in
a
retrieval
constraint. No
request, and
join
was
was
needed to
used
as
satisfy
an
the
request.
The
<relation
type>,
<unit>
and
should
<request>
be
self-evident. For <category>, if there is only one domain in
the list of selection criteria,
then the <category> is 's'.
In the event that there are several domains in the qualifier
the
<category>
is
'c,
and
retrieval from the relation, the
non-specific).
if
there
is
a
sequential
<category> will be 'n' (or
CHAPTER IV
For <qualifier type>,
PAGE 56
a constraint in
a qualifier can be of
essentially two types:
. an equality constraint,
such as 'age=26',
. an inequality constraint, such as 'income > 20,000'.
If
there is
'n')
no constriaint
(as is
the case when <category> is
then the <gualifier type> is
<join info> will be a
- or unspecified.
'u'
in the event that there was a join
'j'
required in the reslowing of the request,
and it
will be 'n'
if no join was necessiry.
<options> will be null in the
If
<category> is
'c',
event that <category> is
however,
whether some
other domain in
inversion in
the real relation.
'i'
-
or
size of
when
then
the
<options> will
list
of qualifiers
If so, then
'index', and the system will also
the resolved set (the
those
domains
restriction).
domains in
If
with
there was no other
are
show
had n
<options> is
store the total
set of entries
indexes
's'.
used
domain in
that results
first
the
in
list
the selection zriteria, then <options> is 'n'.
a
of
CHAPTER IV
PAGE 57
Rule# Rule
2
<retrieved
<relation
object>::=
type>
<unit>
<request> <abject> <join info>
This is a
set of variables that will
contain statistics as
to the use of domains as the objects of a request.
<relation type>::=<real>j<virtual>l<derived>
<real> ::= r
<virtual> :=v
<derived> ::= d
<unit>::=<domain>J<relation>
<domain>::= d(<relation id>,<domain #>)
<relation> ::= r(<relation id>)
<request>::=<retrieve>i<update>I<insert>I<delete>
<retrieve> ::= r
<update> ::= u
<insert> ::= i
<delete> ::= d
<object>::=
<simple>
I
<compund>
I <entry>
(<aggregate>) <object>
<simple> ::= s
<compound> ::= c
<entry> ::= e
<aggregate>::=SUMIAVIMAXIMINICOUNTIUNIQUE
<join info>::=<join>I<nojoin>
-<join> ::= j
CHAPTER IV
PAGE 58
<nojoin> ::= n
Exampie
1) $2rr(i)rej is the number
is
of times a whole entry
real relation i,
retrieved from
when
a join
was necessary to resolve the request.
2)
$2vd(i,j)r(AV)sn is
the number of times that the
average of the values in oly
's')
j
domiin
retrieved; no
of
(since <object> is
virtual
relation
i
is
joins were needed to
resolve the
rule is similar to <category>
of rule #1.
request.
<object> in this
The value of <object>
will be s if this is
the only domain
specified for retrieval (or update) in the request. If there
are
several
domains
specified,
<object> may also be in
then
<object>
is
'c'.
<aggregate> if the individual items
were not required, but some aggregation of them was.
Rule# Rule
3
This
<joins>::= $3 <relation type> <domain> <domain>
type
of
variable
involvement of relations
of times a
maintains
statistics
in joins. It specifies
relation was joined to some other
specific domain.
about
the
the number
relation by a
CHAPTER IV
PAGE 59
<relation type> ::= <virtual>t<real>t<derived>
<virtual> ::= Y
<real> ::= r
<derived> ::= d
<domain>::= d(<relation id>,<domain #>)
relation i
the number
is
$3rd(i,j)d(k,m)
Example:
j
was joined by domain
times
of
that
to domain m of relation
k.
Rule# Rule
4
<system data>::=$4<system variable>
These variables
store information
about system
parameters
and costs.
<system
variable>::=<block
factor>
blocking
size> I
<cost-per-I/O>
I
factor>
I
<index
I
blocking
<relation
<cost/byte/day>
<overhead per call to XRM> I <time period>
<block size>::= p
<index blocking factor>
::= bfx(<domain>)
<domain>::=<relation id>,<domain #>
<relation blo.king factor> ::= bfe(<relation id>)
<cost-per-I/O>::= io
<cost/byte/day>: := sc
<overhead per call to XRM>::= opc
CHAPTER IV
PAGE 60
<time period>::= t
In XRM,
bfx(i,j) is
Vk. So,
for our
constant
purposes we
'bfx' and 'bfe'.
Vi,j,
and bfe(k)
can refer
is constant
to them
The <time period> 't' will
simply as
be the length
of time since the last restructuring of the data base by the
SDS. All
SDS decisions
are based on
last restructuring occurred, and so
the period
since the
decisions will be based
on this time period.
Rule# Rule
<relation data>::=$5<relation variable>
5
These
variables are
used
to
store information
regarding
relations.
<relation
variable>::=<degree><type>
I
<cardinality><type>
<degree>::= #d(<relation id>)
<cardinality>::= cy(<relation id>)
<type>: :=<real>I<virtual>I<derived>
<real>::= r
<virtual>::= v
<derived>::= d (<method of derivation>)
<degree>
is
th e
number
of
domains
in
the
relation;
CHAPTER IV
PAGE 61
<cardinality> is the number of entries in the relation.
Rule# Rule
<domain data>::=$6(<domain name>)<domain variable>
6
are usel
These variables
statistics
for maintaining
about
domains.
<domain variable>::= <# unique values>
<# unique values> ::= q
Eramp1e $6(state)q is the number
in domain 'state'.
be found
relation in
Notice that for
$6(i)g is the same
are numeric,
which
of unique values that will
domains that
as the cardinality
domain i appears.
of the
character strings,
For
may be anything from 1 to the cardinality of the relation
it
in which it
appears.
Rule# Rule
7
<user information>::=$7<user variable>
<user variable>: :=<response time weight factor>
<response time weight factor>::=r
The
<response
preference for
time
weight
factor>
how the response time
is
a
is to be
user-supplied
weighted in
CHAPTER IV
structuring
decisions.
inclusive. For
will be
purposes of
0.5, which is
the variable may
'$7r'
it
to
all
of '$4sc'
in
a
value
from
this thesis,
be taken into account
'$4io'
value.
thru
1
of $7r
However,
by merely appending
and '$4opc'
and by appending
0
the value
essentially a null
cases where
decision rules,
is
PAGE 62
'(1-$7r)'
appear
in
the
to all instances
the decision rules.
Truth functions.
In addition to
set
truth
of
conditions.
The
the statistical variables above,
functions
used
names of these
to
test
is
'0'.
tests
For example,
for
negativity,
if
T(i)
then
specific
truth functions
all begin
'1' if
applying
true; otherwise the value
were a
if
a
for
with: '$8'. The value of a truth function is
the function to a spezifiz case is
these is
i<0,
truth function
T(i)=1,
that
otherwise
T(i)=0.
The truth functions employed here are presented below.
that the <type> of a relation
(Note
(real, derived or virtual) is
not important in applying truth functions.)
CHAPTER IV
j
$8d(ij)
domain
$8i(j,k)
domain k in relation
PAGE 63
appears in relation i
j
is inverted. (For virtual
relations, $8i(j,k)=0 always.)
$8p(i,j)
domain
domains of
primary key
one of the
j is
relation i.
relation i is
$8x(i)r
relation i
$8x(i)d(<method>)
<method> is
is a
derived relation,
If
of derivation.
the method
a
involve
not
did
a real relation.
and
<method>
restriction,
then
<method>::=<nill>.
$8n(i,j)
domain j of relation i
must be proviled for this
is
le: a value
mandatory.
domain before an entry in
relation i will be made.
Note
$8n(i,j)=1
*j
(Primary key
where $8p(ij)=1.
domains are mandatory.)
domain
$8u(j)
$8r(i,j)
j
contains unique values
same as $8n(i,j) except that
Also notice
name.
$8d(i,j) Thus
that $8r(ij)
this is
a
(eg: socsec_#)
it refers to a role
is
a subset
truth function
of
that tests
whether a role name is in relation i.
$8_(j)<data type><starage strategy>
<data type>::=<character> I
<bit>
<character>::=
c
<fixed> I <float> I
<vector> I
CHAPTER IV
PAGE 64
<fixed>::= x
<float>::= f
<vector>::= t(<size>)
<bit>::= b
<storage
I <real
strategy>::=<virtual>
I
encoded>
<real
unencoded>
<virtual>::= v
<real encoded>::= e
<real unencoded>::= u
This set
domain
is
j.
to test
functions is
of truth
For exmaple,
the data
type of
$8_(name)ce=1 then domain 'name'
if
an encoded character string.
is a
$8f(ImI,lnI)
of
the
truth function that tests
whether each
are
functionally
in
domiins
list
Inl
dependent on the whole list ii.
Note 1)
List Imi is
members
which
of
not a
list of all
list
ini
are
domains on
functionally
dependent. Each n'4lnj may be functionally dependent
on some 1xlx/Im
2)
If
also.
In I=O
(ie:
is
empty)
then
$8f(ImlIni) =0.
$8m(lpi,iq)
1i
is a function that tests whether lists IpI and
are
autually
dependent.
Ie:
PAGE 65
CHAPTER IV
and also
$8f(Ip ,PqP)=$8f(IqjqpI)=1,
$8m(lpI,IqI)
implies $8m(IgI,Ip1).
Transitivity also holds: $8m(Jpgjgj)=$8m(Iqi,IsJ)=1
implies that S8m(ptIsI)=1.
which tests whether q
$8c(IpI,g) (<function>) is a function
(note
is
that q
p' then q is computationally
defined as Iq=6.3 *
p. (<function>)
on
dependent
a truth function set
is
'1' if
computation
the
is
required to derive q from the list
$8od(k)
domain q
Ipl. For example, if
dependent on domains
is
computationally
is
a list)
not
of domains Ipf.
up for a request.
It is
domain k appears as one of the object domains
in the request.
is a truth function used in requests. It is '1' if
$8eq(k)
domain
k appears
qualifier with
a
as
<qualifier
type> 'e'.
is similar
$8nq(k)
to $8eg(k) except that
the <qualifier
type> is not 'e'.
Note that
binary
truth functions may
relations
function)
where
(depending
existence
signifies 'true', or '1'.
be implemented as
on
of an
the
entry
unary and
particular
in
the
truth
relation
CHAPTER IV
The complete list
of iecision
PAGE 66
variables and truth functions
that are used in this collection of decision rules is
listed
in Appendix 3.
In
addition, the
following notation
will
be employed
in
subsequent chapters:
. an
'*'
appearing in any decision
variable name means
the sum of all the possible replacements of the **.
For example:
$lvd(i,j)*sej
$lvd(ij)rsej
=
+
$1vd(i,j)dsej
+
$lvd(i,j)usej
($lvd(i,j)isej is
not used.)
Or:
$lrd(i,*)rsej = SUM($lrd(i,k)rsej)
. a
''
in
a name
means the
replacements for the
for k=1,...$5#d(i)
product of
all possible
'%'.
For example:
$lvd(i,j)rseg = $1v(i,j)rsen.$1vd(i,j)rsej
. a list
between two * I'
(vertical bars)
means the sum
of that list.
For example:
$lvd(ij)Id,ulsej
=
$1vd(i,j)dsej + $lvd(i,j)usej
CHAPTER IV
The reader
is
advised to become
PAGE 67
familiar with the
7 rules
and the various truth functions to avoid continual reference
to Appendix 3 and thus to expedite reading.
PAGE 68
ZHAPTER V
The Structural Decision Subsystem
SDS .
This chapter presents a detailed exposition of the SDS.
The SDS is
charged with the responsibility for:
. maintaining the database in third normal form,
.
structuring the
real relations
current system usage is
in such
a way
that
most efficiently serviced,
. modifying any system descriptor
tables to reflect any
change in the structure of the real relations.
Since
we
are
not
implentation here, we
system
descriptor
concerned
specifically
will not address the
tables.
This is,
with
any
modification of
nevertheless,
a
SDS
function.
As outlined in
is a
Chapter III, the creation
privelaged operation.
indefinite number
of virtual
While any
of real relations
user may
relations, the
define an
disorderly or
random definition of real relations would ultimately destroy
the centralized nature, and cohesiveness of the data base.
CHAPTER V
The SDS is
PAGE 69
concerned only with real
since virtual relations may,
by
relation restructuring
definition,
by the user that defined them. However,
virtual relation use for purposes
only be altered
the SDS must examine
of making decisions about
derived relations.
There
are two
separate
points at
which
the
SDS can
be
invoked:
. at definition time,
and
. during the life of the system - or dynamically.
In
case,
either
namely,
to
the
'best'
function
structure the
simply one
variables.
of the
is
database for
identical;
its
source
of values
expected
of SDS use
for the
decision
At definition time, values for decision variables
of truth functions) are
(and the definition
supplied
the SDS
between these two ocasions
use. The distinction
is
of
to
the
SDS.
continuous monitoring
use, which become
the time the SDS is
At
any
time
thereafter
provides accurate
the values for the
invoked.
exogenous,
records of
and
however,
actual
decision variables at
ZHAPTER V
It is important to note that the
no
way dependent
values. Given a
an the
PAGE 70
operation of the SDS is in
source of
set of values,
the decision
the SDS
variable
can operate.
Thus,
the stage of system life (definition, or subsequent thereto)
does
not
predetermine
any
particular
operations
to
be
performed by the SDS that are not required at other stages.
The SDS presented here is
cost
centered.
Ie: it
attempts at
all times to minimize cost as opposed, for example, response
time.
However,
in light of the
fact that response
often the most crucial factor for many users,
<response time
weight factor> provided
time is
there may be a
by the
user ($7r).
If none is supplied, the lefalut is 0.5 (which is in effect,
null). All decision rules here assume that $7r=0.5
We proceed now to the SDS proper.
5.1
Maintaining thir
normal form.
The algorithms within the SDS for maintaining real relations
in third normal form are driven
of the variety presented in
by a set of truth functions
Chapter IV.
CHAPTER V
Every domain mast appear either
domain
in some
truth
PAGE 71
as a functionally dependent
function,
or
some
domain on
which
others are functionally. dependent.
It
the
is
responsibility
co-operation
these truth
with the
data base
functions. The
system can be)
data,
and
the
user
system is not
the
In
order to
interrelations, and
the data
will employ network-like
It
(and in
in
define
fact no
without substantial
do so,
the user must
peculiarities
base administrator
education in this function.
(perhaps
administrator) to
able to third normalize
user-provided information.
understand
of
is
of
the
responsible
for
is envisioned that the user
diagrams to aid in
this task (See
the scenario of Chapter VII).
Note that the user is
not
asked to provide third normalized
relation definitibns per se, but rather the information that
will enable the
This
sstem to define third normalized relations.
distinction
(possible) dynamic
real domain is
to
is
important
nature of the
if
considers
real relations. If
be added to the data base,
required as to which real relation it
knowledge that the
one
a new
a decision is
belongs in.
data base administrator has
the
Given the
of the data
CHAPTER V
base, he is
he
would
the clear candidate for making the decision, and
simply
relation.
Notice
data, base
relefine
and
that this is
administrator,
restructure
the
alternatives
the
to
affected
onl.Y course open
whereas the system,
adequate information would be in
consider
PAGE 72
to the
provided with
the position to dinamically
the
and
full,
expensive,
restructuring of the affected relation.
It
is
thus
to
deemed preferable
SDS to
allow the
and to
information,
order that dynamic molifications
The
algorithm for
Suffice it
Appendix 1.
provide
necessary
third normalize
in
to relations be efficient.
is
third normalization
detailed
given a
here that,
to say
the
in
set of
functional- and mutual dependencies, the system can generate
a database
third normal form.
(real relations) in
computational
dependencies
are
not
Note that
considered
when
third-normalizing.
Now,
assume that some new real domain
database.
for the
set
By having the
new domain,
3f real
normal form is
is
to be added to the
functional- and mutual dependencies
the system
relatioas, this
can determine where
in the
belongs if
third
new domain
to be preserved.
Furthermore,
it
is
able to
CHAPTER V
determine the most
PAGE 73
efficient method of including
it in the
data base.
We
proceed now
assumption
to
that
lecisions made
third
by
normalization
the
SDS under
the
has
occurred,
and
resulted in a set of real relations of the following type:
<name>(<list of domains>) (<list of candidate keys>)
The primary
key will
fewest domains.
specified
become the
<candidate key>
This maximizes the
for
all
primary
key
with the
chances of having values
domains
in
selection
criteria.
For example:
RR1(A,B,C,D,E) ((AB), (C,DE))
The primary key that tould be chosen here is
5.2
(A,B)
Structuring Decisions.
These
decisions
implementation
are
of
the
basically those
that
relations specified
determine
by
the
the
third
normalization process. These decisions are:
1) Encoding or virtualizing of domains
2)
Indexing decisions (the creation of inversions)
3) Factoring decisions '
4)
Decisions to join permanently
into a single relation
CHAPTER V
PAGE 74
any two relations that have the same primary key.
One
of
the
structuring
decisions
subsequently dismissei was that of
third normal form)
considered,
and
replacing a relation (in
by two or more of
its
projections.
The
factors that are involved in any such decision are:
a) the
cost of
transporting little-used
relation to primary memory each
relation
is
ased
(A case
domains of
a
time any part of the
for
splitting
up
the
relation)
b)
the
overhead
relation,
involved
in
and duplicate
maintaining
copies of
an
extra
some domains
(A
case aginst splitting)
c) The overhead iavolved in
one
of the
performing a join each time
domains split
off is
required for
any
reason (A case against splitting the relation)
d)
Pjggible
reduced
storage
(A
marginal
case
for
splitting the relation up)
However,
since the co3t of
(once the entry has
primary memory
is
required
anyway)
is
costs of
value. There
are occasions
may
be
been located,
and an I/O
that it
be clearly
so minimal
dominated by
projection
transporting unneeded domains to
(b)
smaller
and
(c).
in
which
than
(d)
will
is
an
uncertain
the cardinality
that
of
the
of a
original
CHAPTER V
relation,
but this is
never certain.
As such, it appears that the
by two or more of its
PAGE 75
decision to replace a relation
projections will never be made,
and so
was not included in the SDS.
5.3
Encoding and virtualizing decisions.
5.3.1 Virtualizing De-isions
A virtual domain
is
one that is
not
stored physically,
but
rather is computed each time it is required from the domains
on which it is computationally
updating a
virtual domain
virtual nature of the domain
dependent. Also, notice that
is
not
is,
a legal
operation.
The
by definition, not visible
to the user.
A
domain
is
a
canlidate
computationally dependent on
is a
for
virtualization
a set of other
it
is
domains; Ie: p
if $8c(IaI,p)=1.
Any of
functionally dependent (ie:
j where
candidate for virtualization
the domains on which it
if
j<lpi) may also be virtual, but
there must be a restriction
CHAPTER V
PAGE 76
to prevent circular camputational dependencies.
If
Namely:
$8c(IrI,P)=$8c(iyji)=1 where a(IrI then:
$8c(jxj.b)/1
Vb4*yi,
Vii
where p4Ixt
The decision rule is:
For a domain that is
currently virtual,
(cost(making domain
real) + cost(use
If:
real) +
if domain is
cost(maintaining domain if real))
<
(cost(using the domain if
virtual))
then make it real. Otherwise leave it as virtual.
Note that the cost of maintaining a virtual domain is 0.
the event that a
domains
modified,
on
domain is
which
the
it
domiin
clearly not the case if
is
real,
then each time
computationally
itself must
the domain is
be
any of the
dependent
modified.
This
virtual.
Similarly, if the domain is currently real, if:
(cost(virtualizing) + cost(use if
(cost(use if
domain virtual))
domain real) + cost(maintaining if
<
real))
then virtualize the domain; otherwise leave it as real.
Separating out the various costs mentioned above, we get:
5*
3
1
tJ.l
Cost (making domain real)
In
is
is
ZHAPTER V
in
the relation(s)
of
serial processing
involves a
This
PAGE 77
on whibh it is
computationally dependent
exist, and computing the value of
the virtual domain. It is
which the domains
to the
appended
then
database
new
in the
relation,
form. Thus
in
the
are basically
two
written out
and
there
steps:
locate the domains on whizh it is computationally dependent,
and compute and store the value.
Assuming that the domain in
question
domain d,
is
there are
two possibilities when $80(tpI,d)=1 :
Vj4IpI and $8x(i)=1 for that i, or
a) $8d(i,j)=1
b) $8d(i,j)/1 for some j<IpI, and a single i
In
no joins
case (a),
domains on
which d is
all in relation i.
of
ca'se (b),
Case
(a)
however,
is
there will be at least one join
the general
algorithm
capable
of
fact,
case.
determining the
would
algorithm is
Decision Subsystem
form of
If
(RDS)
case (b),
there were
cost
domains in list Ipi for
then case (a)
such an
and very possibly
of Ipi,
really a special
in fact
(case(b))
the
and computing
relation i
which is,
retrieval for all
they are
Computing the value of d consists simply
necessary to retrieve all members
several.
the
when retrieving
computationally dependent;
tuple from
retrieving a
value. In
are necessary
of
a
some
serial
the general case
be automatically included.
also required
in determining
by the
In
Request
the cheapest way of
CHAPTER V
resloving a
how to
request.
The concepts involved
optimally retrieve all
the desired function.
PAGE 78
domains required
to perform
This algorithm is thus common to both
this situation, and the RDS operations.
detailed in
are identical:
Chapter IV.
For our
As such it has been
purposes,
it
is
enough to
note that the invocation of the algorithm, given the list of
domains
Ipi, will result in
way of
performing the
method the 'final'
request.
will call
algorithm, and
will be the 'cost (final) '.
+ ($5cy(i)r/4bfe) .$4io
assuming that the
the cost
virtual domain,
is
real domain d is
the
cost of
+ $5cy(i)r.$4opc
inserted
plus the cost of
writing it
throughout all future lecision rules.
the fact that relations are blocked,
retrieve (or
necessarily
insert, update
result
in
in
computing the
i. The component ($5cy(i)r/$4bfe).$4io
to
the cheapest
the computation of cost(making domain real) becomes:
cost(final)
Ie:
We
method determined by the
the cost of performing it
And so,
a cost estimate for the cheapest
a
value of
out in
the
relation
will become familiar
It takes into account
and that not each call
or delete)
real
relation i.
I/0.
an entry
will
The
cost
component'$5cy(i)r.$4opc' covers the cost of the overhead in
CHAPTER V
each call to XRM.
In this way,
PAGE 79
we take account of the fact
that each call involves some expense, but not necessarily an
I/0. This will be found in most subsequent decision rules.
5.3.1.2
This
is
storage,
Cost(use if domain real)
basically
comprised of
plus the cost of retrieval
the domain is
the
cost
of
additional
(or other operations) if
real.
cast (additional storage) = 4.$5cy(i) .$4sc.$4t
since each domain in IRM is
At this
point, the system
a fullword domain.
would make a
decision regarding
whether this domain should have an index (see 2 below) - ie:
would determine whether $8i(ij)=1.
If
$8i(i,j)=1 then:
cost(use if
domain reil) = (
+
$lrd(i,j)**e**.($4io
+ $4opc)
($lrd(i,j)*sn*+$lrd(i,j)*cn*n)
(($5cy (i)r/$4bfe).$4io
$5cy(i) r.$4opc)
+
(trss/$lrd(i,j)*cn*i(trss))
($4io + $4opc)
)
+
CHAPTER V
If
PAGE 80
$8i(i,j)=0 then:
cost(use if
domain rel) =
((trss/$1rd(i,j)*cn*i(trss))
($4io + $4opc)
($1rd(i,j)*s**+$lrd(ij)*c**n).
+
(($5cy (i) r/$4bfe) .$4io
$$cy(i)r.$4opc)
Thus, in
is an index, any
the event that there
case where
domain j is used as a qualifier in a <qualifier type> of 'e'
selection
index.
criterion, it
simply a
is
For <qualifier type> of 'n',
the
resolved
selection
set, after
using
the index is of no use,
indexed,
was
criteria
that
needed. If some other domain
and some serial search will be
in
of using
case
index,
that
only
then
need
be
the
serially
searched.
If there
cases, except
those where
other lomais in the request with
an index, a
index, then all
is no
there is some
serial serach is
5.3.1.3
The
required.
cost (maintaining domain if
cost of
maintaining the
real)
domain if
it
is
real is
an
CHAPTER V
PAGE 8 1
additional update each time any of
the domains on which the
new real domain is z-omputationally
dependent and in another
is
domain on
which j
same relation,
in any
updated
relation,
then
is
way.
dependent is
computationally
there is
because if
This is
no additional I/0,
in
a
the
or call to
XRM.
For domain j of relation i:
cost = ( $2rd(ik1)u**.($4io + $4opc)
)
Vm*IpI where $8c(IpIj)=1 and $8d(i,m)=0.
5.3.1.4
There is
cost(virtualizing)
a choice
deleted from
being virtual,
as to whether
the relition or whether
cost(virtualizing)
If
it
is
it
and not physically removed.
not physically removed,
is
the domain
just
is
If
physically
marked as
the domain is
then:
= 0.
physically removed,
cost(virtualizing) = 2.(
then:
($5cy(i)r/$4bfe).$4io
+ $5cy(i)r.$4opc
)
for i where $8d(i,j)=1.
le: the
process of
removal involves
serially reading
and
ZHAPTER V
then writing
(with the
PAGE 82
domain removed)
each entry
in the
relation.
remove the domain or not
The decision whether to physically
is:
If
deletion) <
cost(physical
wasted)
cost(storage
then
physically delete the domain.
cost (physical deletion) is
as above.
cost(storage wasted) = 4.$5cy(i) r.$4sc.$4t
Thus
the
a
two-tierred
virtual domains are illegal;
inserts and
cost(virtuailizing)
decision
is
decision rule.
5.3.1.5
cost(use if
(Note: updates of
domain virtual)
deletes are unnecessary. Thus only retrievals and use of the
domain as a qualifier are permissible for virtual domains.)
The
cost(use if
doaiin virtual)
is broken
types of use:
. use as a qualifier
. the object of a retrieval request.
5.3.1.5.1
cost(use as a qualifier)
down into
two
CHAPTER V
value of
processing and
a serial
This involves
PAGE 83
computation of
and checking
for all entries,
the virtual domain
the
that value against tha criteria specified in the qualifier.
The 'cost(final)'
same as
is the
that described
in 1.1.1
above.
For cases where there was some other domain in the selection
criteria
that
the
was indexed,
is
searched
serially
size
of
the set
(on
to
be
average)
(trss/$lrd(i,j)*c**i(trss)).
cost(use as
a qualifier) =
(
cost(final).($1rd(i,j)*s** +
$1rd(i, j)*c**n)
$1rd(i, j)*c**i(trss) .cost (final')
)
where cost(final') is
all instances of
except that
the same as cost(final),
$5cy(i)r in the algorithm
are replaced by
'trss'.
5.3.1.5.2
cost (retrieval)
cost(retrieval) = (
$2rd(i,j)r**.cost(final)
)
This ends the discussion of
PAGE 84
V
CHAPTER
virtualizing decisions.
We move
on now to encoding decisions.
Encodinqg Decisions
5.3.2
Encoding decisions
are made only
for domains which
have a
data type of 'character';
le: where $8_(j)cu=1
For
purposes of
coding scheme.
voluminous,
This was
as the
The scheme
character
that will
2).
unless the
is
number of
(ie: $6(j)g is
one
becoming to
coding schemes
the purpose here
be employed here
strings
which case that
if the
to avoid
possible
Furthermore,
consider only
is
is not
but rather to present an approach.
! (log($6(j)g)/log
integer,
will
done simply
number of
potentially infinite.
to be complete,
thesis, we
this
as
bit
('1'
means here
is
the
strings
the
expression evaluates to an
the value used.)
unique values in
constant)
The encoding decision becomes:
This may
the domain
encoding of
of
next
length
highest
integer,
in
only be done
is constant
CHAPTER V
If the domain (say d) is
PAGE
85
not currently encoded, then encode
it if:
( cost(encoding) + cost(use if encoded) ) <
( cost(extra storage) + cost(use if unencoded) )
Similarly, if it is carrently encoded, then decode if:
(
+
cost(decoding)
cost(extra
storage)
+
cost(use
if
decoded) )
< ( cost (use if encoded) )
Breaking down these casts into the individual components:
5.3.2.1
The
cost
cost(encoding)
of
relations in
processing of all
replacing the id
of the required
domain d
encoding
consists
of
which domain
d appears, and
of the character string with
Additionally,
length.
serial
the
there is
a bit string
the cost of
building, and maintaining the encoding relation.
cost(encoding)
=
(
2.SUM(($5cy(i)r/$4bfe).$4io
+
$5cy(i)r.$43pc )
+ ( ($6(d)q/$4bfe).$4io
Vi
such
that $8d(i,I)=1
ie:
+ $6(d)q.$4opc )
all
relations in
)
which
d
CHAPTER V
PAGE 86
appears.
cost(decoding)
5.3.2.2
and storing the actual values
The cost of decoding a domain
rather
than the
values is
identical
cost(encoding)
see 1.2.1
encaded
to that
of
encoding.
Ie:
cost(decoding)
=
5.3.2.3
cost(use if
encoded)
The way
an encoded
domain is
used is
to employ
value as a primary key for the encoding relation.
the code
This means
that each time the domain is the object of a retrieval or an
update, or each
be
an additional
time it is used as a
1/3
and an
retrieve either the code or
qualifier, there will
additional
call
to
the value (depending on whether
it is being used as a qualifier or is the object).
cost(use if
to XRM
Thus:
encoded) = 2.(Slrd(i,d)***** + $2rd(i,d)ir,ul**)
CHAPTER V
*
PAGE 87
($4io + $4opc)
5. 3. 2. 4
cost (use if
anencoded)
of the domain
Since use
if
encoded involves
an additional
I/O and call to XRM each time the domain is used, it follows
that the
use of the
1/2 of
unencoded should be
lomain if
Thus:
that if encoded; the additional retrieval is avoided.
cast (use
if
uneacoded)
$2rd(i,d)Ir,ul**)
5.3.2.5
The cost
.
=
(
$1rd(id)*****
($4io + $4opc)
cost(extra storage)
of the
additional storage
unencoded values will be the
required if
the domain is
difference between the storage
unencoded, and that
the domain is
store the
required to
required if
encoded.
The cost of storage if unencoded will be a fullword for each
entry in each relation in
cost(storage
if
where $8d(i,d)=1
which the domain appears.
unenzodel)
Ie:
=4.$4sc.$4t.SUM($5cy(i)r)
Vi
CHAPTER V
PAGE 88
The cost of storage if the domain is encoded will be:
relation - about 100 bytes
. the overhead for the extra
in XRM
. two fullwords per entry
first
being the code,
in the encoding relation; the
and the second the actual value,
and
. the sum of all the space in each relation in
which the
domain appears.
Ie:
cost(storage if
encoded)
=
(
(SUM(!(log
8.$6(d)g + 100)
.
$6(d)q/log 2)
+
($4sc.$4t)
Vi such that $8d(i,d)=1.
The additional storage is thus
the difference between these
two. Note that the additional storage may be negative in the
event that
there is
only a
small set of values,
given the
overhead. This would not alter the decision rule in any way,
as it
would way
should happen.
in favor
of not
encoding,
which
is
what
CHAPTER V
Thus:
PAGE 89
cost(extra storage) =
(
4.SUM($5cy(i)r
(log $6(diq/log 2)
(SUM(!
-
+ 8.$6(d)q
+ 100)
)
.$4sc.$4t
This
domains.
54
lecision
concludes the
regarding encoding
rules
of
We proceed now to indexing decisions.
Indexing Decisions.
In XRM, as described in Chapter II, any domain in the system
may have an inversion, or index, created for it.
When there
is an inversion on some domain, and that domain is used as a
qualifier
with <qualifier
extremely rapid.
themselves
type> is
taken
here
may be
qualifiers of type 'n'.
when the
not address such
we are interested
easily
or
in that domain
schemes that
some
for gualifiers
but we shall
For our purposes,
approach
There are
to indexes
'n',
then retrievals,
with the specified value
locating of tuples
is
type> 'e',
address
<qualifier
schemes here.
in an approach,
and the
to
include
extended
CHAPTER V
PAGE 90
In XRM indexes are built in a specific way. Specifically, an
index entry
'value/id'
is a
primary key.
Indexes are
Given a value,
it
is
pair,
where
the value
implemented as
is
the
binary relations.
used as the primary key
to locate the
id of a tuple containing that value in that domain. However,
the use of a value as
There is
a primary key has severe limitations.
no reason why
entries, and
in
a value
should not appear
fact, that is
the case of primary keys).
usually the case
There
in many
(except in
is thus some method needed
which allows for this factor.
The method employed in XRM is
entries that
have any
Thus,
while there
bytes)
for each
pair,
we
now
'id/pointer'
'id'
additional
have
by
ISAM,
decision rules,
index
with the start
'value'.
storage
specified domain.
be two
consisting
of the chain
There
required
is
index is
(8
'value/id'
consisting
of
a
having the
therefor,
per chain
over
one
the
Furthermore, while
several levels of indexing,
the XRM
fullwords
of a
entry
relation implementation.
many schemes have
found in
each
a
in the
ordinarily
index entry,
word of
strict binary
one value
would
pair,
replaced
to chain together all id's of
such as that
only one level
deep. The
however allow for a multi-level index.
CHAPTER V
PAGE 91
The decision rule for indexing is:
+
( cost(storage)
If
Zost(projected
use with
index)
+
cost(building index) )
< cost(projezted use without index)
then build an index.
already exists
Similarly, if
an index
eliminate the
'cost(building
index) '
for a
domain, then
part of
the decision
rule.
The projected use of the domain is based on that experienced
in
the preceding
in
assumption
continue
all
again be
$4t. There
rules
decision
these
which
unchanged,
assumption to make.
will
time period:
is,
in
fact,
In the event that use
invoked,
and will
proceed
is an
that
a
implicit
use
will
reasonable
changes, the SDS
under the
same
assumption.
We proceed now to breik down
the components of the decision
CHAPTER V
PAGE 92
rule.
S.4. 1 , cost (storage)
cost(storage) =
SUM(((#
entries
i)
at level
.
(space per
entry at level i))
+
(overheal for
level i)
),
i=1,...L
where L
is the number of levels of index.
As stated above, in XRM, L=1.
The following is also true of
XRM:
. overhead per inversion is
.
space per entry is
approximately 50 bytes
8 bytes,
plus 4 bytes per chain (see
above)
Thus, for XRH:
cost(storage) =
(50 + 8.$5cy(i)r + 4.$6(d)q).$4sc.$4t
for an index on domain d of relation i.
5.4.2 cost(projected ase with index)
This component
of the decision
down into three sub components.
. cost(retrievals)
rule can be
These are:
further broken
ZHAPTER V
PAGE 93
. cost(decoding index)
. cost (maintaining index)
5. 4. 2. 1
cost (retrievals)
is an index
If there
time that domain j
on domain j
In
of entries.
then
the index
required. Since
the
is
the size of the resolved set
event that the
of no
this is
rules, we
decision
type 'e', the
is used as a qualifier of
index can be employed to limit
-an
then any
of relation i,
value, and
a
'n',
is
qualifier type
serial search
the case
throughout all
eliminate
those cases
is
of these
where
the
qualifier type is not 'e'. The decision rules specified here
will
include
thus
only
<qualifier
'e'
type>
decision
variables.
We
assume,
furthermore,
that
entries
are
retrieved
distributed randomly throughout the relation.
The subcomponent cost(retrieval) thus becomes:
cost(retrieval)
=
( ($5cy(i)r/$6(j)q).($1rd(i,j)*se* + $lrd(i,j)*ce*n)
+ MIN((($5cy (i)r/$6 (j)q).$1rd(i,j)*ce*i(trss)
(trss/$lrd(i,j)*ce*i(trss) )
. ($4io+$4opc)
),
PAGE 94
ZHAPTER V
If
domain j
index,
j',
relation i, say domain
domain in
then some
It
has
and
transfer
some
of
the
$lrd(i,j)*ce*n
to
$1rd(i,j)*ce*i.
relation,
there
If
has
in
requests
the
to
variable
been
a
in the
on some domain
to drgp an index
the
necessary
decision
from
requests
transfer
then
therefor
is
criteria
tentative decision
now
other domain in the selection
become requests in which some
index.
compound
index will
had an
other domain
which no
at which
warrants an
were previously
that
reguests
at the time
see whether it
is being eviluated to
requests in
for some other
index decision has been made
a tentative
oposite
direction.
The
number of
is a
requests transferred
function of
the
number of compound requests that a given domain was involved
in as a fraction of all compound requests.
following
number
of
requests
from
Ie: Transfer the
$lrd(i,j)*ce*n
to
$1rd(i,j)*ce*i:
$1rd (idl)j*ce** . $jrijL&j*c e**
$1rd(i,*)**e**
$1rd (i,
.
$lrd(i,j)*ce**
*) **e**
5.4.2.2 cost (decoding index)
This is a function of both the CPU overhead time invoved in
the decoding of an iniex, as well as the necessary number of
I/O's to get the index
into primary memory.
However,
since
HAPTER V
CPU
the
I/O time,
with
comparison
small in
so
time is
PAGE 95
decision rules presented here will not take into account CPU
overhead in decoding indexes.
Note
(($4p/2) -
$6(d)g)
blocking factor
index
maximum
that the
is
.
number of I/O's
decoding the index is thus the
The cost of
($4bfx)
necessary to bring the index into primary memory.
Ie:
cost(decoding index) =
($1rd(i,j)**e** .
5. 1.2.3
The
($5cy(i)r/$4bfx) .
$4io)
cost (maintaining index)
cost of
is
index
maintaining the
the
function of
a
number of new entries that are made in the relation, as well
as of the number of times a
What
is
presented
here
require
the
However,
to be
be
to
index
and
the length
updates involving the
that the
domain in
of
a series
order to
index need not
all
memory.
rules should
of inserts
or
take into account
be brought
memory separately for each operation.
rules
updates
primary
into
brought
strictly correct, the decision
concerned with
the fact
be
updated.
the decision
insertions
that the
is
of
the purposes
for
assumed
value in the domain is
into primary
CHAPTER V
cost (maintaining index)
=
lu,dI** ).($4io
( $2rr(i)ien + $2rd(ij)
5.4.3
+ $4opc)
cost(building index)
The cost of building the index will
retrieval
of
each
entry
operation to the index.
the
PAGE 96
overhead of
a
relation,
in the
Notice that
single call
be the cost of a serial
to
and
a
write
the rule below has only
XRM.
This is
because
inversion is accomplished by a specific XRM routine.
cost(building index) =
$4opc+$4io. (($5cy(i)r/$4bfe)+($5cy(i)r/$4bfx))
5.4.4
cost(projected use without index)
In the event that there is no
used as a
in which no
index, any time the domain is
qualifier in a simple query, or
other domains in
serial retrieval is
necessary.
a compound quiry
the qualifier had
an index,
a
CHAPTER V
PAGE 97
cost (projected use without index) =
+
(($5cy (i)r/$4:fe) .$4io)
$5cy(i)r.$4opc). ($1rd(i, j)*se*+$1rd(i,j)*ce*n))
+
MIN(
(($5cy(i)r/$4bfe).$4io
($lrd(i,j)*ze*i(trss))
+ $5cy(i)r.$4opc)
,
((trss/$1rd(ij)*ce*i(trss)).
($4io+$4opc) )
This concludes the decision rules for indexing decisions. We
proceed with decision rules for factoring.
Factorinq Decisians.
5.5
Factoring decisions
are decisions regarding the
storing of
aggregations (or factored data) as opposed to computing them
each time
they are required.
The aggregations
system recognizes are:
. MAX
. MIN
. COUNT
UNIQUE
. AVERAGE
.
SUN
the namber of unique values
which this
CHAPTER V
PAGE 98
This information applies only to numeric domains.
In addition, the folloving should be noted:
are always reserved in
. storage for these aggregations
and so the cost of
the system tables,
storage is not
considered in the decision rules.
. COUNT is
as the RDS
always maintained by the system,
continuously.
uses it
. UNIQUE
is no
domain,
and so is
not
is
the underlying
always maintained
If SUM is stored, there is
reverse
COUNT of
more than the
true,
no need to store AVERAGE. The
howver
because
of
possible
roundoff errors.
The factoring decision is:
If
cost(maintaining aggregation)
<
cost(computing aggregation)
then store it.
5.5.1
If
not,
compute it
each time.
cost(computing aggregate)
This consists of a linear processing of the relation, and so
the cost is
simply:
PAGE 99
CHAPTER V
(($5cy (i) r/$4bfe) . $4 i3
+
$5cy(i)r.$4op: ).( $2rd (i,j)r(<aggr>)**
where <aggr>::= SUM
I AVERAGE I MIN I MAX
5.5.2 cost (maintaining aggregation)
This
computing
involves
or updating
maintaining it,
domain is
the aggregation
updated,
it each time
once,
and
a value
then
in that
deletel or inserted.
cost (maintaining aggregation) =
(($5cy(i)r/$4bfe).$4io + $5cy(i)r.$4opc)
+
($2rd(i,j) lu,dj**
+
$2rr(i)ien)
.
($4io
+
$4opc)
This
concludes the
liscussion of
factoring decisions.
We
proceed now with permanent join decisions.
5.6
Permanent Join Dacisions
This decision
domain(s)
rule is employed in
has (have) been added to
the event that
some new
the system and have been
CHAPTER V
key of the existing
will be
relation in which
the new
domain(s) belong(s), and the new domain(s).
because both relations
required.
This
is
and so a
have the same primary key,
a trivial matter.
join is
are
the relations
If
retrieval is
another
required, or
a join is
in the existing relation,
precisely,
more
This means that
used in conjunction with any
any time these new domains are
of those
the
the new relation
The format of
existing relation.
the primary
avoid restructuring
to
separate relations
up in
set
PAGE 100
retrievals
additional
then
left separate,
required
there will
certain
satisfying
in
be
requests. The reverse is not true. That is, if the relations
were to be permanently joined (restructuring were to occur),
they were left separate.
case where
to
the
drop
concentrate
that they
are
rather
of
-oncept
3n
This being so,
cost(projected
cost(projected
the
use
joined
over the
result in additional retrievals
permanently would
able
fact
the
where
no case
is
there
we are
use)
and
premium).
sp;The decision then becomes:
If
( cost(restructuring) + cost(storage if restructured) )
< (
cost(projected use
restructured)
cost(storage if
not
)
then restructure
joined)
premium) +
the relations
relation.
Breaking
into a
lown the
single (permanently
cost components,
we
PAGE 101
CHAPTER V
have:
5.6.1
cost(projected use premium)
necessary
whenever
relation are used
domains in
several
case of
a
is
This
domain (s) in
any
the
cost(projected use premium)
being
un-joined
new
some of those
in a reguest together with
the existing relation.
Ie:
=
$3rd(i,Ipi)d(j,IpI) .
pflpi
retrievals
additional
($4io + $4opc)
where:
Vp such that $8p(i,p)=1
This statistic is maintained separately by the RDS - ie: the
number
of
times each
relation
is
joined to
each
other
relation.
5.6.2
cost(storage if
The cost of
be
4 bytes
restructured)
storage if the relations
for
each
entry for
each
are restructured will
domain
relation excluding the primary key domains.
cost(storage
if
restr)=
in
the
new
Ie:
4.$5cy(i)r.$4sc.$4t.SUM($8d(i,j))
CHAPTER V
PAGE 102
Vj such that $8p(ij)=0
5.6.3
This is
cost(storage if
not restructured)
except that the
similar to the computation of 5.6.2,
restriction that
not be in
the lomain
the primary
key is
lifted. Ie:
cost(storage if not restr)
4.$5cy(i)r.$4sc.$$4t.SUH($8d(ij))
the new relation
where i is
5. 6.4 cost (restructuring)
operations
relation,
and
j
retrieve from
joined entry
the new
writing
three
key to
primary
processing of one relation,
keys consists of a serial
its primary
the same
relations with
of two
The restructuring
for each
the other
out again.
entry.
If
is the new relation, then:
cost (restructuring)= 3. ($5cy (i)r/$4bfe).$4io
+ 3.$5cy(i)r.$4opc
relation, and
In other
i is
using
the
worls,
existing
CHAPTER V
This
the
completes
rules
decision
restructuring
for
rules responsible
with decision
proceel now
decisions. We
PAGE 103
for establishing derived relations.
5.7
It
Derived Relation Decisions.
created solely for
to
create
out that
to point
is important
derived
derived relations
raisons of efficiency,
are
relations
structuring decisions of
quite
and
are
so decisions
independent
the type mentioned above,
of
and the
third normalizing prozess.
The
choice
exists
basically
relation defined by some user
between
storing
a
virtual
for some specific application
(or set of applications) or simulating that virtual relation
each time it
is
referenced.
If:
cost(simulating virtuil relation)
(cost(storage for
relation) +
overhead)
<
derived relation) + cost(use
cost(creating derived
relation) +
of derived
cost(update
PAGE 104
CHAPTER V
then continue simulating the virtual
it
is
relation. If not, then
cheaper to store it.
Similarly, if a derived relation
cease storing
it
relation would
to return
and
be the
is stored, the decision to
to simulating
same as that
above,
the virtual
except
that it
would exclude the 'cost(creating derived relation)'.
'cost(final)' are the same
All references to
earlier in
5.7.1
as those made
the chapter.
cost(simulating virtual relation)
A virtual relation
is simulated by performing
joins in the
user workspace of
various real relations that
are required
for a particular request.
cost (simulating virtual relation) =
cost(final).(
$1vr(m)*nu*
+ 1/$6(j)q.($lvd(m,*)*se* + $lvd(m,*)*ce**)
+ 1/2($1vd(m,*) *sn* + $1vd(m,*)*cn**)
)
5.7.2 cost(storage for derived relation)
The
storage for
a
lerived relation
will
consist of
the
CHAPTER V
overhead
domains of
per relation,
and
PAGE 105
the storage
each of
for
The cardinality
the derived relation.
the
vill be
approximated by the miximus cardinality.
cost(storage for derived relation) =
( 100 + $5cy(m)d.$5#d(m)v.4
where the
derived relation
x il
$5cy(O)r and O=i
x...
is
)
.
relation
Vi such that
$4sc.$4t
m, and
$5cy(m)d =
$8d(ij)=1 and Vj
where $8d(mj)=1.
5.7.3
cost(use of derived relation)
Before determining the use cost of the derived relation,
SDS would make
virtual relation
an inlexing decision for the
(see
2 above
all Irv in relation types).
A
except,
the
domains of the
substitute
'v'
for
derived relation may then be
PAGE 106
CHAPTER V
an analogous manner to a real relation.
treated in
Ie:
cost(use of derived relation a) =
($5cy(m)d/$4bfe).$1vr(m)*nu*.$4io
+ $5cy(m)d.$lvr(m)*nu*.$4opc
+
SUM(($8i(m,j).($5cy (m)d/$6(j)q).
$lvd(m,j)*js,cle**
.
($4io + $4opc)
+ (1-$8i(mj)) ($5zy(m)d/$4bfe).
$1vd(n,j)*Is,cle*
.
$4io
+$5cy(a)d.$lvd(m,j)*Is,cle* . $4opc)
+ (($5cy(m)d/$4bfe).$lvd(m,j)*js,cln** . $4io
+ $5cy(m)d.$lvd(m,j)*Is,cln** . $4opc
)
Vj where $8d(m,j)=1
5.7.4
cost(creating derived relation)
The cost
relation is no
of creating the derived
the cost(final),
since
that is,
in
fact the
more than
optimal way of
creating it.
5.7.5
The
cost(update overhead)
update overhead
for
a
derived relation
involves
an
additional I/0 and call to XRM for each change to any of the
CHAPTER V
the real relation that also appear in the derived
domains in
relation,
PAGE 107
m.
Ie:
cost (update overhead)
=
SUM($2rd(i,j)Iu,s,il** .
and Vi where $8d(i,j)=1.
$8d(m,j)=1,
This completes the discussion of
can be seen
Vj where
($4io + $4opc)
the SDS decision rules. It
that these rules are guite modular
in that any
Derived Relation
Decisions -
major decision -
for example,
are composed
of several subdecisions.
subdecisions
can be
replaced without
Any or all
of these
affecting any
other
part of the SDS.
These rules will be applied in the scenario of Chapter VII.
CHAPTER VI
The Request Decision Sgbsystem -
The RDS is
it
specifically,
More
database.
the
(RQS)
overseeing any requests that are
responsible for
against
made
PAGE 108
is
responsible for the fallowing functions:
. determine
information and
access control
legal -
request is
whether the
ie: check
decide based
on that
whether to perform the request.
.
system
the
whether
information
more
requires
-
ie:
satisfied, or
logically be
it can
determine whether
feasible
is
the request
whether
determine
to
resolve the request.
. determine
the most
of satisfying
efficient way
the
particular request, assuming it is deemed 'feasible'.
. update the relevent decision variables.
the
legality if
access control
relation level.
the
in a
request.
It
mechanisms will be
The :reation
controlled in such a
sees
will omit the
of this thesis, we
For purposes
virtual
is
envisioned that
implemented at
of virtual
that
which
the
the real
relations can
way as to make the data
relation only
question of
be
that the user
he (it)
is
permitted to see. Any data the user is not authorized to see
PAGE 109
CHAPTER VI
will be
thus
making the
supplied
fact that
is
there
some
being
data not
can
security
Thus,
the user.
to
invisible
mapping process,
data during the
removed from the
be
implemented as restrictions of real relations.
We proceed
RDS will contain for the resolving of requests.
detail the
We shall not
variable updates
which decision
points at
that the
collection of algorithms
now with the
are
performed for reasons of making an already difficult section
more unreadable.
it
Instead,
point specific decisian
should be fairly clear at which
context of the discussion at that point.
decision variable
completion
of Step
tentative and
step. The
All updates
only become
reason for this
Finally, note that
actually made
aplates are not
5.
from the
variables will be updated
final at
until
until the
that point
the conclusion
approach will become
are
of the
clear from
the iterative nature of the RDS.
Note that the notation developed
here. In
thus far will be continued
addition, the decision variable
be taken to read '$5cy(i)r
OR $5cy(i)d'.
'$5cy(ije' should
CHAPTER VI
PAGE 110
stepR-.
Step
purpose of
The
1
establish
is to
a
set of
truth
functions regarding the domains appearing in the reguest.
The truth functions set up are:
= 1
* $8od(k)
request.
If
for all k that are object
an entry is
a relation) then
specified
domains in
the
(ie: all domains in
such a truth function is
set up for
ever! domain in the relation.
.
$8eq(k)=1 for all domains
k that appear as qualifiers
with <qualifier type> 'e'.
. $8nq(k)=1 for all domains k that appear in the request
as qualifiers with <qualifier type> not 'e'.
Note that k may be a role name instead of a domain name.
If
the list
of domains appearing in
the request is
RIP,then
for each k<IRI find all relations i such that:
$8r(i,k)=1 or $8d(i,k)=1, and
$8x(i)r=1 or $8x(i)d=1.
but if $8x(i)d=1,
i should not be a
derived relation whose
derivation involved a restriction.
This step finds
all relations in which each
of the domains
VI
CHAPTER
in JRJ appears.
This step produces
These
each domain k ie: ji(k)j.
If
for
a list of relations for
lists are all members of a
where L (I ji(k)11 = L(IRI).
lIi(k)l1,
single list:
PAGE 111
any Ii(k)I we have
L(Ii(k)I)=0 then the
request is
because some
role- or domain name
is
'infeasible',
the request which has not been defined.
that a domain
that is
and if
appears more than once in
the case
It is
used in
also possible
a single relation,
the request must supply role names.
ITe:
$8d(i,k)=0,
and k''
and $8r(i,k')=1 and $8r(i,k'')=1 where k'
are role names.
qtep_3.
This step simply establishes a
eventually
list
of relations which will
(at the end of the alogorithm) become the optimum
list
of relations to satisfy the request.
Ie:
set
up
list
Ifinali,
initially
'infeasible',
and
cost(Ifinall)=2**30 (or some maximum number).
Step_4.
This step finds all possible
combinations of relations that
PAGE 112
CHAPTER VI
of
relations that
satisfy the
cantain
based
request,
established in Step 1.
of the step is
The results
can satisfy the request.
the domains
all
on the
a list
to
necessary
truth functions
set of
The procedure is as follows:
Vii(k) I < I li(k) 1 :
Initialize
if
Vjfk
lxl=Ii(k) I
IxI N Ii(j)
I = g then:
1x1=1xI U jij)j
lxi in ascending order,
order list
been 'done'.
already has
Ii(k)I.
If
not
and see if
go on
If yes,
then save a copy of this
that list
to the
next
lxi as 'done'
and do Step 5.
After completing Step 5:
If
and lxi is
cost(IxI)<ost(lfinall)
not 'infeasible'
then set:
Ifinall=s',
and cost(IfinalI)=cost(ixi)
where s'
is
the collection of set theoretic operators generated by
Step 5.
The results of this step (after repeated invocations of Step
5)
is the
theoretic
optimal
list Ifinall
operators,
cost(ifinall).
s',
and
a
as
an
collection of
associated
set
cost,
PAGE 113
CHAPTER VI
This step consists of a series of substeps whose function it
relation
more precisely,
via relations in
xir from Step 4.
IxI - for a given
any
- or
for resolving the request
the lowest cost
If
cost(Jx)
determine the lowest
is to
Can
relation iIxi
i'4txl then
then
be
not
joined to
any
is 'infeasible',
lxi
other
and
the
evaluation stops.
We proceed now with a detailed algorithm for determining the
minimum cost(lxi).
Ste_ 5.1
lxi with
all relations in
This step finds
cheapest way of ordering
key, and the
the same primary
the necessary joins,
and restrictions.
Repeat Vif1xI not already 'done':
Set
<null>,
Ix't to
cost(Ix'1)=2**30
and
(or some
large
number).
Now find
all relations
j
(say
Iji) in
lxi with
the same
primary key:
Te: Find all
j
such that $8p(j,k)=1 Yk where $8p(i,k)=1.
This results in a set lj'=i U iji
Vaflj'I
do the following:
PAGE 114
CHAPTER VI
Set Ir'I to <null>.
where
k~a
each
Far
r=($5cy(a)e/$6(k)g).
is
IreI
If
in Irli
this r
Insert
$8i(a,k)=1,
and
$8eq(k)=1
set
such that
in ascending order.
$8eg(k)=1 Vk where $8p(a,k)=1 then r=1,
ascending position in
is
This case
and insert in
Ir'I.
of the primary
where all domains
key are
qualifier domains with <qualifier type> 'e'.
with <qualifier type> 'e'
Also, if any qualifier domain
is
unique, then r=1.
Ie: If
$8eq(k)=1,
insert r in
If
and
Ir'I.
member of Ir'I; ie: the smallest r<jr'l.
=
cost(jj'I)
$8i(a,k)=1 then r=1,
and
ascending order in
Choose w = first
Then cost(lj'l)
$8u(k)=1
w.($4io + $4opc).L(Ij'I)
< cost(Ix'I)
cost(Ixe 1)=cost(Ij'),
then:
and
x'lI=Ir'
Reset Ire
to <null>.
The case now is where r will
resolve, but there is no index
on any of the qualifier domains:
If
$8eq(k)=1 and $8u(k)=1 then
r=1,
and insert in ascending
order position in Ir'I.
For each k<a where $8eq(k)=1 and $8d(a,k)=1,
but $8i(a,k)=O,
set:
r= ($5cy(a)e/$6(k)q),
and insert
r
in
ascending order
in
PAGE 115
CHAPTER VI
IrII .
that r=1
reason
the
that
(Note
when
$8u(k)=1
is
that
$6 (k)q=$5cy (a) e.)
After completing all k~a, pick w = first member of Ir'lI; ie:
the smallest r.
Then cost(Ij'I)
=
($5:y(a)e/$4bfe).$4io
+ $5cy(a)e.$4opc
+ (L(Ij'l)-1).w.($4io + $4opc)
If
cost(Ije')<cost(Ix'I)
then:
and
cost(Ix'I) = cost(IjeI),
Ix'I = Ir'I
Reset Ir'I to <null>.
Now,
for each k<a where $8nq(k)=1 and $8d(a,k)=1,
set
r=1/2.$5cy(a)e,
insert r
and
in
Irel
in
ascending
order.
After all k<a are completed,
ie: fir the smallest r.
cost(I j')
=
choose w=first uemb er of Ir'lI;
Then:
($5cy(a) e/$4 bfe) .$4io
+ $5,y(a)e.$4opc
+ (L(jj'j)-1).w.($4io + $4opc)
If
cost(Ij'I)<cost(Ix'I)
cost(izxI)
= cost(ij'i)
then:
and
PAGE 116
CHAPTER VI
Irxe
If at
=
jr'i
this stage jx'j
is
there were
required, since
null, then a serial
no qualifiers
retrieval is
request.
in the
Proceed as follows:
such that $5cy(a)e is
Find a*j I
a minimum Va*Ij'I.
and set s=a.
Mark a as 'done',
, do the following:
For each k/a, k(Ij
Mark k as 'done'.
If Vb where:
$8nq(b)=1
($8eq(b)=1,
($8d(s,b)=1
or $8od(b)=1)
and $8d(kb)=1)
and
then no
needed,
join is
s.
since all domains needed from k are already in
Otherwise, the set theoretic operators become:
* k(I=,pI)
s=s(Ipj)
cost(Ix' ) =
Vp4lpi where $8p(a,p)=1,
and
($5cy (a) e/$4bfe) .$4io
+ $5cy(a)e.$4opc
+ $5cy (a)e.($4io + $4opc).(L(Ij'1)-1)
If, hvwever,
procedure,
x'I was not
<null> at
the end of
the above
proceed as follows:
Choose a=first member of Jx'
and mark a as 'done'.
(ie: where r is smallest),
CHAPTER VI
117
PAGE
Set s=a.
The set theoretic 3perators then become:
R
s=s(Ipj)
(Ie,pi)
such
Vp4IpI
that
$8eq(p)=1
OR
$8nq(p)=1,
domain p. (see Appendix 2,
where '0' is the qaalifier on
Page 175)
Now, for all k/ea, k<Ij'j
do the following:
Mark k as 'done'.
If Vb where:
($8eq(b)=1,
$8nq(b)=1
or $8od(b)=1)
($8d(s,b)=1 and $8d(k,b)=1)
and
no join is
necessary,
since
all domains needed from relation k are already in s.
Otherwise,
establish
the
following
set
theore tic
operators:
If
$8eg(b)=0 and $8ng(b)=0 Vb~k,
s=s(jpi)
*
Or,
$8eq(b)=1
if
k(I=,pI)
Vp4lpl
or
then
where $8p(a,p)=1.
$8nq(b)=1
($8d(s,b)=O and $81(k,b)=1
for
some
b<k,
and
) then set theoretic operators
become:
s=(s(IpI)
*
$8eq (t) =1 or
k (I=,pi))
R
(Ie,tI)
$8nq(t)=1, and 8
Vt<IkI
such
is the qualifier
that
type on
CHAPTER
PAGE
VI
118
domain t (see Appendix 2, Page &pno3).
have been so included in
Once all kij'l
the set theoretic
operators, they can be removed from the list of relations to
be joined
( ie:
replace them
from
would result
that
relation
and
Jxj)
Also,
s.
the single
all by
up
set
truth
functions as follows:
$8d(s,b)=1
Vb where $8d(k,b)=1 for k<Jj'J, and
$8p(s,b)=1
Vb where $8p(a,b)=1.
Also set $5cy(s)v = r.
Nzte that
limit on
r in
this case is,
strictly speaking,
s,
the cardinality of
since other
an upper
qualifiers may
well result in reducing the size of r.
At the completion of Step 5.1,
there exists a collection of
(virtual) relations Is'I each generated
step.
Each s4js'I is
a relation
derived) relations
in
restricted
as
and
is
Ixi that
as outlined in this
consisting of all real (and
have the same
required by
the
primary key,
qualifiers in
the
request.
If
L(Is'I)=1
then steps 5.2
,
5.3 and 5.4 may
The algorithm continues in this case
be omitted.
with where it
left off
CHAPTER VI
PAGE 119
in Step 4.
Step 5.2
This step is
manner,
responsible for joining in
all those relations s*Is'l,
contains the
primary key
primary key as) tfls'j,
If there is
Step
5.3.
in the case where s~is'i
of (but
does not
no s and t
Otherwise
have the
same
t/s.
Vk where $8p(t,k)=1,
Ie: $8d(s,k)=1
the most efficient
s,t*Is'I.
where this is the
continue with
the
case, proceed to
algorithm of
Step
5.2.
Assuming s contains the primary key domains of t:
to see
check
domains (for
Is'I
whether
s already
this request) in
t,
contains
and if
so,
all the
needed
remove
t from
and continue with some other tis'l.
If
$8od(k)=o
Vk
where
$8d(tk)=1
and
$5cy(t)v=1
then
establish 2 set theoretic operators:
1) t
2)
- ie: leave t as it
s'=s(jpl) R (I=,pl)
is
in
p<IpI
Is' I, and
Vp where $8p(tp)=1,
values for the domains are obtained from
and
(1).
cost(Is, 1)=cost(t) + cost(s).
This
means that
if
there
are no
domains that
are to
be
CHAPTER VI
retrieved
from t,
($5cy(t)v=1),
will
ani t
then resolve
PAGE 120
resolve to
t and
use the
a single
entry
values for
the
primary key domains as additional qualifiers in s.
If $8od(k)=1
k where
for some
$8d(t,k)=1 then
proceed as
follows:
s'=(s(jgI) * t(J=,gI))
where g4|gI
R (Ie,pI)
Vg where $8p(t,g)=1,
and
Vp*ItI such that $3eg(p)=1 or $8nq(p)=1,
and
e is the qualifier type on domain p (see Appendix 2, Page
&pno3).
cost(s')=cost(s)
+ 2.$5cy(s)v.($4io + $4opc)
Remove t from the list of relations to be joined (ie: jx'j),
and set up additional truth functions for s' as follows:
$8d(s,b)=1 Vb where $8(t,b)=1.
At the conclusion of this step, there exists a collection of
virtual relations Is" I each generated from some member(s) s
in
Is'l. Furthermore,
these relations are only joinable in
particular way - namely,
as in Step 5.3.
a
CHAPTER VI
PAGE 121
Step 5.3
This step is responsible for joining relations s'*Is''I from
step 5.2
(or 5.1)
to yield
a single relation containing all
object domains in the request. Note
qualifier domains will
that by this stage, all
have been employed in
the necessary
restrictions.
We proceed as follows:
Choose
tfls''I such
relations in
Then,
that $5cy(t)v
is the
minimum of
all
Is''J.
Va*is''j, aps do the following:
If $8d(t,k)=1 Yk where :
($8eq(k)=1,
$8nq(k)=1 or $8od(k)=1),
and $8d(a,k)=1, then t
contains all the domains neccessary for the request that are
in a.
So,
remove a
from Is''I
a<ls''
.
See if
$8d(t,k)=1 and $8d(a,k)=1 for
two relations in Is''I
some other afls''I.
If so:
Contain a
and continue with
some k.
the next
Ie: see if
any
common domain. If not, try
CHAPTER VI
t=t(k) * a(k),
and
+ ($4io + $4opc).$5cy(t)v.$5cy(a)v
cost(t)=cost(t) + cost(a)
Remove relation
truth functions.
At
this
stage,
cheapest method
a from Is''1 and
this is
add the necessary
if
we
L(Is'')=1, then
for joining
not the case,
all relations
have
found
in lxi,
the
and we
Step 4.
and L(Is''I)>1, then the request is
#infeasible' with only the relations
relations must
set of
ie: $8d(tk)=1 Yk where $8d(a,k)=1.
continue where we left off in
If
PAGE 122
be found that will
in lxi, and some other
allow the request
to be
logically satisfied. This is the function of Step 5.4.
Step 5. 4
This step
attempts to
specified in
Ilxi will
find relations
allow the
which, although
request to
not
be completed.
Finding such fintermeliary' relations can be accomplished as
follows:
5.4.1) Set up list Ib'l=<null>
5.4.2) Choose some afls''1
CHAPTER VI
PAGE 123
5.4.3) Find some relation b such that bjIxl and:
$8d(a,j)=1 and $8d(b,j)=1.
5.4.4)
If
found,
5.4.5) If
not
set Ib'I
found,
= Ib'I
U b
then the request
is
'infeasible',
and
continue where we left off in Step 4.
5.4.6)
See
t*a,
tfls'',,
5.4.7) If
$8d(b,k)=l
if
$8d(t,k)=1
and
for
some
k,
b<Jb'j
yes: then set txI=IxI
U gbil, and remove relation
t from Is''f, replacing it with the single relation:
a=
(a(j) * b(j)) (k)
* t(k).
Continue with Step 5.4.12
5.4.8) If
not,
try 5.4.6 for some other t<(s''I, t/a
5.4.9) If that fails, fini relation cfix1, c/b such that:
$8d(b,j)=1 and $8d(c,j)=1
5.4.10) If found,
5.4.11)
If
for some
j.
Ib'I=Ib'I U c, and repeat from 5.4.6
not, the request
from where left off in Step 4.
is
'infeasible',
and continue
PAGE 124
CHAPTER VI
5.4.12) If
11, repeat 5.4.1 through 5.4.11 .
L(Is''I)
5.4.13) If L(Is''J)
restart step 5 again from 5.1
1, then
=
with the new lxi.
This
completes
in
all relations
First find
See
keys.
smallest set
the
list
of those
4hich
of data,
for
optimally
satisfying
works as follows:
it
Very briefly,
requests.
primary
alogorithm
the
and do
have the
that
will
resolve
same
to
the
relation with
a join of that
others of the same primary key.
After relations with the same primary keys have been joined,
see if
any one
other
relation.
relation contains
If
he retrieved
the primary
primary
key
key of
any
values
to
that can
be
retrieve from the other relation.
After
joining all
joined in
that way,
relations by
join
primary keys
remaining relations on some common
domain.
If
there is
no common domain,
has domains common to both.
find some other relation that
CHAPTER VI
This, then, completes
earlier,
the RDS
variables, and
function
can
is
the discussion of the
also responsible for
the appropriate
be
PAGE 125
surmised
from
As stated
updating decision
points for
the
RDS.
performing this
description
of
the
algorithm.
We proceed
now to
a scenario in
which the
decision rules
developed in Chapters V and VI will be applied.
PAGE 126
CHAPTER VII
Scenario for appligtion of Decision Rules.
scenario that applies some of
This chapter presents a brief
the decision rules
VI. Not all
developed in Chapters V and
of these rules be usel in the scenario, but a representative
them will be,
number of
and
that will serve
to illustrate
the use of others.
Consider a company
Manager. Each
awn
divided into Departments, each
Department
within
-ertain
a
single Department.
Parts,
which
are
who
are
at a time, and all projects
Assigned to work on one Project
fall
Employees,
employs
with its
Each
project
by
provided
the
requires
company's
SuupligErs.
f we were
to attempt to establish a company
data base for
this company we would have to:
. specify the entities in which we are interested,
. determine what data we want
to maintain about each of
these entities, and
. determine how these entities interact.
The interactions
Given
the company
can perhaps best be
structure above,
done diagramatically.
we
might diagram
the
CHAPTER VII
interaction between
The entity at
7.1.
'owned'
But
Figure
in some sense
the head of an arrow is,
*
sufficient
is not
diagram
appears in
as it
the entities
by the entity at the tail.
this
PAGE 127
to
express
certain
aspects of the structare.
For
example,
Manager,
one
any
for
only
is
there
2ne
Department while any Manager may have several Projects under
his control. We
will introduce an '='
terminology of
this is
a Autual
Te:
given
determined,
One other
difference
one
head of the
nature of the relationship.
arrow to signify the one-to-one
In the
near the
Chapter IV,
functions of
the truth
.e~endenc.y.
entity,
and vice versa.
case is
between
other
the
This is
not expressed
asses
where
is
entity
shown in
Figure 7.2.
namely the
in Figure 7.2;
one
uniquely
entity
is
uniquely
determined by another, and cases where it is not.
The concept of ownership is
See (7)
network model.
*
the same as that found in
the
CHAPTER
VII
PAGE 128
MANAGER
DEPARTMENT
J PROJECT
II
SUPPLIER
PARTS
EMPLOYEE
Filure 7 1
PAGE 129
CHAPTER VII
MANAGER
PROJECT
DEPA TMENT
SUPPLI ER
PA TS
EMPLOYEE
Figure
7
.2
CHAPTER VII
PAGE 130
**
The former is a case 3f a functional dependency
or
a one-to-many
mapping.
An
where several
Supplier might
mapping.
The
example of the
latter is
many-to-many
latter is Supplier
Suppliers may supply
supply many
a
Parts.
and Parts,
the same Part,
A
and one
possible method
for
diagramatically distinguishing between these two cases would
be a double-headed arrow
7.2 is
updated
that we
for many-to-many mappings.
to include this concept in
have a clear
concept of the
the entities, we can begin
Figure
Figure 7.3.
Now
relationships between
to consider what information, or
attributes, we wish to keep about each entity. ***
Assume for purposes
be maintained
of the scenario that
for each
entity are
the attributes to
as specified
in Figure
7.4.
Now
notice
that
some
attribute
(or
combination
of
Functional dependency, as explained in Chapter II, means
**
simply, given one entity, the other is uniquely determined,
but the reverse is not true. For example, given an Employee,
his Department is uniguely determined since an Employee can
only belong to one Department.
Note that consideration of attributes and consideration
of interrelationships between entities are orthoganal, and
as such, may be done in any order. The order presented here
is by no means mandatory.
***
PAGE 131
CHAPTER VII
MANAGER
I
DEPARm ENT
PR0
SUPPLIER
EMPLOYEE
!iagrge
7
.3
ECT
PAGE 132
CHAPTER VII
m_nime,
Manager(Mgr#,
office#)
Department(Dept#, dname)
Project (Proj#, paame,
Supplier(Supp#,
Part(p#,
guant,
s_name,
startdate,
enddate)
phone)
dite)
Employee (socsec,ename,hiredate,salary, title)
attributes) in
each entity
entity uniquely; such
that the concept
both
of Figure
7.4, identifies
as socsec would an
of 'uniquely defined' has
relationships between
entities,
the
Employee. Notice
been applied to
and within
entities
themselves.
We are now
in a position to define truth
structure between
the entities as
functions for the
depicted in
Figure 7.3,
and within the entities as defined in Figure 7.4.
First we
define functional dependencies within
entities by
PAGE 133
CHAPTER VII
up
setting
functionally
as
attributes
The
as depicted
functions are
resulting truth
on
an
that uniquely define the
attribute (or group of attributes)
entity.
dependent
in
Figure 7.5.
We will call the attribute(s)
in
an entity
on which the other attributes
functionally dependent
are
the
key of
the
entry.
has been omitted from Figure
Notice that Part
7.5. This is
between Part and Supplier,
because the many-to-many mapping
and Part and Project, as depicted by the double-headed arrow
in
Figure 7.3.
To
the entities
arrow,
functional dependency
at the
which yields in
end(s)
tail
truth
key attributes
entities, we include the
functions for such
of
establish
of the
double-headed
this case:
$8f(p#,Proj#,Supp#I, Iquantdatel)=1
This
same approach
arrow) if
would
there were not
be taken
(ie:
a double
headed
attribute(s) within an entity that
uniquely identified the entity.
CHAPTER VII
PAGE 134
,Im_name,office#)=1
$8f (Mgr#
$8f (IDept#I, Id_namel)=1
$8f(IProj#1, Ip_namestartdate,enddatel)1
$8f (ISupp# 1, Is_naae,phone I)= 1
$8f(IsocsecI,Ie_name,hiredate,salary,titlel)=1
Fi3gure 7. 5
care of intra-entity dependencies,
This takes
portrayed in Figure 7.3.
inter-entity dependencies that are
All
we
done
have
thus
double-headed arrow of
far
but neglects
is
account
take
Figure 7.3, but not
of
the
any other types
of arrows.
In order
to handle the
proceed as
single-headed arrows we
follows:
Add
the key
of
the entity
at
the tail
functionally dependent attributes of the
to
the list
of
entity at the head
of the arrow.
For the entities
Figure 7.3,
of Figure 7.5, and
this process generates
the interrelations
the entities
of
of Figure
CHAPTER VII
PAGE 135
$8f(jfgr#,Im_name,office#j)=1
$8f (IDept#1 ,1d_namel) =1
$8f (IProj#IIp_name,startdate,enddate,Ngr#,Dept#I)1
$8f(ISupp#I,Is_name,phonej) =1
$8f(Isoc-secIIe_name
Dept#,
,
hiredate ,
salary
,
title,
Proj#l)=1
$8f(|p#,Proj#,Supp# I,Iquant ,datel)=1
Figure7.6
7.6
Now all that is
'=#.
a
function is
dependengl,
the arrow head with the
indicates a one-to-one mapping,
This type of arrow
mutual
is
left to consider
and so
a
mutual
establishe1 for the keys
tail, and head of the arrow.
$8m (IMgr#I , IDept#I )=1
dependency
or
truth
of the entities at the
Thus, from Figure 7.3 we have:
CHAPTER VII
PAGE 136
We now have a set of truth functions that reflect the entity
attributes,
well
as
entities. Now the
as
the
interrelations
SDS, by applying the
the
algorithm specified
generates the third normalized
in Appendix 1,
between
relations of
Figure 7.7.
Implementation of Relations.
The procedure of diagramming
the inter-entity relationships
as outlined above proviles a logical method for establishing
the truth functions necessary for
the SDS to maintain third
normalization.
The next function
of the SDS at this stage
implementation of
the
the
relations
is to determine
7.7.
of Figure
The
decisions to be made are:
. virtualizing and encoding decisions,
. indexing decisions
. factoring decisions
No derived relation, or permanent join decisions can be made
at this point
new domains
base.
Thus
since there are no virtual
have been
all
defined to be
decision variables
relations, and no
included in
referring
to
the data
virtual
CHAPTER VII
PAGE 137
RR1 (5griDept#,aname,d_name,office#)
RR2 (Proji,Mgr#,p_name,startdate,enddate)
RR3(soc sec,
Proj#,
Dept#,
hiredate,
ename,
salary,
title)
sname,
RR4(Supp,
RR5(p#Proj#,Sqgg,
phone)
guant, date)
(Keys underlined)
and
relations,
relations with
the effect of
resulting in
the
same
the appropriate
be
key will
0,
decision rules
beeing null.
In order
have
point
to employ the various
values for
(ie:
in
some of
decision rules, we
the decision
definition
phase)
need to
variables. At
these
values
must
this
be
user-supplied.
For our purposes, we zan leal
with some of the aggregations
CHAPTER VII
of
decision variables,
and allow
PAGE 138
the SDS
these
to split
All
aggregations into detailed decision variables as needed.
user-supplied values,
which do not have
decision variables
will be 0.
know the
Suppose we
the use
following about
of the
data
base:
a) RR4 is
usually accessed on sname,
b) RR2 is
usually accessed on pname
c) RR1
is
usually
accessed either
or dname,
on mname
equally often on each
d) The
enddate of a project
after
the
stirtdate.
(in
RR2)
(All
is
3 months
always
projects
run
for
3
months.)
of RR3 has exactly 46 possible values, and is not
e) title
expected to change. It is also seldom accessed.
For (a), (b)
and (c),
the
SDS should consider
indexing the
relevent attributes.
For (d),
virtualizing of enddate
is
possible, and
for (e)
encoding of title is possible.
We will
set up the
following decision variable
use in further explanation:
values for
CHAPTER VII
$lrd(RR4,s~name)*se*
= 10
$1rd(RR2,p_name)*sa*
= 10
PAGE 139
$lrd(RR1,aname)*se* = 5
$lrd(RR1,a_name)*ce*n
$lrd(RR1,d_name)*se*
$1rd(RR1,d_name)*c*e*n
= 5
= 5
= 5
All other decision variables have initial values of 0.
Other pertinent data is:
$5cy(RR1)r = 60
$5cy(RR2)r = 123
$5cy(RR3)r = 2000
$5cy(RR4)r = 75
$5cy(RR5)r = 25000
$6(title)q = 46
$6(sname)q = 10
$6(pname)q = 89
$6(mname)q = 58
$6(dname)q = 60
$4bfe = 25
$4bfx = 350
CHAPTER VII
PAGE 140
$4io = 0.0012
$4opc = 0.005
$4t = 1
$4sc = 1.6 x 10**(-5)
7.1 VirtualiZing Decisions.
Suppose:
$1rd(RR2,enddate)*s**=1,
$1rd (RR2,enddate)*c**n=1,
$2rd(RR2,enddate)r**=2,
and
$2rd(RR2,startdate)u**=5 .
Using decision rule 5.3.1 of Chapter V:
Cost(making domain real),
since there is
cost(use if
and cost(virtualizing) are both 0,
no dati yet in
real)=
the system.
(1+1).(123/25)
x 0.0012
+ 123 x 0.005
= 0.63
cost(maintaining if
real) = 5 x
(0.0012 + 0.005)
= 0.03
cost(final) = (123/25 x 0.0012)
+ 123 x 0.005
= 0.62
cost(use if
virtual) = (0.62 x 2)+(0.62 x 2)
CHAPTER VII
PAGE
141
= 2.48
Applying the decision rule of 5.3.1, we have:
Make the domain real if:
(0.63 + 0.03)
<
(2.48).
In this case, the domain would
be made real.
(Note that the
main reason for this is the fact that the domains is used as
a qualifier,
thus reqairing a
linear search of the relation
to compute, and then test the value of the domain.
7.2
Encoding Decisioas.
The candidate here for encoding is 'title' in RR3.
Suppose:
$1rd(RR3,title)***** = 1, and
$2rd(RR3,title)jr,ul** = 2.
Then, applying the decision rule 5.3.2 of Chapter V:
cost(encoding) and cost(decoding) are both 0, since there is
as yet no data in
the data base.
cost(use if encoded) = 2.(1+2).(0.0012+0.005)
= 0.04
cost(use if unencoded)
(1+2).(0.0012+0.005)
= 0.02
PAGE 142
CHAPTER VII
cost(extra storage)
=
( (4 x 2000)
-
(6+(8 x 46)+100) )
x (1 x 1.6 x 10**(-5))
=.12
the decision rule
Thus, using
of 5.3.2, encode
the domain
if:
< (0.12 + 0.02).
(0.04)
In this case,
7.3
of RR3 would be encoded.
'title'
Indexing Decisions.
For each of the domains used as qualifiers with a <qualifier
type> of
'e', the
creating an
domain.
SDS would
index (if one did
evaluate the
desirability of
not already exist)
for that
If an index exists, the SDS would determine whether
it is still needed.
We shall
only follow one case
here; namely, for
RR2.
Applying the decision rule 5.4 of Chapter V:
cost(storage) = (50+(8 x 123)+(4 x 89))
x
((1.6
x 10**(-5))
x 1)
pname in
PAGE 143
CHAPTER VII
= 0.02
cost (projected use with index) =
x 10 x
(123/89)
(0.0012+0.005)
x 0.0012
+ 10 x (123/350)
= 0.09
cost(projected use without index) =
10 x (((123/25) x 0.0012)+(123 x 0.005))
= 6.21
cost(building index) =
0.005 + 0.0012 x((123/25)+(123/350))
= 0.01
Thus, the decision becomes:
Build an index for ths domain if:
(0.02 + 0.09 + 0.01)
which
would result
in
< (6.21)
a
decision
to build
an index
for
p_name of RR2.
No permanent
join, or derived
relation decisions
are made
for reasons outlined above.
Once the system has been 3perational for a while, and values
have been
generated for the different
decision
variables,
CHAPTER VII
an invocation of
similar way,
to
PAGE 144
the SDS would make similar
in
would,
It
the examples above.
decisions in a
addition,
make permanent join decisions (where applicable) and derived
decision rules
as specified by
relation decisions,
5.6 and
5.7 of Chapter V respectively.
This
scenario
has
demonstrate the manner
would be applied.
other
scenarios
presented
in
application
simple
which the various
The reader is
and
a
invited
other decision
similar to that employed here.
to
decision rules
to experiment with
rules
in
a
fashion
CHAPTER VIII
PAGE
145
Conclusion.
What has been presented here is:
. a methodology
data base
for pseudo-optimization of a
for the type of use currently being made thereof. This
is done by the SDS.
a
.
for
procedure
pseudo-optimaztion
of
request
handling, by the RDS.
The phrase
'pseudo-optimal' is
used in
word 'optimal', since the decision
largely heuristic,
preference to
the
rules presented here are
and as such may
well not be
optimal in
the accepted sense of the word.
The
SDS is
driven by
a collection
(maintained by the RDS) and
the
pseudo-optimal,
variables
a collection of truth functions
which are used in SDS decision rules.
is
of decision
The output of the SDS
third-normalized
data
base
structure.
The RDS is
driven primarily by a set of truth functions,
does make some use of decision variables.
RDS is:
but
The output of the
PAGE 146
CHAPTER VIII
. updated decision variables,
set-theoretic
of
collection
a
.
operators
to
best
satisfy the request.
The decision
rules in both the
to permit
modularized
rules
rules,
generally be
by
connected
modularity of
Furthermore,
the case,
degenerate
may well
major feature to
into
boolean
the decision
decision
rules without effects on other
specific, and
highly implementation
this will
of particular
replacement
rules, and parts of decision
parts of the subsystem.
are highly
SDS and the RDS
all decision rules are
it is
envisioned that
as generalized
decision
of
specific
a summation
variables.
rules presented
allow for easy replacement
As
the
such,
here may
be a
of those parts
of rules that are appropriate for other implementations.
Perhaps the most
here is
outstanding feature of the
the dynamic
approach taken
nature of both the SDS and
the RDS.
To
date, this has certainly not been true of subsystems used to
aid in the design
of the data base, and only
rarely in the
PAGE 147
CHAPTER VIII
request-handling function. **
In general,
in a
be stated
had to
queries have
way that
inherently specified the procedural steps to be taken in its
handling, and structuring decisions have always been made by
position of a data
someone in the
(or group)
person
decision model,
well
might
be
base administrator. This
aided
but such structuring, and
restructuring decisions
by
some type
of
more importantly
were never made dynamically
by the
system.
Various algorithms were developed for
of optimizing performance,
aiding in the process
the most major of
which is that
in Chapter VI for pseulo-optimizing request handling.
This
work can
complete
in
certainly
any sense.
not, nor
It
is
does
it,
merely to
claim to
demonstrate
be
a
methodology for approaching the
arena of automated decision
subsystems. As such, there are
many areas into which forays
must be made before such subsystems become complete.
IBM's San Jose Research Center has designed a query
language called 'Sequel' which does quite elaborate dynamic
request handling optimization. See (8).
**
PAGE 148
CHAPTER VIII
Future Research.
here
presented
to
situations.
other
the methodology
be to apply
first step should
Perhaps the
only
the
was
XRN
implementation considered here.
There is
also
a problem that arises from the
way that is rather analogous
is physically implemented in a
to the
attempt
little
system.
the user
interface that
There
separation
and
(or need)
is,
to
however,
the SDS
fact that XRM
sees. As
such, there
separate these aspects
a
should
clear
be
need
for
was
of the
such
a
broken down
into
two
in
fact,
the
distinct parts to handle:
and
. the logical structure,
. the physical structure.
It
might seem
that
virtual relations
logical system structure,
the fact that
a row,
whereas some
that an entity
but further reflection will reveal
real ralations can be
in a variety of ways. XRM
are,
physically implemented
treats, and stores each entity as
systems are
more column-oriented,
consists of a value from each
in
column. It is
CHAPTER VIII
PAGE 149
implement a relational system
even possible to
in
a system
of the IMS variety.
In this
throughout
this
thesis
in
may,
relations used
real
regard, the (third-normalized)
physically
be
fact
The SDS should be expanded
implemented in a number of ways.
to reflect this aspect of data management systems.
There were also places in the
of the
partial inaccuracy
These modifications,
be made
to those
though,
to
body of this thesis where the
decision rule
was pointed
out.
as well as many other refinements could
rules presented
recognize
here.
when fine-tuning
improvements, and when the
is important,
It
will
yield
major
benefits are substantially below
the costs of such efforts.
It
is
felt that the decision rules presented here are of the
type that might affect system
magnitude, particulariy
in cases
time. Fine
tuning these rules
only a few
percentage points.
that
research
conducted
performance by many orders of
where usage
might affect
Perhaps this
in this
vein
changes over
performance by
should indicate
attempt
to
first
PAGE 150
CHAPTER VIII
accomplish the
fine tuning is
orders-of-magnitude
attempted.
improvements
before any
PAGE 151
References
1)
Donovan, J.J.,
aystem
Programming,
McGraw
Chapter 7,
Hill, 1972.
Construction
C 2 !giler
Gries, David,
2)
Computrs, John Wiley & Sons, 1971
For
Digital
3) Codd, E.F., 'A Relational Model of Data for Large Shared
# 6, June
Data Banks', Communications of the ACM, Vol. 13,
1970.
Normalization
'Further
Cadd, E.F.,
4)
Relational Model', IBM, San Jose, 1971.
of the
Data
Base
'XRM - An Extended (N-ary) Relational
Larie, R.A.,
5)
Memory', IBM Cambridge Scientific Center, Cambridge, Ma,
Jan. 1974 (IBM Report G320-2096)
Completeness of Data
'Relational
E.F.,
Cadd,
6)
Sublanguages', IBM, Sin Jose, March 1972 (RJ 987)
Base
'data Structure Diagrams', Data Base
7) Bachman, C.W.,
(Quarterly News letter of the ACM-SIGBDP) Vol. 1, #2, 1969.
8) Astrahan, M.M., Chamberlin, D.D., 'Implementation of a
Structured English Query Language', IBM Research, San Jose,
Oct. 10, 1974.
Bibliography
Examination
'An
1) Ackermann, R.C.,
Prototype Information System', Masters
of Management, MIT, June 1973.
PAGE 152
and Modelling of a
Thesis, Sloan School
'Data Structure Diagrams', Data Base
2) Bachman, C.W.,
(Quarterly News letter of the ACM-SIGBDP) Vol. 1,#2, 1969.
3) Bachman, C.W., 'The Data Base Set Concept; Its Usage and
Internal
Systems,
Honeywell Information
Relaization',
Report, Jan. 31, 1973.
'Redacing the Retrieval Time of Scatter
Brent, R.P.,
4)
Storage Techniques', Zommunications of the ACM, Vol. 16, #2,
Feb. 1973.
5) Buchholz, W.,
Systems Journal 2,
'File Organization and Addressing',
(June 1963) pp 86-111.
'Some Approaches
6) Burkhard, W.A.,
Searching', Communications of the ACM,
1973.
IBM
to Best-Match File
#4, April
Vol. 16,
and Selection of File
'Evaluation
Cardenas, A.F.,
7)
A Model and System', Communications of the
Organization 8) Chamberlin, D.D., Gray,
16, #9, Sept. 1973.
ACM, Vol.
J.N., Traiger, I.L., 'Views, Authorization, and Lacking in a
Relational Data Base System', IBM Thomas J. Watson Research
(RJ 1486).
Center, Dec. 19, 1974
Chapin,
9)
Techniques',
1966.
Organization
of File
Comparison
'
N.,
Proceedings ACM, 24th National Conference,
Feature Analysis
Committee,
CODASYL Systems
10)
Generalized Data Base Manaement Systems, ACM, May 1971.
of
11) CODASYL Systems Committee, A Survey of Generalized Dita
ganageent Systems, ACM, May 1969.
Base
Model for Large Shared Data
E.F., 'A Relational
12) Codd,
Banks', Communications of the ACM, Vol. 13, #6, June 1970.
Data
Base
'Further Normalization of the Data
14) Codd, E.F.,
IBM Research, San Jose, 1971.
Model',
Relational
Base
Completeness of
13) Codd, E.F., 'Relational
Sublanguages', IBM Research, San Jose, Mar 1972.
PAGE 153
Bibliography
'Seven Steps to Rendezvous
15) Codd, E.F.,
San Jose, Jan 17, 1974.
Research,
User', IBM
16) Codd,
Tutorial',
with the Casual
E.F., 'Noraalized Data Base Structure; A
Nov. 1971.
IBM Research, San Jose,
Brief
'A Data Base Sublanguage founded on the
17) Codd, E.F.,
Proc. 1971 ACM-SIGFIDET Workshop,
Relational Calculus',
1972.
18) Collmeyer, A.J., Shemer, J.E., 'Analysis of Retrieval
Performance for Selected file Organization Techniques', Fall
Joint Computer Conference, 1970.
'Elements of Data Management
19) Dodd, G.D.,
Computing Surveys 1, June 1969.
Systems', ACM
20) Donovan, J.J., Sjgsters Prggramming, McGraw-Hill, 1972.
'Virtual
Schutzman, H.,
Madnick, S.,
Follinus, J.,
21)
Sloan School Working
Information in Data Base Systems',
Paper, Sloan School of Management, MIT.
22) Frank, R.L., Yamaguchi, K., 'A model for a Generalized
Data Access Method', National Computer Conference 1974.
The Consecutive
S.P., 'File Organization:
23) Ghosh,
Retrieval Property', Communications of the ACM, Vol. 15, #9,
Sept 1972.
Updating Mean and Standard
24) Hanson, R.J., 'Stably
Deviation of Data', Communications of the ACM, Vol. 18, #1,
Jan 1975.
25) Hsiao, D., 'A Formal System for Information Retrieval
from Files', Communications of the ACM, Vol. 13, #2, Feb
1970.
26) Huang, J.C., 'A Note on Information Organization and
July
#7,
Vol. 16,
Communications of the ACM,
Storage',
1973.
Approaches
'Some
27) Langefors, B.,
Information Systems', BIT(3), 1963.
to
the
Theory
of
28) Langefors, B., 'Information System Design Computations
using Generalized Matrix Algebra', BIT(5), 1965.
File Structures
29) Lefkovitz, D.,
1969.
Washington,
Press,
Spartant
for On-line
Systems,
Bibliography
PAGE 154
of Data Base Characteristics
30) Lowe, T.C., 'The Influence
and Usage on Direct Access File Organization', JACM, Vol.
15, #4, Oct. 1968.
31) Lowenthall, E., ' Functional Approach to the Design of
Storage Structures for Generalized Data Management Systems',
Ph.D. Thesis, U. Texas, Austin, Aug. 1971.
32) Luz,
Indices',
1970.
V.Y., 'Multi-attribute Retrieval with Combined
Nov.
Communication of the ACM, Vol. 13, #11,
'Key-to-Address
Dodd, N.,
Yuen, P.S.T.,
33) Lum, V.Y.,
Transformation Techniques: A Fundamental Performance Study
on Large Existing Formatted Files', Communications of the
ACM, Vol. 14, #4, Apr 1971.
S.E.,
34) Madnick,
McGraw-Hill, 1974.
Donovan,
J.J.,
operating
Systems,
'Toward the Automatic Design of Data
W.A.,
35) McCuskey,
Processing
Information
Large Scale
for
organization
Systems', Ph.D. Thesis, Case Western, Jan. 1969.
Design of
Automatic
'On
McCuskey, V.A.,
36)
Organization', Fall Joint Computer Conference, 1970.
Data
37) Mullin, J.K., 'Retrieval-Update Speed Tradeoffs Using
Combined Indices', Communications of the ACM, Vol. 14, #12,
Dec. 1971.
'Attribute Based File
38) Rothnie, J.B., Lozano, T.,
Organization in a Paged Memory Environment', Communications
17, #2, Feb 1974.
of the ACM, Vol.
39) SagamangJ.P., 'Aitomatic Selection of Storage Structure
Masters Thesis,
in A Generalized Data Management System',
UCLA, 1971.
40) Severence, D.G., #Identifier Search Mechanisms: A Survey
and Generalized Model', Computing Surveys, Vol. 6, #3, Sept.
1974.
41) Schachat, I.J., 'A Parameterized Model for Selecting the
Multi-attribute Retrieval
in
Optimum File Organization
Systems', Masters Thesis, Sloan School of Management, MIT,
June 1974.
41)
Shneiderman,
B.,
Scheuermann,
P.,
'Structured
Data
Bibliography
Structures',
1974.
communications of
the ACM,
PAGE 155
Vol.
17,
#10,
Oct.
Data Base Reorganization
'Optimum
Shneiderman, B.,
42)
Vol. 16, #6, June 1973.
ACM,
the
of
Communications
Points',
'A Stochastic Model for the Evaluation of
43) Siler, K.F.,
Large Scale Data Retrieval Systems...', Ph.D. Thesis, UCLA,
1971.
Wallace, R.M., 'Janus: A Data Management
44) Stamen, J.P.,
far the Behavioral Sciences', Cambridge
System
and Analysis
Project, Cambridge, Mi.
'Self Organizing Data
45) Stocker, P.M., Dearnley, P.A.,
Vol. 16, #2,
Journal,
Computer
Management Systems', The
1973.
'A Methodology for Comparison of File
46) Winkler, A.,
organization and Processing Procedures for Hierarchical
Storage Structures', Ph.D. Thesis, U. Texas, Austin, Aug.
1970.
'Management Information Systems - A
47) Ziering, C.A.,
Comparison of the Network and Relational Models of Data',
Masters Thesis, Sloan School of Management, MIT, June 1975.
PAGE 156
Appendix 1
with
deals
appendix
This
the
third
of
procedure
normalization.
presented here is
The alogorithm
detail
that
functions
the
of truth
by a set
driven
mutual
and
functional-
dependencies existing in the data. Notice that computational
gt zonsidered in
dependencies are
presented here.
the alogorithms
The issue of computational dependency is
as to
relevent decisions
in.
any of
belongs
a domain
which relation
not
We proceed now with the algorithm.
1 Apply transitivity to all mutual dependencies
if
Ie: VIaI,IbI,IcI
$8m(IaI,b)=1
and $8m(IbI,cI)=1 then
set $8m(IaI,IcI)=1.
Combine
all functional
with
dependencies
first
the same
list, and remove those with duplicate first lists.
Ie:
VIaI,Icl
jal=IcI, set
where
S8f(IcI,Idj)=1,
domains
$8f(IIIdI)=O.
to include
the
2
and
and set
set $8f(aI,e)=1 where teI=tbIUjdI,
$8f(lal,IbI)=0 and
dependent
$8f(IaI,IbI)=1,
Expand functionally
functionally
dependent
domains of all mutually dependent (sets of) domains.
Appendix 1
$8z (I a Isbi
where
Vialsibi
Ie:
PAGE 157
)=1,
$8f(laisiri)=1
and
$8f(ibijsis)=1:
modify Irl to jr'I, and is1 to is'I where
ir1I=Is'l= jrl U ist
$8f(aI,ir I)=1 and $8f(IbIis'1)=1.
This yields:
3
Remove dependencies on 2artial candidate keys.
(Underlined domains are
domains is
of
primary keys; if
underlined set of domains is
relation, then
any one
underlined in
than one set
more
each
a candidate key.)
where $8f(jal, IbI)=$8f (Icl,IdI) =1:
VIbisIdi
fbi N Idj=ibt and jai N IcI=IaI then :
If
-
Idj=idJ
Now
and
lxl=lIxl
-
9bi
V1xl
such
that
IxI)=1 and $8m(IrI,Ic!)=1.
$8f(jr,
4
ibi,
set
up
relations in
third
forms
normal
in
the
following two steps:
4.1
VibIIdI where $8f(IalbI)=1 and $8f(Icl,1 dl)=1:
Does some relation already set up
contain both jal and Ibi,
or both Ic! and Idj ?
yes: then
continue
mark that
with
a
functional dependency
different
one.
Ie:
as 'done',
Find
some
and
other
Appendix 1
PAGE 158
JalIbIIcl and idi.
d=0 ?
(0) No: Is Ibi N
set
and
RRi(Jaj,lbI)
relation
up
dependency $8f(lal,b)=1
relation',
not 'have
Yes: If $8f(ja,IbI)=1 does
(1)
mark
then
functional
'have relation',
as 'done' and
and continue
(2)
V(Ibl N IdI)/g0' do the following:
No:
Is $8m(lal,IcI)=1 ?
(3) Ie:
set
up relation
RRi(IAI,
"c1, lxi)
where
Ixl=iblUldi
Mark the functional
dependency
$8f(IcidI)1 as 'done'
of $8f(iaIIbI)=1 and $8f(Icl,dI)=1
and both
as 'have
relation'.
(4) No:
is Ibi N
set up 3 relations:
(5) Yes:
RRi(lal,
lbi),
RRj(lI,
Idi),
RRk (12i) where
lei is
If
c=P ?
and
leIlbI N idi
already in
some
RRm, m/i
and m/j,
then
delete RRk.
Mark those functional dependencies as 'done'
relation'.
(6)
No:
is
(7) Yes:
Is
IcJl=cl N Ibi ?
there is transitive dependence.
$8f(IaI,IbI)=1 'done'
?
and 'have
Appendix 1
(8)
!2: set
PAGE 159
1b lb' i where
4.1 for
restart step
-
lb'l=b1
Idi
and
dependency
this functional
set
i&.:
(9)
set
strike all
lbl=lb'l where lb'l=lbl -
domains Idi from
Idi
ad:
set up
the relation
$8f(taJ,lbl)=1. Restart
for functional dependency
step 4.1 for that functional dependency.
(10)
establish relations:
No:
Ibi) and
RRi(aji,
RRJ(ICi-Pldi)
those functional
Mark
as 'done'
dependencies
and
'have relation'.
4.2
Vlal,IbI
(created
(11)
N2:
is
where $8m(Ial,Ibl)=1,
in step
(4.1)
both jai
containing
some relation
there
and Ibi,
eg:
set up relation RRk (Jai, ibN)
(12) Yes: no action
4.3
For
all relations
$8m(laI,Ib)=1
RRi.
for any
RRi
created
lal,lbl*RRi, then
Examples are presented below
described above.
in
Decision points in
4.1 and
4.2,
remove Ibi
if
from
to clarify the procedure
(4.1) and
(4.2) above
Appendix 1
PAGE 160
the examples that follow.
have been numbered for use in
instances of 'dpn' in the examples mean 'decision
functional
expressing
point n'.
the examples is
Other notation that will appear in
that for
dependencies
mutual
and
All
diagramatically rather than in the form of truth functions.
will
'->'
(A,B)->(C) means that
'< --- >
For
example
C is functionally dependent
on A and
dependency.
example,
dependency.
functional
imply
implies
mutual
(AB)<--->(CD) means that
(A,B)
and
For
(C,D) are
mutually
dependent.
example 1
or:
(P) <---> (0, S
(P)
->R
or
(Q)
-R
or
$8m(IPI,Q,SI)=l
$8f(IPI,IRI)=1
:$8f(IQIrlR I)=1
Step 1: No actio n
Step 2:
set up (d)
Step 3:
Using (b)
thus, no action.
$3f(IQ,SIIR1)=1
and (c):
IRI N IRI=IRI,
and IPINIQI=W
Appendix 1
Using (b)
IRINIRI=IR I andtPjNIQ,Sj=Q
and (d):
161
PAGE
thus, no
action.
Using (c)
JR1 N IRI=IRI and IQINIQ,SI=IQI thus:
and (d):
liL
in
IRI=IRI-IRI=0
which
means
that
(d)
becomes
$8f(IQ,SI,<null>), which must be 0 (See pagexx Chapter IV)
Also: $8m(IPjIQ,SI)=1, so strike IRI from
(b)
as well.
We now have:
a)
$8m(IPIJQ,SI)=1
c) $8f(IQIIRI)=1
(d) are 0.
Both (b) and
Since (c) is
Step 4.1:
<
N IbI=0 VIbI
the only functional dependence, IRI
(c) take dpi:
RR1(Q, R)
Step 4.1 complete.
Step
4.2:
Since
$8m(IPIIQ,SI)=1,
relation, take dpll and set up:
RR2(P, %2)
Thus have relations:
RR1(Q, R),
RR2(P,
Step 4.3:
and
9Al)
No action
and
RR1 is
the
only
Appendix 1
PAGE 162
Example 2
i)
(A,B,C)<--->(D,E)
ii)
(D,E)<--->(GK)
or: $8m(ID,EI,IG,KI)=1
iii) (A,BC) -- >(YZ)
(G,K)
iv)
or: $8f(IG,K I,IXI)=1
$8m(IBCjl,13,Kj)=1
(v)
2:
lai=IA,B,CI and
for
becomes:
get
or: $8f(IA,B,CI,IY,ZI)=1
Apply transitivity to get:
Step 1:
Step
-- >(X)
or: $8m(IA,B,CIID,EI)=1
$8f(IA,B,CIIY,ZI)=1
(vi)
Ibl=ID,EI
(i),
(ie: no change),
and
$8f(ID,EIIY,EZ)=1
For |al=ID,EI
and Ibj=IG,KI from
(vi) becomes
$8f(ID,8I,IY,Z,XI)=1,
(iv)
becomes
(ii),
and
$8f(IG,Ki,IX,Y,ZI)=1
Por Iai=IA,B,Cl and
(iii)
(iii)
from
becomes
JbI=IG,KI from (v),
$8f(IA,B,ClIY,Z,XI)=1,
unaffected.
We now have:
i)
$8m(IA,B,ClID,E)=1
ii)
$8m(ID,EIIG,KI)=1
V)
$8m(IA,B,CIGLK)=1
iii) $8f (IA,B,C I, If,3,X1) =1
and
(iv)
Appendix 1
iv)
$8f(IG,KIIX,Y,ZI)=1
vi)
$8f(IDEIJY,Z,XI)=1
Step 3:
Since
IA,B,Cl N )G,KI=W,
N ID,E=W
IA,B,CI
and
PAGE 163
IGKI N ID,EI=O
no action in step 3.
Using (iii)
Step 4.1
get lal=IA,B,Cl,
and
(iv):
IcI=IG,Ki,
$8m(IaI,IcI)=1 from
Ibi N jdI/0, so take dp2.
set up RR1(AxBsg,
G!K,
IbI=IY,Z,XI and Idl=IX,Y,ZI
X,Y,Z),
and
mark (iv)
(v) so dp3.
as 'done'
and
'have relation'. Mark (iii) as 'have relation'.
Using (iii)
and (iv):
IcI=ID,EI, IbI=IY,Z,X
jaJ=jA,B,Cj,
and Idi=IY,Z,XI.
Ibi N IdI$0
$8m(Ial,IcI)=1 from
so take dp2.
(i),
so take
dp3.
Note
that (iii)
'have
relation',
so
simply
add Icj
and
id'I=IdI N JbI to RR1
Ie: get
RR1(APC,
§gK,
D,
Mark (vi) as 'done' and 'have
XY,Z)
relation'. Since there are no
further functional dependencies that are 'not done', proceed
to next step.
Step
4. 2:
Since
for
each truth
function
of
the
form
Appendix 1
$8f(laJ,ibi)=1
(i), (ii) and
(ie:
relation (viz:
RR 1) zontaining
lal
PAGE 164
(iv))
exists some
there
and Ibi,
take decision
point (12).
Step 4.3:
No action
Thus we have relation:
RR1(AgeC,
B
G1jA,
X,Y,Z)
Example 3
or: $8f(IA,B,Ci,,i D,E,FI)=1
(A,B,C) -- >(D,E,F)
i)
-- > F
ii) (DB)
or:
$8f(ID,EIIFI)=1
Step 1:
No mutual dependencies,
so no acti on.
Step 2:
No mutual dependencies,
so no acti on.
Step 3:
and Idj=FIj,
for ibI=ID,EI
Ibi N
IdI=9, so
no
action.
4.1:
Step
From
and
(i)
(ii)
:
lal=IA, B,CI,
lbl=ID,E,FI and idl=IFI.
Ibi N ldl/0
so take dp2.
$8m(Ial,lcj)=O so take dp4.
Ibi N jcl/0 so take dp6.
tcl N lbl=ID,El = Ic
(i)
is
'not done',
(i)
becomes
so take dp7
so take dp8.
$8f(IA,B,CI,D,E)=1, and
(ii) is unchanged.
Icl=ID,EI,
Appendix 1
PAGE 165
Restarting Step 4.1:
from
(i)
and
(ii):
ja=jA,B,CI,
icl=ID,Ei, IbI=ID,E
and
Idj=jFj.
Ibi N Id1=0 so dpl.
Set up realtion RR1(Aegj#,
DE), and
mark (i)
as 'done'
and
as 'done'
and
'have relation'.
From (ii):
Ibi=IFi,
lal=ID,EI,
IcI=Idj=<null>.
JbI N IdI=0 so take dpl.
Set up
'have
relation RR2(QLa,
F) and
mark
(ii)
relation'.
Step 4.2:
No action
Step 4.3:
No action
So we have relations:
RR1(A,B,,
DE), and
RR2(Qgg, F).
Example 4
i)
(A,B) -- > C
ii)
(D,E) -- >(CF)
Note that
or:
$8f(IA,BI,ICI)=1
or: $8f(ID,EI,IF,Cl)=1
(A,B)<--/-->(D,E)
ie: not mutually dependent.
Step 1: No mutual dependencies, so no action.
Appendix 1
Step 2: No mutual dependencies,
PAGE 166
so no action.
Step 3: No action
Step 4.1:
from (i) and
(ii):
lal=IA,BI,
IbI=IC i, IcI=D,EI
and IdI=IF,CI.
Ibi N IdIj0, so dp2.
$8m(IA,B,CI,ID,EI)=O, so dp4
Jbi N Icl=0 so dp5.
Set up relations:
RR1(AgB, C),
RR2(ag, F,C) and
RR3(C).
Mark (i) and (ii) as 'done' and 'have relation#.
Step 4.1 complete,
since all are 'done'.
Step 4.2:
No action.
Step 4.3:
No action
Thus,
we have relations:
RR 1 (.A
RR2(Dja,
,
C),r
F,C)
and
RR3(C).
This concludes the examples.
Appendix 2
In
the examples
and definitions
PAGE 167
that follow
relation names of the form : 'R<i>'.
This is
we will
use
for convenience
only; any character string may be used for a relation name.
Notation.
R<i> is the name of the i th relation
< means 'is a member of'
a list,
J....1 implies
or set
of the
items between
I ' s.
c(i) is the cardinality (number of entries) in R<i>
n (i)
is
the degree (number of domains) in
d(i,j) is
the
v(m) (i,j)
is
t (i)
is
ie:
th domain of R<i>,
(v (a)
t(i)
is
j=1,..n(i)
the m th value of d(i,j),
an n(i)-tuple in
a
L (jai)
j
(i,
R<i>
m=1,..c(i)
R<i>
1),v (a)
(i,#2),..v(a)
(i ,n (i))
1,...c(i)
the length of list
a
3 is the null set - ie: R<i>=W implies c(i)=O
atb means a is
a subset of b (a=b is legal)
aC-b means a is a 2r2p2
subset of b (afb)
Va means for all valaes of a
Examples
the
Appendix 2
The following examples will be
PAGE 168
used throughout this section
to explain deifnitions.
(NAME,
SOCS EC,
P HONE,
DEPT#)
213-07-1666,
232-1500,
15),
(Donovan,
621-49-2990,
6 17-1400,
15),
(Granger,
413-00-0299,
536-5176,
6),
(Smith,
839-41-6942,
253-1410,
6))
(NAME,
SOC SE C,
PHONE,
217-- 1-7322,
253-6671,
15),
(Smith,
213-07-1666,
232-1500,
15),
(Donovan,
621-49-2990,
617-1400,
15))
R1=( (Smith,
R2=( (Madnick,
AGE,
CITY)
R3=( (Madnick,
31
Peabody),
(Donovan,
34,
Ipswitch),
(Smith,
23,
Boston))
(PERSON,
(NAME,
R4=( (Smith,
PHONE)
232-1500),
(Donovan,617-1400))
DEPT#)
PAGE 169
Appendix 2
R5=(
(PERSON,
AGE,
CITY,
(Madnick,
31,
Peabody,
18),
(Donovan,
34,
Ipswitch,
43))
STREET_#)
Definitions
Symbol:
1) Unjion
U
(j=k is valid)
R<i>=R<j> U R<k>
Format:
c(i)=c(j)+c(k)-c(Rj N Rk)
n (i)=mat(n(j) ,n(k))
R<i>=
It(i)
< R<j>, OR
R5 = R1 U R2
Example:
R5=(
:t(i)
< R<k>l
t(i)
would yield:
213-07-1666,232-1500,15),
(Smith,
(Donovan,621-49-2990,617-1400,15),
(Granger,413-00-0199,536-5176,
,839-41-6942,253-0410,
(Smith
6),
6),
(Madnick,217-61-7232,253-6671,15))
2) flatersection
Format:
R<i> = R<j> N R<k>
(Note that if
R<i>
=
Symbol:
n(i) = n(j)
=
(i=j=k is valid)
n (j)/n(k), then B<i>=V)
: t(i)(j
It(i)
N
n(k)
ANp t(i)<kI
Appendix 2
Example:
R6=(
R6 = R1 N R2
PAGE 170
yields:
(Donovan,621-49-2990,617-1400,15),
213-07-1666,232-1 500, 15))
(Smith,
Format:
(Note: If
-
Symbol:
3) Difference
R<i> = R<j> - R<k>
n(j)/n(k) then:
n(i)=n(j)
c(i)=c(j)
)
R<i>=R<j>
n(i)=n(j)=n(k)
- c(k)
c(i)=c(j)
c(R<j> N R<k>)
-
R<i> = it(i) : t(i)*R<j> AND t(i)fR<k>I
Example:
R6=(
R6=R1 - R2
(Granger,413-00-0029,536-5 176,
6),
839-41-6942,253-0410,
6))
(Smith,
4)
yields:
Cartesian Product
Symbol:
(Sometimes called a 'Cardinal
Format:
(Note: if
must
R<i> = R<j> X R<k>
n(j) >
be treated
n(j)=n(k)=1.
)
n (i)=n(j)+n(k)
X
Product')
(j=k is valid)
1, or n(k) > 1, then
as a
single domain,
each t(j)
so that
(or t(k))
effectively
Appendix 2
PAGE 171
c (i)=c(j) .c(k)
Va
jji
a set of ordered pairs with
first
R<i> = I (v(a) (j,1),v(b) (k,1))
ie: R<i> is
Vb * k,
member from
R<j> and second from R<k>.
R5=(
yields:
R5 = R4 X R4
Example:
((Smith
,232-1500), (Smith
((Smith
,232-1500),
(Donovan,617-1400)),
,232-1500)),
((Donovan,617-1400),(Smith
((Donovan,617-1400),
5) Projection
(Donovan,617-1400))
)
P
Symbol:
R<i> = R<j> P
Format:
,232-1500)),
(d(j,1)), 1
11,2,...n(J)I
n (i)=L(l)
redundent
that
(Note
c(i)=c(j)
automatically deleted as proposed in
entries
Example:
R5=(
d(j,l) : 1
1,2,...n(j)
yields:
R5= R2 P (NAME,PHONE)
(1adnick,253-6671),
(Smith,
232-1500)
(Donovan,6 17-1400)
6) Join
Format:
Symbol:
R<i>
=
)
*
R<j>((d(jl)))
0 ::= > I < I = |,@
*
not
some versions. Use the
'compaction' operator to remove redundent entries.)
R<i> =
are
R<k>((ed(km)))
Appendix 2
15Ca1
1, 2,. .. n (J)
m 41l
1,2,....n (k)|
and d(j,l)
and d(k,m)
must be
PAGE
of the same data
type
172
(ie:
must be joinable).
n (i)=n(j)+n(k)-1
but we ignore
is duplication when a not '=',
There
is '='.
(g duplication of the join domain when a
that rare case here.)
c(i)=c(j)+c(k)-c(v(a) (j,1))=v(b)(k,m)),
a=1,..c 0)
b=1,..c(k))
R<i> =
Vb * j, Va * k,
I d(jb),d(ka)
v(g) (j,1) a v(d) (k,m); Ig 4 j,
Example 1)
yields:
AGE,CITY)
DEPT#,NAME,
PHONE,
(217-61-7232,253-6671,
15,
Madnick,
31,Peabody),
(213-07-1666,232-1500,
15,
Smith,
23,Boston),
Donovan,
34,tpswitch)
(621-49-2990,617-1400, 15,
Example 2)
R6=(
Vd * ki
R6=R2(NAME) * R3(=,PERSON)
(SOCSEC,
R6=(
but afam
R6=R3(CITY)
(NAME,
AGECITY,
(Madnick,
31,
*
R4(>,NAME)
Peabody,
617-1400),
this example makes
included simply to illustrate
7) Composiio
Format:
yields:
PHONE)
(Donovan, 34, Ipswitch,617-1400)
(Note that
)
no intuituve sense;
the use of 1*
Symbol:
R<i> = R<j>(I(j,l))
)
.
R<k>(d(ka))
when 8 /
it was
'='
)
Appendix 2
PAGE 173
1 * I 1,...n(j)
S*
1 1,...n(k)
d(km)
d(j,l) and
(ie: of the
must be joinable
same data
type)
n(i)=n(j) + n(k)
-2
c(i)=c(j)+c(k)-c(v(a) (j,1) = v(b) (km); a=1,...c(j);
b=1,...c(k))
( R<j>(d(jl))
R<i> =
Vb<j,
(ie:
*
) P (d(j,b));
R<k>(d(k,m))
except b/l
remove domain
d(j,1)
R<j>
on which
and R<k>
were
joined.)
Example:
R5=R2(NAME)
(SOCSEC,
.
PHONE,
yields:
R3(PERSON)
DEPT#,AGE,CITY)
R5= ( (217-61-7232,253-6671,15,
31,Peabody),
(213-07-1666,232-1500,15,
23,Boston),
(621-49-2990,617-1400,15,
34,Ipswitch)
8) Permutation
Format:
Symb3l:
)
M
R<i> = R<j> M (d(j,l));
1=
1,...n(j)
n (i)=n(j)
c(i)=c(j)
The only effect of this operator is
to re-order the domains
in a relation.
Example:
R5 = R3 M (PERSONCITY,AGE)
yields:
Appendix 2
PAGE 174
R5= ( (MadnickPeabody,31),
(Donovan,Ipswit:h,34),
(Smith,
Boston, 23)
9) Compaction
Format:
Symbol:
C
R<i> = C (R<j>)
(i=j is valid)
n (i)=n (j)
R<i> = I t(b) :t(b)/t(a)
This operator simply
;
a/b I
(Qg: R<i>=R<j> N R<j>)
removes all redundent entries
from a
relation.
10)
Restriction
Symbol:
10.1) Diadic restriction:
Format: R<i>=R<j>(d(j,1))
R R<k>(e,d(k,m)); 1 e1,...n(j)j
a iS1,1,...n(k)
where: L(1)=L(m),
and
n(k) <= n(j)
then n(i)=n(j)
a ::= >
R<i> =
I < I = I ,'
It(j)
: v(a) (j,f)
8 v(b) (kg) VfIl, Vgfm, Va~j, Vb4k
a=1,...c(j);
Example 1)
R6 =
b=1,...c(k)
R2(NAME,PHONE) R
yields
R6=( (Smith
,213-07-1666,232-1500,15),
(Donovan,621-49-2990,617-1400,15)
I
R4((=,NAHE),(=,PHONE))
PAGE 175
Appendix 2
Example 2)
R6=(
R6=R2(PH3NE) R R4(>,PHONE)
yields:
(Madnick,217-61-7232,253-6671,15),
(Donovan,621-49-2990,617-1400,15)
(Note: t(1) of R6 appears
)
because 253-6671 > 232-1500. the
fact that 253-6671 < 617-1400 does not affect this.)
10.2) Monadic restriction:
Format: R<i>
R<j>(1(j,1))
=
R (9,d(jm))
1 40I1,...n(j)I
a SIl1,...n (k)I
L (1)=L(m)
0 ::= > I < I = 1 ,'
n(i)=n(j)
R<i>
=
It(j) : v(a) (j,f) 9 v(b) (j,g),
Example:
R6=(
11)
R6 = R10(A3E) R (<,STREET#)
f41,Yg4m,Vab 4 j I
y ields:
(Donovan,34,Ipswitch,43)
Symbol:
Division
Format: R<i>
=
/
R<j>(d(j,1))
/ R<k>(d(km))
;
1Ea1,...n(j) I
assl1,...n(k) 1
This operator is the inverse of the cartesian product; ie:
Appendix 2
(R<j> X R<k>)
Example:
R5 /
/
R<k> = R<j>
Using R5 of
R4 = R4
(4) above:
PAGE 176
Appendix 3
PAGE 177
Decision Variables.
This appendix
used
that are
decision variables
lists the
throughout this thesis. They are listed in order of rule# as
outlined in Chapter IV.
Rule 1
i used
of relation
domain j
1) $lrd(i,j)rsen
as the
only egui-gualifier (<qualifier type> 'e') in
retrieval request; no joins.
2) $lrd(i,j)rsej
except join
(1),
as
same
involved in
resolving request.
3) $lrd(i,j)rcenn
domain
j
of relation
several qualifiers, as
an equi-qualifier;
other
of
and none
joins,
i used as one of
domains used
no
as
qualifiers had indexes
4)
$1rd (i,j)rceni(trss)
as (3),
same
index.
qualifier had
from
resulting
except some
'trss'
other
is size
of set
with
indexes
using domains
first.
5) $lrd(i,j)rcejn
same
as (3)
except joins
involved in
resolving request.
6)
$1rd(i,j)rceji(trss)
same as
(4),
except joins involved
Appendix 3
PAGE 178
in resolving request.
7) $lrd(i,j)rcnnn
domain
j
of relation
i used as one of
several qualifiers but not as equi-qualifier;
no
other
indexes,
domains
and
used
as
qualifiers
no joins involved
had
in resolving
request.
8)
$lrd(i,j)rcnni(trss)
same as
(4)
except
domain
j
not
used as equi-qualifier.
9)
$lrd(i,j)rcnjn
same
as (5)
j not used
except domain
as equi-qualifier.
10)
11)
$lrd(i,j)rcnji(trss)
$lrd(i,j)rnun
same as (8)
j
domain
unspecifiel
except joins involved
of
relation
qualifier (eg:
i
used
...j='all')
as
no
joins involved in satisfying request.
12)
$lrd(i,j)rnuj
13) $lrr(i)rnun
same as (11) except joins involved.
unspecified retrieval
serial retrieval
from relation i; eg:
of each entry
in relation.
No joins involved.
14) $lrd(i,j)rnuj
same as (13)
except joins involved in
request.
The
same set
<reguests> 'u'
and 'd'.
of 14
and 't,
decision variables
and
is
maintained
also for <relation
for
type>s 'v'
Appendix 3
PAGE 179
For <request> 'i', only one decision variable is maintained:
$lrr(i)inun
inserts of entries into
relation i; no joins
involved.
Rule 2
retrieval of only domain j of relation
1) $2rd(i,j)rsn
i; no joins involved in satisfying request.
same
2) $2rd(i,j)rsj
as (1) except joins
involved in
resolving request.
domain
3) $2rd(i,j)rcn
j
relation i one of several
of
retrieved; no joins involved.
4)
same as
$2rd(i,j)rcj
(3)
except joins involved in
satisfying request.
5) $2rd(i,j)r(<aggr>)sn
domain
j
some single <aggr> of
retrieval of
of relation i;
no joins involved in
request.
6) $2rd(i,j)r(<aggr>)sj
same as (5)
7) $2rd(i,j)r(<aggr>):-n
retrieval
one 3f them domain
except joins involved.
of several aggregations,
j
of relation i; no joins
Appendix 3
PAGE
180
involved.
8)
$2rd(i,j)r(<aggr>)Cj
sames
as
joins
(7) except
involved.
retrieval of whole entry from relation
9) $2rr(i)ren
i;
no joins involved.
same as (9)
10) $2rr(i)rej
The
sane
set
<request>s 'u'
of
decision
and '1',
except joins involved.
variables
and for
is
maintained
<relation type>s
for
'v' and
'd'.
For <request> 'i', only one decision variable is maintained:
$2rr (i) ien
inserts
of entries into
relation i;
no joins
involved.
Rule 3.
Only one decision variable is maintained of this type:
$3rd(i,j)d(k,m)
the
relation k via domains
number of times
j
relation i
and m respectively.
joined to
Appendix 3
PAGE 181
Rule 4
1) $4io
2)
cost per I/O operation
cost per cill
$4opc
to XRM
3) $4bfe
number of entries per IRM block
4)
number of index entries per block
$4bfx
5) $4sc
cost per byte per day of storage
6) $4t
time period since last SDS invocation
7) $4p
XRM blocksize
Rule 5
1) $5cy(i)r
cardinality of real relation i
2) $5#d(i)r
degree of real relation i
3)
$5cy(i)v
cardinality of virtual relation i
4)
$5d(i)v
degree 3f virtual relation i
5)
$5cy(k)d(<method>)
cardinality of
<method> is the method
derived relation
k.
of derivation. If the
derivation did not include restrictions, then
<method>: :=<null>.
6)
$5#d(k)d(<method>)
degree of derived relation k
Appendix 3
PAGE 182
Rule 6
$6 (j)g
number of unijue values in domain
j
Rule 7
$7r
user-supplied response-time weight factor.
Truth Functions.
j
$8d (ij)
domain
$8i(j,k)
domain k in relation
appears in relation i
j
is inverted. (For virtual
relations, $8i(j,k)=O always.)
$8p(i,j)
j
domain
is
one of the
primary key
domains of
relation i.
relation i is
$8x(i)r
$8x(i)d(<method>)
relation i
<method> is
did
a real relation.
is a
the method
of derivation.
involve
not
derived relation,
a
and
If <method>
restriction,
then
<method>:: =<nu 11>.
$8n(i,j)
domain
j
of relation i
is mandatory. le: a value
must be provided for this
domain before an entry in
relation i will be made.
Appendix 3
$8n(i,j)=1
Note
PAGE 183
Vj where $8p(i,j)=1.
(Primary key
domains are mandatory.)
domain j :ontains unique values (eg: socsec_#)
$8u(j)
$8r(i,j)
same as $8n (i,j)
name.
Also notice
$8d(i,j) Thus
except that
refers to a role
it
is
that $8r(i,j)
this is a
whether a role name is
a subset
truth function
of
that tests
in relation i.
$8_(j)<data type><starage strategy>
<data type>::=<character> I <fixed> I <float> I <vector> I
<bit>
<character>::= c
<fixed>::= x
<float>::= f
<vector>::= t(<size>)
<bit>::= b
<storage
strategy>::=<virtual>
(real
<
encoded>
(real
<
unencoded>
<virtual>::= v
<real encoded>::= e
<real unencoded>::= u
functions is
This set
of truth
domain j.
For exmaple,
if
to test
the data
type of
$8_(name)ce=l then domain 'name'
is an encoded character string.
Appendix 3
$8f(JmJ,Jn)
of
is a
the
PAGE 184
truth function that tests
domains
in
Ini
list
whether each
are
functionally
dependent on the whole list jal.
Note 1)
which
List Jai
is
members
of
not a
list
of all
list
Ini
are
domains on
functionally
dependent. Each n'4|nl may be functionally dependent
on some jxiimi
2)
also.
If
(ie:
InI=9
is
then
empty)
$8f(Jmi,jnj)=0.
$8m(pj,jIql)
is
a function that tests whether lists
mutually
are
Jgj
Ipi and
Ie:
dependent.
and also
$8f(IplPqP)=$8f(IqJIpi)=1,
$8m(lpI,Iq)
implies $8m(j,jpJ).
Transitivity also holds: $8m(Jpi ,Iq)=$8m(qi,IsI)=1
implies that $8m(JpJjsJ)=1.
$8c(tpJ,q) (<function>) is a function
(note
that q
is
not
dependent on lomains
is
defined as 'q=6.3
dependent
on
p.
which tests whether q
a list)
Ipl. For example, if
*
p' then q is
(<function>)
is
required to derive q from the list
$8od(k)
is
computationally
is
a truth function set
domain g
computationally
the
computation
of domains Ipi.
up for a request.
It
is
'1' if domain k appears as one of the object domains
in the request.
$8eq(k)
is a truth function used in requests. It is '1' if
PAGE 185
Appendix 3
domain
k appears
as
a
qualifier with
<qualifier
type> 'e'.
$8nq(k)
is
similar
to $8eg(k) except that
type> is not 'e'.
the <qualifier
Download