Network database evaluation using analytical modeling*

advertisement
Network database evaluation using analytical modeling*
by TOBY J. TEOREY and LEWIS B. OBERLANDER
The University of Michigan
Ann Arbor, Michigan
INTRODUCTION
package which evaluates Honeywell's Integrated Data Store
or IDS. 6 While the model produces •• expected value" results, it goes beyond the usual expected value assumptions
and considers the effects of the distribution of logical record
sizes and placement. The DBDE is directly applicable to
CODASYL database systems.
The most easily identifiable forerunner to the DBDE was
the general analytical method of evaluating physical database designs for single record type files proposed by
Yao. 13,14 The general method decomposed all candidate organizations and access methods into several basic parameters: the number of index levels, the expected search length
at each index level and the level of usable data, and the
fraction of sequential accesses versus pointer (random) accesses at each level. All applications were assumed to consist of randomly addressed operations such as queries and
updates.
The File Design Analyzer l l extended the basic method to
consider both sequential and batched processing in addition
to random processing, and it extended the concept of the
pointer (random) access parameter to specify the degree of
randomness required by each configuration. An example of
this is the implementation of overflow in an indexed sequential organization. The pointer to an overflow record
could refer to another point on the same track, another track
with the same cylinder, a different cylinder or even a different device. Each specification would imply a different
expected access time to the overflow area. This extension
enabled the File Design Analyzer to determine a range of
performance between "single access" (i.e., a dedicated device with no seek time, but including an average rotational
delay between consecutive sequential accesses) and "multiaccess" (i.e., a shared device with the worst case conditions
of all consecutive sequential accesses becoming random
accesses in reality). The' purpose of this dichotomy was to
place bounds on I/O service time for the extremes of one
user and many users. No attempt was made to determine
the effect on queuing delays with this model or with the
DBDE.
The DBDE does, however, represent significant improvements over the previous models in the following areas:
The proliferation of computerized databases and the widespread demand for timely access to data has resulted in the
need to develop automated techniques for database design.
In the past, database design was accomplished manually by
data processing personnel through past experience and trial
and error. However, the implementation of large integrated
databases in organizations made it increasingly difficult for
individual users to do the necessary design. A new approach
was required to consider the many tradeoffs among user
objectives and within the complex logical structures required
to integrate the various types of data.
The stages of database design have only recently been
well-defined, 5 and much of that work is still done manually
with some assorted design aids available at some of the
stages. One of the more quantifiable stages in the database
design process is the implementation of a physical database
from a logical database design. Logical database design results in the definition of record types and their relationships
under the constraints of a particular data model, but independent of physical structure. The design of the physical
database must take into account the variations in access
methods, i.e., the access paths and search mechanisms
available, as well as physical clustering of records near each
other within blocks or in a contiguously defined set of
blocks. Considerably more insight is required in physical
database design before determining whether or not it can be
accomplished independently of logical database design.
Currently there exist some very general physical database
designers 10,13 which address tradeoffs among major classes
of storage structure. On the other hand, several evaluators
of physical databases have also been proposed. Some are
simulation models and are quite expensive to run. 2 ,7,9 Others
are probabilistic and are limited by expected value assumptions. 3 ,11
This paper described the concepts leading to the development of an operational Database Design Evaluator
(DBDE), which accounts for the more significant parameters
affecting physical database design performance and yet is
computationally fast enough for consideration as the central
module of a physical designer. The DBDE is a software
1. Computation of database overflow probabilities on the
basis of database growth and distribution of record
size, and determination of the effect of increasing
* Supported
by the Defense Communications Agency, Contract Number
DCA l00-75-C-0064 for the WWMCCS ADP Directorate, Reston, Virginia
833
From the collection of the Computer History Museum (www.computerhistory.org)
834
National Computer Conference, 1978
sequential search in the page and possibly an overflow
page yields the desired record.
3. Secondary records-These are records that can only
be retrieved by accessing the master records and then
traversing the chain to which they belong. Often it is
possible that the immediate master cannot be accessed
directly. In this event, master records that are on a
higher level than that of the required immediate master
need to be accessed. The search is then performed
from these higher level records down to the immediate
master in question and then finally to the desired secondary record. Such higher level masters are known as
entry points and are accessed via primary record operation, calculated (CALC) record operation or by a
RETRIEVE master record operation.
overflow on the I/O service time required to execute
various data manipulation operations.
2. The effect of buffering on I/O service time.
3. The modeling of "currency" and dependent sequences
of operations performed on network databases.
Other analytical models have been confined to storage
structures forflat files or hierarchical systems, but have not
considered network databases. The proliferation of CODASYL DBTG or similar implementations for network databases have increased the need for network evaluation
models.
INTEGRATED DATA STORE (IDS)
The Integrated Data Store (IDS) is a host language Database Management System which interfaces extensively
with COBOL. It was developed in 1963 by General Electric
and is now being used on the Honeywell 6000 series and
other configurations. It is widely regarded as a forerunner
of the CODASYL DBTG concepts, and is very close to the
specifications of the 1971 DBTG report. 4 It was developed
to organize information so as to minimize redundancy and
duplication of records or data, provide a single database for
many applications, store and retrieve logically related data
in cohesive manner, and provide an efficient query capability.6
The structure of IDS may best be summarized in terms of
the interrelationships between records and access mechanisms, and chain classification.
Chain classification
In IDS a record may belong to any number of chains. If
retrieval of a record is specified by a particular chain, it is
retrieved via that chain. If the chain through which the
record is to be retrieved is not specifically stated, the record
is retrieved via its prime chain which is declared a priori (in
the RETRIEVAL VIA CLAUSE). This manner of retrieval
by means of a chain applies only to the secondary records.
Finally, a chain may be implemented as any combination
of the options chain PRIOR (backward pointer) and chain
MASTER (pointer to the master record). Every chain has
a NEXT (forward) pointer.
MODEL PARAMETERS
Record classification
DBDE inputs
A record may either be a master record or a detail record.
Masters and details are relate4 to each other by means of
chains. A chain may contain only one master record. However, a record may be the master or detail of more than one
chain, and a record may be a master in one chain and a
detail in another chain simultaneously. All IDS records of
a specific record type are fixed length and fixed format.
Different record types may have different lengths and formats.
Another form of record classification is by the retrieval
access mechanism. Three methods are allowed in IDS:
Categories of inputs and outputs are summarized in Figure
1.
a. Database Description
The IDS model requires the definition of master records
and detail records of chains, the page size and page range
of subfiles, and the record lengths and page ranges of all
record types (see Table I for an example).
b. Workload Specification
1. Primary records-retrieved by reference codes which
point directly to the physical page and line number on
which the record resides. The page is the basic unit of
transfer between secondary storage and main storage.
It is a physical block (with fixed size in subfiles, somewhat analogous to DBTG areas) that may contain many
record types simultaneously. PAGE RANGE is a feature of IDS that specifies a set of physically contiguous
blocks (pages) to contain records of a given type or
types.
2. Calculated records-retrieved by calculating a page
address (via hash code) from a particular key value. A
The model identifies applications in terms of specific IDS
operations called: RETRIEVE, HEAD (i.e., find master
record), STORE, DELETE, MODIFY, and QUERY. An
application can be defined by a sequence of related retrieval
operations, which traverse the network in various ways,
ending with a specified operation on the desired record( s).
It is also possible to specify location mode in terms of
database entry points and type of access method, and the
PLACE NEAR option. PLACE NEAR allows the clustering of detail records near master records they are most
frequently processed with.
From the collection of the Computer History Museum (www.computerhistory.org)
Network Database Evaluation
a. Main Storage Requirement
User Commands
1
This is a straightforward computation, adding the user
estimate for object program storage space to the space required for buffers for each application.
USER INTERFACE
Database
Descri pt ion
Workload
Specific ation
Hardware
Characte ristics
,
835
IDS (NETWORK) STORAGE
b. Secondary Storage Requirement
,
STRUCTURE
,.""
AL TERNA TI YES
.Jo.
In IDS this is input explicitly by the user (analyst) in terms
of subtile specifications for first and last page numbers.
c. I/O Service Time
LMal'n Storaoe
Requi red.. Se condary Storage
Requi red
II o Service Time
~
I/O time implies elapsed I/O service time, which affects
either response time or turnaround time. It differs from
channel time in that it always includes all the major components of I/O service: seek time, rotational delay, and data
transfer. Bounds on I/O time are provided for the extremes
of "single access", and "multi-access" configurations described above.
CP U Time
Figure I-The data base design evaluator (DBDE)
c. Hardware Characteristics
Hardware specifications are, for the most part, limited to
the timing characteristics and capacity levels of the secondary storage subsystem. It is assumed that only one type of
secondary storage device is used, although the alternative
types of devices could be evaluated by reinitializing these
parameters. In IDS, allocation across several devices is possible.
Other input parameters which fall loosely under the category of hardware are the estimated CPU time required to
initiate I/O starts and completions, the CPU time for moving
data within main storage, and the main storage required for
support software plus user applications.
One parameter that falls in none of the above categories
is the specification that the effect of overflow on performance is desired. The analyst may specify any number of time
periods that overflow performance data is to be computed,
with the database growth rate implicitly specified by the
current database size and the number of record additions
per time period. All computations in the DBDE are based
on a unit time period. The analyst may choose the most
meaningful time unit for his analysis, and merely needs to
be consistent with that time unit when specifying workload
parameters and overflow time periods.
DBDE outputs
The DBDE outputs reflect an attempt to present a fairly
complete picture of resource usage and the components of
response time. Enough performance data is given so the
analyst can estimate the real cost of the applications evaluated.
d. CPU Time
The DBDE estimates the CPU time required for the database functions: overhead to start and complete I/O commands (for physical reads and writes), time for data movement between buffers and user work areas where applicable,
database search time, and software write verification. The
user CPU time to process logical records is merely echoed
from the input data, and is an optional feature.
MAJOR COMPONENTS
An algorithm for database sizing and overflow
computation
Given a database with N records of varying size, an algorithm to determine its size needs to answer the following
questions:
1. How many physical blocks (pages) will be required if
the database is organized sequentially and the ordering
results in random placement of records of a given size?
2. How many record occurrences will typically fit into a
block?
The first question is of interest in modeling the record
distribution or population of sequential and indexed sequential database organizations, which can be computed by the
DBDE. The second question is of interest in modeling the
placement of chains in IDS.
If all records were of the same size, R words, and the
physical block size were B words, then the number of records that could fit into a block would be
~]-
where [expr]- denotes the "floor function" or "greatest
integer equal to or smaller than" the actual value of the
From the collection of the Computer History Museum (www.computerhistory.org)
836
National Computer Conference, 1978
TABLE I-Input data for test network database
HARDWARE SECTION
Honeywell 6060/055191
30 mi11isec
10 mi 11 i sec
8.35 mi11isec
16.7 mi11isec
Secondary storage device
Average seek time
Adjacent cy1. seek time
Avg. rotational latency
Full rotation time
Bytes per word
Transfer rate
Cylinders per disk pack
Tracks per cylinder
Bytes per track
Block gap size
CPU time for I/O start + end
CPU time for core-core move
6
1,074,000 byte/sec.
404
19
11904 bytes
12 bytes
1 mi11isec
2.75 microsec/byte
IDS SECTION
IDS DATA DIVISION
Number of buffers (Max. is 256)
Write verify ON?
Main storage reqmt. for software
CALC chain retrievals/time unit
RETRIEVE EACH operations/time unit
SUBFILE DIVISION
First page number
Last page number
. Page size
Disk number
SUBFILE 1
1
100
320 words
1
50
NO
lOOK bytes
o
o
SUBFILE 2
101
150
320 words
SUBFILE 3
151
200
320 words
2
From the collection of the Computer History Museum (www.computerhistory.org)
3
Network Database Evaluation
837
TABLE I-(Continued)
Max. 100 record types
RECORD DIVISION
Record number
1
Record length (words)
Retrieval via
No. occurrences
First page number
Last page number
Workload per time unit:
a. Retrieve direct
b. Retrieve record
c. Retrieve current
d. Store
e. Delete
f. Modify
2
3
4
5
6
7
8
10
CALC
50
1
100
90
CHN.6
50
1
100
60
CALC
50
101
150
35
CALC
50
151
200
65 25
CHN.2 CH'l.4
50 50
151 1
200 100
40
am.8 CHN.10
50
50
1
101
100 150
5
5
5
15
10
0
5
5
5
15
10
0
5
5
5
5
15
10
0
5
5
5
15
10
0
5
5
5
15
10
0
5
5
15
10
0
5
5
5
15
10
0
75
i
5
5
5
15
10
0
Max. 100 Chain types
CHAIN DIVISION
Chain number
1
2
3
4
5
6
7
8
9
10
Master record
Detail record
Chain order
Link to next
Link to prior
Link to master
Record selection
Retrv next/~~it
tlme
Retrv.master/ unit
time
Head opns . /u~it
t,me
3
4
FIRST
YES
NO
NO
CURR.
4
5
FIRST
YES
NO
NO
CURR.
1
5
FIRST
YES
NO
NO
CURR.
2
6
FIRST
YES
NO
NO
CURR.
8
0
1
7
FIRST
YES
NO
NO
CURR.
0
8
0
2
7
FIRST
YES
NO
NO
UNIQ.
0
3
0
1
2
FIRST
YES
NO
NO
CURR.
0
1
0
1
6
FIRST
YES
NO
NO
CURR.
0
FIRST
YES
NO
NO
CURR.
0
FIRST
YES
NO
NO
CURR.
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Place near master
NO
YES
NO
YES
NO
YES
NO
YES
NO
YES
From the collection of the Computer History Museum (www.computerhistory.org)
838
National Computer Conference, 1978
expression. The resulting database size would be
[[irT
blocks
where [expr]+ denotes the "ceiling function" or "smallest
integer equal to or greater than" the actual value of the
expression.
The problem becomes significantly more difficult when
there are a variety of record sizes and not much is known
about their distribution in the database. The database size
is extremely sensitive to record placement when there are
very few record sizes, but a large difference in those sizes.
An example will help illustrate the possible extreme values. Suppose there were 1000 records of length 20 words
and 1000 records of length 90 words and a constant block
size of 100 words. If the ordering specified all the large
records must come first and all the small records second,
the total size of the database would be:
DATABASE SIZE
~ [[:FJ=
J [[:2~ T
The algorithm takes into account parameters such as percent fill and page overhead associated with specific storage
structures in order to compute realistic values for E(used).12
As an example of the type of analysis required, let the
database consist of 1000 records with an expected record
size of 200 words and a block size of 320 words. The actual
distribution of record size is:
Record Type
Record Size
(Words)
Number of
Occurrences
A
B
C
100
200
300
300
400
300
For this simple configuration, a random distribution of record types across the database results in the seven possible
arrangements of records in a block shown in Figure 2 occurring with the probability Ph i = 1, 2, ... , 7. From the
+
1000 +
200
1200 blocks
On the other hand, an ordering which alternated large and
small records would require 2000 blocks. Any other ordering
of records would produce a database size somewhere between these two extremes. For very large databases this
nearly 2-to-l difference in size will greatly affect the estimated performance for many applications, and consequently
record distribution must be represented accurately in the
model.
The algorithm assumes that the size of any record in the
database is independent of the size of its neighbors. If this
assumption is not true for sequential or indexed sequential
organizations, then the database size must be computed
manually. The IDS model allows dependent record placement to be specified in terms of page ranges. The algorithm
then enumerates the possible combinations of record sizes
that would overflow a physical block and r~quire the populating of the next block, and it obtains a distribution of
,unused space in each block. The expected used space in a
block, called E(used), is easily computed as the difference
between block size B and the expected value for unused
space. The total database size is therefore computed as
DATABASE SIZE
= [ ~ ReconLsizej*Number_of_occurrencesj/E(used)]+
ALL
RECORD
TYPES;
100
+---+----t-rr....-n""+----;
200
+---t-,-,.-n-n-t""~ \
300
h-M~rn\
Next record A.BorC
type in
random
sequence
Probabil.027
ity of
occurrence'P i
Used space 300
in the block.
A.B.orC
B or C
.063
200
A.B.orC
B or C A.B.or C
.09
.12
.12
.28
.30
100
300
300
200
300
USED;
Figure 2-Random sequence distribution of record types in blocks
data in Figure 2 we can compute the expected value of used
space in each block and then the database size.
7
E[used]
=
L
Pi
* USED i
= 247.7 words
i=l
3
Total data volume
=L
j=l
RecorcL.sizej * Number _of_occurrences
where j is the number of record types
= 100 * 300 + 200 * 400 + 300 * 300
= 200,000 words
Expected number of blocks required
= [ 200,000 words J+ = 808 blocks
247.7 wordslblock
where a record type is defined by its length and number of
occurrences.
The same number of fixed size records of 200 words each
would require 1000 blocks. Therefore a significant reduction
From the collection of the Computer History Museum (www.computerhistory.org)
Network Database Evaluation
is possible in this simple case when the record sizes are
distributed over a larger range.
The above computation becomes extremely difficult as
the number of record types becomes large (Le., 500 or
more), and even the computer time to exhaustively enumerate all cases is prohibitive. The DBDE uses an algorithm
to compute E(used) that reduces the computational complexity to a feasible range for 50-100 record types, which is
adequate for most IDS databases. The complete derivation
of this algorithm is given in Reference 12.
A natural extension of the database sizing algorithm is to
consider new records being added to the database and the
probability that they will overflow the page where they
would be naturally placed. From this value one can determine the expected number of overflow records generated by
a given number of record additions over a particular time
period. Overflow in IDS significantly increases the expected
retrieval time for records.
Buffering
The original purpose for buffering on second generation
(uniprogramming) computer systems was to provide a mechanism for overlapping CPU and I/O time and improve overall
performance of each application. Buffering for multiprogramming systems still produces an overlap of resources,
but not necessarily for the same application Gob). While job
A is executing, job B may be inputting into its buffer while
job C is done with I/O and waiting for the CPU. The existence of a large number of jobs in the system causes other
delays which usually subsume the synchronous CPU-I/O
sequences of a sequential data processing operation. Consequently, the benefits of such buffering received by sequential or random database applications in multiprogramming environments are not easily measurable with
probability models, such as DBDE, but they can be better
approximated by a queuing model for the whole system.
The more directly measurable function of buffering in
multiprogramming systems is that of keeping active data in
main storage in order to reduce physical I/O operations
when data is continually referenced. The DB DE considers
this form of buffering, and it allows the analyst to specify
up to 256 buffers for IDS applications. Naturally, increasing
the number of buffers increases the probability that the next
record you need is already in main storage and consequently
decreases the expected I/O service time. However, increasing the buffer size too much may result in wasted storage,
increased cost for storage used, and main storage allocation
delays that may degrade overall performance. Therefore the
choice of proper buffer size is critical to an effective IDS
application. The DB DE implements the buffer concept by
maintaining buffer currency data based on the most recently
executed IDS operations and the database definition of
chains and record types. It also simulates the basic IDS
buffer replacement algorithm, least-recently-used (LRU).
Thus, for this part of the IDS model, the DBDE uses a
hybrid analytical and deterministic simulation technique.
839
Currency in a network database
IDS (as well as DBTGdatabases) has currency indicators
which specify the most recently accessed record of a particular record type or master and detail records for a chain
type, for example. The DBDE maintains similar information
to represent the current status of access to the database.
Consequently the next operation is evaluated according to
current position and next position desired. The logical structure diagram of the database is implicitly known from the
form of the input parameters, and the algorithm consults
this information to compute I/O time and CPU overhead to
access the next record.
EXAMPLE NETWORK DATABASE EVALUATIONS
The objectives of the test network database evaluations
were to validate the IDS model with live test data from an
existing database, to provide insight regarding the sensitivity
of database performance to the values of storage structure
and user workload parameters, and to compare the performance of alternative database designs.
The conclusions from preliminary validation tests al • that
the DBDE is able to describe the major database design
parameters and their relationships accurately, that the input
data can be created easily and quickly, and that many meaningful experiments can be conducted at a single interactive
session with the program. The validation IDS database
which represented a large equipment inventory system, was
supplied by a client U.S. government agency. The database
consisted of 155,000 records of varying length across three
major record types, and live test data was collected on an
IDS system interfaced with a Honeywell 6060 GCOS operating system for an extensive series of retrieval operations.
The test was conducted during peak hours in a large batch
multiprogramming environment. Estimates for elapsed time
for the live test data (54* 103 seconds) and the DBDE
(42.7*103 seconds) differed by 21 percent. The ability of the
DBDE to estimate total elapsed time was due to the availability of an extension which implements a central server
queuing model of the test system and estimates total elapsed
time based on CPU and I/O requirements derived by the
IDS model for the test database. Although these preliminary
results have been encouraging, we believe that the validation
process should be a continually on-going one as more detailed monitor data is made available. Validation at the individual IDS operation level is currently not possible in the
existing environment.
Example 1
The first test database chosen for experimentation was
the support database for DBDE,l which provided the opportunity to evaluate our own design for DB DE before ap-.
plying the technology to client systems. The test database
schema is illustrated in Figure 3. The database contains eight
record types and ten chain types. The database is quite small
From the collection of the Computer History Museum (www.computerhistory.org)
840
National Computer Conference, 1978
(755 records), but this kept the cost down while allowing a
great deal of analysis of the model. Other input parameters
are summarized in Table 1.
The basic output produced by the program is illustrated
in Table II. Each operation is considered separately and
totals are produced for given frequencies of occurrence of
each operation. Table II illustrates a two-operation application that retrieves a record and modifies it. To accomplish
this task, 2 sUb-operations are required for each retrieval
and 5 sub-operations are necessary for the modify operation.
This feature of providing detailed output for each operation
enables the analyst to obtain a better picture of cause and
effect relationships for overall system performance, and thus
make better decisions on parameter values to choose.
Example 2
The estimates of database I/O requirements for several
configurations are summarized in Figure 4. The purpose of
this experiment was to investigate the tradeoff between I/O
time and secondary storage space for this experiment. It
consisted of a single master record type (100 occurrences)
and five detail record type (5000 occurrences each) with
each contained in a separate chain type. The secondary
Figure 3-DBDE database schema
TABLE II-IDS retrieval and update application. Sample output
APPLICATION: Retrieve and Modify Record 7
I/O Service Time (MLSEC)
Single
Mu1tiAccess
Access
System
CPU Time
Estimate (MLSEC)
Record
Type
Chain
Type
Retrieve as CALC
record
1
-
10.20
40.40
2.77
Retrieve as chain
detail record
7
8
.07
.26
.02
Modify and del ink
7
-
50.69
200.70
8.50
Retrieve master wIG
currency table
2
7
10.14
40.14
2.75
Link to chain 7
7
7
20.28
80.28
2.00
Retrieve record 1
as CALC
1
-
10.20
40.40
2.77
Link to chain 8
7
8
20.28
80.28
2.00
-
-
121 .86
482.46
20.81
ACTION
TOTAL
From the collection of the Computer History Museum (www.computerhistory.org)
Network Database Evaluation
841
storage space for this configuration was:
150 -
CHAIN
NEXT
CHAIN
NEXT, PRIOR. MASTER
Master
Records
100*(144 bytes)
100*(164 bytes)
Detail
Records
25000*(128 bytes)
25000*(136 bytes)
The implementation of pointers for PRIOR and MASTER
in each record represents an overall increase of slightly over
6 percent in total storage space. The results illustrated in
Figure 4 show that the I/O time for most operations is independent of chain pointer implementations. However,
there is a slight increase in the STORE and a very significant
decrease in HEAD (i.e., retrieve master) operations. The
results for "multi-access" were relatively the same as those
in Figure 4 but all values were approximately 30 percent
larger on a point-by-point basis. The increase for STORE is
due to a greater number of pointers to reset, while the
decrease for HEAD is due to direct access to the master
record with a MASTER pointer. Overall, this experiment
shows that choice of pointer implementation depends on the
types of applications to be run. Normally, a 6 percent increase in storage space is not significant, therefore, the
increase in storage space appears to be less important than
the effect on I/O time, which in tum affects total response
time.
Example 3
We now return to the data structure for Example 1 to
illustrate the capability of the DBDE to predict degradation
800
1/0
700
service
time in
mi 11 i seconds
RETRIEVE FIRST
600
500
100 -
3416K bytes
3214K bytes
Total
I/O
service
til!'<! in
milliseconds
HEAD
(RETRIEVE
MASTER)
/'
/'
/'
/'
/'
50
/'
/'
/'
Single
Acces
--- --I
I
I
I
'5
I
--- ---
_lllS.gX.O~li"'l.---
I
I
I
'10
I
I
I
I
l'S
I
I
I
I
26
Future Time Periods
Figure 5-1/0 service time for the retrieve record 7 operation as a function
of increasing IDS database size
in I/O service time as the probability of overflow increases.
U sing the database configuration in Table 1, we assume a
database growth rate (number of record occurrences of each
type) of 10 percent per time period for an arbitrary time
period length. The IDS model derives the overflow probability at discrete time period intervals for as many intervals
as the analyst wishes to specify. The assignment of a value
such as "3 days" for a time period has no effect on the
model; real time units for time periods are independent of
the overflow algorithm.
I/O service time to retrieve a record (type 7) is shown to
degrade at a nonlinear rate over 20 time p'eriods in Figure
5. The linear 10 percent database growth rate is included for
comparison. The purpose of conducting this type of experiment is to compare storage structures, not just for the
current configuration, but also for future configurations that
involve more record occurrences and possibly a different
level of workload. Observation of the results in Figure 5
could also help determine when and how often the database
should be reorganized to clear the overflow areas.
400
300
FUTURE DIRECTIONS
MODIFY FIRST
200
STORE
100
CHAIN NEXT
CHAIN NEXT, PRIOR, and MASTER
Figure 4-1/0 service time for IDS operations with two pointer
imple~entations ("single access" with dedicated disk)
Experience in the development of the DBDE has led to
the conclusion that considerable insight can be gained into
the causes and effects of physical database performance.
Based on this experience it is felt that reasonable extensions
can be made in three major directions. The first potential
extension needed is to model the interface between the
DBMS and the operating system in its many facets: mutual
exclusion of processes attempting to read and update the
From the collection of the Computer History Museum (www.computerhistory.org)
842
National Computer Conference, 1978
same data, queuing delays from mutual exlusion or other
resource contention, and details of operating system overhead to facilitate DBMS commands to access data. The
simulation model of Rulten and Soderlund 7 has achieved
considerable success in this area.
Secondly, the evaluation can be extended to databases
based on hierarchical or other logical models of data because
they exhibit the same properties as network models: access
can be described in terms of some combination of direct
(hashing) or indexing plus a series of sequential or pointer
steps within a particular level of indexing or at the final level
of data. The generalization applies to existing search methods, indexing methods, and overflow techniques, although
the resulting implementation in the analytical model is more
arduous in some cases. At the level of detail exemplified by
the DBDE, each technique must be modeled individually,
and although models designed by Yao 14 and Severance 10 are
more general, they lack important detail available in the
DBDE.
The third major extension would be to incorporate the
DBDE into a physical database designer that optimizes user
specified performance measures as a function of user specified control parameters. Typically, the control parameters
would be defined as a subset of the database characteristics
currently input to the DBDE. Several of these characteristics are fixed (Le., bound) at the logical database design
. phase, and so the range of physical design control parameters is further narrowed. In IDS one could control subfile
page size, page range for subfiles and record types, PLACE
NEAR specifications, RETRIEV AL VIA clauses, and
pointer options.
ACKNOWLEDGMENTS
The research and development for the Database Design
Evaluator was conducted under Defense Communications
Agency Contract Number DCA 100-75-C-0064 for the
WWMCCS ADP Directorate, Reston, Virginia. The authors
also gratefully acknowledge the fine programming contributions made by Judy Botwick and Rick Haan.
REFERENCES
1. Bastarache, M. J. and E. A. Hershey, "The Data Base Management
System User Manual and Example," ISDOS Working Paper No. 89,
Dept. of Industrial and Operations Engin., The University of Michigan,
April, 1975.
2. Cardenas, A. F., "Evaluation and Selection of File Organization-A
Model and System," Comm. ACM, 16,9, 1973, PP.' 540-548.
3. Cardenas, A. F., "Analysis and Performance of Inverted Data Base
Structures," Comm. ACM 18,5, 1975, pp. 253-263.
4. CODASYL Data Task Group, April 197IReport, ACM, New York.
5. Fry, J. P. and B. K. Kahn, "A Stepwise Approach to Database Design,"
Proc. ACM Southeast Regional Conference, Birmingham, Alabama,
April 1976.
6. Honeywell Information Systems, Inc. Integrated Data Store, Wellesley
Hills, Massachusetts, Order No. BR69, 1971.
7. Hulten, C. and Soderlund, L. "A Simulation Model for Performance
Analysis of Large Shared Data Bases," Proc. International Conf. on
Very Large Data Bases, Tokyo, Oct. 6-8, 1977.
8. Knuth, D. E. The Art of Computer Programming, Vol. 3: Searching and
Sorting, Addison-Wesley, Reading, Massachusetts, 1973.
9. Senko, M. E., V. Lum, and P. Owens, "A File Organization Evaluation
Model (FOREM)," Proc. IFlP 1968, pp. CI9-C23.
10. Severance, D. G., Some Generalized Modeling Structures for Use in
Design of File Organizations, Ph.D. Dissertation, University of Michigan, 1972 .
11. Teorey, T. J., and K. S. Das, "Application of an Analytical Model to
Evaluate Storage Structures," Proc. ACMISIGMOD International Conference on the Management of Data, Washington, D.C. June 2-4, 1976,
pp.9-19.
12. Teorey, T. J., J. Botwick, R. A. Haan, and L. Oberlander, "Design
Specifications for the Database Design Evaluator," Data Translation
Project Working Paper DE 7.2, Graduate School of Business Admin.,
University of Michigan, 1977.
13. Yao, S. B. and A. G. Merten, "Selection of File Organization Using
Analytic Modeling", Proc. International Conference on Very Large Databases, Framingham, Massacliusetts, September 22-24, 1975, pp. 255267.
14. Yao, S. B., "An Attribute Based Model for Database Cost Analysis,"
ACM Trans. on Database Systems, 2,1, March 1977, pp. 45-67.
From the collection of the Computer History Museum (www.computerhistory.org)
Download