Techniques for Structuring Database Records SALVATORE T. MARCH

advertisement
Techniques for Structuring Database Records
SALVATORE T. MARCH
Department of Management Science, School of Management, University of Minnesota, Minneapolis,
Minnesota 55455
Structuring database records by considering data item usage can yield substantial
efficienciesin the operating cost of database systems. However, since the number of
possible physical record structures for database of practical significanceis enormous, and
their evaluation is extremely complex, determining efficientrecord structures by full
enumeration is generally infeasible.This paper discusses the techniques of mathematical
clustering, iterative grouping refinement, mathematical programming, and hierarchic
aggregation, which can be used to quickly determine efficientrecord structures for large,
shared databases.
Categories and Subject Descriptors: D.4.2 [Operating Systems]: Storage Management-segmentation; D.4.3 [Operating Systems]: File Systems Management--file
organization;D.4.8 [Operating Systems]: Performance--modeling andpredietion; E.2
[Data]; Data Storage Representations; E.2 [Data]; Data Storage Representationsn
composite structures, primitive data items; H.2.1 [Database Management]: Logical
Design--schema and subschema; H.2.2 [Database Management]: Physical Design--
access methods
General Terms: Economics, Performance
Additional Key Words and Phrases: Aggregation, record segmentation, record structures
INTRODUCTION
Computer-based information systems are
critical to the operation and management
of large organizations in both the public
and private sectors. Such systems typically
operate on databases containing billions
of data characters [Jefferson, 1980] and
service user communities with widely varying information needs. Their efficiency
is largely dependent on the database
design.
Unfortunately, selecting an efficient
database design is a difficult task. Many
complex and interrelated factors must be
taken into consideration, such as the logical
structure of the data, the retrieval and update patterns of the user community, and
the operating costs and accessing charac-
teristics of the computer system. Lum
[1979] points out that despite much work
in the area, database design is still an art
that relies heavily on human intuition and
experience. Consequently, itspractice is becoming more difficult a s t h e applications
that the database must support become
more sophisticated.
An important issue in database design is
how to physically organize the data in secondary memory so that the information
requirements of the user community can be
met efficiently [Schkolnick and Yao, 1979].
The record structures of a database determine this physical organization. The task
of record structuring, is to arrange the
database physically so that (1) obtaining
"the next" piece of information in a user
Permission to copy without fee all or part of this material is granted provided that the copies are not made or
distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its
date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To
copy otherwise, or to republish, requires a fee and/or specific permission.
© 1983 ACM 0010-4892/83/0300-0045 $00.75
Computing Surveys,Vol:.15,No. 1,March 1983
46
•
Salvatore T. March
CONTENTS
person who works for the company is an
instance of that entity). Clearly, an entityinstance cannot be stored in a database
(e.g., people do not survive well in databases); therefore, some useful description
is actually stored. The description of an
INTRODUCTION
entity-instance is a set of values for selected
I. P R O B L E M DEFINITION A N D R E L E V A N T
characteristics (commonly termed data
TERMINOLOGY
2. R E C O R D S E G M E N T A T I O N T E C H N I Q U E S
items) of the entity-instance (e.g., "John
2.1 Mathematical Clustering
Doe" is the value of the data item EMP2.2 InterativeGrouping Refinement
NAME
for an instance of the entity EM2.3 Mathematical Programming
PLOYEE). Two types of data items are
2.4 A Comparative Analysis of Record Segmentation Techniques
distinguished: (1) attribute descriptors (or
3. H I E R A R C H I C A G G R E G A T I O N
attributes), which are characteristics of a
4. S U M M A R Y
AND DIRECTIONS FOR
single entity, and (2) relationship descripFURTHER RESEARCH
tors, which characterize an association beACKNOWLEDGMENTS
tween the entity described and some other
REFERENCES
(describing) entity.
Typically, each entity-instance is described by the same set of data items which,
along with some membership criteria, is
used to define the entity. In addition, each
request has a low probability of requiring entity has some (subset of) data item(s),
physical access to secondary memory and termed the identifier schema, whose
(2) a minimal amount of irrelevant data is value(s) is(are) used to identify unique entransferred when secondary memory is ac- tity.instances. The entity EMPLOYEE, for
cessed. This is accomplished by organizing example, might be defined by the data
the data in secondary memory according to items: EMP-NAME, EMP-NO, AGE, SEX,
the accessing requirements of the user com- DEPARTMENT-OF-EMPLOYEE,
and
munity. Data that are commonly required ASSIGNED-PROJECTS. The membertogether are then physically stored and ac- ship criterion is, of course, people who work
cessed together. Conversely, data that are for the company; individual EMPLOYEEs
not commonly required together are not (entity-instances) are identified by the
stored together. Thus the requirements of value of EMP-NO (EMP-NO is the identithe user community are met with a minimal fier schema for the entity EMPLOYEE).
number of accesses to secondary memory EMP-NAME, EMP-NO, and AGE are atand with a minimal amount of irrelevant tribute descriptors since they apply only to
data transferred.
EMPLOYEE; DEPARTMENT-OF-EMThis paper presents alternative tech- PLOYEE and ASSIGNED-PROJECTS
niques that can be used to quickly deter- are relationship descriptors that associate
mine efficient record structures for large EMPLOYEEs with the D E P A R T M E N T s
shared databases. In order to delineate the in which they work and with the PROJnature of this problem more clearly, the ECTs to which they are assigned. Relationpaper briefly describes both logical and ship descriptors may be thought of as
physical database concepts prior to "logical connectors" from one entity to andiscussing the record structuring problem other.
in detail.
At the physical level, data items are typAt the logical level a database contains ically grouped to define database records
and provides controlled access to descrip- (or records). In general, records may be
tions of instances of entities as required by quite complex, containing repeating groups
its community of users. An entity is a cat- of data items describing one or more entiegory or grouping of things [Chen, 1976; ties. Further, for operating efficiency, these
Kent, 1978]; instances of an entity (or en- data items may be compressed or encoded
tity-instances) are members of that cate- or physically distributed over multiple recgory (e.g., EMPLOYEE is an entity and a ord segments. Records are typically named
A
v
Computing Surveys,Vol. 15,No. I,March 1983
Techniques for Structuring Database Reaords
°
47
ASSIGNED-PROJECTS
EPARTMENT'OF"EMPLOY~/,I
i
RetrievalRequests
Access Paths ]
Reports and
Other Outputs
DatabaseUpdates
EMPLOYEE
File Organization
DEPARTMENT
File Organization
PROJECT
File Organization
PhysicalDatabase
Figure 1. A physical database with three file organizations-
for the entities whose data items they contain. The record EMPLOYEE, for example,
might be defined as all attributes of the
entity E M P L O Y E E (EMP-NAME, EMPNO, AGE, and SEX) plus the relationship descriptor D E P A R T M E N T - O F - E M PLOYEE. The EMPLOYEE record might,
in turn, be divided into two record segments; EMP-SEG-1, containing EMPNAME and EMP-NO, and EMP-SEG-2,
containing AGE, SEX, and D E P A R T MENT-OF-EMPLOYEE.
A record-instance at the physical level
contains the description of an entity-instance at the logical level {i.e., strings of
bits that represent the values of its data
items). The set of all record-instances for a
particular record is called a file. Similarly,
the set of all segment-instances for a particular record segment is termed a subtile.
If a record is unsegmented, it has exactly
one segment (the record itself) and exactly
one subtile (the entire file). Subfiles reside
permanently in secondary memory in
groups called data sets; that is, a data set
is a physical allocation of secondary memory in which some number of subfiles are
maintained. Each data set has a set of accesspaths {algorithms and structures) that
are used to store, retrieve, and update the
segment-instances in the data set. A data
set with its associated access paths is
termed a file organization. A physical
database is a set of interconnected file organizations. Interconnections among file organizations must be maintained when the
descriptions of related entities are stored in
different data sets. As stated earlier, the,
terms record and segment define groupings
of data items. For convenience, however,
these items are also used, respectively, in
place of record-instance and segment-instance, provided that it is obvious from the
context that instances and not their definitions are of concern.
Figure I shows a physical database with
three file organizations, EMPLOYEE, DEP A R T M E N T , and P R O J E C T , where each
file organization is named for the file that
its data set contains. That is, the EMPLOYEE file organization's data set contains the E M P L O Y E E file, and so forth, for
the D E P A R T M E N T and P R O J E C T file
organizations. Interconnections between
the E M P L O Y E E f i l e organization and
the D E P A R T M E N T a n d P R O J E C T file
organizations are required for the relationship descriptors D E P A R T M E N T - O F - E M PLOYEE and A S S I G N E D - P R O J E C T S in
the E M P L O Y E E record. The values stored
for these relationship descriptors are
termed pointers, where a pointer is any
data item that can b e used to locate a
record (or segment ff records are segComputing Smweys, VoL 15, No. 1, March 1983
48
•
Salvatore T. March
mented). Two types of pointers are commonly used: symbolic pointers (typically
entity identifiers) and direct pointers (typically "relative record numbers" [IBM,
1974], i.e., the relative position of a record
in a data set). A symbolic pointer relies on
an access path of the "pointed to" file organization to retrieve records (segments).
Direct pointers, on the other hand, reference records (segments) by location. Hence
direct pointers are more efficient than symbolic pointers; however, they have wellknown maintenance disadvantages. For example, if a record (segment) is physically
displaced, in order to maintain some physical clustering when a new record (segment)
is added to the data set, all its direct
pointers must be updated. On the other
hand, symbolic pointers to that record (segment) axe still valid since its identifier was
not changed. Pointers (symbolic or direct)
are also used to connect records (segments)
residing in the same data set.
The record structures illustrated in Figure 1 are extremely simple (i.e., unsegmented records each defined for a single
entity). Many other alternatives exist. Any
of these records may, for example, be segmented and the associated subfiles stored
in individual data sets. Alternately, a single
record may be defined containing all data
items. Such a record could contain PROJE C T as a repeating group within EMPLOYEE, which, in turn, is a repeating
group within D E P A R T M E N T .
The criterion for selecting among alternative record structures is the minimization
of total system operating cost. This is typically approximated by the sum of storage,
retrieval, and update costs. The data storage cost for a physical database is easily
estimated by multiplying the size of each
subtile by the storage cost per character for
the media on which the subtile is maintained. In order to estimate retrieval and
update costs, however, the retrieval and
update activities of the user community
must be characterized and the access paths
available .to support those activities must
be known. A retrieval activity is characterized by selection (of entity-instances), projection (of data items pertaining to those
instances), and sorting (output ordering)
criteria and by a frequency of execution. An
Computing Surveys, Vol. 15, No. 1, March 1983
update activity is characterized by selection
(of entity-instances) and projection (data
items updated for those instances) criteria
and, again, by frequency of execution. Selection and projection criteria for both retrieval and update activities determine
which subfiles must be accessed. The frequency of execution, of course, determines
how often the activity occurs. The total
cost of retrieval and update also depends
on what access paths are available to support those activities. In addition, the access
paths themselves typically incur storage
and maintenance costs.
The access paths of a file organization
define a set of record selection criteria
which are efficiently supported. Each file
organization has a primary access path
which dictates the physical positioning of
records in secondary memory. Hashing
functions, indexed sequential, and sequential files are examples of types of primary
access paths. As discussed by Severance
and Carlis [1977], the selection of a primary
access path for a file organization is based
on the characteristics of the "dominant"
database activity. In addition to having a
primary access path, file organizations typicaUy have some number of secondary or
auxiliary access paths, such as inverted
files, lists, full indexes, and scatter tables.
These access paths are used to retrieve
subsets of records that have been stored via
the primary access path.
Secondary access paths are included in a
file organization in order to support the
retrieval activities more efficiently. They
do, however, typically increase data maintenance costs and require additional storage space. Subtle trade-offs between retrieval efficiency and storage and maintenance costs must be evaluated in selecting
the most cost effective secondary access
paths.
While access paths focus on the storage
and retrieval of records, the actual units
that they transfer between primary and
secondary memory, termed blocks, typically contain more than one record. The
number of records per block is termed the
blocking factor. The retrieval advantages
of blocking are obvious for sequential access paths where the number of secondarymemory accesses required to retrieve all of
Techniques for Structuring Database Records
a file organization's records is simply the
number of records divided by the blocking
factor. For direct access paths (including
secondary access paths), the analysis depends on both the blocking factor and the
proportion of records that are required, and
it is less straightforward. If records are unblocked (i.e., stored as one record per
block), then each record in a user request
requires a secondary-memory access.
Blocking may reduce the number of secondary-memory accesses since the probability that a single block contains multiple
required records increases (although not
linearly) with the blocking factor [Yao,
1977a]. The cost (measured in computer
time used) per secondary-memory access,
however, also increases (again not necessarily linearly) with the blocking factor
since more data must be transferred per
access [Severance and Carlis, 1977]. When
the proportion of records required by a user
request is large (over 10 percent [Martin,
1977]), the increased cost per secondarymemory access is typically outweighed by
the decreased number of accesses needed
to satisfy that request. Direct access paths
are more commonly used, however, when a
user request requires only a small proportion of records. The number of secondarymemory accesses is then approximately
equal to the number of records required
(i.e., approximately the same as when records are unblocked). In this case, blocking
does not reduce the number of secondarymemory accesses but only increases the
cost per block accessed. Therefore, the retrieval cost for that user is increased. Subtle
trade-offs between the number of accesses
and the cost per access must be evaluated
for each user request and blocking factors
that minimize total operating cost selected
when access paths are designed.
Both record structuring and access path
design are significant database design subproblems, and their solutions are interrelated [Batory, 1979]. However, access path
selection has been treated in detail elsewhere (see Severance and Carlis, 1977;
Martin, 1977; CSxdenas, 1977; Yao, 1977b;
March and Severance, 1978b) and will be
discussed further only as it relates to the
selection of efficient record structures. In
addition, the problems of data compression
*
49
and encoding are separable design issues
(see Maxwell and Severance, 1973; Aronson, 1977; Wiederhold, 1977; Severance,
1983), and in the following analysis it is
assumed that each data item has an encoded compressed length corresponding to
the average amount of space required to
store a single value for that data item.
To determine a set of efficient record
structures, one might proceed naively by
evaluating the operating cost of all possible
record structures (using time-space cost
equations from, e.g., Hoffer, 1975; March,
1978; or Yao, 1977a). However, the number
of possible alternatives is so large (on the
order of n n, where n is the number of data
items [Hammer and Niamir, 1979]) and the
evaluation of each alternative is so complex
that this approach is computationally infeasible.
The remainder of this paper is organized
as follows: in Section 1 the record structuring problem is formally defined and relevant terminology is presented; in Sections
2 and 3 alternative record structuring techniques are discussed; finally, Section 4 presents a summary and discussion of directions for further research.
1. PROBLEM DEFINITION AND RELEVANT
TERMINOLOGY
As discussed above, a database contains
and provides access to descriptions of instances of entities as required by its community of users. These descriptions are organized into data sets that are permanently
maintained in secondary memory. The design of record structures for these data sets
is critical to the performance of the database. This is evidenced by t h e order of
magnitude reduction in total system operating cost reported by Carlis and March
[1980] when flat file record structures were
replaced by more complex hierarchic ones
(see also Gane and Sarson, 1979).
To review our terminology, recall that a
data item is a characteristic or a descriptor
of an entity. For instance, E M P - N A M E is
a data item (a characteristic of the entity
EMPLOYEE), whereas " J O H N DOE" is a
value for that data item. At the physical
level, data items are grouped into records;
the data items of a record define the recComputingSurveys, VoL 1~, No. 1, March 1983
50
•
Salvatore T. March
Supplier 1
Part
SUP-NAME
SUP-NO
DESCRIPTION
ADDRESS
PART-NO
PRICE
AGENT
DISCOUNT
Part
Part
Q
t
o
P;rt
Supplier 2
(
I
I
Figure 2. An aggregation of supplier and part records.
ord's fields. Record-instances contain the
actual data values. The set of all instances
of a record is a file. For convenience,
"record" is used for "record-instance" when
"instance" is implied by the context. For
operating efficiency the items of a record
may be partitioned to define some number
of record segments {thus a segment has
fields defined by the subset of data items
that it contains); the set of segment-instances for a segment is a subtile. Again, for
convenience, "segment" is used for "segment-instance" when "instance" is implied
by the context. In a database management
system, data are physically stored in data
sets. Each data set contains some number
of subfiles.
The record structuring problem is to define records and record segments (i.e., assign data items to records and record segments, thus defining their fields), to assign
the associated subfiles to data sets, and,
when more than one subtile is assigned to
the same data set, to physically organize
the segments within the data set. The objective is to meet the information requirements of the user community most efficiently.
The problem of record structuring is a
difficult one. Three factors Contribute to
this difficulty: (1) modern databases typically have a complex logical data structure,
which yields an enormous number of alterComputing Surveys, Vol. 15, No. 1, March 1983
native record structures; (2) the complex
nature of the user activity on that logical
data structure requires the analysis of subtle and intricate trade-offs in efficiency for
users of the database; and (3) the selection
of record structures and access paths is
highly interrelated--each can have a major
impact on the efficiency of the other. Each
of these difficulties is discussed in more
detail below.
The logical structure of a database describes the semantic interdependencies that
characterize a composite picture of how the
user community perceives the data. Within the context of commercially available
database management systems (DBMSs),
three data models have been widely followed to represent the logical structure of
data: hierarchic, network, and relational
(see Martin, 1977; Date, 1977; Tsichritzis
and Lochovsky, 1982). Since IMS [McGee,
1977] is the most widely used hierarchic
DBMS, its terminology is used in the following discussion. There are several network D B M S s {e.g., IDMS, IDS II, D M S
1100 [C~rdenas, 1977]); however, they are
all based on the work of the Data Base
Task Group (DBTG) of the Conference on
Data Systems Languages (CODASYL)
[CODASYL, 1971, 1978] and are collec.
tively referred to as CODASYL DMBSs;
therefore the CODASYL terminology is
used. Relational D B M S s [Codd, 1970, 1982]
/
Techniques for Structuring Database Records
Supplier I
SUP-NO
AGENT
51
ADDRESS
NAME
Supplier 2
Supplier 3
Sup;tier N
SUP-SEG 1
Figure 3.
SUP-SEG 2
A segmentation of the supplier record.
are best exemplified by INGRES [Held,
Stonebreaker, and Wong, 1975] and System
R [Blasgen et al., 1981; Chamberlin et al.,
1981]; when terminology is unique to one of
these systems, it is so noted.
Each data model provides rules for
grouping data items into hierarchic segments in IMS, owner-member sets in CODASYL DBMSs, and normalized relations
in relational DBMSs. Database records
could be defined and their files stored in
data sets according to the groups specified
by a data model; however, performance
may be improved considerably by combining (or aggregating) some and/or partitioning (or segmenting) other of these groups.
Suppose, for example, that there are two
entities, SUPPLIER and PART, to be described in a database. Each SUPPLIER is
described by the attributes SUP-NAME,
SUP-NO, ADDRESS, and AGENT. Each
PART is described by the attributes DESCRIPTION, PART-NO, PRICE, and
DISCOUNT and by the relationship descriptor SUPPLIER-OF-PART (assume
that SUPPLIERs can supply many
PARTs, but each PART is supplied by
exactly one SUPPLIER). Database records
could be defined for both SUPPLIER and
PART and their files stored in SUPPLIER
and PART data sets, respectively. For
database activity directed through SUPPLIERs to the PARTs they supply, however, the aggregation (i.e., hierarchic arrangement) illustrated in Figure 2 is a more
efficient physical arrangement (see Gane
and Sarson, 1979; Schkolnick, 1977; Schkolnick and Yao, 1979). In this arrangement
each part record is stored with its SUPPLIER record, and the relationship descriptor SUPPLIER-OF-PART is represented by the physical location of PART
records.
On the other hand, database activities
typically require (project) only a subset of
the stored data items (see Hoffer and Severance, 1975; Hammer and Niamir, 1979).
For a user activity that requires only the
data items SUP-NO and AGENT, segmenting (i.e., partitioning into segments)
the SUPPLIER record, as shown in Figure
3, is an efficient physical arrangement. In
Figure 3 two segments are defined: SUPSEG1, containing SUP-NO and AGENT,
and SUP-SEG2, containing NAME and
ADDRESS. In order to satisfy the above
mentioned user activity, only the subtile
defined by SUP-SEG1 needs to be accessed. This subtile is clearly smaller than
the SUPPLIER file; therefore fewer accesses and fewer data transfers are required
to satisfy this user activity from the SUPSEG1 subtile than from the SUPPLIER
file. Hence this segmentation reduces the
cost of satisfying this user activity. If the
data items required by a user activity are
distributed over multiple segments, then
evaluating the cost ~ more complex. If
subfiles are independently accessed and
may be processed "in parallel," then cost
may still be reduced. However, this is not
common in practice; usually access to multiple subfiles proceeds serially and thus at
greater cost than access to a single subtile.
While record segmentation may significantly reduce retrieval costs, it can also
Computing Surveys, Vo]. !,5,No. 1, March 1983
52
•
Salvatore T. March
increase database maintenance costs. In
particular, inserting or deleting an instance
of a record typically requires access to each
of the record's subffles. Clearly, the more
subffles a record has, the more expensive
insertion and deletion operations become.
Record structuring may be viewed as the
aggregating and segmenting of logically defined data, within the context of a database
management system. IMS [McGee, 1977]
supports aggregation by permitting multiple hierarchic segments to be stored in the
same data set (termed "data set group" in
IMS). In this case, children segments are
physically stored immediately following
their parent segments. CODASYL DMBSs
support aggregation by allowing repeating
groups in the definition of a record type
and by permitting the storage of a MEMB E R record type in the same data set
(termed "area" in CODASYL) as its
OWNER, "VIA SET" and " N E A R
OWNER". In CODASYL, a SET defines a
relationship by connecting O W N E R and
M E M B E R record types (e.g., if SUPP L I E R and P A R T are record types, then
S U P P L I E R - O F - P A R T could be implemented as a S E T and S U P P L I E R as the
O W N E R and P A R T as the MEMBER).
Storing a M E M B E R record type " N E A R
O W N E R " directs the D B M S to place all
M E M B E R records physically near {presumably on the same block as) their respective O W N E R records. In order to specify " N E A R O W N E R " in a CODASYL
DBMS, "VIA SET" must also be specified,
indicating that access to the M E M B E R
records occurs primarily from the O W N E R
record via the relationship defined by the
SET. Relational D B M S s do not support
aggregation per se (i.e., "normalizing" relations requires, among other things, the removal of repeating groups [Date, 1977]).
However, because substantial efficiencies
can be achieved through the use of aggregation, Guttman and Stonebreaker [1982]
suggest its possible inclusion in INGRES
under the name "ragged relations" (since
the relations are not normalized). In System R [Chamberlin et al., 1981] aggregation
can be approximated by initially loading
several tables into the same data set
(termed "segment" in System R) in hierarchic order and supplying "tentative recComputing Surveys, Vol. 15, No. 1, March 1983
ord identifiers" during I N S E R T operations
to maintain the physical clustering [Blasgen et al., 1981].
All three types of D B M S s support record
segmentation. In IMS, segmentation is accomplished simply by removing data items
from one segment to form new segments,
which become that segment's children.
Similarly, in a CODASYL DBMS, segmentation is accomplished by removing data
items from one record type to form new
record types and defining SETs with these
new record types as M E M B E R S and
O W N E D BY the original record type. The
new (MEMBER) record types are stored
VIA S E T but not N E A R OWNER. In addition, a proposed CODASYL physical
level specification [CODASYL, 1978] suggests that record segmentation be supported at the physical level by allowing the
definition of multiple record segments,
termed storage records, for a single record
type. In a relational DBMS, record segmentation is accomplished by breaking up a
"base relation" [Date, 1977] into multiple
relations, each of which contains the identifier (key) of the original; these then become "base relations."
To be efficient, record structures must be
oriented toward some composite of users.
One design approach is simply to design the
record structures for the efficient processing of the single "most important" user
activity. Record structures that are efficient
for one user, however, may be extremely
inefficient for others. Since databases typically must serve a community of users
whose activities possess incompatible or
even conflicting data access characteristics,
such an approach is untenable.
An alternative design approach is to provide each user with a personalized database
consisting of only required data items, organized for efficient access. In a database
environment, however, many users often
require access to the same data items;
hence personalized databases would require
considerable storage and maintenance
overhead owing to data redundancy, and
are directly contrary to the fundamental
database concept of data sharing [Date,
1977]. The problem is to determine a set of
record structures that minimizes total system operating costs for all database users.
Techniques for Structuring Database Records
The task of structuring database records
is further complicated because the retrieval
and maintenance costs of a database depend on both its record structures and
its access paths. Consider, for example, a
database containing 1,000,000 instances of
a single record, where the record is defined
by a group of data items totaling 500 characters in length. Consider also a single user
request that selects 4 percent of the file
(i.e., 40,000 records) and projects 30 percent
(by length) of the data items (i.e., 150 characters) from each of these records. At the
physical level, records may be either unsegmented or segmented, and the access path
used to satisfy that user request may be
either sequential or direct.
Suppose that the physical database was
designed with unsegmented records. Satisfying the user request via a sequential access path would require a scan of the file
that, as discussed earlier, requires N / b
block-accesses, where N is the number of
records (here N = 1,000,000) and b is
the blocking factor of the file. Assuming a
realistic block length of 6,000 characters,
the blocking factor for this file is 12 (i.e., 12
500-character records may be stored in a
6,000-character block). Satisfying the user
request via a sequential access path would
require 83,334 block-accesses. Assuming
that the 40,000 required records are randomly distributed within the file, then
satisfying this user request via a direct
access path (such as an inverted list) would
require M * (1 - (l/M) K) block-accesses
[Yao, 1977a], where M = N / b and K is the
number of records selected (here K =
40,000). Again, assuming 6,000-character
blocks, yielding a blocking factor of 12, a
direct access path would require 31,368
block-accesses. If, instead, the physical
database was designed with segmented
records, one segment containing only those
data items required by the user request
under consideration, then only that subtile
must be processed. The expressions for the
number of accesses required for sequential
and direct access paths remain the same;
however, again assuming 6,000-character
blocks, the blocking factor increases to 40
(i.e., 40 150-character segments may be
stored in a 6,000-character block). A sequential access path would require 25,000
•
53
block-accesses to satisfy the:User request; a
direct access path would require 19,953.
Clearly, segmented records processed via
direct access path minimizes the number of
block-accesses required to satisfy the user
request; unfortunately, it may not be the
most efficient database design. Four additional factors must be considered: (1) the
types of direct access paths used to support
such user requests typically incur additional storage and maintenance costs; (2) in
a computer system where there is little
contention for storage devices, the cost per
access for a sequential access path is typically less than that for a direct access path,
and hence the number of block-accesses
may not be an adequate measure for comparison; (3) record segmentation typically
increases database maintenance costs by
increasing the number of subfiles that must
be accessed to insert and delete record instances; and (4) both record segmentation
and secondary access paths add to the complexity of the database design and hence
may increase the cost of software development and maintenance.
For the database and user request under
consideration, the simplest physical database, and the least costly to maintain, has
unsegmented records processed by a sequential access path. If a secondary access
path (say an inverted list) is added to that
physical database, then the number of
block-accesses is reduced by 62 percent
(from 83,334 to 31,768), a considerable savings even in a low-contention system. The
cost of maintaining the secondary access
path must, of course, be considered. On the
other hand, segmenting the records without
adding a secondary access path reduces the
number of block-accesses by 70 percent
(from 83,334 to 25,000) over unsegmented
records processed by a sequential access
path and by 37 percent (from 31,768 to
25,000) over unsegmented records processed by a direct access path. The addition
of a secondary access path further reduces
the number of block-accesses by 20 percent
(from 25,000 to 19,953). The significance of
this 20 percent reduction depends on the
relative cost of sequential access as compared to direct access and the additional
storage and maintenance costs incurred by
the secondary access path.
54
•
Salvatore T. March
As illustrated by the above discussion,
structuring database records is a difficult
task, and the selection of inappropriate record structures may have considerable impact on the overall performance of the database. In the next two sections, alternative
techniques for structuring database records
are presented and compared. Techniques
for the segmentation of flat files are discussed in Section 2; a technique for the
aggregation of more generalized logical data
structures is discussed in Section 3.
2. RECORD SEGMENTATION TECHNIQUES
In this section, techniques for structuring
database records in a flat file environment
are discussed. A flat file is one that is logically defined by a single group of singlevalued data items (i.e., a single normalized
relation).
Table 1 shows the logical representation
of a flat file personnel database that will be
used in the following discussion. As is
shown in that table, each employee is
described by a set D, of 30 data items di,
i - 1. . . . . 30; for instance, dl ffi Employee
Name, de ffi Pass Number. Each data item
has an associated length, li, corresponding
to the (average) amount of space required
to store a value for that data item, such as
30 characters for Employee Name. Processing requirements consist of a set U, of 18
user retrieval requests u~, r ffi 1. . . . ,18, each
of which is characterized by selection, projection, and ordering criteria (S, P, and O,
respectively), by the proportion of records
selectedpr, and by a frequency of access v~.
For example, retrieval request R5 has a
selection criterion based upon Pay Rate, a
projection criterion requiring Security
Code, Assigned Projects, Project Hours,
Overtime Hours, and Shift. Its ordering
criterion is by Department Code within
Division Code. The request selects an average proportion of 0.9 (90 percent) of the
employees and is required daily (its frequency is 20 times per month).
Retrieval requests R1 through R8 each
select a large proportion of the stored data
records (requests R1 through R3, in fact,
require all data records). These retrievals
are likely to be most efficiently performed
by using a sequential access path which
Computing Surveys, Vol. 15, No. 1, March 1983
retrieves the entire file (or required
subfiles). Retrieval requests R9 through
R18, on the other hand, each select a relatively small proportion of the stored records
(R15 through R16 retrieve individual data
records). These are likely to be best performed via some auxiliary access paths,
such as inverted lists or secondary indexes
[Severance and Carlis, 1977].
The record segmentation problem is to
determine an assignment of data items to
segments (recall that each segment defines
a subtile) that will optimize performance
for all users of the database. The problem
may be formally stated as
rain ~ vr Y, Xr~ RC(r, s)+ ~ SC(s),
AER
Ur~ U
s~.A
s~A
where R is the set of all possible segmentations (i.e., assignments of data items to
segments), A is one particular segmentation, s is a segment of the segmentation A,
and X~, is a 0-1 variable defined as follows:
Xrs ffi
r 1,
L0,
if segment s is required
by user request m;
otherwise.
Finally RC(r, s) is the cost of retrieval for
the subtile defined by segment s for the
retrieval request Ur, and SC(s) is the cost of
storage for segment s, including overhead
space required for its access paths (if any).
Hoffer and Severance [1975] point out
that, theoretically, this problem is trivial to
solve. Since retrieval costs are minimized
by accessing only relevant data and storage
and maintenance costs are minimized by
eliminating redundancy, each data item
should be maintained independently. Such
an arrangement would permit completely
selective data retrieval and would eliminate
data redundancy. Assuming that buffer
costs and file-opening costs are negligible,
this solution does, in fact, minimize retrieval cost for sequential fries that are independently processed in a paged-memory
environment [Day, 1965; Kennedy, 1973].
To show that this is true, it is sufficient to
look at the access cost for a single user, u~.
If each data item is assigned to an independent segment (i.e., di is assigned to segment i), then the access cost for each subtile
G~
b~
o~
T~
i ~ ~.
~J
c~
!!°
o~
a~
rJ~
fl
~ 1 ~ t C~r~
m~
56
•
Salvatore T. March
A paged-memory environment assumes
that any access to any subfile requires the
N
same amount of time and expense. ConCost(r, s) -- a ~ L,
sider, however, a system in which all
subtiles
of the database reside on a dediwhere a is the cost per page access {including the average seek and latency, as well as cated disk and there is little contention
page-transfer costs), B is the block size, and among users of the database. Sequential
hence (N/B)l~ is the number of page ac- access to a single subffle then requires only
cesses required to scan a subtile, where each one (minimal) disk-arm-seek operation for
segment is of length ls. Here, l~ is equal to each cylinder on which the subtile is stored
li, the length of the corresponding data item [IBM, 1974]. A n y access to a different
subtile, however, requires an average diskdi. The total access cost for that user is
seek operation. Thus access time (and
hence cost) increases significantly when
Xria N
died
~ li,
more than one subfUe must be sequentially
processed for a single user request. Actual
where each segment is now indexed by the time and cost estimates for such an envidata item that it contains.
ronment must take into consideration the
If this is not the optimal solution, then physical location of subfiles and the disthere must exist some subset of data items tance that the disk arm must move.
d i n , . . . , dn such that the access cost for user
Three major techniques have been sugur is reduced when those data items are gested for the determination of efficient
assigned to a single segment, say segment record segmentations: clustering, iterative
j. If u~ requires any data item in segment j , grouping refinement, and bicriterion prothen the subffle that it defines must be gramming. Each technique varies in the
retrieved. Since the number of accesses re- assumptions it makes about access paths
quired to retrieve a subffle via a sequential and the secondary memory environment it
access path is a linear function of the seg- considers. These variations yield signifment length, the retrieval cost for the icantly different problem formulations.
subtile defined by segment j is given by a Each, however, is restricted to a single flat
(N/B) (lm+ . . . + ln). Mathematically this file representation, and none permit the
yields
replication of data.
The techniques are first discussed indiN
a ~ (lm+ ..- + l~)max(x. . . . . . . Xrn)
viduallymclustering in Section 2.1, iterative
grouping refinement in Section 2.2, and biN
criterion programming in Section 2.3. Each
< a - ~ (xrml,,, + . . . + Xrdn).
then is used to solve the personnel database
problem shown in Table 1. These results
Clearly, this inequality is false since are analyzed and discussed in Section 2.4.
l~ . . . . . l~ > 0 and x ~ . . . . . Xr~ ~ (0, 1}.
Therefore, according to theory, the pre2.1 Mathematical Clustering
vious solution must have been optimal.
This appealing theoretical approach is Hoffer [1975] and Hoffer and Severance
untenable in practice for several reasons. [1975] developed, for the record-structuring
First, the complex nature of database activ- problem, a heuristic procedure that uses a
ity requires both random and sequential mathematical clustering algorithm and a
processing. For random processing (i.e., branch-and-bound optimization algorithm.
processing via a direct access path), the Their procedure is restricted to sequential
number of accesses required to satisfy a access paths but includes a detailed costing
user request is not a linear function of the model of secondary memory (in a movingsegment length. Hence, the preceding result head disk environment). The solution to
is not valid. In addition, the characteris- this problem is an assignment of data items
tics of secondary storage devices and their to segments such that the sum of retrieval,
data access mechanisms may not be well storage, and maintenance costs is minimodeled by a paged-memory environment. mized.
required by that user is given by
Computing Surveys, Vol. 15, No. 1, March 1983
Techniques for Structuring Database Records
Mathematically, clustering is simply the
grouping together of things which are in
some sense similar. The general idea behind
clustering in record segmentation is to organize the data items physically so that
those data items that are often required
together are stored in the same segment; in
a processing sense, such data items are
"similar." Conversely, data items that are
not often required together are stored in
different segments; they are "dissimilar." In
this way user retrieval requirements can be
met with a minimal amount of extraneous
data being transferred between main and
secondary memory. Unfortunately, user retrieval requirements are typically diverse,
and a method of evaluating trade-offs
among user requirements must be determined.
The clustering approach therefore first
establishes a measure of similarity between
each pair of data items. This measure reflects the performance benefits of storing
those data items in a common segment. It
then forms initial segments by using a
mathematical clustering algorithm to group
data items with high similarity measures.
Finally, a branch-and-bound algorithm selects an efficient merging of these initial
segments to define a record segmentation.
The factors included in the similarity
measure are critical to the value of this
approach. Hoffer and Severance [1975b]
developed a similarity measure for each
pair of data items (termed a pairwise similarity measure) based upon three characteristics:
(1) the (encoded} data item lengths (li for
data item dz);
(2) the relative weight associated with each
retrieval request (v~ for request Ur);
(3) the probability that two given data
items will both be required in a retrieval
request (p~j~for data items di and d/in
request u~).
The first characteristic, li, is obtained
directly from the problem definition (see
Table 1) and reflects the amount of data
that must be accessed in order to sequentially retrieve all values of that data item.
For any pair of data items, the bigger the
length of the data items, the more often
they must be retrieved together in order for
•
57
the retrieval advantage (for users requiring
both data items) to outweigh the retrieval
disadvantage (for users requiring one or the
other but not both data items). Hence a
pairwise data item similarity measure
should decrease with the lengths of the data
items.
The second characteristic, v~, is also obtained directly from the problem definition
and reflects the relative importance of each
user request measured by its frequency of
access. Again, for a pair of data items, the
more frequently they are retrieved together, the greater is the performance advantage of storing them in the same segment. A pairwise data item similarity measure should increase with the frequency
with which the data items are retrieved
together.
The final characteristic pi]~ is the probability of coaccess of data items di and d] in
retrieval request u~ and is defined as follows. Let p~,. be the probability that data
item d~ is required by user request u~. The
value of pir is determined by the user request's (re's) use of data item di. Three
possibilities exist:
(1) Data item d~ is used by m for selection.
Since only sequential access is permitted, the value of d~ must be examined
in each record to determine if the record is required by u~; therefore the
value ofp~ is equal to 1.
(2) Data item di is used by u~ for projection
only. Data item d/is required only from
those records which have been selected;
therefore the value of p~ is equal to p~,
the proportion of records selected
by Ur.
(3) Data item di is not used by request Ur.
Data item d~ is irrelevant for request u~
and therefore the value ofpi~ is 0.
The value ofp~j~ is then given by
p~'j~=
1,
O,
p~,
if pir "ffipy~;
if p~ -- 0 or py~ffi0;
otherwise.
For any pair of data items d/ and dj,
Hoffer's similarity measure is given by
S~(~) ffi
Y,a.r vrS~~ (~)
Ya,,r vr[ S~~ (~) ] '
Computing Sth~eys, Vol. 15, No. 1, March 1983
58
*
Salvatore T. March
where [x] denotes the smallest integer
greater than or equal to x, S~~ (a) denotes
the pairwise similarity of data items d~ and
dj in the user request u~, and a is a parameter that controls the sensitivity of the measure. The numerator is the sum of the
similarity measures for a pair of data items
in a retrieval request, summed over all retrieval requests and weighted by the frequency of retrieval. The denominator simply normalizes the similarity measure to
the interval [0, 1] since S~p (a) is itself in
the interval [0, 1], as discussed below.
S~~)(a) is given by
if
0,
S~f ~(a) ffi
( li Jr b)
(c~))~
li+ b
pijr -~
O;
otherwise,
'
where C~p represents the proportion of information useful to user request u~, which
is contained in a subffle consisting only of
data items di and dy. Assuming that pi~ ->
p]~, then it is given by
C¢r)
'J ---~
~0,
if p~/~ffiO;
li + p q r l j
L l~ + b
'
otherwise.
Since 0 _<p/y~ _< 1 for all di, dj, and ur, and
li -> 0, b -> 0 for all di and dy, then 0 _< C~,fl
1. Hence for any a _> 0, 0 _< (C~r~)~ _< 1
and 0 _< ~,~r~(a) <_ 1. For values of C~r~ < 1
(i.e., when piy~ < 1) larger values of a would
cause (C}/~))~ to decrease rapidly and hence
reduce the pairwise similarity measure
S}p (a) for that user request rapidly as piyr
decreased. Thus large values of a would
yield low similarity measures (except, of
course, when p#r ffi 1). Small values of a
(particularly 0 _< a -< 1) would yield high
similarity measures that would not decrease rapidly as p#r decreased.
Having as a basis these definitions, the
pairwise similarity measure Sij (a) does, in
fact, exhibit the characteristics suggested
earlier: for any pair of data items di and dy
(provided that p~yr> 0 for some user request
Ur), then Siy(~) decreases as the lengths l/
and b increase and Sq(a) increases as the
frequency of access or the probability of
coaccess of the data items increases.
Computing Surveys, Vol. 15, No. 1,March 1983
Using this measure to form a data-item
data-item similarity matrix, Hoffer and
Severance applied the Bond Energy Algorithm (see McCormick, Schweitzer, and
White, 1972) and used the results to form
initial clusters (groups) of data items. The
Bond Energy Algorithm manipulates the
rows and columns of the similarity matrix,
clustering large similarity values into blocks
along the main, upper left to lower right
diagonal. These blocks correspond to
groups of similarly accessed data items that
should be physically stored together. However, because block boundaries are generally fuzzy, initial clusters must be subjectively established from the permuted
matrix.
An optimal grouping of initial clusters is
determined by a branch-and-bound algorithm, which (implicitly) evaluates the expected cost of all possible combinations of
initial clusters and selects the one with
minimum cost. The branch-and-bound algorithm is more efficient than complete
enumeration since many (nonoptimal)
solutions are eliminated from consideration
without actually evaluating their costs. (See
Garfinkel and Nemhauser, 1972, for a general discussion of branch-and-bound algorithms.) The algorithm first divides the set
of all possible solutions into subsets and
then establishes upper and lower bounds
on cost for each subset. These bounds are
used to "fathom" (i.e., eliminate from consideration) subsets that do not contain an
optimal solution. A subset is fathomed if its
lower bound is higher than the upper bound
of some other subset (or if its lower bound
is higher than the cost of some known solution). Each remaining subset is further
divided into subsubsets, and the process is
repeated until only a small number of solutions remain (i.e., are not fathomed).
These remaining solutions are evaluated
and the minimum cost solution selected.
Hoffer and Severance [1975] report that
although the execution time of the branchand-bound algorithm is significantly affected by the number of initial clusters selected, the performance of the final solution
is not--provided, of course, that some
"reasonable" criteria are used to establish
the initial clusters. They also suggest the
potential for developing an algorithmic prox
Techniques for Structuring Database Revords
cedure to identify a "reasonable" set of
initial clusters, thereby reducing the subjective nature of this approach. They do not,
however, present such a procedure.
2.2 Iterative Grouping Refinement
Hammer and Niamir [1979] developed an
heuristic procedure for the record structuring problem which iteratively forms and
refines record segments based on a stepwise
minimization ("hill climbing") approach.
Their procedure considers the number of
page accesses for multiple record segments
that have been stored nonredundantly. Access may be either sequential or random
via any combination of predetined secondary indexes (i.e., direct access paths), each
of which is associated with exactly one data
item. In order to evaluate random retrieval,
they define the selectively, S~, of a data
item (S~ is used to represent the selectivity
of data item di) as the expected proportion
of records that contains a given value for
that data item. The selectivity of a data
item is estimated by the inverse of the
number of distinct values the data item
may assume. If an index exists for a data
item, then the selectivity of the index is
equal to the selectivity of the data item.
The performance of a segmentation is
evaluated by the number of pages that must
be accessed in order to satisfy all user requests. For each subtile required by a user,
the number of pages that must be accessed
is given by
where C~ is the number of combinations of
k objects t a k e n j at a time, N is the number
of records in the file organization, b is the
blocking factor (i.e., segments per page) for
the subffle, and s is the number of segments
required from the subtile.
This expression is derived by Yao [1977a]
and briefly discussed below. Assuming that
the s segments required by a user request
are uniformly distributed over the pages on
which the subtile is stored, then the retrieval problem may be viewed as a series
of s Bernoulli trials without replacement
(see Feller, 1970). Then C~-b/CN is the
probability that a page does not contain
•
59
any of the s required records. One minus
this quantity is, of course, the probability
that a page contains at least one required
record. Multiplying by N/b, the total number of pages, gives the expected number of
pages which must be accessed. For a more
detailed discussion of this derivation, the
interested reader is directed to Yao's paper
[Yao, 1977a].
The remaining factor, the number of segments required from each subtile, is calculated as follows. Assuming that the selectivities of data items are independent, that
selection is done first with all indexed data
items, and that subfdes are linked and directly accessed, then the number of record
segments required from a subtile, f, by a
user u~ is given by
s f f i N * II S ,
where Lf is the set of selection data items
required by user u~ that are either indexed
or appear in subfdes accessed before subtile
f is accessed.
The record structuring proceeds as follows. An initial segmentation is formed,
with each data item stored in a separate
segment. This is termed the trivial segmentation. The procedure iteratively defines
new segments by selecting the pair of existing segments that, when combined, yields
greatest performance improvement. When
no pairwise combination of existing segments yields improved performance, single
data item variations of the current segmentation are evaluated. If any variation is
found that improves the performance, then
the procedure is repeated, using that variation as the initial segmentation; otherwise,
the procedure terminates. While it is possible for the procedure to cycle a large
number of times, experimental evidence
suggests that for reasonable design problems it is not expected to cycle more than
once before terminating with an efficient, if
not optimal, record structuring. Hammer
and Niamir report determining record
structures yielding access cost savings of up
to 90 percent over the cost of using a singlesegment record structure. They do not,
however, consider the cost of data storage,
buffer space, or update operations, nor do
they consider the complexities of a movingComputing Surveys~Vol. 15, No. lj March 1983
60
•
Salvatore T. March
head-disk environment in which such databases are likely to be maintained.
the remainder of the data items; the length
of each is given by
2.3 Mathematical Programming
Eisner and Severance [1976] proposed a
mathematical programming approach to
the record structuring problem, which was
extended by March and Severance [1977]
and March [1978]. This approach views
record structuring as a constrained optimization problem in which the objective is to
minimize the sum of retrieval and storage
costs.
Based on a behavorial model of database
usage that predicts subsets of "high"-use
and "low"-use data items, the procedure is
restricted to considering at most two segments: "primary" and "secondary." Data
items with high usefulness, relative to their
storage and transfer costs, are isolated and
stored in a primary segment; the remaining
data items are stored in a secondary segment. All activity must first access the primary subtile, and may access the secondary
subtile (at additional cost) only if the request cannot be satisfied by the primary
subtile. The possibility of retrieving data
items by only accessing the secondary
subffle is not considered.
Eisner and Severance [1976] analyze this
design problem by assuming that activity
to the primary subfile is either all random
(with both subffles unblocked) or all sequential (with a blocked primary subfile
and an unblocked secondary subtile). In
addition, they assume that pointers stored
in the primary segments link the corresponding secondary segments. They formulate the problem as a bicriterion mathematical program and develop a parametric
solution procedure as discussed below.
A record segmentation, A, is defined by
a subset Da C D of data items that are
assigned to the primary segment. The
length of each primary segment is given by
WA---- F, l i + p ,
-diEDA
where p is the length of a system pointer.
(Note that if both primary and secondary
segments are sequentially processed, then
the pointer is unnecessary and hence p is
set to 0.) The secondary segments contain
Computing Surveys, Vol. 15, No. 1, March 1983
diED-DA
and WA + WA-- W, a constant.
Associated with any segmentation A is a
subset UA C U of users whose retrieval
requirements are not satisfied by the data
items assigned to the primary segment. The
cumulative frequency of retrieval by satisfied users during time period T is
VA =
E
uj~U-Ua
Vj.
The frequency of retrieval by dissatisfied
users is
(zA=~vj,
u]eUA
and VA + VA ---- V, a constant.
When a user requires a data item di that
is not part of the primary segment, then the
secondary subtile must be accessed as weft.
A frequently referenced data item, say di,
could be moved to the primary segment;
this inclusion, however, would create an
additional transportation burden (proportional to li, the length of data item di) for
all users of the database, and in particular
those users not needing di.
Consider a database with N records and
assume that primary and secondary subfiles
are both unblocked. Let al be the cost of a
single access to the primary subtile, tl be
the unit cost of data transfer from the primary subtile, and s~ be the unit cost of
primary subtile storage over the time interval T. Similarly define a~, t2, and s2 for the
secondary subffie. Given a specific segmentation A and assuming that the primary
and secondary subffles are both sequentially processed, then
Costa ~- N((al + tl WA) Y
+ (a2 "}" t2 W A ) ~rA
Jr S l W a 4" S2WA).
(1)
This equation is explained as follows:
(al + tl Wa) is the cost to access and transfer one primary segment. Multiplying by N
gives the cost to retrieve the entire primary
subtile. Since V is the total frequency of
retrieval for all users, N(a~ + t~WA)V
T e c h n i q u e s for S t r u c t u r i n g D a t a b a s e R e c o r d s
•
61
d~
U1
do
Uo
Figure 4. Networkrepresentationused by Eisner and Severance[1976l.
is the total cost of retrieval from the
primary subtile. Similarly, the term
N ( a 2 + t 2 W A ) V A is the total cost of
retrieval from the secondary subtile.
N(s~, WA) is the cost of storage for the
primary subtile, and N(s2, ~VA) is the cost
of storage for the secondary subtile.
Defining R to be the set of all possible
record segmentations, the design problem
becomes
min N((a~ + t~ WA) V + (a2 + t2WA) VA
A~R
+ sl WA + s2 WA).
(2)
This expression can be reduced to one in
WA and IYA to yield
min K3 - K4(K1 - W A ) ( K 2 --" fZA),
A~R
where
K1 = W + a2/t2,
K 2 = (tl V + sl - s2)/t2,
K3 = N(a~ V + s2 + t 2 K I K 2 ) ,
K 4 = Ntz.
(3)
sions (i.e., (K1 - W~) and (K2 - ~ ) ) . T he
fact that its solution also solves the objective (3) can be seen by noting that
(1) a constant may be added to or subtracted from an objective function
without changing its optimal solution
(hence K3 can be subtracted from (3));
(2) an objective function may be multiplied
or divided by an positive constant without changing its optimal solution
(hence the result of subtracting K3
from (3) can be divided by K4);
(3) maximizing an objective is equivalent
to minimizing its negative (hence the
change from a minimization in (3) to a
maximization in (4)).
Geoffrion [1967] has shown that if K1 _
WA >--O, K 2 >-- VA >--0 and R is a continuous
convex set, then a solution A* that solves
the objective
max c(K1 - WA) + (1 - c)(K2 - VA)
A~_R
(5)
over all c ~ [0, 1] also solves the objective
(4).
Eisner and Severance [1976] also examined objective functions of this form specifFor reasonable values of ai, t~, and si, K1,
ically for the record segmentation problem,
K2, K3, and K4 are nonnegative constants
and the objective function (3) is minimized where the set R is discrete. T h e y analyzed
the problem using the network formulation
by that segmentation which solves
illustrated in Figure 4. Data items and users
m a x ( K 1 - W A ) ( K 2 - VA).
(4) are represented by nodes in the network.
AER
Activities are described by edges that conT h e expression (4) is termed a bicriterion nect each user Ur to all data items required
m a t h e m a t i c a l p r o g r a m because its objec- by that user. User u~, for example, is shown
tive function is the product of two expres- to require data items d~, d3, and ds. Two
Compt~tiug Surveys, VoL 15, No. 1, March 1983
62
•
S a l v a t o r e T. M a r c h
additional nodes (do and uo) and a set of
directed edges (from do to di V di ~ D and
from uj to Uo V uy E U) complete the network representation
The capacity of an edge, (do, di) is given
by cli, where li is the length of data item di
and c is a parameter that varies the relative
cost of retrieval as compared to that of
storage. The capacity of an edge (uj, uo) is
given by (1 - c)vr, where Vr is the frequency
of retrieval by user Ur. The remaining network edges have infinite capacity. Intuitively, the capacities of the arcs represent
the relative importance of data item length
as compared to retrieval frequency for the
overall performance of a segmentation.
By using this network formulation and
standard network theory (see Ford and
Fulkerson, 1962), the maximum flow from
do to u0 may be quickly calculated, and is
given by
min CWA 4- (1 -- c) VA.
(6)
A~R
Noting that eq. (5) may be written as
max cK1 + (1 - c ) K 2
A~R
-- (cWA + (1 -- C)17A)
or
cK1 + (1 - c)K2 - rain cWA + (1 -- C) ~A,
AER
it is clear that for a given value of c, a
solution to (6) also solves (5). Using this
approach, Eisner and Severance developed
a procedure that produces a family of
"parametric" solutions for the parameter c.
On the basis of conditions established by
Eisner and Severance [1976] the procedure
terminates either with a parametric solution, which is known to be optimal, or with
the best parametric solution and a lower
bound on the optimal solution. If the best
parametric solution is not "sufficiently
close" to the lower bound, a branch-andbound procedure is used to find the optimal
solution.
March and Severance [1977] extended
this work in the area of blocked sequential
retrieval. They generalized the analysis to
a situation in which primary and secondary
subffles are arbitrarily blocked, subject to
the constraint of a limited buffer area; they
then presented an algorithm that deterComputing Surveys, Vol. 15, No. 1, March 1983
mines an optimal combination of record
structures and block size, given a fixed main
memory buffer allocation.
The more complex design problem is analyzed as follows. First, the Eisner-Sever.
ance procedure is used to produce the family of parametric solutions. Ranges on the
primary block size for which each parametric solution is known to be optimal are then
established. Within each such range an optimal combination of primary block size
and segmentation is determined and the
cost of that combination calculated. The
least cost combination becomes a global
upper bound for the problem (all other
combinations may, of course, be discarded).
For each range in which an optimal solution
is not known, a lower bound is determined.
A range may be fathomed (i.e., eliminated
from further consideration) if its lower
bound is greater than the global upper
bound. For each range which remains, a
branch-and-bound procedure that either
fathoms the range or determines an optimal
segmentation-primary block size combination for that range is used. This segmentation-primary block size combination then
becomes the global upper bound (otherwise
the range would have been fathomed).
When all ranges have been analyzed, the
global upper bound is, in fact, optimal.
The procedure does guarantee to produce
an optimal assignment of data items to
primary and secondary segments, and total
cost savings of 60 percent over the cost of
using single-segment record structures have
been demonstrated. While this procedure
considers the cost of data storage, it does
not explicitly consider buffer space or update costs. The procedure has, however,
been interfaced with a model of secondarymemory management (see March, Severance, and Wilens, 1981) that implicitly considers these effects.
2.4 A Comparative Analysis of Record
Segmentation Techniques
Three approaches to record segmentation
have been discussed: mathematical clustering [Hoffer and Severance, 1975], iterative
grouping refinement [Hammer and Niamir,
1979], and bicriterion programming [Eisner, and Severance, 1976; March and Severance, 1977]. Table 2 characterizes each of
Techniques for Structuring Database Records
Table 2.
•
63
A Characterization of Techniques for Record Segmentation
Scope of the problem
Technique
Access paths
Secondary
memory
environment
Class of record
structures
Evaluation
criteria
Composite of
retrieval
Storage and
maintenance
costs
Number of page
accesses
Mathematical
clustering
Branch and
bound
Composite of
retrieval
Storage costs
Bicriterion
mathematical
programming
Clustering
Sequential
Moving
head
disk
~_1 Segment
No redundancy
Iterative
grouping
refinement
Bicriterion
programming
Scan
Predefined
indexes
Any predefined
access paths
Paged
~1 Segment
No redundancy
Paged
1 or 2 segments
No redundancy
these techniques by the scope of the problem addressed (i.e., the types of data access
paths, the secondary-memory environment
modeled, the class of alternative record
structures allowed, and the criteria by
which alternatives are evaluated) and the
type of algorithm used.
Assuming a flat file logical data structure,
stored without redundancy, each approach
establishes an assignment of data items to
subffles, which improves the overall system
performance by either eliminating or reducing unnecessary data transfer. The mathematical clustering approach produces a permuted data-item × data-item similarity
matrix; this matrix must be subjectively
interpreted to form initial groups of data
items, which are then combined using a
branch-and-bound procedure. The iterative
grouping-refinement procedure iteratively
groups pairs of previously established data
item groups, and evaluates single data item
variations in the initial solution for improved performance. The mathematical
programming approach formulates the
problem as a bicriterion mathematical program which is solved by using a network
representation and classical operations research techniques. The solution isolates a
set of data items with high usefulness relative to their storage and transfer cost.
These data items are stored in a primary
segment; the remaining data items are
stored in a secondary segment.
The first two approaches are heuristic in
nature, and while both yield considerable
cost savings, neither can guarantee opti-
Type of
algorithm
Hill climbing
mality. The first, mathematical clustering,
is more limited than the second, iterative
grouping refinement, in that it only consid.
ers sequential retrieval activity, while the
latter considers both sequential and random retrieval. The third approach, bicriterion programming, guarantees optimality;
however, it is restricted to a maximum of
two segments processed as primary and
secondary. It is capable of modeling both
sequential and random retrieval; but for
random retrieval, it assumes that both primary and secondary segments are unblocked.
Each of these techniques was used to
solve the personnel database problem
shown in Table 1. Extensions and/or modifications were made to the various techniques to permit solution of this problem
and a meaningful comparison of results. In
order to facilitate this comparison, the
characterization of the computer system
and data set shown in Table 3 were assumed for all experiments. In addition,
since the clustering technique is limited to
sequential-only data access, this restriction
was also initially placed on the other two
techniques. Following a presentation of the
results from each technique, a comparative
analysis is given.
When applied to the personnel database
problem shown in Table 1, the clustering
approach yields the permuted similarity
matrix of Table 4. The values in Table 4
represent the "normalized" similarities between each pair of data items. These values
range from a minimum of 1, when the data
ComputingSurveys,YoL 15, No. 1, March 1983
64
°
S a l v a t o r e T. M a r c h
Table 3.
A Characterization of the Computer System and Data Set
System parameters
Disk device
Minumumdisk arm access time
Average disk arm access time
Rotation time
Size of a track (maximumblock size)
Tracks per cylinder
Cost allocation
Cost of bufferspace
Cost of disk space
Cost of retrieval time
Data set parameters
Number of data records
items are "rarely" required together, to a
maximum of 100, when the data items are
always required together. This particular
example illustrates the difficulty of establishing initial clusters from the permuted
similarity matrix. It is clear, for example,
that data items 17, 18, 29, and 30 should be
initially assigned to the same segment since
their similarities are all 100. Data item 28,
on the other hand, has pairwise similarities
of 51, 51, 51, and 53 with data items 17, 18,
29, and 30, respectively. Should it also be
included? If not, where should it be initially
assigned? The initial clusters shown in
boxes in Table 4 were subjectively established by the author simply by intuitively
looking for "high" similarity values. The
results from (implicitly) evaluating, via a
branch-and-bound algorithm, all possible
combinations of these initial clusters, in
both a dedicated disk environment and a
paged-memory environment, are summarized in Table 5.
In order to similarly apply the iterative
grouping-refinement procedure to the personnel database problem, the calculation of
s (the number of segments required from a
subffle, f, by a user ur) was modified to
allow a user to select multiple values for a
data item in a single retrieval. The modified
value is
s = N*
1-[ S i * M r i ,
dlelrf
where N, Irf, and Si are as defined in Section
2.2 and Mri is the number of values of di
that are required for the selection criteria
of user u,. In addition, in order to eliminate
the explicit storage of linkage data, it is
Computing Surveys, Vol. 15, No. 1, March 1983
36.3 milliseconds
43.3 milliseconds
16.7 milliseconds
13,030 characters
19
$2.27 × 10-~/character-second
0.67/cylinder/month
0.01/second
30,000
assumed that each segment-instance of a
single record-instance is stored and addressed by relative record number within
its respective subffle.
Again assuming the system and data set
parameters of Table 4, this procedure was
applied to the employee problem shown in
Table 1. To facilitate comparison with the
other record segmentation techniques,
three different accessing environments
were considered: (a) sequential access only,
(b) sequential access within a subfde (i.e.,
no indexes) but direct access (via relative
record number) from one segment to another of the same record, and (c) direct
access via a set of indexes (one for each
selection data item). The results of these
experiments are summarized in Table 6.
Finally, the bicriterion programming procedure was applied to the employee problem shown in Table 1. Again, the system
and data set parameters of Table 4 were
assumed and the three different accessing
environments described above were considered. In order to model the effects of secondary indexes, however, it must be assumed that both the primary and the secondary subffles are unblocked [Eisner and
Severance, 1976]. The results of these experiments are summarized in Table 7.
The results produced by each algorithm
are discussed below. Prior to this discussion, however, it should be noted that the
performance measures of each are not directly comparable. Although both the clustering algorithm and the bicriterion programming algorithm use "Operating Cost"
as the measure to be minimized, each evaluates this measure differently. The clustering algorithm includes the cost of buffer
it
~D
o
e.
E
Q~
..Q
o
0.
c.
0
O.
~D
. xh
t~L~
c~
ID
LE
cD
O.
.Q
L~
I-
)
66
•
Salvatore T. March
Table 5.
Analysis of the Record Segmentation Selected via the Clustering Heuristic
and Branch-and-Bound Algorithm
Cost of storage retrieval
Unsegmented
Data items
($)
Trivial
segmentation
(%)
(%)
1-14, 16, 25-26
15
17-18, 28-30
19-24, 27
134
120
78
41.8
35.0
1-14, 16, 25-26
15
17-18, 28-30
19-24, 27
295
120
121
59.0
-0.8
Data item assignments
Segment
(a) Dedicated disk environment
1
2
3
4
(b) Paged-memory environment
1
2
3
4
M e m o r y environment
Table 6.
Selected
segmentation
Cost reduction
($)
($)
Over
Over
trivial
unseg- segmenmented tation
Analysis of the Record Segmentations Selected by the Iterative Grouping:
Refinement Heuristic for the Personnel Database Problem
Performance improvement
SeTrivial
lected
Unseg- segmen- segmenmerited tation
tation
Data item assignments
Processing
environment
Segmerit
Data
items
Segment
Data
items
Over
unsegmented
(%)
Over
trivial
segmenration
(%)
(a) All sequential
processing
Each data item stored in its own
segment
294,824
41,979
41,979
85.8
0.0
(b) With no secondary
indexes
1
2
3
4
5
6
7
8
9
10
I1
12
10
2
3
4
5
6
7
8
9
11
12
13
13
14
15
16
17
18
19
20
21
22
23
14
1, 15
16
17-18
27
19-20
26
21-22, 28
23-25
29
30
294,824
37,879
30,159
89.8
20.4
(c) All selection data
items indexed
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
13
14
15
16
17
18
19-24,
26-27
25
28
29
30
132,405
30,187
29,804
77.5
1.3
20
21
22
23
space and the effects of disk arm movement
and rotational delay induced by the "simultaneous" processing of multiple subfiles
Computing Surveys, Vol. 15, No. 1, March 1983
stored on a common disk. These factors are
ignored by the bicriterion programming algorithm, which assumes a paged-memory
Techniques for Structuring Database R e c o r d s
Table 7.
67
*
Analysis of the Record Segmentation Selected by the Bicriterion Programming Algorithm
Cost per month
Data item assignment
Processing environment
Primary
Secondary
Unsegmented
($)
Segmented
($)
Cost reduction
(%)
(a) Blocked primary and secondary, each
sequentially processed
1-16
17-30
206
96
53,4
(b) Blocked primary, unblocked, randomly
processed, secondary
1-16, 24-27
17-23,28-30
206
115
44.2
(c) Unblocked primary and secondary,
both randomly processed (with an
index for each selection data item)
1-16, 23-27
17-22,28-30
1884
1878
0.3
environment and considers only page access and transfer costs. Both include the
cost of data storage. In contrast, the iterative grouping-refinement algorithm does
not explicitly consider cost but attempts to
minimize the number of page accesses (multiplying by the cost per page access, of
course, gives total access cost). It does not
consider the cost of data storage.
In a sequential-only data processing environment, the iterative grouping-refinement procedure selected the trivial partition in which each data item is stored in a
distinct segment (Table 6a). For the same
access environment, quite different segmentations were selected by the other approaches (see Tables 5a and 7a). The reason
for this discrepancy is the difference in the
criteria used to evaluate a segmentation.
The iterative grouping-refinement technique seeks to minimize the total number
of page accesses, which, as discussed above,
will always result in the selection of the
trivial partition if all subfiles are sequentially processed. The clustering technique,
on the other hand, seeks to minimize a
function of retrieval and storage costs that
takes into account the complexities of secondary memory and multiple file processing. It is important to note that when the
secondary memory environment is restricted to page accessing, then, provided
that the cost of buffer space is "reasonable,"
the trivial partition (i.e., with each data
item in its own segment) is also "optimal"
for the clustering technique; however, since
initial clusters are subjectively established,
this solution may never be considered (see
Table 4). Finally, while the bicriterion programming procedure seeks to minimize the
sum of data access costs and storage costs
in a paged-memory environment, it is limited to a maximum of two segments and
therefore does not consider the trivial partition.
In order to compare each procedure's
performance in generating efficient record
structures for sequentially processed files,
the segmentations selected by each procedure were evaluated assuming both a dedicated disk and paged-memory environments (by the evaluation procedure used in
the clustering technique). The results are
summarized in Table 8. As shown in that
table, the clustering algorithm selected an
efficient segmentation for both the dedicated disk and paged-memory environments, while the iterative grouping-refinement algorithm selected a segmentation
that was considerably less efficient for the
dedicated disk environment but slightly
more efficient for the paged-memory environment. The segmentation selected by the
bicriterion programming algorithm, while
never the most efficient, is also an efficient
segmentation. Given the increase in complexity as the number of segments per record increases, the restriction to two segments may actually be desirable. Note that
the bicriterion programming algorithm
took nearly two orders of magnitude less
computer time to produce its solution than
either of the other two algorithms. Timing
consideration~ are discussed in more detail
at the end of this section.
The remaining processing environments
require direct access to at least some subfiles. The clustering approach is inappropriate for such problems and therefore is
not included in the following discussion.
Computing Surveys,Vol. 15,No. 1,March 1983
68
•
Salvatore T. March
Table 8.
Analysis of Solutions for Sequential-Only Retrieval
Dedicated disk environment
Paged memory environment
Cost reduction
Data time assignment
Procedure
Clustering
Segment
1
2
3
4
Data items
Cost of
Over
retrieval unsegand stor- mented
age ($)
(%)
1-14, 16, 25-26
15
17-18, 28-30
19-24, 27
Iterative grouping refinement
Trivial segmentation
Bicriterion programming
1
2
1-16
17-30
The segmentations selected by the iterative grouping-refinement algorithm and
the bicriterion programming algorithm are
widely different from each other in each of
the processing environments considered.
Again, this results from the evaluation measures used and the limitations of each procedure. The bicriterion programming procedure limits the number of segments to
two and requires that subffles be unblocked
in order to model random processing.
Clearly, the ability to block segments contributes substantially to the processing efficiency when moderate-to-large portions of
a file (subffle) are retrieved. Eisner and
Severance [1976] point out that record segmentation has considerably greater potential for cost saving when subffles are
blocked. These effects are seen in Table 7,
which shows significant cost reduction
when at least the primary subffle is blocked
and minimal cost reduction when neither
primary nor secondary subfiles are blocked.
In addition, as discussed in Section 1, the
existence of secondary-access paths reduces
the potential for cost reduction. This is seen
in Table 6, where the percentage of reduction in the number of accesses drops from
89.8 percent to 77.5 percent when indexes
are added for the selection data items.
The effects of secondary indexes are even
more pronounced when the retrieval emphasis shifts from large subset to small subset. In the personnel problem (Table 1) the
first eight retrievals require large subsets
Computing Surveys, Vol. 15, No. 1, March 1983
Over
trivial
(%)
Cost of
retrieval Over
and
unsegstorage mented
($)
(%)
Over
trivial
(%)
78
41.8
35.0
121
59.0
--0.8
120
10.4
0.0
120
59.3
0.0
82
38.8
31.7
135
54.2
-12.5
(90 percent or more) of the data records,
while the last ten retrievals require a small
subset (1 percent or less) of the data records. Focusing only on the small subset (or
so-called "executive" retrievals [Eisner and
Severance, 1976]), both algorithms were
used to determine efficient record segmentations. The results are summarized in Tables 9 and 10. The performance improvement decreased from well over 93 percent
to under 57 percent for the iterative grouping-refinement algorithm and from nearly
45 percent to 0 percent for the bicriterion
programming algorithm.
Consider the results from the iterative
grouping-refinement algorithm shown in
Table 9. The overall performance improvement obtained by indexes (with unsegmented records) is a 91 percent reduction
in the number of accesses required (178,204
to 15,784). The incremental improvement
obtained by segmentation is only 4 percent
(from 15,784 to 8,399). From the other perspective, however, the overall performance
improvement obtained by segmentation
alone (without indexes) is over 93 percent
(178,204 to 11,949). The incremental improvement due to indexing is under 2 percent (form 11,949 to 8,399), yielding a total
improvement of 95 percent. Either segmentation alone or indexing alone contributes
significant performance improvement and
greatly reduces the contribution of the
other. To see why this is true, we must
consider the effect of isolating a "selection"
Techniques for Structuring Database R e v ~ d s
Table 9.
•
.69
Analysis of the Record Segmentations Selected by the Iterative Grouping-Refinement Heuristic
for Executive Retrievals
Performance
improvement
Number of accesses
Unsegmented
Trivial
segmentation
Seletted
segmen.
tation
Over
unsegmented
(%)
Over
trivial
segmenta~on
(%)
Data item assignments
Processing
environment
Segment
Data
~ems
Segmeat
Data
~ems
(a) All sequential
processing
Each data item stored in its own
segment
178,204
21,400
21,400
88.0
0.0
(b) W i t h n o s e c o n d a r y
indexes
1
2
3
4
5
6
7
8
9
10
11
178,204
17,301
11,949
93.3
30.9
15,784
9,607
8,399
46.8
12.6
(c) All selection data
items indexed
1
2
3
4
5
6
Table 10.
10
2
3
6
5
26
7
1, 8-9
11
12
13-14
12
13
14
15
16
17
1, 5, 7-9,
13-14
2-4, 12,
19-24,
26-27
6
10
11
15-16
18
19
20
21
22
15-16
27
17
18
24
4, 19-20,
22-23
21
25
28
29
30
7
8
9
10
11
12
17
18
25
28
29
30
Analysis of the Record Segmentations Selected by the Bicriterion Programming Algorithm
for Executive Retrievals
Cost per m o n t h
Unsegmented
Segmented
Cost reduction
136
65
52.2
1, 4, 6, 8-9, 1113, 15-23 25,
28-30
136
59
56.6
15-18, 25, 28-30
37
37
0.0
Data item assignment
Processing environment
Primary
Secondary
(a) Blocked primary and secondary,
each sequentially processed
1-14
(b) Blocked primary; unblocked, randomly processed secondary
2-3, 5, 7, 10, 14, 24,
26-27
(c) Unblocked primary and secondary, both randomly processed
(with an index for each selection
data item)
1-14, 19-24,
26-27
data item in its own segment. Such an
arrangement effectively reduces the search
time required to satisfy that selection criteria. The subffle is itself an index since it
merely contains (value, pointer) pairs for
15-30
($)
($)
(%)
each record in the file (the pointer part
being explicitly maintained by position in
the subffle). While the time required to
search the subtile for the appropriate
"values" can be reduced by secondary inComputing Surveys, VoJLlSk No. 1, M~urchl~B3
70
Salvatore T. March
Table 11.
Computational Time (Seconds) for Each Record
Segmentation Procedure
Personnel database
Algorithm
Sequential
only
Clustering
253.5
Iterative grouping refinement
152.3
Bicriterion programming
5.6
dexes, the improvement is mitigated by the
fact that such subffles are quite small compared to an unsegmented file.
Similar results are obtained from the bicriterion programming algorithm although
the effects of segmentation are not quite as
dramatic since the number of segments is
limited to two. As shown in Table 10, with
unsegmented records, indexes reduced the
total cost by 73 percent ($136 to $37); no
additional improvement was obtained by
segmentation. Again, from the other perspective, with no indexes, segmentation reduced the total cost by over 52 percent
($136 to $65). Indexing further reduced the
cost by over 20 percent (from $65 to $~7),
yielding a total improvement of nearly 73
percent.
One final factor is considered: the run
time of the algorithms. Table 11 shows the
time required by each procedure, running
on a CDC Cyber 74 Computer, to solve
the database problems discussed above.
Clearly, the bicriterion programming approach is considerably faster than either
the mathematical clustering or the iterative
grouping-refinement approaches. However,
H a m m e r and Naimir have suggested techniques for improving the speed of the iterative grouping-refinement procedure that
were not incorporated into the implementation used by the author.
While the above techniques provide valuable insight into the record structuring
problem, the restriction to flat file data
representations severely limits their applicability for modern database design problems, where the logical data structure is
typically quite complex. The following section presents a record structuring technique, called "aggregation," that recognizes
Computing Surveys, Vol. 15, No. 1, March 1983
Without indexes
.
With
indexes
.
1202.1
4.5
Executive
retrievals
.
Without indexes
With
indexes
.
539.3
564.5
282.9
4.3
2.7
2.6
the complexity of the logical data structures
typical of modern databases.
3. HIERARCHIC AGGREGATION
Aggregation is a type of record structuring
that physically arranges groups of data
items that have been defined by some data
model. It is motivated by the recognition
that in database systems the logical structure of the data and the activities on that
data are typically complex and interrelated.
As with record segmentation, the task of
determining an efficient aggregation may
be formally stated as that of minimizing
storage and retrieval costs over all database
users. However, retrieval and storage cost
functions, and the set of possible aggregations over which their sum is minimized,
are highly dependent upon the data model
under consideration.
This section discusses a technique
termed hierarchic aggregation. As its
name implies, hierarchic aggregation assumes that the logical data structure is
expressed in the hierarchic data model.
Thus the set of aggregations from which a
record structure is selected is restricted
to "child" segments being aggregated iteratively into "parent" segments. Such an
approach is also possible for the network
and relational data models; however, its
efficiency is highly dependent on the single
parent restriction of the hierarchic model,
and this restriction does not apply to the
network or relational models.
Schkolnick [1977] developed hierarchic
aggregation for use in designing IMS databases since IMS is a widely used hierarchic
DBMS. The scope of the problem addressed by Schkolnick is restricted to hier-
Techniques for Structuring Database Records
I
NAME
I
PAYROLL
•
71
E
ADDRESS
SKILL
EXPERIENCE
I
Figure 5. A graphical representation of a hierarchic personnel database.
archic organizations of data stored without tains a SKILL segment and some number
redundancy in a paged secondary-memory of E X P E R I E N C E and E D U C A T I O N segenvironment, and sequentially processed ments related to that SKILL). The number
via hierarchic scans. A hierarchic organi- of instances of a type tree Tj associated
zation of data is simply one which can be with a root segment S, is denoted Zj and is
represented by a tree structure. The basic called the Z-fanout of segment],
A type tree defines a hiera~hic organielement considered is termed a segment.
o
storage
Again, a segment is defined by the set of zation Given a type t r ee, a physical
data items it contains; however, in this con- scheme must be devised which efficiently
text, a segment is assumed to be indivisible supports the activity upon that structure
(hence it is the "basic element"). Each seg- Two storage schemes are immediately obment is described by length (the number of vious First, one may use the hierarchic
characters required to store the segment) preorder to linearize the structure and then
and type (an identifier for the items stored simply store segments sequentially, accordwithin the segment). Segments are orga- ing to that ordering, in a single file organinized into trees by means of a type tree. A zation. Alternatively, one may store each
type tree is an (m + 1)-tuple T ffi (S, TI, segment in a separate file organization. The
T~. . . . . Tin), where S is the root segment of first structure has obvious advantages for
the type tree, TI, T 2 , . . . , Tm are children the retrieval of all information for a partictype trees of S, and m is termed the root ular instance of the m a i n t y p e tree (e.g., all
fanout of T; if m -- 0, then T is called a root information describing a single employee),
while the second has advantages for the
tree.
Figure 5 shows a graphical representation retrieval of all instances of a single segment
of a type tree, T = (NAME, T1, T~, T3), for tree (e.g., skill information for all employa personnel database. T1 ffi (ADDRESS) ees). Unfortunately, retrieval requirements
and T2 -- (PAYROLL) are root trees, are rarely so simply characterized, and the
whereas T3 = (SKILL, T4, T0) is a type tree selection of efficient record, stixtctUre must
containing T4 = (EXPERIENCE) and take into consideration t h e type and freT5 = (EDUCATION). Each employee has quencies of retrievals, as well as the actual
exactly one NAME segment (containing, structure of the hierarchy.
e.g., first, middle, last name, employee pass
The number of possible record structures
number, social security number), some is a function of the n u m b e r of segments.
number of A D D R E S S and PAYROLL seg- Consider an n-segment type tree T, and
ments (containing, perhaps, current and arbitrarily index each segment by the order
historical information), and some number in which it would appear in a preorder
of SKILL type trees (each of which con- traversal (see Knuth, 1973) of that tree. An
ComputingSm~,~s,V ~ 15,~ ,
Ma~eh1983
72
•
Salvatore T. March
aggregation C of that type tree T is a set
Xc ffi {xj(C)}y-2, where x](C) is a 0-1 variable such that
1,
x/(C) ffi
0,
i f j is stored with
its parent;
otherwise.
A t'de organization is defined for each segment which is not stored with its parent.
Since each segment has exactly one parent
(except the root segment, which has no
parents), there are 2 n-1 possible ways to
aggregate the segments of the type tree T.
Figure 6a shows several ways of aggregating
the segments of the type tree illustrated in
Figure 5. Figure 6b shows the corresponding record structures for those aggregations,
with repetitions of segments shown only
when those segments are aggregated.
The objective of hierarchic aggregation
is to determine a set of record structures
t h a t meets the database activity with the
minimal number of page accesses. Considering segments to be indivisible units and
given a set of user retrieval requests defined
by hierarchical scans (i.e., by the transitions
among structurally connected segments),
Schkolnick developed criteria for determining the number of page accesses required to
satisfy a retrieval request for a given set of
record structures. These criteria are based
upon the type of transition required and
the record structures (and thus the file organizations) defined by the selected aggregation. Since it is assumed that file organizations are independently processed, then
any transition from one file organization to
another is assumed to cause a page access.
Only transitions that occur within a single
file organization need by considered further.
Schkolnick argues that there are only
three types of transitions of concern:
(1) parent-child transition (PCT), a transition from a parent segment to the first
instance of one of its children (e.g., from
a NAME segment to the first instance
of a PAYROLL segment for a single
employee);
(2) twin transition (TT), a transition from
one instance to the next (in hierarchic
order) of the same segment type, provided that no parent segment appears
Computing Surveys, Vol. 15, No. 1, March 1983
between them (e.g., from one instance
of a SKILL segment to the next for the
same employee);
(3) child-parent transition (CPT), a transition from a child segrnentj to the next
instance (in hierarchic order) of a parent segment type i {e.g., from an instance of a SKILL segment for one
employee to the instance of the NAME
segment for the next employee). This
next instance is called the "uncle" segment of the given child segment.
All database activity may then be characterized by the number of such transitions
given by
nij ffi the number of PCTs between parents
of type i and children of type j,
nit ffi the number of TTs between instances
of segment type i, and
nj,. ffi the number of CPTs between children
of type j and parents of type i.
The record structuring problem may be
formulated as
min P(C) = ~ (ni~ii(C)
C~8
lET
+ ~ [nijPij(C)+njiPji(C)]),
yes(O
where 0 is the set of all possible aggregations. S(i) is the set of segments which are
sons of segment / ~ T, and Pi~(C), P~(C),
and Pj~(C) are the expected number of page
faults for TT, PCT, and CPT transitions,
respectively, for a given aggregation C ~ #.
Expressions for Pii( C), Pij( C), and Pii( C)
are obtained by determining the expected
number of page references for each transition and then determining the relationship
between page references and page faults.
Assuming that each file organization has
some number of buffers in main memory
which hold the most recently accessed
pages from that file organization, then a
new page reference causes a page fault only
ff the referenced page is not in a buffer.
For a specific aggregation, the number of
page faults caused by a set of transitions
depends on the frequency of each type of
transition, on the conditional probability
that each transition type is followed by any
of the others, and on the number of pages
held in main memory buffers. Since it is
f--.~.
~-~\\ ~ 1
d
d
d
\\
,'x\ ~
d
II
)
)
~
2>
f
~N
®srqr4®rq
<a
I
,2
d
i
I
7
I
I
i
E
II
i// "/ F
/ r - - - q l h ~<
\
~-
. / ~J .J
"~' 77j r~
'
<
'\
/
.
,.
P.
.
IF;q I r - ~
~ × II
=l
O1
II.
ff
\
/
-E
74
•
Salvatore T. March
assumed that access is done via hierarchic
scan and that each file organization has at
least one buffer, conditions for page faults
for each type of transition in terms of new
page references are characterized as
SKILL (Zs ffi 3), each of which contains
one SKILL segment and two EDUCATION segments (Zz = 2). Thus, the distance between NAME segments, DNs, is
PCT ffi any new page reference causes a
page fault (since it is highly unlikely that the page containing the
child segment is in main memory
from a previous access),
T T ffi any new page reference cause a
page fault (again, since it is highly
unlikely that the page containing
the twin is in main memory from
a previous access), and
CPT ffi a new page reference causes a page
fault if and only if the parent segment is on a page different from
the uncle segment (since the parent must be accessed before the
child in a hierarchic scan).
•10+3.(15+2.5)
DNN = 1N + Zs(/s + ZEIE)
ffi
85.
Functions are similarly calculated for the
distance between parent-child segments
and child-parent segments.
Given these distance functions, the probability of a new page reference for any type
of transition is determined as follows. Consider a transition from a segment of type g
to a segment of type h (where g may be the
same as h). If the distance from g to h, Dgh,
is greater than the page size, then the transition will always cause a new page reference. If, on the other hand, the distance
between g and h, Dgh, is less than the page
size, then a new page must be referenced
In addition, since it is assumed that file only if the distance between segment g and
organizations are independently main- the end of the page on which it is stored is
tained, any transition from one file organi- less than Dgh. Assuming that g is randomly
zation to another is assumed to cause a located on a page, then the probability of
page fault.
this occurring is simply the ratio of the
To determine the expected number of distance between g and h, Dgh, and the page
page references for each type of transition, size, B. For a given aggregation C, the probthe distance between segments in the ag- ability of a new page reference for a trangregation must first be determined. Since sition from a segment of type g to one of
aggregated segments are stored in hier- type h may be mathematically stated:
archic order, the distance between two seg6eh(C) = min(Dgh(C)/B, 1).
ments is simply the sum of the lengths of
all intervening segments in the hierarchic
Expressions for the expected number of
order. Consider, for example, the aggrega- page faults for each type of transition (astion defined by X2 = {0, 0, 1, 0, 1] (see suming t h a t i is the parent of j) for a given
Figure 6) and assume that the segment aggregation, C, are as follows:
lengths are given by 1N ffi 20, 1A ---- 50, lp .ffi
10, ls ffi 15, lx ffi 30, 1E ffi 5 for NAME, (1) For a parent-child transition, if child
segment j is stored with its parent (i.e.,
ADDRESS, PAYROLL, SKILL, EXPExy(C) = 1), then Pi/(C) is simply the
RIENCE, and EDUCATION segments, reprobability of a new page reference;
spectively. Further assume that the Z-fanotherwise, Pi/(C) = 1. Mathematically,
out for each segment is given by ZA ffi 2,
this may be stated as
Zp = 52, Zs =- 3, Zx = 5, ZE ffi 2 ( t h e Z fanout is irrelevant for the root segment
P~j(C) ffi ~,AC)xAC) + (1 - x/(C)).
NAME). Then the distance between twin
ADDRESS, PAYROLL, and EXPERI- (2) For a twin transition, Pjy(C) is simply
ENCE segments (DAA, Dpp, Dxx, respecthe probability of a new page reference,
tively) is simply the length of each segment,
or
50, 10, and 15, respectively, since they are
P~A c) = ~//(c).
each root segments and not involved in any
aggregations. Between each NAME seg- (3) For a child-parent transition, if the
child segment j is stored with its parent
ment, however, are three subtrees rooted at
Computing Surveys, Vol. 15, No. I, March 1983
Techniques for Structuring Database Records
(xj(C) ffi 1), then PjI(C) is the probability of a new page reference; otherwise, it is the probability that the uncle
segment (of type i) is stored on a different page than the parent (also of
type i). Mathematically, this may be
stated as
Pj~(C) ffi ~j~(C)xAC) + ~ . ( C ) ( 1 - xAC)).
The cost function for a subtree rooted at
i may then be stated
Ki( C) ~- nii~ii( C)
+~
j~s(i)
{ n~j[&j(C)xj(C) + (1 -xj(C))]
+ n~(C)[~j~(C)xAC)
+ ~(C)(1 - xAC))]
+KAC)),
where Cj is the restriction of C to only those
variables in the subtree rooted at j. Since
K~(C) ffi P(C), the aggregation that minimizes KdC) also minimizes P(C).
A brute-force approach to the problem
generates Ki(C) for all possible (i.e., 2"-~)
aggregations C E a. Schkolnick noted that
the brute-force approach spent a great deal
of time generating and evaluating record
structures that were known to be nonoptimal (i.e., ones that were not optimally aggregated). Using this information, he developed a branch-and-bound algorithm that
produces an optimal grouping of segments
in linear time. To test his algorithm,
Schkolnick collected usage statistics from
an actual 25-segment hierarchical database
for a period of 24 hours and used his branchand-bound algorithm to determine an optimal record structuring. Five alternative
record structurings were evaluated for comparison. Schkolnick then loaded the database into each of the six configurations and
obtained measurements of the actual system performance. In no case was the performance of any of the alternative configurations better than that of the configuration
selected by the algorithm--in most cases it
was considerably worse.
While this approach has been studied
only for the restricted case of the hierarchic
data model, it is also applicable to
other data models. Teorey and Fry [1982]
suggest a two-step approach to aggregating
•
75
network structures, in the first step, the
network structure is broken down into a set
of overlapping hierarchies and the Schkolnick algorithm directly applied to each hierarchy. In the CODASYL terminology
used by network DBMSs, a network is defined by record types related via O W N E R M E M B E R SETs, and a record type may
be a M E M B E R of more than one SET.
Forming a set of hierarchies from a network
structure is relatively straightforward:
when a record type is a M E M B E R of only
one SET, then the O W N E R and MEMB E R record types are already hierarchically related and no further work need be
done; however, when a record type is a
M E M B E R of more than one SET, then a
distinct hierarchy must b e formed for each
O W N E R record type, with the M E M B E R
record type appearing as a child. Since a
record type may appear in more than one
hierarchy, it may also appear in more than
one aggregation. Assuming that data are
stored nonredundantly, then record types
appearing in more than one aggregation
conflict with the method of implementation. Such conflicts are resolved in the second step of the Teorey and Fry procedure
by evaluating all possible combinations of
conflicting aggregations.
Consider, for example, the "classic" EMP L O Y E E - P R O J E C T database illustrated
in Figure 7a with the three record types
(EMPLOYEE, PROJECT, and ASSIGNMENT) and two sets (EMPLOYEEA S S I G N M E N T and P R O J E C T - A S S I G N MENT). Then two hierarchies must be created as shown in Figure 7b. If the application of the Schkolnick algorithm to each
hierarchy results in A S S I G N M E N T being
aggregated together with both EMPLOYEE and P R O J E C T , then two alternative CODASYL implementations must
be evaluated: (1} A S S I G N M E N T aggregated into EMPLOYEE with P R O J E C T
stored separately (denoted by the dashed
box on the left side of Figure 7c) and (2)
A S S I G N M E N T aggregated into PROJECT with E M P L O Y E E stored separately
(again denoted by the dashed box on the
right side of Figure 7e). This approach may
similarly be applied to the relational model
in which multiple relations can be hierarchically structured (e.g., by fore!gn keys or
partial key matches).
Computiug Surveys0 Vol, 15, No. 1, March 1983
76
Salvatore T. March
PROJECT
EMPLOYEE
\
/
EMPLOYEEASSIGNMENT
PROJECTASSIGNMENT
\
/
I ASSIGNMENT I
(a)
PROJECT
EMPLOYEE
I ASSIGNMENT
ASSIGNMENT
(b)
EMPLOYEE
PROJECT
I
PROJECT
,I
ASSIGNMENT I
EMPLOYEE 1
I
I'
[ [ ASSIGNMENT I
[
I
(c)
Figure 7. Aggregation of an employee-project database: (a) a CODASYL network; (b) overlapping hierarchies;
(e) alternative CODASYL implementations.
4. SUMMARY AND DIRECTIONS FOR
FURTHER RESEARCH
This paper has addressed the problem of
selecting efficient record structures for a
database system. Simply stated, the problem is to determine which data items should
be physically stored together so that the
Computing Surveys, Vol, 15, No. I, March 1983
total operating costs are minimized. The
solution depends on the intrinsic structure
of the data (as expressed in some data
model), its volume and volatility, characteristics of the user retrieval requirements
and the computer system environment, and
the selection of data access paths within
the database.
Techniques for Structuring Database Records
Two aspects of record structuring have
been delimited, segmentation and aggregation, and alternative techniques for each
aspect have been presented and discussed.
A complete record structuring for a database would include both aspects; however,
current techniques focus on only one or the
other. Mathematical clustering, iterative
grouping refinement, and mathematical
programming each address record segmentation. Mathematical clustering considers
only sequential processing but includes a
detailed costing model of secondary memory. Iterative grouping refinement considers both sequential and random processing
but assumes a paged-memory environment.
Both of these techniques are heuristic in
nature and cannot guarantee an optimal
solution. Mathematical programming, on
the other hand, does guarantee optimality;
however, it limits the number of segments
to two, processed as primary and secondary. Again, it assumes a paged-memory environment. Hierarchic aggregation addresses record aggregation. As its name implies, this technique was developed for the
hierarchic data model. It may, however,
also be applied to the network data model
by first decomposing the network into a set
of overlapping hierarchies.
There are several major areas of concern
for future research, each of which is motivated by weaknesses in the techniques discussed above. The first is data redundancy.
While data redundancy causes both update
efficiency and integrity concerns, it may be
of great practical significance in meeting
data retrieval requirements, particularly
where the volume of update is not significant. None of the techniques discussed
above address data redundancy.
A second area for research is that of
determining the effects of update operations. The costs associated with doing update operations are significantly different
from those of retrieval operations and
therefore affect the selection of record
structures in different ways. In addition,
alternative record structures vary greatly
in their requirements for backup and recovery procedures. Consider, for example, a
record structure that isolates all data items
subject to update. Clearly, the backup and
recovery considerations for the volatile
,
77
subtile are quite different from~those for
the stable file, and substantial cost savings
may be realized by such an arrangement.
In any database serving a diversified user
community, the question of data security
must also be considered. Although in certain cases segmented record structures may
be more expensive to implement, for security reasons it may be desirable to isolate
sensitive data items in a high-security segment, perhaps stored on a separate disk
device and requiring specific access authority.
The final area for future research is the
integration of record structuring with the
selection of data access paths. What is
needed is a comprehensive model that will
include both aggregation and segmentation
and explicitly incorporate the effects of alternative data access paths. Such a model
would serve as the basis for an integrated
approach to database design optimization.
ACKNOWLEDGMENTS
The research describedin this paper was supportedin
part by the David W. Taylor Naval Ship Research
and Development Center under contract N00167-80C-0061. The author would like to thank the referees
for their incisivecomments on earlierversions of this
paper. Finally,the author would liketo thank Dennis
Severance, Mario Schkolnick, and D o n Batory for
their advice and encouragement in the preparation of
this paper.
REFERENCES
ARONSON,J. Data Compression--A Comparison of
Methods. National Bureau of Standards, Special
Publication 500-12, June 1977,39 pp.
BATORY,D.S. "On searching transposed files." ACM
Trans. Database Syst. 4, 4 (Dee. 1979), 531-544.
BLASGEN,M. W., ASTRAHAN,M. M,, CHAMBERLIN,D.
D., GRAY, J. N., KING, W. F., LINDSAY,B. G.,
LORIE, R. A., MEHL, J. W., PRICE, T. G., PUTZOLU, G. R., SCHKOLNICK,M., SELINGER, P. G.,
SLUTZ, D. R., STRONG,H. R., TRAIGER, I. L.,
WADE, B. W., AND YOST, R.A. "SystemR: An
architecturaloverview."IBMSyst. J. 20, 1 (1981),
41-62.
CARDENAS, A.F. "Analysisand performance of inverted data base structures." Commun. ACM 18,
5 (May1975), 253-263
CARLIS,J. V., ANDMARCH,S.T. "A computeraided
database design methodology."MISRC Tech.
Rep. TR-81-01, Management Information Systems Research Center, School of Management,
Univ. of Minnesota, Minneapolis, Minn., July
1980.
Computin$Surveys,VoL 15, No. 1, March 1983
78
•
Salvatore T. March
CRAr~SERLIN,D. D., ASTRAHAN,M. M., BLASGEN,M.
W., GRAY, J, N., KxNO, W, F., LINDSAY,B. G.,
LORIE, R. A., MEHL, J. W., PRICE, T. G., PUTZOLU, F., SCHKOLNICE, M., SELIN6ER, P. G.,
SLUTZ, D. R., TRAINER,I. L., WADE,B, W., AND
YOST,R.A. "A history and evaluation of System
R." Commun. ACM 24, 10 (Oct. 1981), 632-646.
CHEN, P. P. "The entity-relationship model-Toward a unified view of data." ACM Trans.
Database Syst. 1, 1 (March 1976), 9-36.
CODASYL Data Base Task Group Report. ACM,
New York, 1971.
CODASYL Data Description Language, Journal of
Development. Secretariat of Canadian Government EDP Standards Committee, Hull, Quebec,
1978.
CODD, E. F. "A relational model of data for large
shared data banks." Commun. ACM 13, 6 (June
1970), 377-387.
CODD, E.F. "Relational database: A practical foundation for productivity." Commun. ACM 25, 2,
(Feb. 1982), 109-117.
DATE, C.J. An Introduction to Database Systems,
2nd ed. Addison-Wesley, Reading, Mass., 1977.
DAY, R.H. "On optimal extracting from a multiple
t'de data storage system: An application of integer
programming." Oper. Res. 13, 3 (May 1965), 482494.
EISNER, M. J., AND SV.VERANCE, D. G. "Mathematical techniques for efficient record segmentation in large shared databases." J. ACM 23, 4
(Oct. 1976), 619-635.
FELLER, W. An Introduction to Probability Theory
and Its Applications, vol. 1. Wiley, New York,
1970, 509 pp.
FORD, L. R., AND FULKERSON,D.R. Flows in Networks. Princeton Univ. Press, Princeton, N. J.,
1962.
GANE,C., ANDSARSON,T. Structured Systems Analysis. Prentice-Hall, Englewood Cliffs, N. J., 1979.
GARFINKEL,R. S., AND NEMHAUSER,G.L. Integer
Programming. Wiley, New York, 1972.
GEOFFRION,A.M. "Solving hi-criterion mathematical programs." Oper. Res. 15, 1 (Jan. 1967), 3954.
GUTTMAN,A., AND STONEBROKER,M. "Using a relational database management system for computer aided design data." Bull. IEEE Comput.
Soc. Tech. Comm. Database Eng. 5, 2 (June
1982), 21-28.
HAMMER,M., ANDNIAMIR,B. "A heuristic approach
to attribute partitioning." In Proc. ACM SIGMOD Conf. (Boston, Mass,, May 30-June 1,1979).
ACM, New York, 1979, pp. 93-101.
HELD, G. H., STONEBREAKER,M. R., AND WONG, E.
"INGRES: A relational data base system." In
Proc. 1975 Nat. Computer Conf., vol. 44. AFIPS
Press, Arlington, Va., 1975, pp. 409-416.
HOPFER, J.A. "A clustering approach to the generation of subf'iles for the design of a computer data
ComputingSurveys,Vol.15. No. 1, March 1983
base." Ph.D. dissertation, Dep. Operations Research, Cornell Univ., Ithaca, N.Y., Jan. 1975.
HOFFER, J. A., AND SEVERANCE,D.G. "The use of
cluster analysis in physical data base design." In
Proc. Int. Conf. Very Large Data Bases (Framingham, Mass., Sept. 22-24, 1975). ACM, New
York, 1975, pp. 69-86.
IBM CORPORATION Introduction to Direct Access
Storage Devices and Organization Methods, Student Text (No. GC20-1649-8). IBM Corp., White
Plains, N.Y., 1974.
JEFFERSON, D. K. "The development and application of data base design tools and methodology."
In Proc. Sixth Int. Conf Very Large Data Bases
(Montreal, Canada, Oct. 1-3, 1980). ACM, New
York, 1980, pp. 153-154.
KENNEDY,S, R., "Mathematical models of computer
file organizations." Ph.D. dissertation, Dep. of
Operations Research, Cornell Univ, Ithaca, N.Y.,
June 1973.
KENT, W. Data and Reality. North-Holland Publ.,
Amsterdam, 1978.
KNUTH,D. The Art of Computer Programming: Vol.
3, Sorting and Searching. Addison-Wesley, Reading, Mass., 1973.
LUM, V. "1978New Orleans Database Design Workshop Report." IBM Res. Rep. RJ2554 (33154),
July 1979, 117 pp.
MARCH,S.T. "Models of storage structures and the
design of database records based upon a user
characterization." Ph.D. dissertation, Dep. of Operations Research, Cornell Univ., Ithaca, N.Y.,
May 1978.
MARCH, S. T., AND SEVERANCE, D.G. "The deterruination of efficient record segmentations and
blocking factors for shared data files." ACM
Trans. Database Syst. 2, 3 (Sept. 1977), 279-296.
MARCH, S. T., AND SEVERANCE, D.G. "A mathematical modeling approach to the automatic selection of database designs." In Proc. ACM SIGMOD Conf. (Austin, Tex., May 31-June 2, 1978).
ACM, New York, 1978, pp. 52-65.
MARCH, S. T., SEVERANCE, D. G., AND WILENS,
M. "Frame memory: A storage architecture to
support rapid design and implementation of efficient databases." ACM Trans. Database Syst. 6,
3 (Sept. 1981), 441-463.
MARTIN,J. Computer Data-Base Organization, 2nd
ed. Prentice-Hall, Englewood Cliffs, N. J., 1977.
MAXWELL, W. L., AND SEVERANCE, D. G. "Comparison of alternatives for the representation of
data items values in an information system." In
Proc. Wharton Conf. Research on Computers in
Organizations (Univ. of Pennsylvania, Philadelphia, Pa., Oct. 1973). University of Pennsylvania,
Philadelphia, Pa., 1973, pp. 121-136.
McCORMICK, W. T., JR., SCHWEITZER, P. J., AND
WHITE,T.W. "Problem decomposition and data
reorganization by a clustering technique." Oper.
Res. 20, 5 (Sept. 1972), 933-1009.
T e c h n i q u e s for S t r u c t u r i n g D a t a b a s e Re/cords
McGEE, W.C. "The information management system, ICM/VS, Part II: Data base facilities." IBM
Syst. J. 16, 2 (1977), 96-122.
SCHKOLNICK,M. "A clustering algorithm for hierarchical structures." A CM Trans. Database Syst. 2,
1 (March 1977), 27-44.
SCHKOLNICK,1V[.,ANDYAO,S.B. "Physical database
design." 1978 New Orleans Data Base Design
Research Report, IBM Res. Rep. RJ2554(33154),
July 1979.
SEVERANCE, D. G., AND CARLIS, J.V. "A practical
approach to selecting record access paths." ACM
Comput. Surv. (Dec. 1977), 259-272.
SEVERANCE, D. G. "A practitioner's guide to data
compression." Inform. Syst. 8, 1 (to appear), 1983.
•
?9
TEOREY, T. J., AND FRY, J . P . Designof Database
Structures. Prentice-Hell,Englewood Cliffs, N. J,,
1982.
TSlCHRITZlS, D. C., A~D LOCHOVSKY,F, H. Data
Models. Prentice-Hall, Englewood Cliffs, N. J.
1982.
WIEDERHOLD, G. Database Design. McGraw-Hill,
New York, 1977, 658 pp.
YAO, S.B. "Approximating block accesses in database organizations." Commun. ACM 20, 4 (Apr.
1977), 260-261. (a)
YAO, S.B. "An attribute based model for database
access cost anelysis." ACM Trans. Database Syst.
2, 1 (March 1977), 45-67. (b)
Received July 1981 ; final revision accepted Septembe[ 1982
Comput~ 5'~ve~, V~. 15,No. 1, M~ch 1983
Download