IDBD AN INTERACTIVE DESIGN TOOL FOR CODASYL-DBTG-TYPEDATA BASES Janis Roland Dahl' A. Bubenko jr* The Systems Development Laboratory 1. The Interactive Data Base Designer (IDBD) assumes as input a conceptual description of data to be stored in the data base (in terms of a binary data model) and an expected workload in terms of navigations in the conceptual model. Extensive checking of input is performed. The designer has the possibility to restrict the solution space of the design algorithm by prescribing implementation strategies for parts of the binary model. 2. University Stockholm, of Stockholm, Sweden Introdmction Before we can design a data base schema, compatible with some existing Data Base Management System, we have to determine what kind of data it should contain and what kind of updates, work-load, in terms of queries, and deletes it must be able to haninserts dle. examination of In order to permit alternative solutions the requirements must be stated in as implementation independent possible. By 'implementation terms as independent' we mean that there have been made no decisions on how to group data items in records, which access techniques to use and how to navigate in a structure of records and sets. ABSTRACT 1. Chalmers University of Technology, 41296 Gothenburg, Sweden (SYSLAB)3 Designing a data base is thus only tively small) part of a systems process. It is preceded by a activities the purpose of which is corporate information needs and the requirements of an information be developed. S- a (reladevelopment number of to analyze to specify system to The Interactive Data Base Designer (IDBD) presented in this paper has been developed to be compatible with two kinds of system design (philosophies) approaches. S-10691 8. This work has been supported by the National Swedish Board for Technical Development the analytical approach, The first kind, proceeds through development phases like - goal and problem analysis - activity analysis etc. set of and arrives at a comprehensive This set also specifications. requirements includes a conceptual information model of relevant parts of the enterprise and a set of information requirements [Bub-801. The conceptual information rode1 (CIM) describes and defines relevant phenomena (entities, relaassumptions, inference rules events, tions, of Discourse (UoD). etc.) of the Universe The CIM models the UoD in an extended time Proceedings of the Eighth International on Very Large Data Bases Conference 108 Mexico City, September, 1982 perspective in order and constraints. to capture dynamic rules + the tool existing The next step in this approach is to 'restrict' the CIM (from a time perspective point of view) and to decide what information to in the data base and how to conceptustore in ally navigate in this set of information to satisfy stated information requireorder ments (see Gus-82 for a comprehensive exposition of this problem). If the information to be stored in the data base is defined in terms of a binary data model and the conceptual navigations are specified assuming such a model then this is the required input to IDBD. a of is in be data of the follow- - description of the conceptual data model - its data item types (representing entity types) and relations the the - a description model defined of the workload of in terms of run-units the - a description be considered of certain parameters by the design algorithm to - a set of design directives restricting algorithm space. - comprehensive checking of the consistency of input data is performed (we have empirically found that it is difficult to supply a tool of this kind with correct input the first time; a waste of and computer resources is the time result of optimizing incorrect input) to the design solution its A consistency check is performed on the model Also an and the work-load descriptions. analysis is performed to estimate the number of references to the data items and the relations when navigating in the model. The input, interaction will be illustrated case, the enterprise assumed. - the designer has a possibility to prescribe certain implementation alternatives for parts of the model (or the whole model). This has the following advantages and processing of IDBD by a small practical GROSS. The following iS GROSSis a local supplier, i.e. it supplies parts to customers located in the same city. Parts are distributed by cars and one cargo is called a delivery. One delivery can conEvery day 1 to 20. tain several orders, there are several deliveries, 1 to 20. Customers send their orders to GROSS. An order includes 1 to 25 parttypes. When an order arrives, its day for delivery is determined. ' the designer can test his/hers own solution alternatives which may be 'natural' or which have other, preferred, non-quantifyable properties Proceedings of the Eighth International on Very Large Data Bases Input The input to the IDBD consists ing types of data: The DBTG-schema design algorithm of IDBD has the binary data model and a set of implementation strategies in common with design-aids developed at the University of Michigan [Mit-75, Ber-77B, Pur-791. It differs, however, from them in several important respects solutions an This paper describes and explains IDBD in Secterms of running a small sample case. tion 2 describes the input to IDBD - the conceptual binary data model and how to describe in the the work-load in terms of navigating model. User interaction, checking of input and how to supply IDBD with design directives is presented in section 3. The design algorithm and the results of performing a design run are given in section 4. 2. unnormal or *non-sensebe avoided augment - the description of the work-load practically realistic as navigation the conceptual binary model can defined. It is, of course, also advantageous to use the experimental approach as a complement to the purely analytical one in order to avoid guesswork concerning the requirements and the work-load. which gives to monitor to + the solution space can be drastically reduced thereby making IDBD realistic tool also for design large, complex data base schemata The other approach to data base design is the experimental one. In this case we assume that a 'fast prototype' is developed by the use of the CS4 system [Ber-77A]. CS4 employs a binary data model and is thus compatible with IDBD. Experimental use of the prototype system can provide us with statistics of navigation types and frequencies. - the tool is interactive designer a possibility design process can be used DB-schema can Conference 109 Mexico City, September, 1982 The following information requirements terms of queries - are defined in the ing design stages. 1. For a particular day, all the customers which are to he supplied, and for each customer: name and address. 2. For a particular parttype, which it is included delivery for each order. .7 . For a particular parttype and for t icular customer, the total amount for the part in order. 4. assumed are model entity The types represented in this case by a set of data In fact, as “partinfo” in the item types. item type can also data structure, a data represent a group of data item types. - in preced- For all orders, their amounts. Assume also that the have been defined: print Data item types are ITEM (in capital input and thereafter, per each data item: the orders in and the day of all parttypes following 1. Deletion 2. Insertion of a delivery. 3. Insertion of a new customer. 4. Updating a pardollar with of the conceptual We assume that the tional data structure basis of an analysis conceptual information tion requirements. data item size of acters the data item in number the data item cardinality of a security code In our case as follows: attributes. constitutes the basis for the conceptual (binary) data model and its work-load. Description the of char- (optional) If different data items cannot be placed in for instance because of the same record, data, constraints or distributed security types can be specified then the data item If the with different security code numbers. security code is not specified, it is put to zero. This 2.1 name of transactions of a delivery. of part described with the word letters) on one ‘line of on separate lines, one ITEM delivno delivday orderno ord-part partno partinfo amount custno custname custadr data model following binary relahas been created on the of the enterprise, the model and the informa- 5 6 6 0 8 92 8 5 30 30 the data item types are specified 150 30 300 1500 2000 2000 1500 1000 1000 1000 Rinary relations in the data model are described with the word RELATION on one input line followed by one line per relation containing: dday-dno cua t-ord - name of the - name of relation the data origins - name of the data tion is directed ord-o-p - the number - minimum item2 part-o-p Proceedings of the Eighth International on Very Large Data Bases relation Conference 110 item (iteml) from item to which (item2) of instances of the which the the rela- relation number of item1 related to one - maximum number item2 of item1 related to one - minimum item1 of item2 related to one number Mexico City, September, 1982 - maximum number of item2 related item1 The relations follows: RELATION dday-dno dno-ord cust-ord c-cname c-cadr ord-o-p o-p-amnt part-o-p p-pinfo in our case are to specified one dclivday i.--,-- as idday-dno ..!I.---, delivday delivno custno custno custno orderno ord-part partno partno delivno orderno orderno custname custadr ord-part amount ord-part partinfo 150111 300 1 1 300 1 1 1000 1 1 1000 1 1 1500 1 1 1500 1 1 1500 1 1 2000 1 1 1 0 1 1 1 1 0 1 delivno ' *._-_ _.-_-.' 20 20 50 1 1 25 1 200 1 jdno-ord ,..A-, orderno +- cust-ord For the specified input data for a relation and the cardinality of the participated data items, IDBD determines the average number of iteml(item2) related to item2(iteml), the type of the relation (l:l, l:M, M:l, M:N) and what type of mapping, total (T) and partial (P), that the domain and range of the relation participates in. This will be further discussed in section 3.1. 2.2 Description I-- (custno 5 of the Pork-load This access path could language as follows: The work-load is generated by the information requirements (queries) and the transactions. They imply a need to navigate in the binary relational structure. In order defined, to show how navigations two examples are given. can ; be described in natural F’O~ a particular delivday get all delivno related to it via the relation dday-dno for all these delivno get all orderno which are related to all these delivno via the relation dno-ord for all these orderno get the custno related to these orderno via the relation cust-ord for the custno, get custname related to it via the relation c-cname, custadr related to it via the relation c-cadr be Example 1 For query 1 in our example the following access path is defined: The navigation starts always at where the entry-point, i.e. item, and the operation for it In example I the starting item the level 1, the starting Is described. is dellvday. After delivday has been accessed;alI delivno This related to delivday are to be accessed. We call is done at the next higher level. the item delivday qualifier of delivno,. because delivno related to delivday is requested. In the access path, each time when the latest accessed Item Is used as a qualifier for next item wanted, a next higher level is specified. When the same item is used as qualifier for several required items, that can be specified at the same level. Custname and custadr have the same qualifier custno and therefore they are specified at the same level. Proceedings of the Eighth International on Very Large Data Bases Conference 111 Mexico City, September, 1982 Example 2 In query 3 we want to get the dollar amount for the parts in order for a particular parttype and for a particular customer. RUN-UNIT RU custinf 30 1 FIND UNIQ delivday RU amount 30 1 FIND UNIQ partno 2 2 GET ALL delivno GET ALL ord-part 3 3 GET ALL orderno GET orderno 4 4 GET custno 5 GET custname GET custadr GET custno 2 3 SELECT ord-part 3 GET amount IDBD uses description ord-o-p Run-units are UNIT followed Every run-unit taining : 4 - the word part-o-p - name of CL3 partno for word RDNthemselves. line con- RU the run-unit of the run-unit After the head line the run-unit is described in a hierarchical way with level numbers much like a COBOL data declaration. On each line there is either a level description or a data operation description. The level description lines are numbered The first line after the 1 and upwards. line is always a level description line level number 1. We access a unique partno, thereafter all ord-part related to it. At this stage we do not know for which of the ord-part we want to get the dollar amount, because we do not know to which customer ord-part is related. Therefore, we continue by accessing orderno for each of ord-part and then access the customer related to each order. Now we know which orders are related to the required customer. Now we need to know the ord-part related to those orders. We have already accessed ord-part and therefore we do not need to access them again. By a SELECT operation we can select the ord-part related to the required customer without additional accesses. Each level - the description level line from head with contains: number - optionally a cardinality Normally the cardinality for a level is one. In that case there is no need to specify the cardinality number. Otherwise, a real number both < 1 (a probability) and > 1 (a freThis cardinality quency) can be specified. multiplies with the cardinality on the next lower level to give the cardinality on the actual level. If a cardinality is specified cardinalon level 1, it multiplies with the ity of the run-unit. Going backwards in the access path can be done in two different ways. If the item to be accessed has the same qualifier as the item at the next higher level, then that higher level is specified. If one wants to skip one or more higher levels without making any access to the database, it can be done by using the special operation SELECT. On each level there can be specified zero, of data operation occurrences one or more Every data operation descripdescriptions. tion is defined on one line and contains: - data operation verb example shows the way query 1 be expressed in terms of a run- unit. - name of - optionally Proceedings of the Eighth International on Very Large Data Bases rules described with the by the run-units starts with a head - cardinality We can start the navigation either with access to partno or to custno. Here we choose to start with partno. The following and 3 can the following syntactical of the run-units. Conference 112 the required a relation data item name Mexico City, September, 1982 Usually there is no need to specify a relation name. Only if there are more than one relation type between two data item types, it is necessary to specify the relation name. Otherwise the IDBD finds the relation type itself. The data item at the previous level At level 1, where is called the qualifier. there is no lower level, the access is directly to the data item type and not via a relation as at levels > 1. The different data operation FIND verbs are: INSERT is analogous to DELETE. By MODIFY one or all data the qualifier are modified. items related to By LINK one or all data items are linked to This means that the relation the qualifier. instance(s) is(are) stored in the data base. By UNLINK one or all data items are unlinked. At level 1 only the following verbs can be specified: TI 11 data operation WIQ FIND GET A final example shows run-unit describing the tion nr 2 for inserting the inserted delivno delivday is not already base with a probability SELECT I 1[ 1 LINK ALL UNLINK where { ) means one of the elements inside ( 1 and [ lmeans that the element inside [ I is optional. FIND defines an entry point for the run-unit. FIND UNIQ implies some sort of direct access to the data item, while FIND SEQ implies a sequential browse through all data items of the mentioned data item type. data items which are By GET one or all related to the qualifier are accessed- the definition of a work-load of transacFor a new delivery. it is assumed that its stored in the data of 20%. RU ins-supp 30 1 INSERT UNIQ delivno 2 0.2 INSERT delivday 2 0.8 LINK delivday 2 INSERT ALL orderno .l LINK custno INSERT ALL ord-part 4 INSERT amount LINK partno 3. 3.1 User interaction Checking and anal~ais of input data By SELECT no access is made, because already This noraccessed data items are selected. mally also means a branch from a higher to some lower numbered level. After the input phase and before the optimithere is an interactive phase zation phase, certain consistency checks are where also made. By DELETE one or all data items related to DELETE causes the qualifier are deleted. also deletion (i.e. UNLINK) of the relation which has the deleted data item as instance, its destination. First of all the IDBD asks for values These are tain constants. Proceedings of the Eighth International on Very Large Data Bases Conference of cer- Mexico City, September, 1982 - maximum record acters length in number - counter length in number of characters - pointer length in number of - maximum secondary storage characters that is allowed in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 of char- characters number of - CALC storage factor, a factor greater or how much more equal to one, which tells secondary storage than nominal is required for hashed storage - CALC access factor, a factor greater or how many more equal to one, which tells accesses than one that is required due to synonyms in hashed storage. From the values of the first three of these constants and from the earlier description of data items, relations and run-units, the program decides which consistency constraints must hold. IDBD finds out for each relation which implementation alternatives (IAs) are possible , which are -wed and which IAs are impossible (OFF). The lists of IAs for these three cases can be displayed for the DBA. For the possible and unwanted cases the DBA can interactively change IAs. There is also a fourth case that the DBA can use, namely to specify that one special IA is to be used (ON). = = = = = = = = = = = = = = = = = = = = = Fixed Fixed Variable Variable Fixed Fixed Variable Variable Chain, Chain, Chain, Chain, Chain, Chain, Chain, Chain, Pointer Pointer Pointer Pointer Dummy duplication duplication reversed duplication duplication reversed aggregation aggregation reversed aggregation aggregation reversed next pointer next pointer reversed next + owner pointer next + owner pointer next + prior pointer next + prior pointer next + owner + prior next + owner + prior array array reversed array + owner pointer array + owner pointer record For illustration, the implementation tives 15 and 17 are shown below. reversed reversed pointer pointer rev. reversed alterna- :---6-------m The reason for changing the lists of IAs can be that an IA is already fixed or that the IA gives an unnormal solution. (An example of an unnormal solution would be the case, where “article-number” is suggested to be aggregated under “number-in-stock” instead of the other way around.) If the DBA reduces the solution space for the different relations, the execution time during the optimization phase is also substantially reduced. This is necessary for a large data base. After the user has finished the change of the (see section 3.2) IDBD takes the best IAs, relation, e.g. it every remaining IAs for the set of IAs, which are either ON, takes in that order. The possible or unwanted, alternatives are multinumber of possible plied and the total number of combinations of displayed for the user. DBTG-structures are The user has now the possibility to further The reason for reduce the solution space. this is that the computing time can be very long, if there are many DBTG-structure alterIt is also possible to natives to consider. limit the processing time to some value. the optimizing phase is that time After interrupted and the next phase in the program continues. There are 21 different implementation alternatives, numbered 1-21 (see also [Ber-77Bl). The IAs are displayed with their number. Here is a short description of each IA: Proceedings of the Eighth International on Very Large Data Bases Conference 114 Mexico City, September, 1982 For the different valid DBTG-structures IDBD calculates an estimate of the number of accesses to the data base for every run-unit. The need for secondary storage is also calculated. The best solutions are presented to the user. “The best solutions” are those with the least secondary number of accesses for a certain storage size. The solutions are presented in the order of increasing number of accesses. storage At the secondary the same time requirements are decreasing in order to (The belong to the set of best solutions. first 10 solutions are always added to the set .) For different some messages DBTG-structures such as - an entry point - there no access is for IDBD also an item path - there is no SYSTEM access) to an item Explanation: - type M:l, gives user interaction section. - min/av/max21 the reverse to an item is of data Data items name size delivno 5 delivday 6 orderno 6 ord-part 0 examplified partno partinfo amount C”6 tno custname custadr a 9.2 a 5 30 30 in 150 30 300 1500 2000 2000 1500 1000 1000 1000 dday-dno dno-ord cust-ord c-cname c-cadr ord-o-p o-p-amnt part-o-p p-pinfo be illusIDBD has the input particular seq.ref.no secur. Explanation 120 30 1630 20 l:M, (P) in the but concerns reference frequencies reference item2 delivday delivno custno custno custno orderno ord-part partno partno to relations: count ref.no fwd number % 150 1.2 delivno 540 4.2 orderno orderno 320 2.5 custname 320 2.5 custadr 2700 21.1 ord-part 2790 21.8 amount 98 0.8 ord-part 3000 23.5 partinfo ref.no bwd number X 141 1.1 75 0.6 443 3.5 98 0.8 2100 16.4 : shows the number of required from item1 to item2 in the (forwards). Every update (DELETE, INSERT and MODIFY) is as 2 references. - rel.no bwd shows the number of required references from item2 to item1 in the relation (backwards). These numbers give the designer some indication of which relations are critical for the efficiency of the data base system. Those relations can then be analyzed in greater detail. means the number of references needed which are accesses), to the data item type sequenof - seq .ref .no means the number which are needed to the tial accesses, data item type. These numbers give the data base tor or designer some indication data items there will be a need access and/or sequential access. Proceedings of the Eighth International on Very Large Data Bases is analogous, relation - rel.no fwd references relation operation calculated 300 Explanation: - DA-ref.no (logical directly of Relations with rel .name item1 items: card. DA-ref.no (l:l, (sequential The analysis of input data will trated by the following examples. given the following output from data checking procedures for this case : Analysis of mapping - min/av/maxC! shows the minimum, average and maximum number of item1 related to i tern2 Analysis A typical the next type - T/P stands for total (T) and partial mappings between the data items domain and range of the relation is missing entry is the M:N) Conference The analysis each of run-units shows for run-unit the following type of output (examplified for run-units “custinf”, “amount” and “ins-supp”) . administraof to which for direct 115 Mexico City, September, 1982 Look at/change Run-units ru-name cardinality level-no cardinality operation item1 Instructions? item2 relation , I FIND UNIQ delivday 2 GET ALL delivno delivday dday-dno 3 GET ALL orderno delivno dno-ord 4 GET custno orderno cust-ord rev. 5 GET custname custno c-cname GET custadr custno c-cadr . 1 LINE INSERT ALL 4 INSERT LINK delivno dday-dno delivno dday-dno delivno dno-ord custno ord-part orderno orderno amount partno ord-part ord-part The different ON POSSIBLE = UNWANTEDOFF - dday-dno ON= POSSIBLE= 1 UNWNTED= UNWANTED= rev. o-p-amnt part-o-p or alt. 2 3 relations specified 5 7 11 15 19 9 13 17 OFF= 4 6 change?y change=15 o more changes?n dday-dno ON= 15 POSSIBLE= 1 2 rev. cust-ord ord-o-p consistency constraints are: the relation shall have this impl. possible impl. alternatives not desirable impl. alternatives these impl. alt. are not permitted Do you want all displayed? (a/s)a 8 10 12 14 16 18 20 21 3 5 71119 9 13 17 OFF= 4 6 8 10 12 14 16 18 20 21 dno-ord ON= POSSIBLEJ- 1 2 3 5 7 11 15 19 UNWANTED= 9 13 17 OFF= 4 6 8 10 12 14'16 18 20 21 rev. rev. p-pinfo ON- The "Run-units" listing is merely just a structured reprint of the input data with the relations expanded and a note "rev." if the relation is used in backward direction, i.e. from item2 to iteml. Design (y/n)y You can change among the ON, POSSIBLE and UNWANTEDimplementation alternatives by writing the implementation alternative number and the letter 0, P or IJ respectively on one line Explanation: 3.2 (y/n)Y You will see all or specified relations with their relation name and what consistency constraints that hold for the different implementation alternatives. amount 30.00 1 FIND UNIQ partno 2 GET ALL ord-part partno part-o-p 3 GET orderno ord-part ord-o-p rev. 4 GET custno orderno cust-ord rev. 2 3.00 SELECT ord-part 3 GET amount ord-part o-p-amnt 30.00 ins-supp 1 INSERT UNIQ delivno 2 0.20 INSERT delivday 2 0.80 LINK delivday 2 INSERT ALL orderno alternatives? reversed 30.00 custinf implementation POSSIBLE= 5 UNWANTED= 6 9 10 11 12 OFF1 2 3 4 7 8 13 14 15 16 17 18 19 20 21 change?n Note: For all l:M-relations except for dnoord the suggested IA is in this example put chain with both owner and prior to 15, e.g. For all l:l-relations the only pospointer. sible IA is left with number 5, e.g fixed aggregation. directfree Supplying design directives to the IDBD for choosing among implementation alternatives is best illustrated by showing part of a typical interactive user-IDBD session. The following is an example interaction with the IDBD: of a IDBD will listing: Conference with the following There are 8 different DBTG-structures to examine. Do you want to reduce the solution space further7n Do you want to limit the CPU-time when optimizing?n user Submit values to the following constants maximum record length=512 counter length=2 pointer length=4 max. secondary storage=100000000 CALC storage factorll.2 CALC access factorl.5 Proceedings of the Eighth International on Very Large Data Bases then continue Note: Normally IDBD will save the best 10 solutions (with least number of accesses to In this case there are only the data base). of which 6 of these are 8 possible solutions, accepted as correct DBTG-structures. 116 Mexico City, September, 1982 4. The Design-aid 4.1 The algorithm preferable(possible) and unwanted IAs. In this way the tool helps the designer to choose good data structures. If the designer lets IDBD to decide, IDBD has then a smaller number of IAs to consider. The solution space is thereby considerably reduced. 4.1.1 Interactivity The three different design tools developed at the University of Michigan [Mit-75, Ber-77B, Pur-79 ] all work in a similar way. The design aids are typically batch programs. From a specified input the tools produce, sfter an optimization phase, efficient data structures. Some problems with this type of tools are that - they sometimes tions produce - they restrict themselves solution space unnormal to a - maximum record length which are are is not violated - if the data item types in the relation have different security codes, duplication and aggregation are ruled out solu- - for a partial there related sible all B reduced - the designer has no possibility of ing his/her own data structures, may be more natural or which may other (non-quantifiable) desirable perties - they take, in spite of algorithms, optimization to run on a computer for data bases. Some of the validity constraints, checked before the interaction, testwhich have pro- relation, where there exists a mapping from A to B, such that exist B data items that are not to any A data item, it is imposto aggregate B under A such that data items are represented. Suppose there is a l:M-type of relation between A and B data items so that one A data item is related to zero, one or more B data The validity constraints, which are items. are analochecked in this case, are (there gous rules for l:lM:l- and M:N-relations) sophisticated too long time normal sized - it To overcome these problems the design aid has The data base designer to be interactive. has then the possibility to manipulate with the solution space so that these problems do not have to arise. is impossible to aggregate - a chain or pointer array in reversed direction - there A under B can not be used is no need for dummy records 4.1.2 Validity constraints Certain validity in order to constraints must be satisfied arrive at a valid DBTG data structure. - duplicating of A under B is done fixed length, not variable length Some of these rules cannot be checked until a with records and sets is DBTG structure The validity constraints, which are created. tested, are - if references in the run-units only in the direction of the relatraverse tion, e.g. from A to B, there is no need for owner pointers or for duplication of A under B record its lengths are within permitted a record type has no repeating a level > 1 lim- - if there is only traversing in the opposite direction, duplication or aggregation of B under A and chain or pointer are made array without owner pointer unwanted groups on a data item type is not aggregated than once more that a set has not the same record both as an owner and as a member. me - if there is traversing in both tnions, we need owner pointers, chain or pointer array without pointer are made unwanted. can, howof the validity constraints be checked before the interaction with ever, This is the data base designer takes place. which IAs are valid or done by determining Unlike the not valid for every relation. Michigan design tools, IDBD does not just distinguish between valid and not valid IAs into IAs valid categorizes also but Proceedings of the Eighth International on Very Large Data Bases Conference with direce-g* owner 4.1.3 Cost caIcul.ation The storage cost is calculated as the sum of the lengths of all records, pointers and wasted areas for hashed wasted areas are These record types. "CALC estimated by the aid of the parameter storage factor". 117 Mexico City, September, 1982 There are a few assumptions about the record type implementation and the DBTG data base management system. If a record type is only with direct then it is accessed access, assumed that the record type will be hashed If a record type is (CALC in Codasyl). record accessed sequentially, then the type is made a member in a Codasyl SYSTEM set (Singular set). If there is a need for direct access to more than one data item type in a record type, then it is created a seconThis index is assumed to be dary index. implemented by a pointer array. The access cost is calculated for unit. We differentiate between of access costs. These are the number each three Run-unit verb GET FIRST MODIFY FIRST INSERT FIRST TABLE 1. runtypes 4.2 of alternative total storage - CALC accesses, hashed records. plied with the factor” total number set which are accesses to The number is multiparameter “CALC access are the the The number of accesses calculated is an estithe number of physical accesses. mation of Records, which are stored in the same block or already stored in primary storage in buffers, are not considered. There To get an estimate of the number of accesses, IDBD takes care of the hierarchical structure in the run-units with different number of data items required on each level. Average IDBD also considers values are used. - what kind of data entry to the record type - the combination run-unit . there is of IAs and verbs are defined in the of pointer of the number As an example consider the verbs GET accesses calculated, FIRST, MODIFY FIRST and INSERT FIRST for an implementation alternative with a chain with prior pointers, for instance IA=lS. Table 1 shows the figures. Proceedings of the Eighth international on Very Large Data Bases Conference cer- contains informa- structures types access cost of accesses cost for and each run-unit. This solution, information is, for each displayed as follows (examplified by the solution alternative with the lowest access cost). accesses - the cases where the data item types stored in the same record type for The results Each solution tion about which accesses In the case of INSERT FIRST both the owner and the previous first member have to be read and written and the new first member shall be written, e.g. 5 accesses. record accesses, pointers. 1 2 5 Number of pointer tain verbs scan where a - sequential accesses, through the records for a certain record These accesses can be type is made. types, if the cheaper than the other records are clustered into blocks. - pointer through Number of pointer accesses 118 are 6 record types Record type no 0 delivday Record length: 30 records 1 (delivday) has 6 char. CALC access 6 char. Record type no 0 delivno Record length : 2 (delivno 150 records ) has 5 char. CALC access 5 char. Record type no 0 orderno Record length: 300 records 3 (orderno ) has 6 char. seq. access 6 char. Record type no 0 custno 1 custname 1 custadr Record length: 4 (custno ) has 1000 records 5 char. CALC access 30 char. 30 char. 65 char. Record type no 0 ord-part 1 amount Record length: 5 (ord-part) 0 char. 8 char. 8 char. Record type no 0 partno 1 partinfo Record length: 6 (partno ) has 2000 records 8 char. CALC access 92 char. 100 char. has 1500 records Mexico City, September, 1982 There are 5 set types Table 2 shows the storage and access costs for the 6 alternatives in this sample case. 150 instances Set type no 1 (dday-dno) has 30 records Owner = record type no 1 150 records Member = record type no 2 The set is implemented with DBTG-set with owner pointer and prior pointer 300 instances Set type no 2 (dno-ord ) has 150 records Owner = record type no 2 300 records Member = record type no 3 The set is implemented with DBTG-set with owner pointer Solution number Access cost Storage cost No. of record types No. of set types 1 2 3 4 14045 14165 14315 76512 100025 100025 391836 393036 420636 390336 391356 410436 6 6" 5 5 6 6 6 4" 4 TABLE 2. delivday 1g delivno cost: Ron-unit ord-day has the access seq. access 0 CALC access 1 pointer access3 cost: Run-unit amount has the access 0 = seq. access CALC access 1 pointer access2 cost: Run-unit orders has the access seq. access = 300 CALC access 0 pointer access- 3000 cost: 1 times Run-unit del-supp has the accesr seq. access = 0 CALC access = 1 pointer access77 cost: 30 times Run-unit ins-supp has the access cost: 0 seq. access CALC access = 1 pointer access= 106 Run-unit ins-cust has the access cost: se*. access 0 CALC access 1 pointer access3 30 times Run-unit upd-part has the access 0 seq. access CALC access 1 pointer access1 The total number of accesses- Proceedings of the Eighth International on Very Large Data Bases orderno T ord-part amount e partno partinfo the data base is 391836 char. Run-unit custinf has the access = 0 seq. access CALC access = 1 pointer access= 25 R - 1500 instances Set type no 5 (part-o-p) has 2000 records Owner = record type no 6 1500 records Member = record type no 5 The set is implemented with DBTG-set with owner pointer and prior pointer for custno custadr 1500 instances Set type no 4 (ord-o-p ) has 300 records Owner = record type no 3 1500 records Member = record type no 5 The set is implemented with DBTG-set with owner pointer and prior pointer cost 30 times R = DBTC-set implemented with owner pointer = record 30 times - 14045 1 with chain 100 times El cost: costs The schemata for solution alternatives 1 and 4 are graphically illustrated in figures 1 and 2. 300 instances has Set type no 3 (cust-ord) 1000 records Owner = record type no 4 300 records Member = record type no 3 The set is implemented with DBTG-set with owner pointer and prior pointer The storage Access and storage - set Figure 1. type type Solution = seq. access +p = direct alternative'no. access 1 LO times 1500 times I Conference 119 Mexico City, September, 1982 1 i ; 2. alternative delivno that to in the records access The too1 IDBD is running on VAX computers The program operating system. standard UNIX call: idbd c-i infile. 1-r rfilel i-s [-u sifilel (-a outfild sofilel with call [-p UNIX is a -s The name of the input data file, which contains the data items, the relations and the run-units. outfile The name lengthy filename nal itself. produces parmfile answering questions Instead of about the values for the different parameters, these can be stored in The name of that file is a file. given with this parameter. rfile If there is just one DBTG structure to analyze, Put the IAs for the relations 1, 2 etc. in a file and name that file with this parameter at the program call. of the file, where the listings are stored. The /dev/tty means the termiThe filename /dev/null no output file. Proceedings of the Eighth International on Very Large Data Bases Mscur3sioll IDBD has the flexibility to optimize the design from a predefined, desired scope. Tests show that it is a valuable property in order to make a tool useful in a variety of practical design situations. parmfilel It is possible to name the different files at Otherwise the program call with parameters. the names of the files will be asked for by The different the IDBD during execution. files are infile TO Discussions with practitioners in the field one is not always has disclosed that interested in 'optimizing' the total DBschema. For many reasons (reliability, security, modifyability, comprehensibility etc.) the designers may wish to implement parts of the data base in a specific way and leave the rest of it (if any) to the 'optimizer'. The difficulty of predicting future workloads may - in the strict sense also make optimization - more an intellectual exercise than a realAn optimal solution alternaistic approach. have a merit. It tive may, nevertheless, against which other, provides a 'base-line' solutions can be measured in more 'natural', terms of access and storage costs. no. 4 In solution alternative no. 4 is duplicated under orderno. This means get an orderno from a given delivno records with ordernotdelivno, those have to be scanned. That is why the cost in this case raises tremendously* 4.3 continue processing after an interrupt with -s aofile give the name of the saved fi'le with this parameter. 5. :partno partinfo L---d Solution sifile amount r---- Figure It is possible to interrupt the program after the input phase in order to examine the output from the input checking procedures. Give in that case this parameter. The data is saved in the file sofile. custadr h---‘-7 Y sofile Conference 120 [Ber-77A] Berild S., Nachmens S.: A Tool for Data Base Design by Infological Simulation, Proc. 3rd Int. Conf. on VLDB, Tokyo, 1977. [Ber-77B? A Methodology for Berelian E.: Base Design in a Paging Data The University of Environment, Engineering Systems Michigan, Laboratory, Ann Arbor, USA, Ph.D. Thesis, 1977. [Bub-801 Bubenko jr J.A.: Information Modeling in the Context of Systems Lavington Development, in S.H. (Ed.) "Information Processing SO”, North Holland, 1980, pp. 395-411. 1 Gus-821 Gustafsson Bubenko jr Approach to Modeling, Conference Information Karlsson T., M.R., J.A.: A Declarative Conceptual Information IFIP WG8.1 Working "Comparative Review of System Design Metho- Mexico City, September, 1982 dologies", [Mit-751 [Pur-79 North Holland, 1982. Mitoma M-F., Irani K.B.: Automatic Database Schema Design and Optimization, Proc. of 1st Int. Conf. on VLDB, 1975. ] Purkayastha S.: Design of DBMSprocessable Logical Database Structures, The University of Michigan, Ann Arbor, USA, Ph.D. Thesis, 1979. Proceedings of the Eighth International on Very Large Data Bases Conference 121 Mexico City, September, 1982