DEZISION RULES FOR THE AUTOMATED GENERATION OF SrORAGE STRATEGIES IN MANAGEMENT SYSTEMS DATA by GRANT N SMITH S.B., MIT (1974) SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE at the MASS&AHUSETTS INSTITUTE OF TECHNOLOGY June, 1975 Signature of Author .......... ..... .......... Alfred P. Sl1oan'- chool otjAnageggp, Certified by..-.. .. - . .. ... . /7 May 9, . 1 5 , ,........ Thesis Supervisor .o .0 *.* . * . .... * ..... by . . .s.. ..... ....... Chairman, Departmental Committee on Graduate Students Accepted RC-HIVES JUN 13 1975 16RARIUS Abstract PAGE 2 DECISION RULES FOR THE AUTOMATED GENERATION OF STORAGE STRATEGIES IN DATA MANAGEMENT SYSTEMS. by GRANT N SMITH Submitted to the Alfred P. Sloan School of Management on May 9, 1975 in partial fulfillment of the requirements for the degree of Master of Science. Current methods of determining storage strategies (both logical and physical) rely usually on (1) expert opinion, and (2) the experience of the designers. There has been some work in the area of automated design, but the approaches taken to date generally apply only at generation time, thus leaving the resulting design in effect for the rest of the life of the system. Should usage of the system change over time, as experience shows that it will, large inefficiencies may result owing to the original choice of storage strategy. The work presented here attempts to introduce dynamic decisions regarding storage strategies that will be invoked (1) on a regular basis, and (2) when system performance degrades below an unacceptable level. These decisions involve both the structure of the data base (such as which fields are to be in which files), as well as indexing, data encoding, factoring and virtualizing decisions. Decision rules are described which achieve this result. Also described is a procedure whereby any given request will be most efficiently satisfied, making use of the current structure of the data base, indexes, etc. Finally, the set of decision variables required to drive the above decision subsystems is specified in detail. Thesis Supervisor: Stuart E. Madnick Title: Assistant Professor of Management Science PAGE 3 Acknowledgements I wish to invaluable advice, comments only the writing in which I Professor thank for his criticism throughout not Stuart and madnick E. of this thesis, but through all the years be associated with have been so fortunate as to him. In everyone's life, there is that is Professor one eternal optimist. In mine, Donovan. He John has been a powerful motivating force throighout this work. Last, but by no means least, thanks are due Chip Ziering for many hours of fruitfull - often heated - discussion. rable of Contents Introduction...... . . a...**..* * .... PAGE 4 .* ... The Relational Model of Data............ ....... Page 6 ....... 15 Page Shared Data Bases and User Flexibility.. ...... Decision Variables and Truth Functions.. ..... The Structural Decision Subsystem....... ....... Page The Request Decision Sub system.. Page 37 *.Page 53 68 Page 108 ............... Scenario for application of Decision Rules... conclusion.. .. . ......... References. . . . . .. . .. .. .. .. .... ........ .. Bibliography............ Appendix 1.......... .. ............ 00 * ......... Page 126 ........ Page 145 ........ Page .......... Page 152 151 .Page 156 Appendix 2.......... .Page 167 Appendix 3...... .Page 177 ... . * .. eO..e. .. * S 55 555550*05 SL SS * .. * . . S S SS S S * * 5555*B*O 55 * 55 * 0 L "L 9 *L S S. S 55 0555 *050 0S05 OS S 505000 055050055 C~ L L 5550 * 535 550505555 55 6ZL1 5005505 Z. L * 055 5.5555555 Li *fI *05055505 * 8 ti Se. SeS E IE 5.55 55.5 Eti * 55 *O.S**5 SSSsSS *05S5S~**. 555 t OE ti~ 550 .3.55.5 91t .SSSO 55 *55. a * * &SS B &S *SS 0 0 5 055 0 0. . . . . . . 5555 . . 35 . . 5 00 , L eJ 0a go Is- dsajnbT~ S Z!)Yd CHAPTER I PAGE 6 Introduction. For programming applications primarily stake at was if applications programs the (as structure the structure that the idea being been the avoidance expounded. from the logical mapping- function. Then, system the to machine data structure), the physical (or if into some be known) of underlying data structure available data was What rewriting of whenever the and came to applications program oriented data has Involved was a mapping base was changed. data data-independent of concept the nos- years some would take care of this were any change in the there physical structure, the mipping function would be changed so that change structure logical the same still would programs, and, in programs, or a be person generating Note that to mean either a not necessary for our purposes two classes, since as far as to user (be fact, any logical data structure). the term user presented as existed any it in requests this before applications of the form that against throughout we shall use program or a person. It is to distinguish between these the database is concerned, all PAGE 7 ZHAPTER I requests look alike. a division of responsibility, Arising from this approach is The logical structure of the data is and thus of expertise. in the responsibility of of the domain the user, while mapping function previously also physical structure and tha in the domain of the aser, have now been removed. This is in fact a desirable featire as the user may concentrate efforts on technicalities of in the bogged down rather problems applications-oriented than becoming data establishing a base. However, that is way things turned out. There not guite the were several attempts to design systems which would perform the mapping function ini handle the physical structures for the user, given the was that the turned out the physical user had Furthermore, no to be (and structure quite thus differentiation generally delineated the way mapping capabilities tended rather simplistic in Zoncept and that But logical structure. and so execution, of the user to be with the result knowledgable the it mapping about function). responsibility (or user the was group) now took on the responsibility of both the logical- and physical PAGE 8 CHAPTER I structures. True, for any one logical structure the user now had a choice of physical structure coming from a wider range previously have allowed, but than personal experieice might was a blessing whether that horror remains or a disguised unclear. Other promises made - and not kept - about data independence mapping function. Theoretically, a again revolve around the structure should be given physical several logical mapped into able to be This facility structures, and vice-versa. has not been realized to any notable extent. Furthermore, the namely the isolation data structure, Rarely, if ever, was of data of users from changes has not once established. It one physical purpose primary yet realized the physical and no one in the physical its full potential. data structure was a Herculean task structure, independence, altered to implement any was about to go in and mapping would tamper once it was working. Any one physical-to-lagical generally be performel structure only once, and decisions as to what CHAPTER I it should be decisions were, and still are, made a fixed with the data base. These by people. Much of the future uses of to the time, point in were male at one perception as PAGE 9 knowledge on which these decisions were based was knowledge gained from experience, and so was more akin to an art than a science. non-trivial some However, of subset such decisions are indeed logical and rational, and so subject to some measure of automition. It would made to take of some of advantage of the structured nature On the :ontrary, these decisions. there have task, and to this efforts addressed has been no attempt to claim that be inaccurate been several these efforts can be divided (perhaps unfairly) into two major groups: simulation-oriented generation to are notably decisions used prior to system structuring decisions. These static, one-time decisions made at the aid in discretion of some person, and requiring substantial human intervention. The results of decisions made at that point were life of the the effort to be lata base. influential throughout the credit is due However, to formalize some much major aspects of the decision. dynamic rules used continually throughout the life CHAPTER the of to base data I reguire major Whether any gained is arose the dilemma of to draw intended the years to provile logical-to-physical (albeit working a system. over purport results was on these with tamper inefficiently) This work that they ongoing basis. the system on an iction was taken to in However, the upon. questionable. 3nce again there whether and monitoring effort these efforts was in important aspect of 10 human intervention their interpretation and acting attempted to track usage system monitor results of this performance. The would, again, PAGE in dealing data structure invaluable insights on the some and independence, to mapping, and as systems with such propose a methodology for achieving some of the promises made earlier. It is important to emphasize is that this methodology a pretend to be all-encompassing. The since no one work could approach here will be to: 1) describe a system independence 3asel on in which there is true a physical-to-logical data mapping capability, 2) enhance some of this system the better with the ability to formulated decision perform tasks, PAGE 11 CHAPTER I the including reconfiguration dynamic structures without structures. Attention initial of system monitoring structuring of of also be will decisions made the data physical the alteration and use the logical paid to at the definition time. 3) further enhance the system capabilities that are oriented with decision toward the efficient satisfying of requests against the data base. The work here revolves around the relational model of data. be a dismissal of all other This should not be construed to models (such as the network model) as inferior. The author's familiarity with the relational model and the existence of a well-defined set of theoretical rules that can be applied in the model were the motivating It should also factors behind this decision. be pointel out that the herein used has embellishments relational model as and alterations derived from various personal experiences and sources of the author. responsibilities for any errors and inconsistencies model employed here should not the well-known names in the necessarily be attributed to behind the relational model; well be the fault of the author. The they may ZHAPTER I PAGE 12 Structure of Thesis. Chapter II will introiuce the relational model as needed for our purposes, and point out the differences, where applicable, between this molel and that found in most of the literature. Chapter III will address itself to the methodology employed for achieving data independence. of decision variables maintained Chapter IV presents a list by the system. Since there is information about system usage and performance consistent set required to regarding physical restructuring, support dynamic decisions a of statistical a long list of rules has been developed for naming these decision variables. Chapter V responsible will address for initial itself to the specification, decision and physical data dynamic reconfiguration of the the Structural Decisign Substga (or SDS), and rules subsequent structures chapter VI CHAPTER I concern will decisions those optimally satisfying requests PAGE 13 dynamically about made against the data base - made the Reguest Decision Subsistem (or RDS). scenario, and those decision Chapter VII presents i typical VI will be applied to the rules developed in Chapters V and demonstrate the scenario to effectiveness of the decision rules. Chapter VIII possibilities further explored, as well as that should, perhaps can, and ways to as to with some remarks concludes the thesis expand be rules the decision utilizing a similar methololgy to that employed here. Again, pointel must be it and are intended clearly to that the V and VI are developed in Chapters certainly dependent out on the not demonstrate situation specific (and implementation of universally rules decision Chapter III) applicable. a methodology and They there is are no intention of developing a comprehensive and universal set of rules. ZHAPTER I Finally, some familiirity with BNF (Backus-Normal assumed throughtout. PAGE 14 Form) is Good introductory sources are (1,2). CHAPTER II PAGE 15 The Relational |odel af Data. Probably the major model relational stumbling is the block in introducing The terminology. the concepts underlying this approiCh are familiar to us all. consider a regular report, one time or another. convenient format spell out the for each or table that we have all seen at In Figure 2.1 for representing such data. category. example, in 'dept#', has Figure 'description', difficulty in rows and that the Note 2.1 we a table; a The columns provide a value categories of data; the rows well be interchanged without loss For is such columns might or alteration of meaning. see the columns etc. And there interpreting the are 7 labelled rows. No-one information in Figure 2.1, and this is essentially the relational model. By convention in columns, and the relational model, put the data in the rows we always label the (ie: horizontally) just as is the case in Figure 2.1 . Furthermore, the columns are called domains, names. ani the column headings are thus domain This arises from the mathematical concept of a domain CHAPTER II PAGE Plant: White Plains, New York. Period ending: Aug 31. Summari of -oerations (in 000's) Difference Dppt# Descrittion La b3 r Expense Actual Budet (Actual-Budget) - 639 6464 7103 2990 1 Spray -1152 12829 13981 5915 Coating 2 + 400 2590 998 2190 3 Filing -1336 5243 3907 1637 Sanding 6 + 525 11275 10750 7 5915 Buffing - 152 8998 8846 4788 Assemble 10 and Pack TOTAL 22243 45911 48265 Fi-aur_ 2. 1 - 2354 16 CHAPTER as being a II PAGE 17 collection of objects (or numbers, information-carrying item). When or any other we choose a value from that collection we are choosing an item from that domain. Notice that each row is created of the from each called an entry. six domains. then the departments for example, information is Now, of the entries might just We the top of the we shuffle table, and In fact, the rows; it the rows in random order in a telephone as may be (as it would but directory) no lost by a random ordering of the entries. if we were to interchange domains 1 and 2 of Figure 2.1 (ie: 'Dept#' But and 'description') changed the domain provided we well. table is in decreasing 'dept#' order. no information if inconvenient to have be, important. the 'total' entry at easily put we lose is not single item in the Each row Notice also that the order the table (rows) in by choosing a notice that the order one entry must be the table is there would be names (column of the domains the same as that in all to remain meaningful. no problem headings) as within any other entries if Thus, the order of the CHAPTER II PAGE 18 domains is important, while that of the rows is not. PrimaEXr-Ks. the table far entry in only one can be that there may observe 2.1 we In Figure and the value of 'Depti', any one same applies for 'description', while there is no reason for this to be the case in any of the other columns. In fact, in column of is information alone which department it it is both departments, given a value then there is there is no for 'Dept#', information relevent to that row. Eg: say that problem.) 'Dept#' table; ie: for any value of But, ambiguity about any given Dept# = candidate £Rigary a 'Dept#' is (If meant. values in the entry. can uniquely determina all other we that is no that from determine not we can 2.1, Figure Thus, the 'labor' that it is in value '5915' and told given the twice. '5915' appears domain the value the 'labor' there is t§! for 2, we Thus, the only one entry in the table. In the event that combination there of domains is must property; namely, given a set of domains, the entry no be then some exhibit this such domain, found that of values for that comination zontaining those values in those CHAPTER II domains is A PAGE 19 uniquely determinel. table need hive not a primary key, but it is often advantageous from the standpoint of efficiency to do so. (In model propsed by the relational that are two entries may contain no relation Codd, et. al. identical, and always exists some primiry key, even so there if it is a combination of all domains. This is not the case here, as can be seen in the definition of the 'Join' operator in Appendix 2.) Normalization. Looking again the table at Figure 2.1, is some idditional 'Plant', and the are two We see 'Period' covered. the unler such information, heading of as the also that there 'Expense'; viz. and 'Budget'. 'Actual' Since columns printed above we notice that the relational tables, world as the a set of we must find some way to include that information in the table itself. As it the informarion in heading. model views now stands, it is not really part of the table; rather it is a Considering the fact that form of table the 'Plant' and the CHAPTER II of the table, we may assume 'Period' are printed at the top and we further of some iiportance, is that it PAGE 20 assume that there are other plants and other periods. up a distinct table for each one course of action is to set plant/period combination, each having an identical format to result in that of Figure 2.1 . This would distinct tables, and identical, yet actioa suggests itself: set plant/period combinations, a up a large number of so a second single table and somehow course of for all distinguish entries and period. This can be as belonging to some specific plant done by simply adding two domains to Figure 2.1: 'Plant' and 'Period'. Furthermore, notion of 'Budget'. we must of incorporating way find some domains 'Expense' into the two so, we might merely To do 'Actual expense', and 'Buiget Expense'. 'Actual' rename the the and domains The table now is as appears in Figure 2.2 . Notice, however, based on the that the table notion 3f 'Expense'; contains two we have domains each just renamed the V. -~ Summary of Operations (In Plant Period Dept* DescErig2tion Labor W Plns 10/31 1 W Plns 10/31 10/31 10/31 10/31 2 10/31 10 W Plns W Plns W Plns V Plns 3 6 7 Spray Coating Filing Sanding Buffing Assemble 2990 5915 998 1637 5915 4788 000's) Budget Actual Expense Expense 7103 6464 13981 12829 2190 2590 5243 3907 10750 11275 8846 8998 Difference (Actual-Budget) - 639 -1152 + 400 -1136 + 525 - 152 and Pack W Plns 10/31 TOTAL 22243 45911 Figure 2. 2 -2354 CHAPTER II two domains appearing single as in Figure 2.2. in either column domain: the prefixing 'Actual' specify the role In 'Expense' case, the fact, domain. and 'Budget' to domain is table, must be qualified by a to provide The reason the domain name used more suzh values chosen from of each of these domains in if a failure this are, in general, it PAGE 22 was to in any role name. If in for the table. In than once role names a there one is that event, a then ambiguity results. Use of a role name is is used not limited to cases in which a domain more than on:e in the same table, and any domain name may be qualified by a role name. Figure 2.2 is a version of the table which has unique domain (or rather role) names, and is set in contains all information some of it in the form of normalized table. In of taking information the plant and perioi the table itself table headings. general, of Figure we take the as opposed to This is called a normalizing a table consists that applies to all integral part of the entries More specifically, up in such a way that it 2.1) entries (such as and making it an themselves (as in Figure 2.2). primary key of tables higher CHAPTER II up in the hierarchy, ini make it the lower table. An PAGE 23 part of the primary key of this clarify help to will example point. The example Figure 2.3(a) appearing in is one domains, view. 2.3(a). Notice in the Notice that some domains table) are not really 'employee' ames of other tables. but the Figure 2.3(c) that might be form of a set of tables formed to store this logical (such as 'children' logical exist in a corporate data base. view of the data that might Figure 2.3(b) shows the is the normalized set of tables arising from that Figure 2.3(c) was derived from Figure 2.3(b) by the following steps: for each domain fact a name in a table (say A) table name table A, and make (say B) it part that is in take the primary of the key of primary key of table B. remove the table name (B) from table A. This is the process of normalization, and, in the relational model, 111 tables must be normalized (ie: must not contain CHAPTER II PAGE 24 EMPLOYEE (E MP. NAME, AGE) CHILDREN (CHILD. NAME, AGE) JOBHIST (JOB.DATE,TITLE) SALARY (SAL. DATE,SALARY) 1) 2) 3) 4) EMPLOYEE (EMP. NAgAGEJOBHISTCHILDREN) JOBHIST (JOB.DAT,TITLE,SALARY) SALARY(SAL.DAfTSALARY) CHILDREN (CHILD. VAgAGE) laL 1) EMPLOYEE (EMP. NAMAGE) 2) 3) JOBHIST (EMP. NAME, OB.DATgTIT LE) S ALARY (E MP. NA ME, JB.ATE, _AL.DATE, SALAR Y) 4) CHILDREN(EM_.NAEg,ZHILDNAEi, AGE) Fiqure 2.3 (Primary keys underlined) CHAPTER II PAGE 25 domain names that are in fact table names). Why 'relational' model? What we have been calling tables are call 'relations' in the relational an arbitrary name. entry is table. model. This is more than Remember that we described formed by selecting a value from each domain in In mathematical terminology, subset of all combinations of values, of above how an the domains. The name used for these the entries are a or a cartesian product such a subset is a 'relation'. More formally: The cartesian product of A and B (written A X B) is a set of ordered pairs, each first element of the pair coming from A, and each second element from B. A X B = Ie: J(a,b):a(A, b<Bj ('4' means 'is a member of') We can easily obtain an ordered n-tuple (where n>2) by this method: D1 X D2 X...X Dn A relation will = normally a(dl,d2,...dn): di Di, be written as i=1,...nj a relation name PAGE 26 CHAPTER II list of domain names. followed by an ordered, parenthesized For example: le: R1(D1,D2,D3). The (3) for referred to reader is Employee(name,emp#,dept#). further discussion of relations and normalization. Second Normal Form ani Functional Dependence. The process of normalization (namely, the described above removal of all domain names that are in fact relation names) is adequate for most situations in which the user is careful asigned to the various relations to ensure that the domains are in fact assigned to aided by the the 'correct# relations. process of diagramming the data in Figure 2.3(a). However, there are times This is base as shown when what seems quite logical will, in fact, give rise to problems. Consider Figure 'Dept#' the or in value of 'description' is other words, 'description' is on 'Dept#'. well; Notice that for 2.1 , Clearly, namely, any given uniquely determined; functionally dependent in this case, the reverse 'Dept#' is value of functionally is true as dependent on CHAPTER II PAGE 27 'description', but this need not be the case. Now, if for any reason we were no longer interested in Dept# therefor struck 2 and lose fact the that the second row 2 Dept# from Figure 2.1, we ie: the exist anywhere else. One is 'coating'; relationship (2,coating) loes not way to avoid this is to establish a new relation containing only the functionally domains dependent (Dept#, description). We may now strike either of these domains from the relation in Figure 2.1 without loss of information. are said to be These relations in second normal form; Ie: Domains not functionally lependent on each candidate key are stored in a separate relation. Figure 2.1 would thus contain 'Dept#' as a domain, but relation now contains 'Dept#' not 'description', and another and 'description'. Third Normal Form and transitive dependence. Third normalized relations are second normalized relations in which there exist no transitive dependencies. If B is functionally dependent on A, and C is functionally CHAPTER II dependent on B, PAGE 28 then by the algebraic is also dependent on A. But in a transitivity laws, C somwhat different manner, since it is also dependent on B which is dependent on A. this case, This is we say that C is true in any cise transitively dependent where In on A. the application of algebraic transitivity yields an additional functional dependency, as it did in the above case. Relations in third normal form would not contain any domains that were dependent other domain on any which is itself functionally dependent on some domain in the relation. For the case above, where C is we would establish a separate transitively dependent on A, relation containing domains B and C, and remove C from the relation containing A. It thus appears preferable to retain all relations in third normal form for the reasons outlined above. The reader is referred to (4) for a more treatment of second- and third normal forms. Transferability of Role Names. comprehensive CHAPTER II PAGE 29 Consider the existence of two relations: wife.socsec) marriage(husband.soc_sec, does Since 'marriage' 'marriage'. twice, name is role a and 'husband.socsec' the name list request to with socsec list Those supplied and age of the wife of phrased might be are: consider Now and wife.name that 'wife.socsecI. arbitrary retrieval language) domain contains mandatory. 617-03-2911. This and so domain 'socsect contains the 'person' Notice that and age), person (soc-sec, name, a a person (in some as follows: husband.socsec for wife.age '617-03-2911' ; Notice that referenced (viz: contains 'person' relation the 'name' and 'age') but the not the domains role name qualifier 'wife'. Intuitively, however, it is clear that the request is t3 satisfy the information needed present, but not in any way that the system can utilize. The way that the system mikes use of implicit information of the type name qualifier entry is by transferring in the example above 'wife' designated 'husband.socsec to the 'person' by 617-03-2911' relation only for that relationship the and the role the between corresponding CHAPTER 'wife.soc-sec'. 'wife' loas name qualifier in the 'person' PAGE 30 II not become a permanent role relation. Set Theoretic Notatil, Definitions and Examples. In chapter I was mentioned the fact collection of theoretical rules exists operate on relations (regardless of particular rules. This normalizel form). is perhaps most from those presented This where the that a well-defined which may be used to whether they are in any section outlines model used elsewhere(3,4). here differs Differences will be pointed out at appropriate points in the discussion. The following operations will be defined: these PAGE 31 CHAPTER II syabl 2peralion adi** Diadic/fj Union U Diadic Intersection N Diadic Difference - Diadic Cartesian Produzt X Diadic Projection P Monadic Join * Diadic Diadic Composition Permutation M Monadic Compaction C Monadic Restriction R Both Division / Diadic These operations are briefly described here, and are formally defined and examples given in Appendix 2. ** Diadic operators operate on two relations (they may both be the same relation); monadic operators operate on a single relation. CHAPTER II PAGE 32 Notation. R<i> is the name of the i th relation * means 'is a member of' J....j implies a list, or set of the items between the 'Il's. c(i) is the cardinality n(i) is the degree (number of domains) (number of entries) in d(i,j) is the j th domain of R<i>, v(m)(ij) is t(i) is ie: 0 is in R<i> j=1,..n(i) the m th value of d(ij), m=1,..c(i) an n(i)-tuple in R<i> (v (a) (i,1) t(i) a L(jaI) R<i> is , v(a) (ij2),..v(a) (i . n(i)) 1,...c(i the length of list a the null set - ie: R<i>=0 implies c(i)=O asp means a is a subset of b (a=b is legal) acb means a is a prgaer subset of b (a b) Va means for all values of a This notation will be used throughout the remainder of this CHAPTER II PAGE 33 thesis. Explanation of gperatars. Union The union of two sets consists of a set that contains all entries that appear in eitter of the two sets. Intersection The intersection of two sets is a set that contains only entries that appear in both of the two sets. Difference The two difference of entries that appear other. Eg: If the in sets is one a of the set that contains sets, but not in two sets were A and B, then 'A all the - B' is a set of all entries that appear in A, but not in B. Cartesian Product This is as defined on Page 25 Projection The projection of a relation is a procedure whereby some of the domains in the relation are removed. Join A join of two relations is the process whereby two relations may be combined into a single relation containing domains of the two being joined. all the CHAPTER II PAGE 34 Composit ion This is the same as the join, except which the relations are joined is that the removed. domain on This means that a composition is in fact, a projection of a join. Permutation A permutation appliel to re-ordering the domains in a relation consists of merely the relation. compaction. The compaction operator entries from is used for deleting a relation. conjunction with It is all redundent used most commonly the projection operator, which in may result in redundent entries. Restriction The restriction operator is used for selective retrieval from a relation. Division Division is the inverse of the cartesian product. Introduction to XRN. This section is intenled to be a very brief introduction to the pertinent points about XRM. XRM is a particular implementation of an n-ary relation data management model designed Center, San Jose (5). and built by IBM Scientific It operates basically as follows. CHAPTER II PAGE 35 XRM can handle two types of information: , character string data, and (32 bit) numeric data . fullword There are correspondingly two relations; one that handles major subcategories that handles character strings, n-tuples of numeric data. type (character or numeri. n-tuple) of and another Any one relation can only handle data of that type. Each entity in the system (character string, or n-tuple) is automatically assigned an IRM ID base. Given that ID, efficiently retrieved obtains its entity, and ID by string the entity by XRM. And, applying a then performing the character strings, are hashed; when entered into the data some number in the can given rapidly and the entity, XRM to that be hashing function retrieval. In the of the case of first bytes case of of the numeric n-tuples, all primary key domains are hashed. All IDs in XRM are fullword integers. In numeric relations (n-tuples), any domain can be inverted. CHAPTER II This is Once equivalent to such domain, an inversion that domain. If contain More inversions in the given said about that domain. a value for ID's of n-tuples value in that in the the inverted a linear search would be the implementation of Chapter V. For our purposes, will information, is index for given no inversion existed, necessary. concepts exists, will rapidly find all XRN relation building an PAGE 36 this introduction will suffice. be explained the reader is as needed. referred to (5). Additional For further CHAPTER III PAGE 37 Shared Data Bases and User Flexibinlity. based on the relational This chapter presents a methodology between the logical- and intentions of data independence is to allow the achieving independence model for physical data structures. One of the (it) user to view the struCture existing in the data he in a way most convenient for a specific uses application. This provided with the facility to means that the user should be define any relation containing any domains in any order, and be able it as to use such. Notice, some however, of the issues raised by permitting this flexibility. The most problem glaring arises divergence from the concept of bases. The benefits been adequately proposing the covered result specific applications. the many and have Now elsewhere, every user we are a powerful easy definition of relations for What does this Each of shared (or centralized) data facility for allowing concept? a of shared data banks are tool that allows rapid and data base as user now do to the centralized wants (and is able to CHAPTER have) different III relations for his application(s), basically gaining effiziency and of generality. maintain his PAGE 38 Each user own data which is convenience at the expense must, needed to furthermore, collect support his application, and rather than delegate that function to a central authority. The traditional maintainence methal of centralizing data has administrator been the appointment whose responsibility central shared data base, and to that data base. Zhinge to time consuming, convenience was and so collection and of it is a data to maintain base the ensure that all users conform the data base is expensive and generally sacrificed in favor to be of a avoided. User centralized data base. There is no need for sacrifice on either the user's part, or the data base concept of a administrator's part. 1:n mapping of physical demonstrates its user a specific value. There is no This is where to logical structures reason for mapping from the single denying a physical data base into a specific logical relation for some application. presents no problem if the the logical relation that This the user wishes to define on the physical data base is some subset of CHAPTER III PAGE 39 the domains existing in that physical data base. But what if the logical relation requires a mapping onto does not yet exist in the physical a domain that data base, and is yet to be created? one relation domain. existing the base data the physical in to include relevent new the the principle one could invoke Alternatively of a now from the logical to the physical relations, 1:n mapping, the containing relation new a create and redefine to is possibility required information. We expressed the have thus logical to physical relations; n:m mapping for a need ie: map onto several physical relations, from relation can a logical and a physical relation can be mapped into several logical relations. Before a proposing mappings, let efficiency. In us a methodology iddress very very large for briefly data constantly uses the same, small subset relation pays a high price implementing the base, a issue user n:m of that of data in a logical in performing the mapping each CHAPTER III PAGE 40 time. Some exception should be made in such a case whereby a physical relation data, and existing relations. This different to still is established containing that alongside should not the user; appear to be the however be the logical the same subset of original made to physical appear any relation defined and contain mast fully updated of relations that we information. We turn now to a methodology. Methodoloay. There are basically three categories have expressed a desire for in the above discussion: physical relitions in the physical (centralized) data base, logical (user defined) relations, and special physical efficiency-oriented relations. The terminology to be used here to be consistent literature): is as follows (and intended with the current terminology found in the CHAPTER III . - relations real physically in user-defined - mapped are exist that relations the data base * virtual relations which those PAGE 41 by the (logical) relations the onto system real relations . derived relations are subsets base. data - relations that real (physical) 3f the real relations They exist primarily constituting the for efficiency reasons. We thus have a basic system as shown in Figure 3.1 . Notice that the elements of the system shown in Figure 3.1 interact in a Figure 3.1 reformatted to yield can be easily The same is true of all a hierarchy. precisely, they form specific way; more Figure 3.2. other figures in this chapter: they can be expressed in an hierarchical relationship. Notice also that this system has not eliminated either the data base administrator or the need for some person (perhaps again the data base administrator) to specify the initial real relations. The features of the system thus far are: . the ability system to maintained define real virtual relations relations, and on the have the CHAPTER III Figure 3.1 PAGE 42 CHAPTER III FiAure 3.2 PAGE 43 CHAPTER III system perform mapping PAGE 44 functions virtual relation may in fact More than relation). (note that a be identical to a real one virtual relation may be defined on any one real relation. . the the ability to decide data base derived (the decision being alzinistrator) relation for reasons made by to create a of efficiency in (real) a particular application . the ability for a virtual relation to contain domains from more than one real relation. As can be seen, the enhancements are concerned only with thesystem mapping functions. Thus, users may define virtual relations consisting of domains in any (combination of) real relations, but may n2t define additional domains. It is also important to point out the following: . primary keys in virtual relations exist in the eye of the user only; they do not necessarily correspond to primary keys in real relations. . the are set-theoretiz operators defined in all by that are required creation of virtual relations, are a function only of the user Chapter II for the as virtual relations existing relations. The data base adminstritor, however, who needs the capability CHAPTER III to in perhaps facilities; additional form of the a not available to the command, DEFINE... or .REATE... needs domains and/or relations real define PAGE 45 user. Assuming a in some real is relation somewhere in the data will adding Thus exist. real relations) involves the user interacting adminstrator. basically Zonsist (or domains real with the data base data base administrator in The actions of the situation would this base and there the data base this new a decision required as to where in domain as real domains will have to exist additional real domaims domains, these new real to define user wishes following of the steps: 1) Determine If sa, merely existing real the user simply tell is he for some iifferent name utilizing a domaia. whether user the from to define a synonym equating the two names. 2) If (1) is determine not the Case, in which apply some set real relation of rules to the domain(s) belong(s). 3) Add the domain to that real relation, thus making it available relation. for use (Note in that any this user-defined step may virtual require CHAPTER III restructuring some real PAGE 46 Alternatively, relation. a new real relation could be established consisting of the new domain, and the primary key relation that should contain that reguired decision each time must the new be made real domain. A join is domain by of the the is used. This data base administrator.) Step (1) above appears to require some human intervention on the part of a person such as the data base adminstrator, who is familiar relations. with the But major indeed be formalized, Provided there real global system relations portions and the of steps (2) existing real and (3) can and automated. are same guidelines (eg: they must third-normal form - see Chapter II), for the all be then step maintaining of maintained (2) in above can be performed by the system. In a similar way, by supplying expected use to some information as to the be made of this new domain, the system can to include this new real domain in determine precisely haw CHAPTER III the data base - decision ie: parform step analogous is directly PAGE 47 Notice also that this (3). that required to in the creation of a derived relation. 0 Perhaps it would appear that all automation of the major share of steps (2) applying step (3) the capability of accomplished by the and (3) above is administrative overhead. But consider the reduction of some data that is base adminstrator does not dynamically, have which the (except perhaps at predetermined, discrete time intervals). This means that the real can be relations so structured current system usage and requirements. of the n:m mapping capability of the as to reflect the Furthermore, because system, these changes in the real relations - be it mere addition of a real domain or relation, or a restructuring of existing real relations in any way to the virtual are not visible user. We may now modify Figure there is some relations of the 3.1 to show the system funation controlling the the real relations in Decision Subsystem data base; namely, the appears in Figure 3.3 . If the Structural Figure 3.3 were reformatted into the SDS, which must be available to the real relation hanilers, of the hierarchy. structure of The modified version of Figure 3.1 (gQ). an hierarchical diagram, fact that would become the innermost level CHAPTER III f ilure 3. 3 PAGE 48 CHAPTER III We have discussed methodology for around the far in thus PAGE 49 a a non-technical manner functions revolving automating some of the maintainence of the real relations of the data a structure for the real relations of the data base. Now, given base (as specified by the SDS), and given also the possible determined by the SDS) existence of derived relations (also it becomes clear that there may well be more than one way to users are derivation mapped onto All requests from agiinst the data base. satisfy a request against virtual relations (or of virtual relations), more real, one or some set-theoretic from above, which or derived are relations. Once again some decisions are required in the mapping function to determine: . whether the request (and legally, - ie: is valid from an access can logically control point of view) be satisfied given the virtual relations involved, . how the request can be satisfied, and . how best to satisfy the request. The subsystem that controls these decisions is Decision Subsystem (RD2). The RDS the Request is responsible also for CHAPTER III PAGE 50 determining how well it is doing in terms of efficiency. if the performance RDS (perhaps decides as a thit result trigger the SDS in as shown of changing system an attempt to restore acceptable level. We RDS, system in is usage) it 3.4 . corresponding hierarchical diagram is will performance to an thus modify Figure 3.3 Figure degrading Its to include the position in the self-evident from this figure. The discussion non-technical in this in chapter nature in global functions and an attampt the techniques employed virtual relations are purposely in been to demonstrate interactions within the system subsystems - the major decision has SDS and the of the RDS. Furthermore, implementing both the of no consequence to real- and the discussion, and have no impact on the methodology proposed. that the Finally, notice ilentical in their requests against The requests set-theoretic virtual-, real- and conceptual underpinnings, either are made used throughout operations or derived on virtual relations in a will be relations, relations. (These are thus and consistent fashion. in the be format they of real-, operations are as CHAPTER III PAGE 51 D PLIONS m Fi.qure 3. 4 on a" a -m a*N CHAPTER defined in Chapter II.) that a user mapping III PAGE 52 This does not of necessity imply will employ set-theoretic requests from a higher-level request directly; a language to a set-theoretic algebra is a well-understood, and conceptually simple operation. Figure 3.4 (See (6)) might involve Thus our hierarchical an additional user and the virtual relations; view of layer between namely, the a request language - to - set theoretic operation mapping facility. What we have presentel in this chapter is a methodology for achieving true data independence and providing the user with powerful relations. facilities At centralized data by two decision the for defining same time, base concept. application-specific however, we The methodology subsystems which assume some preserve is the enhanced of the system structuring responsibility. We proceed now subsystems. to a detailed inspection of the decision CHAPTER IV PAGE 53 Decision Variables and Truth Functions. This chapter describes the naming conventions to be used in subsequent sections for the naming of decision variables. Since there variables, for is it a was naming them. (Backus-Normal rather large set of decided to establish a This Form) method is these decision consistent method presented here format, along with the in BNF appropriate explanations. Note that all '$<number>'. generate decision This signifies that name. variable the BNF A reference names rule begin with number used section containing a to these numbered rules appears as Appendix 3. <relation id> is an XRM-assigned internal ID; <domain #> is the position of a domain within a tuple. Rule# Rule CHAPTER IV 1 PAGE 54 <qualifier>: := $1<relation <category> <qualifier type> <unit> <request> type> <join info> <options> These are variables domains in the containing statictics about the capicity of qualifiers selection criteria that appear in in the use of list a request. <relation type>::= <virtual>I<real>J<derived> <virtual> ::= v <real> ::= r <derived> ::= d <unit>: :=<doma in> I<relation> <domain> ::= d(<relation id>,<domain#>) <relation> ::= r(<relation id>) <request> ::= <retrieve>l<update>l<insert>l<delete> <retrieve> ::= r <update> ::= u <insert> ::= i <delete> ::= d <category>::=<siaple>l<compound> l<non-specific> <simple> ::= s <compound> ::= c <non-specific> : := n <qualifier type>::=<equality>g<nonequality> <unspecified> <equality> ::= e of CHAPTER IV PAGE 55 <noneguality> ::= n <unspecified> u <join info>::=<join>J<nojoin> <join> ::= j <nojoin> ::= n <options>: :=<null>I<index> I<no index> <null> ::= <index> i(trss) ::= (trss='total resloved set size') <no index> ::= n ExamPge: $lrd(i,j)rsen j-th times the number of is the domain of real relation i is used as a simple (ie: the only) qualifier equality in a retrieval constraint. No request, and join was was needed to used as satisfy an the request. The <relation type>, <unit> and should <request> be self-evident. For <category>, if there is only one domain in the list of selection criteria, then the <category> is 's'. In the event that there are several domains in the qualifier the <category> is 'c, and retrieval from the relation, the non-specific). if there is a sequential <category> will be 'n' (or CHAPTER IV For <qualifier type>, PAGE 56 a constraint in a qualifier can be of essentially two types: . an equality constraint, such as 'age=26', . an inequality constraint, such as 'income > 20,000'. If there is 'n') no constriaint (as is the case when <category> is then the <gualifier type> is <join info> will be a - or unspecified. 'u' in the event that there was a join 'j' required in the reslowing of the request, and it will be 'n' if no join was necessiry. <options> will be null in the If <category> is 'c', event that <category> is however, whether some other domain in inversion in the real relation. 'i' - or size of when then the <options> will list of qualifiers If so, then 'index', and the system will also the resolved set (the those domains restriction). domains in If with there was no other are show had n <options> is store the total set of entries indexes 's'. used domain in that results first the in list the selection zriteria, then <options> is 'n'. a of CHAPTER IV PAGE 57 Rule# Rule 2 <retrieved <relation object>::= type> <unit> <request> <abject> <join info> This is a set of variables that will contain statistics as to the use of domains as the objects of a request. <relation type>::=<real>j<virtual>l<derived> <real> ::= r <virtual> :=v <derived> ::= d <unit>::=<domain>J<relation> <domain>::= d(<relation id>,<domain #>) <relation> ::= r(<relation id>) <request>::=<retrieve>i<update>I<insert>I<delete> <retrieve> ::= r <update> ::= u <insert> ::= i <delete> ::= d <object>::= <simple> I <compund> I <entry> (<aggregate>) <object> <simple> ::= s <compound> ::= c <entry> ::= e <aggregate>::=SUMIAVIMAXIMINICOUNTIUNIQUE <join info>::=<join>I<nojoin> -<join> ::= j CHAPTER IV PAGE 58 <nojoin> ::= n Exampie 1) $2rr(i)rej is the number is of times a whole entry real relation i, retrieved from when a join was necessary to resolve the request. 2) $2vd(i,j)r(AV)sn is the number of times that the average of the values in oly 's') j domiin retrieved; no of (since <object> is virtual relation i is joins were needed to resolve the rule is similar to <category> of rule #1. request. <object> in this The value of <object> will be s if this is the only domain specified for retrieval (or update) in the request. If there are several domains specified, <object> may also be in then <object> is 'c'. <aggregate> if the individual items were not required, but some aggregation of them was. Rule# Rule 3 This <joins>::= $3 <relation type> <domain> <domain> type of variable involvement of relations of times a maintains statistics in joins. It specifies relation was joined to some other specific domain. about the the number relation by a CHAPTER IV PAGE 59 <relation type> ::= <virtual>t<real>t<derived> <virtual> ::= Y <real> ::= r <derived> ::= d <domain>::= d(<relation id>,<domain #>) relation i the number is $3rd(i,j)d(k,m) Example: j was joined by domain times of that to domain m of relation k. Rule# Rule 4 <system data>::=$4<system variable> These variables store information about system parameters and costs. <system variable>::=<block factor> blocking size> I <cost-per-I/O> I factor> I <index I blocking <relation <cost/byte/day> <overhead per call to XRM> I <time period> <block size>::= p <index blocking factor> ::= bfx(<domain>) <domain>::=<relation id>,<domain #> <relation blo.king factor> ::= bfe(<relation id>) <cost-per-I/O>::= io <cost/byte/day>: := sc <overhead per call to XRM>::= opc CHAPTER IV PAGE 60 <time period>::= t In XRM, bfx(i,j) is Vk. So, for our constant purposes we 'bfx' and 'bfe'. Vi,j, and bfe(k) can refer is constant to them The <time period> 't' will simply as be the length of time since the last restructuring of the data base by the SDS. All SDS decisions are based on last restructuring occurred, and so the period since the decisions will be based on this time period. Rule# Rule <relation data>::=$5<relation variable> 5 These variables are used to store information regarding relations. <relation variable>::=<degree><type> I <cardinality><type> <degree>::= #d(<relation id>) <cardinality>::= cy(<relation id>) <type>: :=<real>I<virtual>I<derived> <real>::= r <virtual>::= v <derived>::= d (<method of derivation>) <degree> is th e number of domains in the relation; CHAPTER IV PAGE 61 <cardinality> is the number of entries in the relation. Rule# Rule <domain data>::=$6(<domain name>)<domain variable> 6 are usel These variables statistics for maintaining about domains. <domain variable>::= <# unique values> <# unique values> ::= q Eramp1e $6(state)q is the number in domain 'state'. be found relation in Notice that for $6(i)g is the same are numeric, which of unique values that will domains that as the cardinality domain i appears. of the character strings, For may be anything from 1 to the cardinality of the relation it in which it appears. Rule# Rule 7 <user information>::=$7<user variable> <user variable>: :=<response time weight factor> <response time weight factor>::=r The <response preference for time weight factor> how the response time is a is to be user-supplied weighted in CHAPTER IV structuring decisions. inclusive. For will be purposes of 0.5, which is the variable may '$7r' it to all of '$4sc' in a value from this thesis, be taken into account '$4io' value. thru 1 of $7r However, by merely appending and '$4opc' and by appending 0 the value essentially a null cases where decision rules, is PAGE 62 '(1-$7r)' appear in the to all instances the decision rules. Truth functions. In addition to set truth of conditions. The the statistical variables above, functions used names of these to test is '0'. tests For example, for negativity, if T(i) then specific truth functions all begin '1' if applying true; otherwise the value were a if a for with: '$8'. The value of a truth function is the function to a spezifiz case is these is i<0, truth function T(i)=1, that otherwise T(i)=0. The truth functions employed here are presented below. that the <type> of a relation (Note (real, derived or virtual) is not important in applying truth functions.) CHAPTER IV j $8d(ij) domain $8i(j,k) domain k in relation PAGE 63 appears in relation i j is inverted. (For virtual relations, $8i(j,k)=0 always.) $8p(i,j) domain domains of primary key one of the j is relation i. relation i is $8x(i)r relation i $8x(i)d(<method>) <method> is is a derived relation, If of derivation. the method a involve not did a real relation. and <method> restriction, then <method>::=<nill>. $8n(i,j) domain j of relation i must be proviled for this is le: a value mandatory. domain before an entry in relation i will be made. Note $8n(i,j)=1 *j (Primary key where $8p(ij)=1. domains are mandatory.) domain $8u(j) $8r(i,j) j contains unique values same as $8n(i,j) except that Also notice name. $8d(i,j) Thus that $8r(ij) this is a (eg: socsec_#) it refers to a role is a subset truth function of that tests whether a role name is in relation i. $8_(j)<data type><starage strategy> <data type>::=<character> I <bit> <character>::= c <fixed> I <float> I <vector> I CHAPTER IV PAGE 64 <fixed>::= x <float>::= f <vector>::= t(<size>) <bit>::= b <storage I <real strategy>::=<virtual> I encoded> <real unencoded> <virtual>::= v <real encoded>::= e <real unencoded>::= u This set domain is j. to test functions is of truth For exmaple, the data type of $8_(name)ce=1 then domain 'name' if an encoded character string. is a $8f(ImI,lnI) of the truth function that tests whether each are functionally in domiins list Inl dependent on the whole list ii. Note 1) List Imi is members which of not a list of all list ini are domains on functionally dependent. Each n'4lnj may be functionally dependent on some 1xlx/Im 2) If also. In I=O (ie: is empty) then $8f(ImlIni) =0. $8m(lpi,iq) 1i is a function that tests whether lists IpI and are autually dependent. Ie: PAGE 65 CHAPTER IV and also $8f(Ip ,PqP)=$8f(IqjqpI)=1, $8m(lpI,IqI) implies $8m(IgI,Ip1). Transitivity also holds: $8m(Jpgjgj)=$8m(Iqi,IsJ)=1 implies that S8m(ptIsI)=1. which tests whether q $8c(IpI,g) (<function>) is a function (note is that q p' then q is computationally defined as Iq=6.3 * p. (<function>) on dependent a truth function set is '1' if computation the is required to derive q from the list $8od(k) domain q Ipl. For example, if dependent on domains is computationally is a list) not of domains Ipf. up for a request. It is domain k appears as one of the object domains in the request. is a truth function used in requests. It is '1' if $8eq(k) domain k appears qualifier with a as <qualifier type> 'e'. is similar $8nq(k) to $8eg(k) except that the <qualifier type> is not 'e'. Note that binary truth functions may relations function) where (depending existence signifies 'true', or '1'. be implemented as on of an the entry unary and particular in the truth relation CHAPTER IV The complete list of iecision PAGE 66 variables and truth functions that are used in this collection of decision rules is listed in Appendix 3. In addition, the following notation will be employed in subsequent chapters: . an '*' appearing in any decision variable name means the sum of all the possible replacements of the **. For example: $lvd(i,j)*sej $lvd(ij)rsej = + $1vd(i,j)dsej + $lvd(i,j)usej ($lvd(i,j)isej is not used.) Or: $lrd(i,*)rsej = SUM($lrd(i,k)rsej) . a '' in a name means the replacements for the for k=1,...$5#d(i) product of all possible '%'. For example: $lvd(i,j)rseg = $1v(i,j)rsen.$1vd(i,j)rsej . a list between two * I' (vertical bars) means the sum of that list. For example: $lvd(ij)Id,ulsej = $1vd(i,j)dsej + $lvd(i,j)usej CHAPTER IV The reader is advised to become PAGE 67 familiar with the 7 rules and the various truth functions to avoid continual reference to Appendix 3 and thus to expedite reading. PAGE 68 ZHAPTER V The Structural Decision Subsystem SDS . This chapter presents a detailed exposition of the SDS. The SDS is charged with the responsibility for: . maintaining the database in third normal form, . structuring the real relations current system usage is in such a way that most efficiently serviced, . modifying any system descriptor tables to reflect any change in the structure of the real relations. Since we are not implentation here, we system descriptor concerned specifically will not address the tables. This is, with any modification of nevertheless, a SDS function. As outlined in is a Chapter III, the creation privelaged operation. indefinite number of virtual While any of real relations user may relations, the define an disorderly or random definition of real relations would ultimately destroy the centralized nature, and cohesiveness of the data base. CHAPTER V The SDS is PAGE 69 concerned only with real since virtual relations may, by relation restructuring definition, by the user that defined them. However, virtual relation use for purposes only be altered the SDS must examine of making decisions about derived relations. There are two separate points at which the SDS can be invoked: . at definition time, and . during the life of the system - or dynamically. In case, either namely, to the 'best' function structure the simply one variables. of the is database for identical; its source of values expected of SDS use for the decision At definition time, values for decision variables of truth functions) are (and the definition supplied the SDS between these two ocasions use. The distinction is of to the SDS. continuous monitoring use, which become the time the SDS is At any time thereafter provides accurate the values for the invoked. exogenous, records of and however, actual decision variables at ZHAPTER V It is important to note that the no way dependent values. Given a an the PAGE 70 operation of the SDS is in source of set of values, the decision the SDS variable can operate. Thus, the stage of system life (definition, or subsequent thereto) does not predetermine any particular operations to be performed by the SDS that are not required at other stages. The SDS presented here is cost centered. Ie: it attempts at all times to minimize cost as opposed, for example, response time. However, in light of the fact that response often the most crucial factor for many users, <response time weight factor> provided time is there may be a by the user ($7r). If none is supplied, the lefalut is 0.5 (which is in effect, null). All decision rules here assume that $7r=0.5 We proceed now to the SDS proper. 5.1 Maintaining thir normal form. The algorithms within the SDS for maintaining real relations in third normal form are driven of the variety presented in by a set of truth functions Chapter IV. CHAPTER V Every domain mast appear either domain in some truth PAGE 71 as a functionally dependent function, or some domain on which others are functionally. dependent. It the is responsibility co-operation these truth with the data base functions. The system can be) data, and the user system is not the In order to interrelations, and the data will employ network-like It (and in in define fact no without substantial do so, the user must peculiarities base administrator education in this function. (perhaps administrator) to able to third normalize user-provided information. understand of is of the responsible for is envisioned that the user diagrams to aid in this task (See the scenario of Chapter VII). Note that the user is not asked to provide third normalized relation definitibns per se, but rather the information that will enable the This sstem to define third normalized relations. distinction (possible) dynamic real domain is to is important nature of the if considers real relations. If be added to the data base, required as to which real relation it knowledge that the one a new a decision is belongs in. data base administrator has the Given the of the data CHAPTER V base, he is he would the clear candidate for making the decision, and simply relation. Notice data, base relefine and that this is administrator, restructure the alternatives the to affected onl.Y course open whereas the system, adequate information would be in consider PAGE 72 to the provided with the position to dinamically the and full, expensive, restructuring of the affected relation. It is thus to deemed preferable SDS to allow the and to information, order that dynamic molifications The algorithm for Suffice it Appendix 1. provide necessary third normalize in to relations be efficient. is third normalization detailed given a here that, to say the in set of functional- and mutual dependencies, the system can generate a database third normal form. (real relations) in computational dependencies are not Note that considered when third-normalizing. Now, assume that some new real domain database. for the set By having the new domain, 3f real normal form is is to be added to the functional- and mutual dependencies the system relatioas, this can determine where in the belongs if third new domain to be preserved. Furthermore, it is able to CHAPTER V determine the most PAGE 73 efficient method of including it in the data base. We proceed now assumption to that lecisions made third by normalization the SDS under the has occurred, and resulted in a set of real relations of the following type: <name>(<list of domains>) (<list of candidate keys>) The primary key will fewest domains. specified become the <candidate key> This maximizes the for all primary key with the chances of having values domains in selection criteria. For example: RR1(A,B,C,D,E) ((AB), (C,DE)) The primary key that tould be chosen here is 5.2 (A,B) Structuring Decisions. These decisions implementation are of the basically those that relations specified determine by the the third normalization process. These decisions are: 1) Encoding or virtualizing of domains 2) Indexing decisions (the creation of inversions) 3) Factoring decisions ' 4) Decisions to join permanently into a single relation CHAPTER V PAGE 74 any two relations that have the same primary key. One of the structuring decisions subsequently dismissei was that of third normal form) considered, and replacing a relation (in by two or more of its projections. The factors that are involved in any such decision are: a) the cost of transporting little-used relation to primary memory each relation is ased (A case domains of a time any part of the for splitting up the relation) b) the overhead relation, involved in and duplicate maintaining copies of an extra some domains (A case aginst splitting) c) The overhead iavolved in one of the performing a join each time domains split off is required for any reason (A case against splitting the relation) d) Pjggible reduced storage (A marginal case for splitting the relation up) However, since the co3t of (once the entry has primary memory is required anyway) is costs of value. There are occasions may be been located, and an I/O that it be clearly so minimal dominated by projection transporting unneeded domains to (b) smaller and (c). in which than (d) will is an uncertain the cardinality that of the of a original CHAPTER V relation, but this is never certain. As such, it appears that the by two or more of its PAGE 75 decision to replace a relation projections will never be made, and so was not included in the SDS. 5.3 Encoding and virtualizing decisions. 5.3.1 Virtualizing De-isions A virtual domain is one that is not stored physically, but rather is computed each time it is required from the domains on which it is computationally updating a virtual domain virtual nature of the domain dependent. Also, notice that is not is, a legal operation. The by definition, not visible to the user. A domain is a canlidate computationally dependent on is a for virtualization a set of other it is domains; Ie: p if $8c(IaI,p)=1. Any of functionally dependent (ie: j where candidate for virtualization the domains on which it if j<lpi) may also be virtual, but there must be a restriction CHAPTER V PAGE 76 to prevent circular camputational dependencies. If Namely: $8c(IrI,P)=$8c(iyji)=1 where a(IrI then: $8c(jxj.b)/1 Vb4*yi, Vii where p4Ixt The decision rule is: For a domain that is currently virtual, (cost(making domain real) + cost(use If: real) + if domain is cost(maintaining domain if real)) < (cost(using the domain if virtual)) then make it real. Otherwise leave it as virtual. Note that the cost of maintaining a virtual domain is 0. the event that a domains modified, on domain is which the it domiin clearly not the case if is real, then each time computationally itself must the domain is be any of the dependent modified. This virtual. Similarly, if the domain is currently real, if: (cost(virtualizing) + cost(use if (cost(use if domain virtual)) domain real) + cost(maintaining if < real)) then virtualize the domain; otherwise leave it as real. Separating out the various costs mentioned above, we get: 5* 3 1 tJ.l Cost (making domain real) In is is ZHAPTER V in the relation(s) of serial processing involves a This PAGE 77 on whibh it is computationally dependent exist, and computing the value of the virtual domain. It is which the domains to the appended then database new in the relation, form. Thus in the are basically two written out and there steps: locate the domains on whizh it is computationally dependent, and compute and store the value. Assuming that the domain in question domain d, is there are two possibilities when $80(tpI,d)=1 : Vj4IpI and $8x(i)=1 for that i, or a) $8d(i,j)=1 b) $8d(i,j)/1 for some j<IpI, and a single i In no joins case (a), domains on which d is all in relation i. of ca'se (b), Case (a) however, is there will be at least one join the general algorithm capable of fact, case. determining the would algorithm is Decision Subsystem form of If (RDS) case (b), there were cost domains in list Ipi for then case (a) such an and very possibly of Ipi, really a special in fact (case(b)) the and computing relation i which is, retrieval for all they are Computing the value of d consists simply necessary to retrieve all members several. the when retrieving computationally dependent; tuple from retrieving a value. In are necessary of a some serial the general case be automatically included. also required in determining by the In Request the cheapest way of CHAPTER V resloving a how to request. The concepts involved optimally retrieve all the desired function. PAGE 78 domains required to perform This algorithm is thus common to both this situation, and the RDS operations. detailed in are identical: Chapter IV. For our As such it has been purposes, it is enough to note that the invocation of the algorithm, given the list of domains Ipi, will result in way of performing the method the 'final' request. will call algorithm, and will be the 'cost (final) '. + ($5cy(i)r/4bfe) .$4io assuming that the the cost virtual domain, is real domain d is the cost of + $5cy(i)r.$4opc inserted plus the cost of writing it throughout all future lecision rules. the fact that relations are blocked, retrieve (or necessarily insert, update result in in computing the i. The component ($5cy(i)r/$4bfe).$4io to the cheapest the computation of cost(making domain real) becomes: cost(final) Ie: We method determined by the the cost of performing it And so, a cost estimate for the cheapest a value of out in the relation will become familiar It takes into account and that not each call or delete) real relation i. I/0. an entry will The cost component'$5cy(i)r.$4opc' covers the cost of the overhead in CHAPTER V each call to XRM. In this way, PAGE 79 we take account of the fact that each call involves some expense, but not necessarily an I/0. This will be found in most subsequent decision rules. 5.3.1.2 This is storage, Cost(use if domain real) basically comprised of plus the cost of retrieval the domain is the cost of additional (or other operations) if real. cast (additional storage) = 4.$5cy(i) .$4sc.$4t since each domain in IRM is At this point, the system a fullword domain. would make a decision regarding whether this domain should have an index (see 2 below) - ie: would determine whether $8i(ij)=1. If $8i(i,j)=1 then: cost(use if domain reil) = ( + $lrd(i,j)**e**.($4io + $4opc) ($lrd(i,j)*sn*+$lrd(i,j)*cn*n) (($5cy (i)r/$4bfe).$4io $5cy(i) r.$4opc) + (trss/$lrd(i,j)*cn*i(trss)) ($4io + $4opc) ) + CHAPTER V If PAGE 80 $8i(i,j)=0 then: cost(use if domain rel) = ((trss/$1rd(i,j)*cn*i(trss)) ($4io + $4opc) ($1rd(i,j)*s**+$lrd(ij)*c**n). + (($5cy (i) r/$4bfe) .$4io $$cy(i)r.$4opc) Thus, in is an index, any the event that there case where domain j is used as a qualifier in a <qualifier type> of 'e' selection index. criterion, it simply a is For <qualifier type> of 'n', the resolved selection set, after using the index is of no use, indexed, was criteria that needed. If some other domain and some serial search will be in of using case index, that only then need be the serially searched. If there cases, except those where other lomais in the request with an index, a index, then all is no there is some serial serach is 5.3.1.3 The required. cost (maintaining domain if cost of maintaining the real) domain if it is real is an CHAPTER V PAGE 8 1 additional update each time any of the domains on which the new real domain is z-omputationally dependent and in another is domain on which j same relation, in any updated relation, then is way. dependent is computationally there is because if This is no additional I/0, in a the or call to XRM. For domain j of relation i: cost = ( $2rd(ik1)u**.($4io + $4opc) ) Vm*IpI where $8c(IpIj)=1 and $8d(i,m)=0. 5.3.1.4 There is cost(virtualizing) a choice deleted from being virtual, as to whether the relition or whether cost(virtualizing) If it is it and not physically removed. not physically removed, is the domain just is If physically marked as the domain is then: = 0. physically removed, cost(virtualizing) = 2.( then: ($5cy(i)r/$4bfe).$4io + $5cy(i)r.$4opc ) for i where $8d(i,j)=1. le: the process of removal involves serially reading and ZHAPTER V then writing (with the PAGE 82 domain removed) each entry in the relation. remove the domain or not The decision whether to physically is: If deletion) < cost(physical wasted) cost(storage then physically delete the domain. cost (physical deletion) is as above. cost(storage wasted) = 4.$5cy(i) r.$4sc.$4t Thus the a two-tierred virtual domains are illegal; inserts and cost(virtuailizing) decision is decision rule. 5.3.1.5 cost(use if (Note: updates of domain virtual) deletes are unnecessary. Thus only retrievals and use of the domain as a qualifier are permissible for virtual domains.) The cost(use if doaiin virtual) is broken types of use: . use as a qualifier . the object of a retrieval request. 5.3.1.5.1 cost(use as a qualifier) down into two CHAPTER V value of processing and a serial This involves PAGE 83 computation of and checking for all entries, the virtual domain the that value against tha criteria specified in the qualifier. The 'cost(final)' same as is the that described in 1.1.1 above. For cases where there was some other domain in the selection criteria that the was indexed, is searched serially size of the set (on to be average) (trss/$lrd(i,j)*c**i(trss)). cost(use as a qualifier) = ( cost(final).($1rd(i,j)*s** + $1rd(i, j)*c**n) $1rd(i, j)*c**i(trss) .cost (final') ) where cost(final') is all instances of except that the same as cost(final), $5cy(i)r in the algorithm are replaced by 'trss'. 5.3.1.5.2 cost (retrieval) cost(retrieval) = ( $2rd(i,j)r**.cost(final) ) This ends the discussion of PAGE 84 V CHAPTER virtualizing decisions. We move on now to encoding decisions. Encodinqg Decisions 5.3.2 Encoding decisions are made only for domains which have a data type of 'character'; le: where $8_(j)cu=1 For purposes of coding scheme. voluminous, This was as the The scheme character that will 2). unless the is number of (ie: $6(j)g is one becoming to coding schemes the purpose here be employed here strings which case that if the to avoid possible Furthermore, consider only is is not but rather to present an approach. ! (log($6(j)g)/log integer, will done simply number of potentially infinite. to be complete, thesis, we this as bit ('1' means here is the strings the expression evaluates to an the value used.) unique values in constant) The encoding decision becomes: This may the domain encoding of of next length highest integer, in only be done is constant CHAPTER V If the domain (say d) is PAGE 85 not currently encoded, then encode it if: ( cost(encoding) + cost(use if encoded) ) < ( cost(extra storage) + cost(use if unencoded) ) Similarly, if it is carrently encoded, then decode if: ( + cost(decoding) cost(extra storage) + cost(use if decoded) ) < ( cost (use if encoded) ) Breaking down these casts into the individual components: 5.3.2.1 The cost cost(encoding) of relations in processing of all replacing the id of the required domain d encoding consists of which domain d appears, and of the character string with Additionally, length. serial the there is a bit string the cost of building, and maintaining the encoding relation. cost(encoding) = ( 2.SUM(($5cy(i)r/$4bfe).$4io + $5cy(i)r.$43pc ) + ( ($6(d)q/$4bfe).$4io Vi such that $8d(i,I)=1 ie: + $6(d)q.$4opc ) all relations in ) which d CHAPTER V PAGE 86 appears. cost(decoding) 5.3.2.2 and storing the actual values The cost of decoding a domain rather than the values is identical cost(encoding) see 1.2.1 encaded to that of encoding. Ie: cost(decoding) = 5.3.2.3 cost(use if encoded) The way an encoded domain is used is to employ value as a primary key for the encoding relation. the code This means that each time the domain is the object of a retrieval or an update, or each be an additional time it is used as a 1/3 and an retrieve either the code or qualifier, there will additional call to the value (depending on whether it is being used as a qualifier or is the object). cost(use if to XRM Thus: encoded) = 2.(Slrd(i,d)***** + $2rd(i,d)ir,ul**) CHAPTER V * PAGE 87 ($4io + $4opc) 5. 3. 2. 4 cost (use if anencoded) of the domain Since use if encoded involves an additional I/O and call to XRM each time the domain is used, it follows that the use of the 1/2 of unencoded should be lomain if Thus: that if encoded; the additional retrieval is avoided. cast (use if uneacoded) $2rd(i,d)Ir,ul**) 5.3.2.5 The cost . = ( $1rd(id)***** ($4io + $4opc) cost(extra storage) of the additional storage unencoded values will be the required if the domain is difference between the storage unencoded, and that the domain is store the required to required if encoded. The cost of storage if unencoded will be a fullword for each entry in each relation in cost(storage if where $8d(i,d)=1 which the domain appears. unenzodel) Ie: =4.$4sc.$4t.SUM($5cy(i)r) Vi CHAPTER V PAGE 88 The cost of storage if the domain is encoded will be: relation - about 100 bytes . the overhead for the extra in XRM . two fullwords per entry first being the code, in the encoding relation; the and the second the actual value, and . the sum of all the space in each relation in which the domain appears. Ie: cost(storage if encoded) = ( (SUM(!(log 8.$6(d)g + 100) . $6(d)q/log 2) + ($4sc.$4t) Vi such that $8d(i,d)=1. The additional storage is thus the difference between these two. Note that the additional storage may be negative in the event that there is only a small set of values, given the overhead. This would not alter the decision rule in any way, as it would way should happen. in favor of not encoding, which is what CHAPTER V Thus: PAGE 89 cost(extra storage) = ( 4.SUM($5cy(i)r (log $6(diq/log 2) (SUM(! - + 8.$6(d)q + 100) ) .$4sc.$4t This domains. 54 lecision concludes the regarding encoding rules of We proceed now to indexing decisions. Indexing Decisions. In XRM, as described in Chapter II, any domain in the system may have an inversion, or index, created for it. When there is an inversion on some domain, and that domain is used as a qualifier with <qualifier extremely rapid. themselves type> is taken here may be qualifiers of type 'n'. when the not address such we are interested easily or in that domain schemes that some for gualifiers but we shall For our purposes, approach There are to indexes 'n', then retrievals, with the specified value locating of tuples is type> 'e', address <qualifier schemes here. in an approach, and the to include extended CHAPTER V PAGE 90 In XRM indexes are built in a specific way. Specifically, an index entry 'value/id' is a primary key. Indexes are Given a value, it is pair, where the value implemented as is the binary relations. used as the primary key to locate the id of a tuple containing that value in that domain. However, the use of a value as There is a primary key has severe limitations. no reason why entries, and in a value should not appear fact, that is the case of primary keys). usually the case There in many (except in is thus some method needed which allows for this factor. The method employed in XRM is entries that have any Thus, while there bytes) for each pair, we now 'id/pointer' 'id' additional have by ISAM, decision rules, index with the start 'value'. storage specified domain. be two consisting of the chain There required is index is (8 'value/id' consisting of a having the therefor, per chain over one the Furthermore, while several levels of indexing, the XRM fullwords of a entry relation implementation. many schemes have found in each a in the ordinarily index entry, word of strict binary one value would pair, replaced to chain together all id's of such as that only one level deep. The however allow for a multi-level index. CHAPTER V PAGE 91 The decision rule for indexing is: + ( cost(storage) If Zost(projected use with index) + cost(building index) ) < cost(projezted use without index) then build an index. already exists Similarly, if an index eliminate the 'cost(building index) ' for a domain, then part of the decision rule. The projected use of the domain is based on that experienced in the preceding in assumption continue all again be $4t. There rules decision these which unchanged, assumption to make. will time period: is, in fact, In the event that use invoked, and will proceed is an that a implicit use will reasonable changes, the SDS under the same assumption. We proceed now to breik down the components of the decision CHAPTER V PAGE 92 rule. S.4. 1 , cost (storage) cost(storage) = SUM(((# entries i) at level . (space per entry at level i)) + (overheal for level i) ), i=1,...L where L is the number of levels of index. As stated above, in XRM, L=1. The following is also true of XRM: . overhead per inversion is . space per entry is approximately 50 bytes 8 bytes, plus 4 bytes per chain (see above) Thus, for XRH: cost(storage) = (50 + 8.$5cy(i)r + 4.$6(d)q).$4sc.$4t for an index on domain d of relation i. 5.4.2 cost(projected ase with index) This component of the decision down into three sub components. . cost(retrievals) rule can be These are: further broken ZHAPTER V PAGE 93 . cost(decoding index) . cost (maintaining index) 5. 4. 2. 1 cost (retrievals) is an index If there time that domain j on domain j In of entries. then the index required. Since the is the size of the resolved set event that the of no this is rules, we decision type 'e', the is used as a qualifier of index can be employed to limit -an then any of relation i, value, and a 'n', is qualifier type serial search the case throughout all eliminate those cases is of these where the qualifier type is not 'e'. The decision rules specified here will include thus only <qualifier 'e' type> decision variables. We assume, furthermore, that entries are retrieved distributed randomly throughout the relation. The subcomponent cost(retrieval) thus becomes: cost(retrieval) = ( ($5cy(i)r/$6(j)q).($1rd(i,j)*se* + $lrd(i,j)*ce*n) + MIN((($5cy (i)r/$6 (j)q).$1rd(i,j)*ce*i(trss) (trss/$lrd(i,j)*ce*i(trss) ) . ($4io+$4opc) ), PAGE 94 ZHAPTER V If domain j index, j', relation i, say domain domain in then some It has and transfer some of the $lrd(i,j)*ce*n to $1rd(i,j)*ce*i. relation, there If has in requests the to variable been a in the on some domain to drgp an index the necessary decision from requests transfer then therefor is criteria tentative decision now other domain in the selection become requests in which some index. compound index will had an other domain which no at which warrants an were previously that reguests at the time see whether it is being eviluated to requests in for some other index decision has been made a tentative oposite direction. The number of is a requests transferred function of the number of compound requests that a given domain was involved in as a fraction of all compound requests. following number of requests from Ie: Transfer the $lrd(i,j)*ce*n to $1rd(i,j)*ce*i: $1rd (idl)j*ce** . $jrijL&j*c e** $1rd(i,*)**e** $1rd (i, . $lrd(i,j)*ce** *) **e** 5.4.2.2 cost (decoding index) This is a function of both the CPU overhead time invoved in the decoding of an iniex, as well as the necessary number of I/O's to get the index into primary memory. However, since HAPTER V CPU the I/O time, with comparison small in so time is PAGE 95 decision rules presented here will not take into account CPU overhead in decoding indexes. Note (($4p/2) - $6(d)g) blocking factor index maximum that the is . number of I/O's decoding the index is thus the The cost of ($4bfx) necessary to bring the index into primary memory. Ie: cost(decoding index) = ($1rd(i,j)**e** . 5. 1.2.3 The ($5cy(i)r/$4bfx) . $4io) cost (maintaining index) cost of is index maintaining the the function of a number of new entries that are made in the relation, as well as of the number of times a What is presented here require the However, to be be to index and the length updates involving the that the domain in of a series order to index need not all memory. rules should of inserts or take into account be brought memory separately for each operation. rules updates primary into brought strictly correct, the decision concerned with the fact be updated. the decision insertions that the is of the purposes for assumed value in the domain is into primary CHAPTER V cost (maintaining index) = lu,dI** ).($4io ( $2rr(i)ien + $2rd(ij) 5.4.3 + $4opc) cost(building index) The cost of building the index will retrieval of each entry operation to the index. the PAGE 96 overhead of a relation, in the Notice that single call be the cost of a serial to and a write the rule below has only XRM. This is because inversion is accomplished by a specific XRM routine. cost(building index) = $4opc+$4io. (($5cy(i)r/$4bfe)+($5cy(i)r/$4bfx)) 5.4.4 cost(projected use without index) In the event that there is no used as a in which no index, any time the domain is qualifier in a simple query, or other domains in serial retrieval is necessary. a compound quiry the qualifier had an index, a CHAPTER V PAGE 97 cost (projected use without index) = + (($5cy (i)r/$4:fe) .$4io) $5cy(i)r.$4opc). ($1rd(i, j)*se*+$1rd(i,j)*ce*n)) + MIN( (($5cy(i)r/$4bfe).$4io ($lrd(i,j)*ze*i(trss)) + $5cy(i)r.$4opc) , ((trss/$1rd(ij)*ce*i(trss)). ($4io+$4opc) ) This concludes the decision rules for indexing decisions. We proceed with decision rules for factoring. Factorinq Decisians. 5.5 Factoring decisions are decisions regarding the storing of aggregations (or factored data) as opposed to computing them each time they are required. The aggregations system recognizes are: . MAX . MIN . COUNT UNIQUE . AVERAGE . SUN the namber of unique values which this CHAPTER V PAGE 98 This information applies only to numeric domains. In addition, the folloving should be noted: are always reserved in . storage for these aggregations and so the cost of the system tables, storage is not considered in the decision rules. . COUNT is as the RDS always maintained by the system, continuously. uses it . UNIQUE is no domain, and so is not is the underlying always maintained If SUM is stored, there is reverse COUNT of more than the true, no need to store AVERAGE. The howver because of possible roundoff errors. The factoring decision is: If cost(maintaining aggregation) < cost(computing aggregation) then store it. 5.5.1 If not, compute it each time. cost(computing aggregate) This consists of a linear processing of the relation, and so the cost is simply: PAGE 99 CHAPTER V (($5cy (i) r/$4bfe) . $4 i3 + $5cy(i)r.$4op: ).( $2rd (i,j)r(<aggr>)** where <aggr>::= SUM I AVERAGE I MIN I MAX 5.5.2 cost (maintaining aggregation) This computing involves or updating maintaining it, domain is the aggregation updated, it each time once, and a value then in that deletel or inserted. cost (maintaining aggregation) = (($5cy(i)r/$4bfe).$4io + $5cy(i)r.$4opc) + ($2rd(i,j) lu,dj** + $2rr(i)ien) . ($4io + $4opc) This concludes the liscussion of factoring decisions. We proceed now with permanent join decisions. 5.6 Permanent Join Dacisions This decision domain(s) rule is employed in has (have) been added to the event that some new the system and have been CHAPTER V key of the existing will be relation in which the new domain(s) belong(s), and the new domain(s). because both relations required. This is and so a have the same primary key, a trivial matter. join is are the relations If retrieval is another required, or a join is in the existing relation, precisely, more This means that used in conjunction with any any time these new domains are of those the the new relation The format of existing relation. the primary avoid restructuring to separate relations up in set PAGE 100 retrievals additional then left separate, required there will certain satisfying in be requests. The reverse is not true. That is, if the relations were to be permanently joined (restructuring were to occur), they were left separate. case where to the drop concentrate that they are rather of -oncept 3n This being so, cost(projected cost(projected the use joined over the result in additional retrievals permanently would able fact the where no case is there we are use) and premium). sp;The decision then becomes: If ( cost(restructuring) + cost(storage if restructured) ) < ( cost(projected use restructured) cost(storage if not ) then restructure joined) premium) + the relations relation. Breaking into a lown the single (permanently cost components, we PAGE 101 CHAPTER V have: 5.6.1 cost(projected use premium) necessary whenever relation are used domains in several case of a is This domain (s) in any the cost(projected use premium) being un-joined new some of those in a reguest together with the existing relation. Ie: = $3rd(i,Ipi)d(j,IpI) . pflpi retrievals additional ($4io + $4opc) where: Vp such that $8p(i,p)=1 This statistic is maintained separately by the RDS - ie: the number of times each relation is joined to each other relation. 5.6.2 cost(storage if The cost of be 4 bytes restructured) storage if the relations for each entry for each are restructured will domain relation excluding the primary key domains. cost(storage if restr)= in the new Ie: 4.$5cy(i)r.$4sc.$4t.SUM($8d(i,j)) CHAPTER V PAGE 102 Vj such that $8p(ij)=0 5.6.3 This is cost(storage if not restructured) except that the similar to the computation of 5.6.2, restriction that not be in the lomain the primary key is lifted. Ie: cost(storage if not restr) 4.$5cy(i)r.$4sc.$$4t.SUH($8d(ij)) the new relation where i is 5. 6.4 cost (restructuring) operations relation, and j retrieve from joined entry the new writing three key to primary processing of one relation, keys consists of a serial its primary the same relations with of two The restructuring for each the other out again. entry. If is the new relation, then: cost (restructuring)= 3. ($5cy (i)r/$4bfe).$4io + 3.$5cy(i)r.$4opc relation, and In other i is using the worls, existing CHAPTER V This the completes rules decision restructuring for rules responsible with decision proceel now decisions. We PAGE 103 for establishing derived relations. 5.7 It Derived Relation Decisions. created solely for to create out that to point is important derived derived relations raisons of efficiency, are relations structuring decisions of quite and are so decisions independent the type mentioned above, of and the third normalizing prozess. The choice exists basically relation defined by some user between storing a virtual for some specific application (or set of applications) or simulating that virtual relation each time it is referenced. If: cost(simulating virtuil relation) (cost(storage for relation) + overhead) < derived relation) + cost(use cost(creating derived relation) + of derived cost(update PAGE 104 CHAPTER V then continue simulating the virtual it is relation. If not, then cheaper to store it. Similarly, if a derived relation cease storing it relation would to return and be the is stored, the decision to to simulating same as that above, the virtual except that it would exclude the 'cost(creating derived relation)'. 'cost(final)' are the same All references to earlier in 5.7.1 as those made the chapter. cost(simulating virtual relation) A virtual relation is simulated by performing joins in the user workspace of various real relations that are required for a particular request. cost (simulating virtual relation) = cost(final).( $1vr(m)*nu* + 1/$6(j)q.($lvd(m,*)*se* + $lvd(m,*)*ce**) + 1/2($1vd(m,*) *sn* + $1vd(m,*)*cn**) ) 5.7.2 cost(storage for derived relation) The storage for a lerived relation will consist of the CHAPTER V overhead domains of per relation, and PAGE 105 the storage each of for The cardinality the derived relation. the vill be approximated by the miximus cardinality. cost(storage for derived relation) = ( 100 + $5cy(m)d.$5#d(m)v.4 where the derived relation x il $5cy(O)r and O=i x... is ) . relation Vi such that $4sc.$4t m, and $5cy(m)d = $8d(ij)=1 and Vj where $8d(mj)=1. 5.7.3 cost(use of derived relation) Before determining the use cost of the derived relation, SDS would make virtual relation an inlexing decision for the (see 2 above all Irv in relation types). A except, the domains of the substitute 'v' for derived relation may then be PAGE 106 CHAPTER V an analogous manner to a real relation. treated in Ie: cost(use of derived relation a) = ($5cy(m)d/$4bfe).$1vr(m)*nu*.$4io + $5cy(m)d.$lvr(m)*nu*.$4opc + SUM(($8i(m,j).($5cy (m)d/$6(j)q). $lvd(m,j)*js,cle** . ($4io + $4opc) + (1-$8i(mj)) ($5zy(m)d/$4bfe). $1vd(n,j)*Is,cle* . $4io +$5cy(a)d.$lvd(m,j)*Is,cle* . $4opc) + (($5cy(m)d/$4bfe).$lvd(m,j)*js,cln** . $4io + $5cy(m)d.$lvd(m,j)*Is,cln** . $4opc ) Vj where $8d(m,j)=1 5.7.4 cost(creating derived relation) The cost relation is no of creating the derived the cost(final), since that is, in fact the more than optimal way of creating it. 5.7.5 The cost(update overhead) update overhead for a derived relation involves an additional I/0 and call to XRM for each change to any of the CHAPTER V the real relation that also appear in the derived domains in relation, PAGE 107 m. Ie: cost (update overhead) = SUM($2rd(i,j)Iu,s,il** . and Vi where $8d(i,j)=1. $8d(m,j)=1, This completes the discussion of can be seen Vj where ($4io + $4opc) the SDS decision rules. It that these rules are guite modular in that any Derived Relation Decisions - major decision - for example, are composed of several subdecisions. subdecisions can be replaced without Any or all of these affecting any other part of the SDS. These rules will be applied in the scenario of Chapter VII. CHAPTER VI The Request Decision Sgbsystem - The RDS is it specifically, More database. the (RQS) overseeing any requests that are responsible for against made PAGE 108 is responsible for the fallowing functions: . determine information and access control legal - request is whether the ie: check decide based on that whether to perform the request. . system the whether information more requires - ie: satisfied, or logically be it can determine whether feasible is the request whether determine to resolve the request. . determine the most of satisfying efficient way the particular request, assuming it is deemed 'feasible'. . update the relevent decision variables. the legality if access control relation level. the in a request. It mechanisms will be The :reation controlled in such a sees will omit the of this thesis, we For purposes virtual is envisioned that implemented at of virtual that which the the real relations can way as to make the data relation only question of be that the user he (it) is permitted to see. Any data the user is not authorized to see PAGE 109 CHAPTER VI will be thus making the supplied fact that is there some being data not can security Thus, the user. to invisible mapping process, data during the removed from the be implemented as restrictions of real relations. We proceed RDS will contain for the resolving of requests. detail the We shall not variable updates which decision points at that the collection of algorithms now with the are performed for reasons of making an already difficult section more unreadable. it Instead, point specific decisian should be fairly clear at which context of the discussion at that point. decision variable completion of Step tentative and step. The All updates only become reason for this Finally, note that actually made aplates are not 5. from the variables will be updated final at until until the that point the conclusion approach will become are of the clear from the iterative nature of the RDS. Note that the notation developed here. In thus far will be continued addition, the decision variable be taken to read '$5cy(i)r OR $5cy(i)d'. '$5cy(ije' should CHAPTER VI PAGE 110 stepR-. Step purpose of The 1 establish is to a set of truth functions regarding the domains appearing in the reguest. The truth functions set up are: = 1 * $8od(k) request. If for all k that are object an entry is a relation) then specified domains in the (ie: all domains in such a truth function is set up for ever! domain in the relation. . $8eq(k)=1 for all domains k that appear as qualifiers with <qualifier type> 'e'. . $8nq(k)=1 for all domains k that appear in the request as qualifiers with <qualifier type> not 'e'. Note that k may be a role name instead of a domain name. If the list of domains appearing in the request is RIP,then for each k<IRI find all relations i such that: $8r(i,k)=1 or $8d(i,k)=1, and $8x(i)r=1 or $8x(i)d=1. but if $8x(i)d=1, i should not be a derived relation whose derivation involved a restriction. This step finds all relations in which each of the domains VI CHAPTER in JRJ appears. This step produces These each domain k ie: ji(k)j. If for a list of relations for lists are all members of a where L (I ji(k)11 = L(IRI). lIi(k)l1, single list: PAGE 111 any Ii(k)I we have L(Ii(k)I)=0 then the request is because some role- or domain name is 'infeasible', the request which has not been defined. that a domain that is and if appears more than once in the case It is used in also possible a single relation, the request must supply role names. ITe: $8d(i,k)=0, and k'' and $8r(i,k')=1 and $8r(i,k'')=1 where k' are role names. qtep_3. This step simply establishes a eventually list of relations which will (at the end of the alogorithm) become the optimum list of relations to satisfy the request. Ie: set up list Ifinali, initially 'infeasible', and cost(Ifinall)=2**30 (or some maximum number). Step_4. This step finds all possible combinations of relations that PAGE 112 CHAPTER VI of relations that satisfy the cantain based request, established in Step 1. of the step is The results can satisfy the request. the domains all on the a list to necessary truth functions set of The procedure is as follows: Vii(k) I < I li(k) 1 : Initialize if Vjfk lxl=Ii(k) I IxI N Ii(j) I = g then: 1x1=1xI U jij)j lxi in ascending order, order list been 'done'. already has Ii(k)I. If not and see if go on If yes, then save a copy of this that list to the next lxi as 'done' and do Step 5. After completing Step 5: If and lxi is cost(IxI)<ost(lfinall) not 'infeasible' then set: Ifinall=s', and cost(IfinalI)=cost(ixi) where s' is the collection of set theoretic operators generated by Step 5. The results of this step (after repeated invocations of Step 5) is the theoretic optimal list Ifinall operators, cost(ifinall). s', and a as an collection of associated set cost, PAGE 113 CHAPTER VI This step consists of a series of substeps whose function it relation more precisely, via relations in xir from Step 4. IxI - for a given any - or for resolving the request the lowest cost If cost(Jx) determine the lowest is to Can relation iIxi i'4txl then then be not joined to any is 'infeasible', lxi other and the evaluation stops. We proceed now with a detailed algorithm for determining the minimum cost(lxi). Ste_ 5.1 lxi with all relations in This step finds cheapest way of ordering key, and the the same primary the necessary joins, and restrictions. Repeat Vif1xI not already 'done': Set <null>, Ix't to cost(Ix'1)=2**30 and (or some large number). Now find all relations j (say Iji) in lxi with the same primary key: Te: Find all j such that $8p(j,k)=1 Yk where $8p(i,k)=1. This results in a set lj'=i U iji Vaflj'I do the following: PAGE 114 CHAPTER VI Set Ir'I to <null>. where k~a each Far r=($5cy(a)e/$6(k)g). is IreI If in Irli this r Insert $8i(a,k)=1, and $8eq(k)=1 set such that in ascending order. $8eg(k)=1 Vk where $8p(a,k)=1 then r=1, ascending position in is This case and insert in Ir'I. of the primary where all domains key are qualifier domains with <qualifier type> 'e'. with <qualifier type> 'e' Also, if any qualifier domain is unique, then r=1. Ie: If $8eq(k)=1, insert r in If and Ir'I. member of Ir'I; ie: the smallest r<jr'l. = cost(jj'I) $8i(a,k)=1 then r=1, and ascending order in Choose w = first Then cost(lj'l) $8u(k)=1 w.($4io + $4opc).L(Ij'I) < cost(Ix'I) cost(Ixe 1)=cost(Ij'), then: and x'lI=Ir' Reset Ire to <null>. The case now is where r will resolve, but there is no index on any of the qualifier domains: If $8eq(k)=1 and $8u(k)=1 then r=1, and insert in ascending order position in Ir'I. For each k<a where $8eq(k)=1 and $8d(a,k)=1, but $8i(a,k)=O, set: r= ($5cy(a)e/$6(k)q), and insert r in ascending order in PAGE 115 CHAPTER VI IrII . that r=1 reason the that (Note when $8u(k)=1 is that $6 (k)q=$5cy (a) e.) After completing all k~a, pick w = first member of Ir'lI; ie: the smallest r. Then cost(Ij'I) = ($5:y(a)e/$4bfe).$4io + $5cy(a)e.$4opc + (L(Ij'l)-1).w.($4io + $4opc) If cost(Ije')<cost(Ix'I) then: and cost(Ix'I) = cost(IjeI), Ix'I = Ir'I Reset Ir'I to <null>. Now, for each k<a where $8nq(k)=1 and $8d(a,k)=1, set r=1/2.$5cy(a)e, insert r and in Irel in ascending order. After all k<a are completed, ie: fir the smallest r. cost(I j') = choose w=first uemb er of Ir'lI; Then: ($5cy(a) e/$4 bfe) .$4io + $5,y(a)e.$4opc + (L(jj'j)-1).w.($4io + $4opc) If cost(Ij'I)<cost(Ix'I) cost(izxI) = cost(ij'i) then: and PAGE 116 CHAPTER VI Irxe If at = jr'i this stage jx'j is there were required, since null, then a serial no qualifiers retrieval is request. in the Proceed as follows: such that $5cy(a)e is Find a*j I a minimum Va*Ij'I. and set s=a. Mark a as 'done', , do the following: For each k/a, k(Ij Mark k as 'done'. If Vb where: $8nq(b)=1 ($8eq(b)=1, ($8d(s,b)=1 or $8od(b)=1) and $8d(kb)=1) and then no needed, join is s. since all domains needed from k are already in Otherwise, the set theoretic operators become: * k(I=,pI) s=s(Ipj) cost(Ix' ) = Vp4lpi where $8p(a,p)=1, and ($5cy (a) e/$4bfe) .$4io + $5cy(a)e.$4opc + $5cy (a)e.($4io + $4opc).(L(Ij'1)-1) If, hvwever, procedure, x'I was not <null> at the end of the above proceed as follows: Choose a=first member of Jx' and mark a as 'done'. (ie: where r is smallest), CHAPTER VI 117 PAGE Set s=a. The set theoretic 3perators then become: R s=s(Ipj) (Ie,pi) such Vp4IpI that $8eq(p)=1 OR $8nq(p)=1, domain p. (see Appendix 2, where '0' is the qaalifier on Page 175) Now, for all k/ea, k<Ij'j do the following: Mark k as 'done'. If Vb where: ($8eq(b)=1, $8nq(b)=1 or $8od(b)=1) ($8d(s,b)=1 and $8d(k,b)=1) and no join is necessary, since all domains needed from relation k are already in s. Otherwise, establish the following set theore tic operators: If $8eg(b)=0 and $8ng(b)=0 Vb~k, s=s(jpi) * Or, $8eq(b)=1 if k(I=,pI) Vp4lpl or then where $8p(a,p)=1. $8nq(b)=1 ($8d(s,b)=O and $81(k,b)=1 for some b<k, and ) then set theoretic operators become: s=(s(IpI) * $8eq (t) =1 or k (I=,pi)) R (Ie,tI) $8nq(t)=1, and 8 Vt<IkI such is the qualifier that type on CHAPTER PAGE VI 118 domain t (see Appendix 2, Page &pno3). have been so included in Once all kij'l the set theoretic operators, they can be removed from the list of relations to be joined ( ie: replace them from would result that relation and Jxj) Also, s. the single all by up set truth functions as follows: $8d(s,b)=1 Vb where $8d(k,b)=1 for k<Jj'J, and $8p(s,b)=1 Vb where $8p(a,b)=1. Also set $5cy(s)v = r. Nzte that limit on r in this case is, strictly speaking, s, the cardinality of since other an upper qualifiers may well result in reducing the size of r. At the completion of Step 5.1, there exists a collection of (virtual) relations Is'I each generated step. Each s4js'I is a relation derived) relations in restricted as and is Ixi that as outlined in this consisting of all real (and have the same required by the primary key, qualifiers in the request. If L(Is'I)=1 then steps 5.2 , 5.3 and 5.4 may The algorithm continues in this case be omitted. with where it left off CHAPTER VI PAGE 119 in Step 4. Step 5.2 This step is manner, responsible for joining in all those relations s*Is'l, contains the primary key primary key as) tfls'j, If there is Step 5.3. in the case where s~is'i of (but does not no s and t Otherwise have the same t/s. Vk where $8p(t,k)=1, Ie: $8d(s,k)=1 the most efficient s,t*Is'I. where this is the continue with the case, proceed to algorithm of Step 5.2. Assuming s contains the primary key domains of t: to see check domains (for Is'I whether s already this request) in t, contains and if so, all the needed remove t from and continue with some other tis'l. If $8od(k)=o Vk where $8d(tk)=1 and $5cy(t)v=1 then establish 2 set theoretic operators: 1) t 2) - ie: leave t as it s'=s(jpl) R (I=,pl) is in p<IpI Is' I, and Vp where $8p(tp)=1, values for the domains are obtained from and (1). cost(Is, 1)=cost(t) + cost(s). This means that if there are no domains that are to be CHAPTER VI retrieved from t, ($5cy(t)v=1), will ani t then resolve PAGE 120 resolve to t and use the a single entry values for the primary key domains as additional qualifiers in s. If $8od(k)=1 k where for some $8d(t,k)=1 then proceed as follows: s'=(s(jgI) * t(J=,gI)) where g4|gI R (Ie,pI) Vg where $8p(t,g)=1, and Vp*ItI such that $3eg(p)=1 or $8nq(p)=1, and e is the qualifier type on domain p (see Appendix 2, Page &pno3). cost(s')=cost(s) + 2.$5cy(s)v.($4io + $4opc) Remove t from the list of relations to be joined (ie: jx'j), and set up additional truth functions for s' as follows: $8d(s,b)=1 Vb where $8(t,b)=1. At the conclusion of this step, there exists a collection of virtual relations Is" I each generated from some member(s) s in Is'l. Furthermore, these relations are only joinable in particular way - namely, as in Step 5.3. a CHAPTER VI PAGE 121 Step 5.3 This step is responsible for joining relations s'*Is''I from step 5.2 (or 5.1) to yield a single relation containing all object domains in the request. Note qualifier domains will that by this stage, all have been employed in the necessary restrictions. We proceed as follows: Choose tfls''I such relations in Then, that $5cy(t)v is the minimum of all Is''J. Va*is''j, aps do the following: If $8d(t,k)=1 Yk where : ($8eq(k)=1, $8nq(k)=1 or $8od(k)=1), and $8d(a,k)=1, then t contains all the domains neccessary for the request that are in a. So, remove a from Is''I a<ls'' . See if $8d(t,k)=1 and $8d(a,k)=1 for two relations in Is''I some other afls''I. If so: Contain a and continue with some k. the next Ie: see if any common domain. If not, try CHAPTER VI t=t(k) * a(k), and + ($4io + $4opc).$5cy(t)v.$5cy(a)v cost(t)=cost(t) + cost(a) Remove relation truth functions. At this stage, cheapest method a from Is''1 and this is add the necessary if we L(Is'')=1, then for joining not the case, all relations have found in lxi, the and we Step 4. and L(Is''I)>1, then the request is #infeasible' with only the relations relations must set of ie: $8d(tk)=1 Yk where $8d(a,k)=1. continue where we left off in If PAGE 122 be found that will in lxi, and some other allow the request to be logically satisfied. This is the function of Step 5.4. Step 5. 4 This step attempts to specified in Ilxi will find relations allow the which, although request to not be completed. Finding such fintermeliary' relations can be accomplished as follows: 5.4.1) Set up list Ib'l=<null> 5.4.2) Choose some afls''1 CHAPTER VI PAGE 123 5.4.3) Find some relation b such that bjIxl and: $8d(a,j)=1 and $8d(b,j)=1. 5.4.4) If found, 5.4.5) If not set Ib'I found, = Ib'I U b then the request is 'infeasible', and continue where we left off in Step 4. 5.4.6) See t*a, tfls'',, 5.4.7) If $8d(b,k)=l if $8d(t,k)=1 and for some k, b<Jb'j yes: then set txI=IxI U gbil, and remove relation t from Is''f, replacing it with the single relation: a= (a(j) * b(j)) (k) * t(k). Continue with Step 5.4.12 5.4.8) If not, try 5.4.6 for some other t<(s''I, t/a 5.4.9) If that fails, fini relation cfix1, c/b such that: $8d(b,j)=1 and $8d(c,j)=1 5.4.10) If found, 5.4.11) If for some j. Ib'I=Ib'I U c, and repeat from 5.4.6 not, the request from where left off in Step 4. is 'infeasible', and continue PAGE 124 CHAPTER VI 5.4.12) If 11, repeat 5.4.1 through 5.4.11 . L(Is''I) 5.4.13) If L(Is''J) restart step 5 again from 5.1 1, then = with the new lxi. This completes in all relations First find See keys. smallest set the list of those 4hich of data, for optimally satisfying works as follows: it Very briefly, requests. primary alogorithm the and do have the that will resolve same to the relation with a join of that others of the same primary key. After relations with the same primary keys have been joined, see if any one other relation. relation contains If he retrieved the primary primary key key of any values to that can be retrieve from the other relation. After joining all joined in that way, relations by join primary keys remaining relations on some common domain. If there is no common domain, has domains common to both. find some other relation that CHAPTER VI This, then, completes earlier, the RDS variables, and function can is the discussion of the also responsible for the appropriate be PAGE 125 surmised from As stated updating decision points for the RDS. performing this description of the algorithm. We proceed now to a scenario in which the decision rules developed in Chapters V and VI will be applied. PAGE 126 CHAPTER VII Scenario for appligtion of Decision Rules. scenario that applies some of This chapter presents a brief the decision rules VI. Not all developed in Chapters V and of these rules be usel in the scenario, but a representative them will be, number of and that will serve to illustrate the use of others. Consider a company Manager. Each awn divided into Departments, each Department within -ertain a single Department. Parts, which are who are at a time, and all projects Assigned to work on one Project fall Employees, employs with its Each project by provided the requires company's SuupligErs. f we were to attempt to establish a company data base for this company we would have to: . specify the entities in which we are interested, . determine what data we want to maintain about each of these entities, and . determine how these entities interact. The interactions Given the company can perhaps best be structure above, done diagramatically. we might diagram the CHAPTER VII interaction between The entity at 7.1. 'owned' But Figure in some sense the head of an arrow is, * sufficient is not diagram appears in as it the entities by the entity at the tail. this PAGE 127 to express certain aspects of the structare. For example, Manager, one any for only is there 2ne Department while any Manager may have several Projects under his control. We will introduce an '=' terminology of this is a Autual Te: given determined, One other difference one head of the nature of the relationship. arrow to signify the one-to-one In the near the Chapter IV, functions of the truth .e~endenc.y. entity, and vice versa. case is between other the This is not expressed asses where is entity shown in Figure 7.2. namely the in Figure 7.2; one uniquely entity is uniquely determined by another, and cases where it is not. The concept of ownership is See (7) network model. * the same as that found in the CHAPTER VII PAGE 128 MANAGER DEPARTMENT J PROJECT II SUPPLIER PARTS EMPLOYEE Filure 7 1 PAGE 129 CHAPTER VII MANAGER PROJECT DEPA TMENT SUPPLI ER PA TS EMPLOYEE Figure 7 .2 CHAPTER VII PAGE 130 ** The former is a case 3f a functional dependency or a one-to-many mapping. An where several Supplier might mapping. The example of the latter is many-to-many latter is Supplier Suppliers may supply supply many a Parts. and Parts, the same Part, A and one possible method for diagramatically distinguishing between these two cases would be a double-headed arrow 7.2 is updated that we for many-to-many mappings. to include this concept in have a clear concept of the the entities, we can begin Figure Figure 7.3. Now relationships between to consider what information, or attributes, we wish to keep about each entity. *** Assume for purposes be maintained of the scenario that for each entity are the attributes to as specified in Figure 7.4. Now notice that some attribute (or combination of Functional dependency, as explained in Chapter II, means ** simply, given one entity, the other is uniquely determined, but the reverse is not true. For example, given an Employee, his Department is uniguely determined since an Employee can only belong to one Department. Note that consideration of attributes and consideration of interrelationships between entities are orthoganal, and as such, may be done in any order. The order presented here is by no means mandatory. *** PAGE 131 CHAPTER VII MANAGER I DEPARm ENT PR0 SUPPLIER EMPLOYEE !iagrge 7 .3 ECT PAGE 132 CHAPTER VII m_nime, Manager(Mgr#, office#) Department(Dept#, dname) Project (Proj#, paame, Supplier(Supp#, Part(p#, guant, s_name, startdate, enddate) phone) dite) Employee (socsec,ename,hiredate,salary, title) attributes) in each entity entity uniquely; such that the concept both of Figure 7.4, identifies as socsec would an of 'uniquely defined' has relationships between entities, the Employee. Notice been applied to and within entities themselves. We are now in a position to define truth structure between the entities as functions for the depicted in Figure 7.3, and within the entities as defined in Figure 7.4. First we define functional dependencies within entities by PAGE 133 CHAPTER VII up setting functionally as attributes The as depicted functions are resulting truth on an that uniquely define the attribute (or group of attributes) entity. dependent in Figure 7.5. We will call the attribute(s) in an entity on which the other attributes functionally dependent are the key of the entry. has been omitted from Figure Notice that Part 7.5. This is between Part and Supplier, because the many-to-many mapping and Part and Project, as depicted by the double-headed arrow in Figure 7.3. To the entities arrow, functional dependency at the which yields in end(s) tail truth key attributes entities, we include the functions for such of establish of the double-headed this case: $8f(p#,Proj#,Supp#I, Iquantdatel)=1 This same approach arrow) if would there were not be taken (ie: a double headed attribute(s) within an entity that uniquely identified the entity. CHAPTER VII PAGE 134 ,Im_name,office#)=1 $8f (Mgr# $8f (IDept#I, Id_namel)=1 $8f(IProj#1, Ip_namestartdate,enddatel)1 $8f (ISupp# 1, Is_naae,phone I)= 1 $8f(IsocsecI,Ie_name,hiredate,salary,titlel)=1 Fi3gure 7. 5 care of intra-entity dependencies, This takes portrayed in Figure 7.3. inter-entity dependencies that are All we done have thus double-headed arrow of far but neglects is account take Figure 7.3, but not of the any other types of arrows. In order to handle the proceed as single-headed arrows we follows: Add the key of the entity at the tail functionally dependent attributes of the to the list of entity at the head of the arrow. For the entities Figure 7.3, of Figure 7.5, and this process generates the interrelations the entities of of Figure CHAPTER VII PAGE 135 $8f(jfgr#,Im_name,office#j)=1 $8f (IDept#1 ,1d_namel) =1 $8f (IProj#IIp_name,startdate,enddate,Ngr#,Dept#I)1 $8f(ISupp#I,Is_name,phonej) =1 $8f(Isoc-secIIe_name Dept#, , hiredate , salary , title, Proj#l)=1 $8f(|p#,Proj#,Supp# I,Iquant ,datel)=1 Figure7.6 7.6 Now all that is '=#. a function is dependengl, the arrow head with the indicates a one-to-one mapping, This type of arrow mutual is left to consider and so a mutual establishe1 for the keys tail, and head of the arrow. $8m (IMgr#I , IDept#I )=1 dependency or truth of the entities at the Thus, from Figure 7.3 we have: CHAPTER VII PAGE 136 We now have a set of truth functions that reflect the entity attributes, well as entities. Now the as the interrelations SDS, by applying the the algorithm specified generates the third normalized in Appendix 1, between relations of Figure 7.7. Implementation of Relations. The procedure of diagramming the inter-entity relationships as outlined above proviles a logical method for establishing the truth functions necessary for the SDS to maintain third normalization. The next function of the SDS at this stage implementation of the the relations is to determine 7.7. of Figure The decisions to be made are: . virtualizing and encoding decisions, . indexing decisions . factoring decisions No derived relation, or permanent join decisions can be made at this point new domains base. Thus since there are no virtual have been all defined to be decision variables relations, and no included in referring to the data virtual CHAPTER VII PAGE 137 RR1 (5griDept#,aname,d_name,office#) RR2 (Proji,Mgr#,p_name,startdate,enddate) RR3(soc sec, Proj#, Dept#, hiredate, ename, salary, title) sname, RR4(Supp, RR5(p#Proj#,Sqgg, phone) guant, date) (Keys underlined) and relations, relations with the effect of resulting in the same the appropriate be key will 0, decision rules beeing null. In order have point to employ the various values for (ie: in some of decision rules, we the decision definition phase) need to variables. At these values must this be user-supplied. For our purposes, we zan leal with some of the aggregations CHAPTER VII of decision variables, and allow PAGE 138 the SDS these to split All aggregations into detailed decision variables as needed. user-supplied values, which do not have decision variables will be 0. know the Suppose we the use following about of the data base: a) RR4 is usually accessed on sname, b) RR2 is usually accessed on pname c) RR1 is usually accessed either or dname, on mname equally often on each d) The enddate of a project after the stirtdate. (in RR2) (All is 3 months always projects run for 3 months.) of RR3 has exactly 46 possible values, and is not e) title expected to change. It is also seldom accessed. For (a), (b) and (c), the SDS should consider indexing the relevent attributes. For (d), virtualizing of enddate is possible, and for (e) encoding of title is possible. We will set up the following decision variable use in further explanation: values for CHAPTER VII $lrd(RR4,s~name)*se* = 10 $1rd(RR2,p_name)*sa* = 10 PAGE 139 $lrd(RR1,aname)*se* = 5 $lrd(RR1,a_name)*ce*n $lrd(RR1,d_name)*se* $1rd(RR1,d_name)*c*e*n = 5 = 5 = 5 All other decision variables have initial values of 0. Other pertinent data is: $5cy(RR1)r = 60 $5cy(RR2)r = 123 $5cy(RR3)r = 2000 $5cy(RR4)r = 75 $5cy(RR5)r = 25000 $6(title)q = 46 $6(sname)q = 10 $6(pname)q = 89 $6(mname)q = 58 $6(dname)q = 60 $4bfe = 25 $4bfx = 350 CHAPTER VII PAGE 140 $4io = 0.0012 $4opc = 0.005 $4t = 1 $4sc = 1.6 x 10**(-5) 7.1 VirtualiZing Decisions. Suppose: $1rd(RR2,enddate)*s**=1, $1rd (RR2,enddate)*c**n=1, $2rd(RR2,enddate)r**=2, and $2rd(RR2,startdate)u**=5 . Using decision rule 5.3.1 of Chapter V: Cost(making domain real), since there is cost(use if and cost(virtualizing) are both 0, no dati yet in real)= the system. (1+1).(123/25) x 0.0012 + 123 x 0.005 = 0.63 cost(maintaining if real) = 5 x (0.0012 + 0.005) = 0.03 cost(final) = (123/25 x 0.0012) + 123 x 0.005 = 0.62 cost(use if virtual) = (0.62 x 2)+(0.62 x 2) CHAPTER VII PAGE 141 = 2.48 Applying the decision rule of 5.3.1, we have: Make the domain real if: (0.63 + 0.03) < (2.48). In this case, the domain would be made real. (Note that the main reason for this is the fact that the domains is used as a qualifier, thus reqairing a linear search of the relation to compute, and then test the value of the domain. 7.2 Encoding Decisioas. The candidate here for encoding is 'title' in RR3. Suppose: $1rd(RR3,title)***** = 1, and $2rd(RR3,title)jr,ul** = 2. Then, applying the decision rule 5.3.2 of Chapter V: cost(encoding) and cost(decoding) are both 0, since there is as yet no data in the data base. cost(use if encoded) = 2.(1+2).(0.0012+0.005) = 0.04 cost(use if unencoded) (1+2).(0.0012+0.005) = 0.02 PAGE 142 CHAPTER VII cost(extra storage) = ( (4 x 2000) - (6+(8 x 46)+100) ) x (1 x 1.6 x 10**(-5)) =.12 the decision rule Thus, using of 5.3.2, encode the domain if: < (0.12 + 0.02). (0.04) In this case, 7.3 of RR3 would be encoded. 'title' Indexing Decisions. For each of the domains used as qualifiers with a <qualifier type> of 'e', the creating an domain. SDS would index (if one did evaluate the desirability of not already exist) for that If an index exists, the SDS would determine whether it is still needed. We shall only follow one case here; namely, for RR2. Applying the decision rule 5.4 of Chapter V: cost(storage) = (50+(8 x 123)+(4 x 89)) x ((1.6 x 10**(-5)) x 1) pname in PAGE 143 CHAPTER VII = 0.02 cost (projected use with index) = x 10 x (123/89) (0.0012+0.005) x 0.0012 + 10 x (123/350) = 0.09 cost(projected use without index) = 10 x (((123/25) x 0.0012)+(123 x 0.005)) = 6.21 cost(building index) = 0.005 + 0.0012 x((123/25)+(123/350)) = 0.01 Thus, the decision becomes: Build an index for ths domain if: (0.02 + 0.09 + 0.01) which would result in < (6.21) a decision to build an index for p_name of RR2. No permanent join, or derived relation decisions are made for reasons outlined above. Once the system has been 3perational for a while, and values have been generated for the different decision variables, CHAPTER VII an invocation of similar way, to PAGE 144 the SDS would make similar in would, It the examples above. decisions in a addition, make permanent join decisions (where applicable) and derived decision rules as specified by relation decisions, 5.6 and 5.7 of Chapter V respectively. This scenario has demonstrate the manner would be applied. other scenarios presented in application simple which the various The reader is and a invited other decision similar to that employed here. to decision rules to experiment with rules in a fashion CHAPTER VIII PAGE 145 Conclusion. What has been presented here is: . a methodology data base for pseudo-optimization of a for the type of use currently being made thereof. This is done by the SDS. a . for procedure pseudo-optimaztion of request handling, by the RDS. The phrase 'pseudo-optimal' is used in word 'optimal', since the decision largely heuristic, preference to the rules presented here are and as such may well not be optimal in the accepted sense of the word. The SDS is driven by a collection (maintained by the RDS) and the pseudo-optimal, variables a collection of truth functions which are used in SDS decision rules. is of decision The output of the SDS third-normalized data base structure. The RDS is driven primarily by a set of truth functions, does make some use of decision variables. RDS is: but The output of the PAGE 146 CHAPTER VIII . updated decision variables, set-theoretic of collection a . operators to best satisfy the request. The decision rules in both the to permit modularized rules rules, generally be by connected modularity of Furthermore, the case, degenerate may well major feature to into boolean the decision decision rules without effects on other specific, and highly implementation this will of particular replacement rules, and parts of decision parts of the subsystem. are highly SDS and the RDS all decision rules are it is envisioned that as generalized decision of specific a summation variables. rules presented allow for easy replacement As the such, here may be a of those parts of rules that are appropriate for other implementations. Perhaps the most here is outstanding feature of the the dynamic approach taken nature of both the SDS and the RDS. To date, this has certainly not been true of subsystems used to aid in the design of the data base, and only rarely in the PAGE 147 CHAPTER VIII request-handling function. ** In general, in a be stated had to queries have way that inherently specified the procedural steps to be taken in its handling, and structuring decisions have always been made by position of a data someone in the (or group) person decision model, well might be base administrator. This aided but such structuring, and restructuring decisions by some type of more importantly were never made dynamically by the system. Various algorithms were developed for of optimizing performance, aiding in the process the most major of which is that in Chapter VI for pseulo-optimizing request handling. This work can complete in certainly any sense. not, nor It is does it, merely to claim to demonstrate be a methodology for approaching the arena of automated decision subsystems. As such, there are many areas into which forays must be made before such subsystems become complete. IBM's San Jose Research Center has designed a query language called 'Sequel' which does quite elaborate dynamic request handling optimization. See (8). ** PAGE 148 CHAPTER VIII Future Research. here presented to situations. other the methodology be to apply first step should Perhaps the only the was XRN implementation considered here. There is also a problem that arises from the way that is rather analogous is physically implemented in a to the attempt little system. the user interface that There separation and (or need) is, to however, the SDS fact that XRM sees. As such, there separate these aspects a should clear be need for was of the such a broken down into two in fact, the distinct parts to handle: and . the logical structure, . the physical structure. It might seem that virtual relations logical system structure, the fact that a row, whereas some that an entity but further reflection will reveal real ralations can be in a variety of ways. XRM are, physically implemented treats, and stores each entity as systems are more column-oriented, consists of a value from each in column. It is CHAPTER VIII PAGE 149 implement a relational system even possible to in a system of the IMS variety. In this throughout this thesis in may, relations used real regard, the (third-normalized) physically be fact The SDS should be expanded implemented in a number of ways. to reflect this aspect of data management systems. There were also places in the of the partial inaccuracy These modifications, be made to those though, to body of this thesis where the decision rule was pointed out. as well as many other refinements could rules presented recognize here. when fine-tuning improvements, and when the is important, It will yield major benefits are substantially below the costs of such efforts. It is felt that the decision rules presented here are of the type that might affect system magnitude, particulariy in cases time. Fine tuning these rules only a few percentage points. that research conducted performance by many orders of where usage might affect Perhaps this in this vein changes over performance by should indicate attempt to first PAGE 150 CHAPTER VIII accomplish the fine tuning is orders-of-magnitude attempted. improvements before any PAGE 151 References 1) Donovan, J.J., aystem Programming, McGraw Chapter 7, Hill, 1972. Construction C 2 !giler Gries, David, 2) Computrs, John Wiley & Sons, 1971 For Digital 3) Codd, E.F., 'A Relational Model of Data for Large Shared # 6, June Data Banks', Communications of the ACM, Vol. 13, 1970. Normalization 'Further Cadd, E.F., 4) Relational Model', IBM, San Jose, 1971. of the Data Base 'XRM - An Extended (N-ary) Relational Larie, R.A., 5) Memory', IBM Cambridge Scientific Center, Cambridge, Ma, Jan. 1974 (IBM Report G320-2096) Completeness of Data 'Relational E.F., Cadd, 6) Sublanguages', IBM, Sin Jose, March 1972 (RJ 987) Base 'data Structure Diagrams', Data Base 7) Bachman, C.W., (Quarterly News letter of the ACM-SIGBDP) Vol. 1, #2, 1969. 8) Astrahan, M.M., Chamberlin, D.D., 'Implementation of a Structured English Query Language', IBM Research, San Jose, Oct. 10, 1974. Bibliography Examination 'An 1) Ackermann, R.C., Prototype Information System', Masters of Management, MIT, June 1973. PAGE 152 and Modelling of a Thesis, Sloan School 'Data Structure Diagrams', Data Base 2) Bachman, C.W., (Quarterly News letter of the ACM-SIGBDP) Vol. 1,#2, 1969. 3) Bachman, C.W., 'The Data Base Set Concept; Its Usage and Internal Systems, Honeywell Information Relaization', Report, Jan. 31, 1973. 'Redacing the Retrieval Time of Scatter Brent, R.P., 4) Storage Techniques', Zommunications of the ACM, Vol. 16, #2, Feb. 1973. 5) Buchholz, W., Systems Journal 2, 'File Organization and Addressing', (June 1963) pp 86-111. 'Some Approaches 6) Burkhard, W.A., Searching', Communications of the ACM, 1973. IBM to Best-Match File #4, April Vol. 16, and Selection of File 'Evaluation Cardenas, A.F., 7) A Model and System', Communications of the Organization 8) Chamberlin, D.D., Gray, 16, #9, Sept. 1973. ACM, Vol. J.N., Traiger, I.L., 'Views, Authorization, and Lacking in a Relational Data Base System', IBM Thomas J. Watson Research (RJ 1486). Center, Dec. 19, 1974 Chapin, 9) Techniques', 1966. Organization of File Comparison ' N., Proceedings ACM, 24th National Conference, Feature Analysis Committee, CODASYL Systems 10) Generalized Data Base Manaement Systems, ACM, May 1971. of 11) CODASYL Systems Committee, A Survey of Generalized Dita ganageent Systems, ACM, May 1969. Base Model for Large Shared Data E.F., 'A Relational 12) Codd, Banks', Communications of the ACM, Vol. 13, #6, June 1970. Data Base 'Further Normalization of the Data 14) Codd, E.F., IBM Research, San Jose, 1971. Model', Relational Base Completeness of 13) Codd, E.F., 'Relational Sublanguages', IBM Research, San Jose, Mar 1972. PAGE 153 Bibliography 'Seven Steps to Rendezvous 15) Codd, E.F., San Jose, Jan 17, 1974. Research, User', IBM 16) Codd, Tutorial', with the Casual E.F., 'Noraalized Data Base Structure; A Nov. 1971. IBM Research, San Jose, Brief 'A Data Base Sublanguage founded on the 17) Codd, E.F., Proc. 1971 ACM-SIGFIDET Workshop, Relational Calculus', 1972. 18) Collmeyer, A.J., Shemer, J.E., 'Analysis of Retrieval Performance for Selected file Organization Techniques', Fall Joint Computer Conference, 1970. 'Elements of Data Management 19) Dodd, G.D., Computing Surveys 1, June 1969. Systems', ACM 20) Donovan, J.J., Sjgsters Prggramming, McGraw-Hill, 1972. 'Virtual Schutzman, H., Madnick, S., Follinus, J., 21) Sloan School Working Information in Data Base Systems', Paper, Sloan School of Management, MIT. 22) Frank, R.L., Yamaguchi, K., 'A model for a Generalized Data Access Method', National Computer Conference 1974. The Consecutive S.P., 'File Organization: 23) Ghosh, Retrieval Property', Communications of the ACM, Vol. 15, #9, Sept 1972. Updating Mean and Standard 24) Hanson, R.J., 'Stably Deviation of Data', Communications of the ACM, Vol. 18, #1, Jan 1975. 25) Hsiao, D., 'A Formal System for Information Retrieval from Files', Communications of the ACM, Vol. 13, #2, Feb 1970. 26) Huang, J.C., 'A Note on Information Organization and July #7, Vol. 16, Communications of the ACM, Storage', 1973. Approaches 'Some 27) Langefors, B., Information Systems', BIT(3), 1963. to the Theory of 28) Langefors, B., 'Information System Design Computations using Generalized Matrix Algebra', BIT(5), 1965. File Structures 29) Lefkovitz, D., 1969. Washington, Press, Spartant for On-line Systems, Bibliography PAGE 154 of Data Base Characteristics 30) Lowe, T.C., 'The Influence and Usage on Direct Access File Organization', JACM, Vol. 15, #4, Oct. 1968. 31) Lowenthall, E., ' Functional Approach to the Design of Storage Structures for Generalized Data Management Systems', Ph.D. Thesis, U. Texas, Austin, Aug. 1971. 32) Luz, Indices', 1970. V.Y., 'Multi-attribute Retrieval with Combined Nov. Communication of the ACM, Vol. 13, #11, 'Key-to-Address Dodd, N., Yuen, P.S.T., 33) Lum, V.Y., Transformation Techniques: A Fundamental Performance Study on Large Existing Formatted Files', Communications of the ACM, Vol. 14, #4, Apr 1971. S.E., 34) Madnick, McGraw-Hill, 1974. Donovan, J.J., operating Systems, 'Toward the Automatic Design of Data W.A., 35) McCuskey, Processing Information Large Scale for organization Systems', Ph.D. Thesis, Case Western, Jan. 1969. Design of Automatic 'On McCuskey, V.A., 36) Organization', Fall Joint Computer Conference, 1970. Data 37) Mullin, J.K., 'Retrieval-Update Speed Tradeoffs Using Combined Indices', Communications of the ACM, Vol. 14, #12, Dec. 1971. 'Attribute Based File 38) Rothnie, J.B., Lozano, T., Organization in a Paged Memory Environment', Communications 17, #2, Feb 1974. of the ACM, Vol. 39) SagamangJ.P., 'Aitomatic Selection of Storage Structure Masters Thesis, in A Generalized Data Management System', UCLA, 1971. 40) Severence, D.G., #Identifier Search Mechanisms: A Survey and Generalized Model', Computing Surveys, Vol. 6, #3, Sept. 1974. 41) Schachat, I.J., 'A Parameterized Model for Selecting the Multi-attribute Retrieval in Optimum File Organization Systems', Masters Thesis, Sloan School of Management, MIT, June 1974. 41) Shneiderman, B., Scheuermann, P., 'Structured Data Bibliography Structures', 1974. communications of the ACM, PAGE 155 Vol. 17, #10, Oct. Data Base Reorganization 'Optimum Shneiderman, B., 42) Vol. 16, #6, June 1973. ACM, the of Communications Points', 'A Stochastic Model for the Evaluation of 43) Siler, K.F., Large Scale Data Retrieval Systems...', Ph.D. Thesis, UCLA, 1971. Wallace, R.M., 'Janus: A Data Management 44) Stamen, J.P., far the Behavioral Sciences', Cambridge System and Analysis Project, Cambridge, Mi. 'Self Organizing Data 45) Stocker, P.M., Dearnley, P.A., Vol. 16, #2, Journal, Computer Management Systems', The 1973. 'A Methodology for Comparison of File 46) Winkler, A., organization and Processing Procedures for Hierarchical Storage Structures', Ph.D. Thesis, U. Texas, Austin, Aug. 1970. 'Management Information Systems - A 47) Ziering, C.A., Comparison of the Network and Relational Models of Data', Masters Thesis, Sloan School of Management, MIT, June 1975. PAGE 156 Appendix 1 with deals appendix This the third of procedure normalization. presented here is The alogorithm detail that functions the of truth by a set driven mutual and functional- dependencies existing in the data. Notice that computational gt zonsidered in dependencies are presented here. the alogorithms The issue of computational dependency is as to relevent decisions in. any of belongs a domain which relation not We proceed now with the algorithm. 1 Apply transitivity to all mutual dependencies if Ie: VIaI,IbI,IcI $8m(IaI,b)=1 and $8m(IbI,cI)=1 then set $8m(IaI,IcI)=1. Combine all functional with dependencies first the same list, and remove those with duplicate first lists. Ie: VIaI,Icl jal=IcI, set where S8f(IcI,Idj)=1, domains $8f(IIIdI)=O. to include the 2 and and set set $8f(aI,e)=1 where teI=tbIUjdI, $8f(lal,IbI)=0 and dependent $8f(IaI,IbI)=1, Expand functionally functionally dependent domains of all mutually dependent (sets of) domains. Appendix 1 $8z (I a Isbi where Vialsibi Ie: PAGE 157 )=1, $8f(laisiri)=1 and $8f(ibijsis)=1: modify Irl to jr'I, and is1 to is'I where ir1I=Is'l= jrl U ist $8f(aI,ir I)=1 and $8f(IbIis'1)=1. This yields: 3 Remove dependencies on 2artial candidate keys. (Underlined domains are domains is of primary keys; if underlined set of domains is relation, then any one underlined in than one set more each a candidate key.) where $8f(jal, IbI)=$8f (Icl,IdI) =1: VIbisIdi fbi N Idj=ibt and jai N IcI=IaI then : If - Idj=idJ Now and lxl=lIxl - 9bi V1xl such that IxI)=1 and $8m(IrI,Ic!)=1. $8f(jr, 4 ibi, set up relations in third forms normal in the following two steps: 4.1 VibIIdI where $8f(IalbI)=1 and $8f(Icl,1 dl)=1: Does some relation already set up contain both jal and Ibi, or both Ic! and Idj ? yes: then continue mark that with a functional dependency different one. Ie: as 'done', Find some and other Appendix 1 PAGE 158 JalIbIIcl and idi. d=0 ? (0) No: Is Ibi N set and RRi(Jaj,lbI) relation up dependency $8f(lal,b)=1 relation', not 'have Yes: If $8f(ja,IbI)=1 does (1) mark then functional 'have relation', as 'done' and and continue (2) V(Ibl N IdI)/g0' do the following: No: Is $8m(lal,IcI)=1 ? (3) Ie: set up relation RRi(IAI, "c1, lxi) where Ixl=iblUldi Mark the functional dependency $8f(IcidI)1 as 'done' of $8f(iaIIbI)=1 and $8f(Icl,dI)=1 and both as 'have relation'. (4) No: is Ibi N set up 3 relations: (5) Yes: RRi(lal, lbi), RRj(lI, Idi), RRk (12i) where lei is If c=P ? and leIlbI N idi already in some RRm, m/i and m/j, then delete RRk. Mark those functional dependencies as 'done' relation'. (6) No: is (7) Yes: Is IcJl=cl N Ibi ? there is transitive dependence. $8f(IaI,IbI)=1 'done' ? and 'have Appendix 1 (8) !2: set PAGE 159 1b lb' i where 4.1 for restart step - lb'l=b1 Idi and dependency this functional set i&.: (9) set strike all lbl=lb'l where lb'l=lbl - domains Idi from Idi ad: set up the relation $8f(taJ,lbl)=1. Restart for functional dependency step 4.1 for that functional dependency. (10) establish relations: No: Ibi) and RRi(aji, RRJ(ICi-Pldi) those functional Mark as 'done' dependencies and 'have relation'. 4.2 Vlal,IbI (created (11) N2: is where $8m(Ial,Ibl)=1, in step (4.1) both jai containing some relation there and Ibi, eg: set up relation RRk (Jai, ibN) (12) Yes: no action 4.3 For all relations $8m(laI,Ib)=1 RRi. for any RRi created lal,lbl*RRi, then Examples are presented below described above. in Decision points in 4.1 and 4.2, remove Ibi if from to clarify the procedure (4.1) and (4.2) above Appendix 1 PAGE 160 the examples that follow. have been numbered for use in instances of 'dpn' in the examples mean 'decision functional expressing point n'. the examples is Other notation that will appear in that for dependencies mutual and All diagramatically rather than in the form of truth functions. will '->' (A,B)->(C) means that '< --- > For example C is functionally dependent on A and dependency. example, dependency. functional imply implies mutual (AB)<--->(CD) means that (A,B) and For (C,D) are mutually dependent. example 1 or: (P) <---> (0, S (P) ->R or (Q) -R or $8m(IPI,Q,SI)=l $8f(IPI,IRI)=1 :$8f(IQIrlR I)=1 Step 1: No actio n Step 2: set up (d) Step 3: Using (b) thus, no action. $3f(IQ,SIIR1)=1 and (c): IRI N IRI=IRI, and IPINIQI=W Appendix 1 Using (b) IRINIRI=IR I andtPjNIQ,Sj=Q and (d): 161 PAGE thus, no action. Using (c) JR1 N IRI=IRI and IQINIQ,SI=IQI thus: and (d): liL in IRI=IRI-IRI=0 which means that (d) becomes $8f(IQ,SI,<null>), which must be 0 (See pagexx Chapter IV) Also: $8m(IPjIQ,SI)=1, so strike IRI from (b) as well. We now have: a) $8m(IPIJQ,SI)=1 c) $8f(IQIIRI)=1 (d) are 0. Both (b) and Since (c) is Step 4.1: < N IbI=0 VIbI the only functional dependence, IRI (c) take dpi: RR1(Q, R) Step 4.1 complete. Step 4.2: Since $8m(IPIIQ,SI)=1, relation, take dpll and set up: RR2(P, %2) Thus have relations: RR1(Q, R), RR2(P, Step 4.3: and 9Al) No action and RR1 is the only Appendix 1 PAGE 162 Example 2 i) (A,B,C)<--->(D,E) ii) (D,E)<--->(GK) or: $8m(ID,EI,IG,KI)=1 iii) (A,BC) -- >(YZ) (G,K) iv) or: $8f(IG,K I,IXI)=1 $8m(IBCjl,13,Kj)=1 (v) 2: lai=IA,B,CI and for becomes: get or: $8f(IA,B,CI,IY,ZI)=1 Apply transitivity to get: Step 1: Step -- >(X) or: $8m(IA,B,CIID,EI)=1 $8f(IA,B,CIIY,ZI)=1 (vi) Ibl=ID,EI (i), (ie: no change), and $8f(ID,EIIY,EZ)=1 For |al=ID,EI and Ibj=IG,KI from (vi) becomes $8f(ID,8I,IY,Z,XI)=1, (iv) becomes (ii), and $8f(IG,Ki,IX,Y,ZI)=1 Por Iai=IA,B,Cl and (iii) (iii) from becomes JbI=IG,KI from (v), $8f(IA,B,ClIY,Z,XI)=1, unaffected. We now have: i) $8m(IA,B,ClID,E)=1 ii) $8m(ID,EIIG,KI)=1 V) $8m(IA,B,CIGLK)=1 iii) $8f (IA,B,C I, If,3,X1) =1 and (iv) Appendix 1 iv) $8f(IG,KIIX,Y,ZI)=1 vi) $8f(IDEIJY,Z,XI)=1 Step 3: Since IA,B,Cl N )G,KI=W, N ID,E=W IA,B,CI and PAGE 163 IGKI N ID,EI=O no action in step 3. Using (iii) Step 4.1 get lal=IA,B,Cl, and (iv): IcI=IG,Ki, $8m(IaI,IcI)=1 from Ibi N jdI/0, so take dp2. set up RR1(AxBsg, G!K, IbI=IY,Z,XI and Idl=IX,Y,ZI X,Y,Z), and mark (iv) (v) so dp3. as 'done' and 'have relation'. Mark (iii) as 'have relation'. Using (iii) and (iv): IcI=ID,EI, IbI=IY,Z,X jaJ=jA,B,Cj, and Idi=IY,Z,XI. Ibi N IdI$0 $8m(Ial,IcI)=1 from so take dp2. (i), so take dp3. Note that (iii) 'have relation', so simply add Icj and id'I=IdI N JbI to RR1 Ie: get RR1(APC, §gK, D, Mark (vi) as 'done' and 'have XY,Z) relation'. Since there are no further functional dependencies that are 'not done', proceed to next step. Step 4. 2: Since for each truth function of the form Appendix 1 $8f(laJ,ibi)=1 (i), (ii) and (ie: relation (viz: RR 1) zontaining lal PAGE 164 (iv)) exists some there and Ibi, take decision point (12). Step 4.3: No action Thus we have relation: RR1(AgeC, B G1jA, X,Y,Z) Example 3 or: $8f(IA,B,Ci,,i D,E,FI)=1 (A,B,C) -- >(D,E,F) i) -- > F ii) (DB) or: $8f(ID,EIIFI)=1 Step 1: No mutual dependencies, so no acti on. Step 2: No mutual dependencies, so no acti on. Step 3: and Idj=FIj, for ibI=ID,EI Ibi N IdI=9, so no action. 4.1: Step From and (i) (ii) : lal=IA, B,CI, lbl=ID,E,FI and idl=IFI. Ibi N ldl/0 so take dp2. $8m(Ial,lcj)=O so take dp4. Ibi N jcl/0 so take dp6. tcl N lbl=ID,El = Ic (i) is 'not done', (i) becomes so take dp7 so take dp8. $8f(IA,B,CI,D,E)=1, and (ii) is unchanged. Icl=ID,EI, Appendix 1 PAGE 165 Restarting Step 4.1: from (i) and (ii): ja=jA,B,CI, icl=ID,Ei, IbI=ID,E and Idj=jFj. Ibi N Id1=0 so dpl. Set up realtion RR1(Aegj#, DE), and mark (i) as 'done' and as 'done' and 'have relation'. From (ii): Ibi=IFi, lal=ID,EI, IcI=Idj=<null>. JbI N IdI=0 so take dpl. Set up 'have relation RR2(QLa, F) and mark (ii) relation'. Step 4.2: No action Step 4.3: No action So we have relations: RR1(A,B,, DE), and RR2(Qgg, F). Example 4 i) (A,B) -- > C ii) (D,E) -- >(CF) Note that or: $8f(IA,BI,ICI)=1 or: $8f(ID,EI,IF,Cl)=1 (A,B)<--/-->(D,E) ie: not mutually dependent. Step 1: No mutual dependencies, so no action. Appendix 1 Step 2: No mutual dependencies, PAGE 166 so no action. Step 3: No action Step 4.1: from (i) and (ii): lal=IA,BI, IbI=IC i, IcI=D,EI and IdI=IF,CI. Ibi N IdIj0, so dp2. $8m(IA,B,CI,ID,EI)=O, so dp4 Jbi N Icl=0 so dp5. Set up relations: RR1(AgB, C), RR2(ag, F,C) and RR3(C). Mark (i) and (ii) as 'done' and 'have relation#. Step 4.1 complete, since all are 'done'. Step 4.2: No action. Step 4.3: No action Thus, we have relations: RR 1 (.A RR2(Dja, , C),r F,C) and RR3(C). This concludes the examples. Appendix 2 In the examples and definitions PAGE 167 that follow relation names of the form : 'R<i>'. This is we will use for convenience only; any character string may be used for a relation name. Notation. R<i> is the name of the i th relation < means 'is a member of' a list, J....1 implies or set of the items between I ' s. c(i) is the cardinality (number of entries) in R<i> n (i) is the degree (number of domains) in d(i,j) is the v(m) (i,j) is t (i) is ie: th domain of R<i>, (v (a) t(i) is j=1,..n(i) the m th value of d(i,j), an n(i)-tuple in a L (jai) j (i, R<i> m=1,..c(i) R<i> 1),v (a) (i,#2),..v(a) (i ,n (i)) 1,...c(i) the length of list a 3 is the null set - ie: R<i>=W implies c(i)=O atb means a is a subset of b (a=b is legal) aC-b means a is a 2r2p2 subset of b (afb) Va means for all valaes of a Examples the Appendix 2 The following examples will be PAGE 168 used throughout this section to explain deifnitions. (NAME, SOCS EC, P HONE, DEPT#) 213-07-1666, 232-1500, 15), (Donovan, 621-49-2990, 6 17-1400, 15), (Granger, 413-00-0299, 536-5176, 6), (Smith, 839-41-6942, 253-1410, 6)) (NAME, SOC SE C, PHONE, 217-- 1-7322, 253-6671, 15), (Smith, 213-07-1666, 232-1500, 15), (Donovan, 621-49-2990, 617-1400, 15)) R1=( (Smith, R2=( (Madnick, AGE, CITY) R3=( (Madnick, 31 Peabody), (Donovan, 34, Ipswitch), (Smith, 23, Boston)) (PERSON, (NAME, R4=( (Smith, PHONE) 232-1500), (Donovan,617-1400)) DEPT#) PAGE 169 Appendix 2 R5=( (PERSON, AGE, CITY, (Madnick, 31, Peabody, 18), (Donovan, 34, Ipswitch, 43)) STREET_#) Definitions Symbol: 1) Unjion U (j=k is valid) R<i>=R<j> U R<k> Format: c(i)=c(j)+c(k)-c(Rj N Rk) n (i)=mat(n(j) ,n(k)) R<i>= It(i) < R<j>, OR R5 = R1 U R2 Example: R5=( :t(i) < R<k>l t(i) would yield: 213-07-1666,232-1500,15), (Smith, (Donovan,621-49-2990,617-1400,15), (Granger,413-00-0199,536-5176, ,839-41-6942,253-0410, (Smith 6), 6), (Madnick,217-61-7232,253-6671,15)) 2) flatersection Format: R<i> = R<j> N R<k> (Note that if R<i> = Symbol: n(i) = n(j) = (i=j=k is valid) n (j)/n(k), then B<i>=V) : t(i)(j It(i) N n(k) ANp t(i)<kI Appendix 2 Example: R6=( R6 = R1 N R2 PAGE 170 yields: (Donovan,621-49-2990,617-1400,15), 213-07-1666,232-1 500, 15)) (Smith, Format: (Note: If - Symbol: 3) Difference R<i> = R<j> - R<k> n(j)/n(k) then: n(i)=n(j) c(i)=c(j) ) R<i>=R<j> n(i)=n(j)=n(k) - c(k) c(i)=c(j) c(R<j> N R<k>) - R<i> = it(i) : t(i)*R<j> AND t(i)fR<k>I Example: R6=( R6=R1 - R2 (Granger,413-00-0029,536-5 176, 6), 839-41-6942,253-0410, 6)) (Smith, 4) yields: Cartesian Product Symbol: (Sometimes called a 'Cardinal Format: (Note: if must R<i> = R<j> X R<k> n(j) > be treated n(j)=n(k)=1. ) n (i)=n(j)+n(k) X Product') (j=k is valid) 1, or n(k) > 1, then as a single domain, each t(j) so that (or t(k)) effectively Appendix 2 PAGE 171 c (i)=c(j) .c(k) Va jji a set of ordered pairs with first R<i> = I (v(a) (j,1),v(b) (k,1)) ie: R<i> is Vb * k, member from R<j> and second from R<k>. R5=( yields: R5 = R4 X R4 Example: ((Smith ,232-1500), (Smith ((Smith ,232-1500), (Donovan,617-1400)), ,232-1500)), ((Donovan,617-1400),(Smith ((Donovan,617-1400), 5) Projection (Donovan,617-1400)) ) P Symbol: R<i> = R<j> P Format: ,232-1500)), (d(j,1)), 1 11,2,...n(J)I n (i)=L(l) redundent that (Note c(i)=c(j) automatically deleted as proposed in entries Example: R5=( d(j,l) : 1 1,2,...n(j) yields: R5= R2 P (NAME,PHONE) (1adnick,253-6671), (Smith, 232-1500) (Donovan,6 17-1400) 6) Join Format: Symbol: R<i> = ) * R<j>((d(jl))) 0 ::= > I < I = |,@ * not some versions. Use the 'compaction' operator to remove redundent entries.) R<i> = are R<k>((ed(km))) Appendix 2 15Ca1 1, 2,. .. n (J) m 41l 1,2,....n (k)| and d(j,l) and d(k,m) must be PAGE of the same data type 172 (ie: must be joinable). n (i)=n(j)+n(k)-1 but we ignore is duplication when a not '=', There is '='. (g duplication of the join domain when a that rare case here.) c(i)=c(j)+c(k)-c(v(a) (j,1))=v(b)(k,m)), a=1,..c 0) b=1,..c(k)) R<i> = Vb * j, Va * k, I d(jb),d(ka) v(g) (j,1) a v(d) (k,m); Ig 4 j, Example 1) yields: AGE,CITY) DEPT#,NAME, PHONE, (217-61-7232,253-6671, 15, Madnick, 31,Peabody), (213-07-1666,232-1500, 15, Smith, 23,Boston), Donovan, 34,tpswitch) (621-49-2990,617-1400, 15, Example 2) R6=( Vd * ki R6=R2(NAME) * R3(=,PERSON) (SOCSEC, R6=( but afam R6=R3(CITY) (NAME, AGECITY, (Madnick, 31, * R4(>,NAME) Peabody, 617-1400), this example makes included simply to illustrate 7) Composiio Format: yields: PHONE) (Donovan, 34, Ipswitch,617-1400) (Note that ) no intuituve sense; the use of 1* Symbol: R<i> = R<j>(I(j,l)) ) . R<k>(d(ka)) when 8 / it was '=' ) Appendix 2 PAGE 173 1 * I 1,...n(j) S* 1 1,...n(k) d(km) d(j,l) and (ie: of the must be joinable same data type) n(i)=n(j) + n(k) -2 c(i)=c(j)+c(k)-c(v(a) (j,1) = v(b) (km); a=1,...c(j); b=1,...c(k)) ( R<j>(d(jl)) R<i> = Vb<j, (ie: * ) P (d(j,b)); R<k>(d(k,m)) except b/l remove domain d(j,1) R<j> on which and R<k> were joined.) Example: R5=R2(NAME) (SOCSEC, . PHONE, yields: R3(PERSON) DEPT#,AGE,CITY) R5= ( (217-61-7232,253-6671,15, 31,Peabody), (213-07-1666,232-1500,15, 23,Boston), (621-49-2990,617-1400,15, 34,Ipswitch) 8) Permutation Format: Symb3l: ) M R<i> = R<j> M (d(j,l)); 1= 1,...n(j) n (i)=n(j) c(i)=c(j) The only effect of this operator is to re-order the domains in a relation. Example: R5 = R3 M (PERSONCITY,AGE) yields: Appendix 2 PAGE 174 R5= ( (MadnickPeabody,31), (Donovan,Ipswit:h,34), (Smith, Boston, 23) 9) Compaction Format: Symbol: C R<i> = C (R<j>) (i=j is valid) n (i)=n (j) R<i> = I t(b) :t(b)/t(a) This operator simply ; a/b I (Qg: R<i>=R<j> N R<j>) removes all redundent entries from a relation. 10) Restriction Symbol: 10.1) Diadic restriction: Format: R<i>=R<j>(d(j,1)) R R<k>(e,d(k,m)); 1 e1,...n(j)j a iS1,1,...n(k) where: L(1)=L(m), and n(k) <= n(j) then n(i)=n(j) a ::= > R<i> = I < I = I ,' It(j) : v(a) (j,f) 8 v(b) (kg) VfIl, Vgfm, Va~j, Vb4k a=1,...c(j); Example 1) R6 = b=1,...c(k) R2(NAME,PHONE) R yields R6=( (Smith ,213-07-1666,232-1500,15), (Donovan,621-49-2990,617-1400,15) I R4((=,NAHE),(=,PHONE)) PAGE 175 Appendix 2 Example 2) R6=( R6=R2(PH3NE) R R4(>,PHONE) yields: (Madnick,217-61-7232,253-6671,15), (Donovan,621-49-2990,617-1400,15) (Note: t(1) of R6 appears ) because 253-6671 > 232-1500. the fact that 253-6671 < 617-1400 does not affect this.) 10.2) Monadic restriction: Format: R<i> R<j>(1(j,1)) = R (9,d(jm)) 1 40I1,...n(j)I a SIl1,...n (k)I L (1)=L(m) 0 ::= > I < I = 1 ,' n(i)=n(j) R<i> = It(j) : v(a) (j,f) 9 v(b) (j,g), Example: R6=( 11) R6 = R10(A3E) R (<,STREET#) f41,Yg4m,Vab 4 j I y ields: (Donovan,34,Ipswitch,43) Symbol: Division Format: R<i> = / R<j>(d(j,1)) / R<k>(d(km)) ; 1Ea1,...n(j) I assl1,...n(k) 1 This operator is the inverse of the cartesian product; ie: Appendix 2 (R<j> X R<k>) Example: R5 / / R<k> = R<j> Using R5 of R4 = R4 (4) above: PAGE 176 Appendix 3 PAGE 177 Decision Variables. This appendix used that are decision variables lists the throughout this thesis. They are listed in order of rule# as outlined in Chapter IV. Rule 1 i used of relation domain j 1) $lrd(i,j)rsen as the only egui-gualifier (<qualifier type> 'e') in retrieval request; no joins. 2) $lrd(i,j)rsej except join (1), as same involved in resolving request. 3) $lrd(i,j)rcenn domain j of relation several qualifiers, as an equi-qualifier; other of and none joins, i used as one of domains used no as qualifiers had indexes 4) $1rd (i,j)rceni(trss) as (3), same index. qualifier had from resulting except some 'trss' other is size of set with indexes using domains first. 5) $lrd(i,j)rcejn same as (3) except joins involved in resolving request. 6) $1rd(i,j)rceji(trss) same as (4), except joins involved Appendix 3 PAGE 178 in resolving request. 7) $lrd(i,j)rcnnn domain j of relation i used as one of several qualifiers but not as equi-qualifier; no other indexes, domains and used as qualifiers no joins involved had in resolving request. 8) $lrd(i,j)rcnni(trss) same as (4) except domain j not used as equi-qualifier. 9) $lrd(i,j)rcnjn same as (5) j not used except domain as equi-qualifier. 10) 11) $lrd(i,j)rcnji(trss) $lrd(i,j)rnun same as (8) j domain unspecifiel except joins involved of relation qualifier (eg: i used ...j='all') as no joins involved in satisfying request. 12) $lrd(i,j)rnuj 13) $lrr(i)rnun same as (11) except joins involved. unspecified retrieval serial retrieval from relation i; eg: of each entry in relation. No joins involved. 14) $lrd(i,j)rnuj same as (13) except joins involved in request. The same set <reguests> 'u' and 'd'. of 14 and 't, decision variables and is maintained also for <relation for type>s 'v' Appendix 3 PAGE 179 For <request> 'i', only one decision variable is maintained: $lrr(i)inun inserts of entries into relation i; no joins involved. Rule 2 retrieval of only domain j of relation 1) $2rd(i,j)rsn i; no joins involved in satisfying request. same 2) $2rd(i,j)rsj as (1) except joins involved in resolving request. domain 3) $2rd(i,j)rcn j relation i one of several of retrieved; no joins involved. 4) same as $2rd(i,j)rcj (3) except joins involved in satisfying request. 5) $2rd(i,j)r(<aggr>)sn domain j some single <aggr> of retrieval of of relation i; no joins involved in request. 6) $2rd(i,j)r(<aggr>)sj same as (5) 7) $2rd(i,j)r(<aggr>):-n retrieval one 3f them domain except joins involved. of several aggregations, j of relation i; no joins Appendix 3 PAGE 180 involved. 8) $2rd(i,j)r(<aggr>)Cj sames as joins (7) except involved. retrieval of whole entry from relation 9) $2rr(i)ren i; no joins involved. same as (9) 10) $2rr(i)rej The sane set <request>s 'u' of decision and '1', except joins involved. variables and for is maintained <relation type>s for 'v' and 'd'. For <request> 'i', only one decision variable is maintained: $2rr (i) ien inserts of entries into relation i; no joins involved. Rule 3. Only one decision variable is maintained of this type: $3rd(i,j)d(k,m) the relation k via domains number of times j relation i and m respectively. joined to Appendix 3 PAGE 181 Rule 4 1) $4io 2) cost per I/O operation cost per cill $4opc to XRM 3) $4bfe number of entries per IRM block 4) number of index entries per block $4bfx 5) $4sc cost per byte per day of storage 6) $4t time period since last SDS invocation 7) $4p XRM blocksize Rule 5 1) $5cy(i)r cardinality of real relation i 2) $5#d(i)r degree of real relation i 3) $5cy(i)v cardinality of virtual relation i 4) $5d(i)v degree 3f virtual relation i 5) $5cy(k)d(<method>) cardinality of <method> is the method derived relation k. of derivation. If the derivation did not include restrictions, then <method>: :=<null>. 6) $5#d(k)d(<method>) degree of derived relation k Appendix 3 PAGE 182 Rule 6 $6 (j)g number of unijue values in domain j Rule 7 $7r user-supplied response-time weight factor. Truth Functions. j $8d (ij) domain $8i(j,k) domain k in relation appears in relation i j is inverted. (For virtual relations, $8i(j,k)=O always.) $8p(i,j) j domain is one of the primary key domains of relation i. relation i is $8x(i)r $8x(i)d(<method>) relation i <method> is did a real relation. is a the method of derivation. involve not derived relation, a and If <method> restriction, then <method>:: =<nu 11>. $8n(i,j) domain j of relation i is mandatory. le: a value must be provided for this domain before an entry in relation i will be made. Appendix 3 $8n(i,j)=1 Note PAGE 183 Vj where $8p(i,j)=1. (Primary key domains are mandatory.) domain j :ontains unique values (eg: socsec_#) $8u(j) $8r(i,j) same as $8n (i,j) name. Also notice $8d(i,j) Thus except that refers to a role it is that $8r(i,j) this is a whether a role name is a subset truth function of that tests in relation i. $8_(j)<data type><starage strategy> <data type>::=<character> I <fixed> I <float> I <vector> I <bit> <character>::= c <fixed>::= x <float>::= f <vector>::= t(<size>) <bit>::= b <storage strategy>::=<virtual> (real < encoded> (real < unencoded> <virtual>::= v <real encoded>::= e <real unencoded>::= u functions is This set of truth domain j. For exmaple, if to test the data type of $8_(name)ce=l then domain 'name' is an encoded character string. Appendix 3 $8f(JmJ,Jn) of is a the PAGE 184 truth function that tests domains in Ini list whether each are functionally dependent on the whole list jal. Note 1) which List Jai is members of not a list of all list Ini are domains on functionally dependent. Each n'4|nl may be functionally dependent on some jxiimi 2) also. If (ie: InI=9 is then empty) $8f(Jmi,jnj)=0. $8m(pj,jIql) is a function that tests whether lists mutually are Jgj Ipi and Ie: dependent. and also $8f(IplPqP)=$8f(IqJIpi)=1, $8m(lpI,Iq) implies $8m(j,jpJ). Transitivity also holds: $8m(Jpi ,Iq)=$8m(qi,IsI)=1 implies that $8m(JpJjsJ)=1. $8c(tpJ,q) (<function>) is a function (note that q is not dependent on lomains is defined as 'q=6.3 dependent on p. which tests whether q a list) Ipl. For example, if * p' then q is (<function>) is required to derive q from the list $8od(k) is computationally is a truth function set domain g computationally the computation of domains Ipi. up for a request. It is '1' if domain k appears as one of the object domains in the request. $8eq(k) is a truth function used in requests. It is '1' if PAGE 185 Appendix 3 domain k appears as a qualifier with <qualifier type> 'e'. $8nq(k) is similar to $8eg(k) except that type> is not 'e'. the <qualifier