DBSchemaProposal

advertisement

A Review of Database Schemas

David Livingstone

5 th

December 2005 (30 th

March 2001)

A Review of Database Schemas

Introduction

The purpose of this note is to review the traditional set of schemas used in databases, particularly as regards how the conceptual schemas affect the design of the storage of relations. It is sometimes considered that each base relation must be stored ‘as is’, i.e. each logical tuple must be mapped directly to a corresponding physical record. This is not so, and some relational DBMS products already provide some facilities to improve on this. The results of the review lead to a rationalisation of the database schema architecture.

Physical Data Independence

The contents of a base relation 1 are physically stored. However the whole purpose of physical data independence is that the data can be stored in any desirable way. It does not need to be stored as the physical equivalent of a table. Any file or data structure or combination thereof can be used and manipulated, so long as the stored data appears to the user as the logical abstraction of a relation.

Another consequence of physical data independence is that a base relation could be stored in terms of other relations. There are three possibilities :-

1.

A relation is fragmented into several smaller relations, and the relational fragments are physically stored instead.

2.

A relation is merged with other relations into a bigger relation, which is physically stored instead.

3.

Some combination of the above two methods.

Whenever one needs the original relation, its contents are created from the contents of the relevant stored relations.

The first possible method is well developed for the design of distributed databases. A relation is fragmented horizontally or vertically. Horizontal fragmentation uses restrictions and/or semi-joins to split the original relation into several sets of tuples

(i.e. several relations), such that they can be unioned together to re-create the original relation. Vertical fragmentation uses projection to split the original relation into several sets of attributes (i.e. several relations), such that they can be naturally joined together to re-create the original relation. A fragment can itself be fragmented, either horizontally or vertically, so that the fragmentation can be continued recursively till fragments suitable for storage are obtained. There are design rules governing how the fragmentation should be specified, in order to ensure that the original relations can be re-created from the fragments and to ensure the fragments have desirable characteristics; these design rules are not further considered here.

1 All the named relations in a database, whether they be base relations, views or stored relations (as defined later) are actually relational variables, since they permanently exist as relations and their values are allowed to vary during their lifetimes. However traditional terminology is still used here, even though it doesn‘t always make clear that the relations are variables.

Page 1 of 26

A Review of Database Schemas 5 th

December 2005 (30 th

March 2001)

David Livingstone

A variation of the first method is also used for single-site databases by some DBMSs - e.g. Oracle and RdB - in that relations can be split horizontally into sub-relations that are separately stored. However the splitting is not done via algebra or its SQL equivalent, but by syntax which is especially developed as part of the SQL storage control facilities. This is in conformance with the principles of physical data independence, in that it makes the fragmentation a purely storage issue and hides it from the SQL query user.

With the second method, relations can be merged together horizontally or vertically.

A horizontal merger is derived by unioning together compatible relations, such that the original relations can be re-created via restrictions on the merged result. A vertical merger is derived by a natural join of appropriate relations, such that the original relations can be re-created via projections of the merger. Mergers can themselves be merged together, either horizontally or vertically, so that the merging can be continued recursively till a merger suitable for storage is obtained. Design rules corresponding to the fragmentation rules apply here too.

Traditionally larger mergers are not considered so attractive for storage, as in general it is harder to access a desired data item from a larger set of data than it is from a smaller set. Nevertheless vertical mergers can be beneficial. If a query involves joining relations, and if the relation that is the result of the join were to be physically stored instead by means of vertical mergers using the same join properties as in the query, it would remove the need for the physical join operation to answer the query; and joins are potentially time-consuming. Despite this, only the Caché DBMS uses vertical mergers; it stores the tables representing hypercubes as a single joined table.

(It also uses a system of coding the data to compress table sizes). Some DBMSs, e.g.

Oracle, do allow clustering whereby data from two relations to be joined is stored so that the data of tuples that would be merged in the result is held in the same physical block. This is a “halfway house” that uses just physical design ideas rather than any logical design ideas.

The de-normalisation of some Conceptual schema designs often merges together relations in the original Conceptual schema. The motivation is usually to improve query performance by merging relations that have to be joined together in commonly occurring queries. However de-normalisation can lead to confusion and maintenance problems in the long run because the design of the Conceptual schema no longer reflects the inherent nature and structure of the data in the database. Queries and performance must inevitably be based on the inherent nature and structure of the data, since nothing else makes sense.

However, there have been research efforts to store vertical mergers in order to improve performance. For example, Schkolnik and Sorenson have investigated storing denormalised relations as ‘materialised joins’ in addition to the normalised relations (see [5]). Scholl, Paul, and Schek have investigated storing nested relations that are the logical equivalent of joins (see [6]).

Thus it would be helpful if database designers could separate out the logical relations that represent the inherent nature and structure of the data from the logical relations needed to optimise performance, particularly as performance optima can change much more frequently than the inherent nature and structure of the data.

Page 2 of 26

A Review of Database Schemas 5 th

December 2005 (30 th

March 2001)

David Livingstone

The author has not come across any instances of the third method, combining both merging and fragmentation, but logically it is possible. If one is alive to the possibility, then opportunities for its use may emerge. (If merging is never formally used, then naturally it cannot be combined with fragmentation).

It becomes convenient to distinguish between base relations and stored relations. A stored relation is one whose data contents are physically stored in files such that there is always a direct link between the stored relation and its physical storage, even if the physical storage is changed (possibly radically and/or frequently) to optimise physical performance. A base relation is also one whose data contents are physically stored, but the storage may be direct - i.e. the base relation is also a stored relation - or it may be indirect in that it may be mapped onto other relation(s), possibly recursively, culminating in stored relation(s).

Note that from a technical perspective updating an underlying stored relation(s) in order to update a base relation, is in principle no different to updating an underlying base relation(s) in order to update a view.

Consider the relations :

R1 (A, B, C)

R2 (A, D, E)

REL (A, B, C, D, E) where A denotes an attribute that is a candidate key. Furthermore, suppose :

REL

R1

R2

R1 Join[ A ] R2

RE L Project[ A, B, C ]

REL Project[ A, D, E ]

R1

REL

R2

Then is it possible to deduce unequivocally that R1 and R2 are two views derived from the base relation REL by projection, or that REL is a stored relation created by a natural join of the two base relations R1 and R2 and used to store their data ? The answer is that it is not, because the two cases are logically equivalent.

In fact it is generally true that when a relation(s) is derived from another relation(s) via algebraic operations, the derivation or mapping between the two relation(s) is independent of whether a view(s) is being derived from a base relation(s) or a base relation(s) is being derived from a stored relation(s). The differences between the two situations are that :

1.

they have different purposes;

Page 3 of 26

A Review of Database Schemas 5 th

December 2005 (30 th

March 2001)

David Livingstone

2.

different approaches are used to defining views and stored relations, since both are derived from base relations but whereas base relations underlie views, stored relations underlie base relations;

3.

stored relations should obey design criteria which views are not constrained to obey.

However none of these affect the logical nature of the mappings.

Date and McGoveran have a developed a method by which all logically-updateable views may be updated by updating the relations from which the views are derived : see ref. [3]. This method is very general and wide-ranging and so does not have the limitations of conventional ad hoc view updating methods. Consequently the same method can be used for updating base relations whose data is held in stored relations.

Ref. [4] explicitly refers to this. In fact Date and McGoveran’s method is far more general than the horizontal and vertical partitioning described above. The only constraint is that updates must be logically possible. An example of where an update is impossible, is an attempt to insert a tuple into a view formed by a GroupBy operation where the view includes an attribute formed by aggregation. The insert is impossible since there is no general way to determine which set of tuples, from all the different possible sets of tuples that are logically possible, is the set to be inserted into the underlying relation in order to create the inserted tuple in the view. As the fragmentation and merging methods do not incur this limitation, it does not cause a problem for them.

A final bonus from distinguishing between stored and base relations is that a query optimiser that uses algebraic transformations can use these transformations in going from base to stored relations. Thus a greater part of the optimisation process can be done via algebraic transformations. (This was a major motivation for Scholl, Paul, and Schek’s work). The extra optimisation opportunity costs nothing since it merely uses the existing algebraic transformation part of the optimiser. Other things being equal, the part of the optimiser that deals with physical storage can now be simplified as the physical storage options can now be simplified without loss of optimisation.

Database Schemas

The fact that fragment relations and merged relations are not intended to be seen by the users, nor reflect the inherent nature or structure of the data, suggests that they be viewed in terms of another schema. In order to pursue this idea, first consider existing database architectures.

The traditional database schema architecture, based on the ANSI/SPARC standard, is the 3-layer one :

1.

External (or Sub-) schemas.

2.

Conceptual (or Logical) schema.

3.

Internal (or Physical) schema.

Where fragment relations have been employed in distributed databases, the traditional schema architecture has been expanded (for example, see ref. [1]) to encompass the following :

1.

Global External (or Sub-) schemas.

2.

Global Conceptual (or Logical) schema.

Page 4 of 26

A Review of Database Schemas 5 th

December 2005 (30 th

March 2001)

David Livingstone

3.

(Global) Fragmentation schema.

4.

(Global) Allocation (or Replication) schema.

5.

Local Conceptual (or Logical) schemas.

6.

Local Internal (or Physical) schemas.

Comparing the traditional and distributed cases, schemas 1 and 2 are the same in both of the cases. The distributed schema 6 is the traditional schema 3 at each node. The distributed schema 5 is the equivalent of the traditional schema 2 at each node. The distributed schemas 3 and 4 are new. The Fragmentation schema is the schema wherein is specified what fragments are to be used to store the data of base relations, and how the base relations relate to their fragments.

Thus it is proposed that :

 the Fragmentation schema become generalised and allow both fragmentation and merging;

 it is called the Storage schema to reflect the change;

 it is used by all databases, whether single-site or distributed.

Since some large single-site databases may also wish to store multiple copies of some relations, in order to improve resilience and/or query performance, one may as well also keep the Allocation schema as standard for all databases. In which case, there would be several Local Conceptual schemas, one for each separately stored part of the database, where typically each part is stored on a separate storage device. This could help to manage the stored data.

Where a single-site database has no multiple copies of relations, the Allocation schema would trivially be a one-to-one mapping of the Storage schema to the one and only ‘Local’ Conceptual schema; i.e. the Storage schema would be identical to the

‘Local’ Conceptual schema. Consequently one would expect the DBMS to automatically optimise this situation so that it caused no additional overhead. Even in this situation, if the database is large and complicated, it might make the database easier to manage if there were several ‘Local’ Conceptual schemas and the Allocation schema mapped different portions of the Storage schema to each ‘Local’ Conceptual schema. These ‘Local’ Conceptual schemas would each be a physically coherent portion of the database that was best managed as a whole.

From a design point of view, adding the Storage schema provides a pleasing symmetry for the top three schema levels of the database. The central schema of the three, the Conceptual schema, consists of all the base relations in the database. These should be designed to the best possible logical design standards to reflect the inherent structure and characteristics of the data, i.e. be a canonical model of the real-world data. They would be designed using normalisation and without regard to application or performance considerations. The External schemas in the layer above would be derived from it, and each would reflect what was best for the application(s) that it supported. The Storage schema in the layer below would also be derived from it, but would reflect what was most desirable from a performance point of view.

It is conjectured that the lack of the Storage schema in the past has encouraged database designers to incorporate the physical design possibilities for each relation in the Conceptual schema, rather than separate out the logical and physical aspects of the relational design. Also that this lack has led to the dilemma of de-normalisation : should Conceptual schema relations be de-normalised for performance or maintained

Page 5 of 26

A Review of Database Schemas 5 th

December 2005 (30 th

March 2001)

David Livingstone as a valid logical design to support the long term maintenance of the database ?

However the ability to fragment and/or merge relations allows for a lot more possibilities for designing stored relations than merely de-normalising. The introduction of the Storage schema could remove all such impediments to thought.

As the horizontal fragmentation storage options of many DBMSs (e.g. Oracle and

RdB) indicate, there is a perceived need to create storage fragments. Creating the new schema allows this to be done openly, with the full power of relational algebra, or its

SQL equivalent, instead of having to re-invent parts of the algebra/SQL under cover of the storage options; and yet by being in a separate schema, it maintains physical data independence, which is so important. It also gives full reign to consider other storage designs that go beyond just simple horizontal fragmentation.

It is also possible to surmise that the lack of the Storage schema is in turn due to the lack of a general-purpose ‘view’ updating mechanism, such as that proposed by Date and McGoveran.

It is obvious that a general-purpose Storage schema is infeasible without such a mechanism. Yet hitherto no such mechanism has been implemented in any DBMS.

View updating is limited and ad hoc in SQL. In turn this is due to the nature of SQL which either does not implement relational principles fully or implements them anomalously. Were Date and McGoveran’s view updating mechanism to be found to be unsound or flawed, then it could not be used for even a pure relational database.

Either another mechanism (if this were logically possible) or a workaround would be needed, at least for the algebraic operators involved in horizontal and vertical fragmentation. Otherwise there would still be no support for the separation of the logical and physical design concerns, and for the distributed case no support for the geographical design concerns.

Appraisal of the Schemas

Consideration of the various schemas reveals that schemas of relations fall into two categories :

1.

sets of relations,

2.

sets of mappings between relations.

Let us consider each category of schemas in turn.

Schemas that are Sets of Relations

The schemas containing sets of relations are the :-

Subschemas, the Conceptual schema, the Storage schema, local Storage schemas.

Note that the use of ‘external’ and ‘internal’ as schema names has been abandoned for simplicity, as has the use of the adjective ‘Global’.

These schemas represent a four-layered architecture. Hence they can be portrayed as follows :-

Page 6 of 26

A Review of Database Schemas

David Livingstone

Subschemas :-

Conceptual Schema :-

Storage Schema :-

Local Storage Schemas :-

5 th

December 2005 (30

................

th

March 2001)

................

View relations appear only in Subschemas, although Subschemas may also include base relations. The Conceptual schema contains only base relations. The Storage schema and the Local Storage schemas contain only stored relations, any of which may also be a base relation.

It would be possible to have a Subschema that contains all the Conceptual schema’s base relations plus some views. (The terminology of s ub schema here - with its normal meaning of subset of the Conceptual schema - would still be appropriate, as a subset does not have to be a proper subset; it can be equal to the other set. Even the additional views are not holding additional information, merely additionally displaying existing information in more convenient ways).

To clarify this architecture, consider a small example that just illustrates the architecture’s top three layers. Let

V

View

B

Base Relation

S

Stored relation

The following represents a small database :-

Subschemas :-

V

1

B

1

B

3

V

2

B

2

V

3

Conceptual Schema :-

B

1

B

2

B

3

B

4

Storage Schema :-

S

1

S

2

S

3

B

3

B

4

Page 7 of 26

A Review of Database Schemas 5 th

December 2005 (30 th

March 2001)

David Livingstone

This shows that some of the base relations appear in Subschemas and the Storage schema as well as (of course) in the Conceptual schema.

If these schemas were to be portrayed using a Venn diagram (since they are all sets of relations), they would be shown as :-

V

1

V

2

V

3

S

1

B

1

B

2

B

3

S

2

B

4

S

3

This example raises the following question. When a view is created, in what schema is it immediately held ? One does not want the inconvenience of having to assign it to a Subschema as part of the view creation operation. The proposal here is that there is a system Subschema (called ‘Views’) to which any view is immediately and automatically assigned. Views can then be moved or copied from there to other

Subschemas as and when desired.

Still this architecture, because it refers only to relations, omits the actual physical storage structures that hold the relational data, the files, indexes, etc. A Local

Physical schema is needed for each Local Storage schema, where the Local Physical schema is the set of all physical objects - files, indexes, etc. - used to actually store the data of the Local Storage schema’s relations. Thus another architectural layer, consisting of the Local Physical schemas, should be added on to the bottom of the architecture as follows :-

Page 8 of 26

A Review of Database Schemas

David Livingstone

Subschemas :-

Conceptual Schema :-

Storage Schema :-

Local Storage Schemas :-

5 th

December 2005 (30

................

................

th

March 2001)

Local Physical Schemas :-

................

Note that the above architecture assumes a distributed database, or a centralised database where it is useful to have local schemas for each of a set of storage devices attached to the single computer. In the case of a simple centralised database where the

Storage schema is not be divided up between several local Storage schemas, the architecture could be simplified to :-

Subschemas :-

................

Conceptual Schema :-

Storage Schema :-

Physical Schema :-

These architectures give rise to some identities that are always true, and which could be useful in managing the database :-

Page 9 of 26

A Review of Database Schemas 5 th

December 2005 (30 th

March 2001)

David Livingstone

Dist[ Union ] Set-of-Subschemas-except-Views Diff Conceptual-schema

Views-Subschema

(N.B

. Dist[ Union ] means Distributed Union 2 ).

Subschema-X Diff Conceptual-schema

Set-of-Views-in-X

Conceptual-schema Diff Subschema-X

Set-of-Base-Relations-not-in-X

Conceptual-schema Diff Storage-schema

Set-of-Base-Relations-not-directly-stored

Storage-schema Diff Conceptual-schema

Set-of-all-Storage-only-Relations

The schema architecture for a distributed database still only has 5 layers compared to the usual 6. This is because the Allocation schema, which is a set of mappings between relations, is missing.

Schemas that are Sets of Mappings

Adding the Allocation schema yields the following architecture :-

Subschemas :-

................

Conceptual Schema :-

Storage Schema :-

Allocation Schema :-

Local Storage Schemas :-

Local Physical Schemas :-

................

................

2 These identities assume a mathematical notation. The operators (including the comparators) could be implemented in a relational algebra language, e.g. RAQUEL.

Page 10 of 26

A Review of Database Schemas 5 th

December 2005 (30 th

March 2001)

David Livingstone

It is readily seen that besides the Allocation schema, other sets of mappings between the layers of the architecture must exist in order to link adjacent layers, and that the database designer needs to be aware of and use these mappings. Two other mapping schemas are :-

1.

the View schema (the mappings that define the views in terms of other relations);

2.

the Equivalence schema (the mappings that define base relations in terms of stored relations).

They are like the Allocation schema in that they are mappings between relations, but they differ in that they arise automatically from the definitions of views and fragments/mergers, whereas the Allocation schema mappings must be entered manually because their choice is part of the design of the database.

Displayed pictorially, a small View schema might look as follows :-

V1

V2

B1

B2

B3

Three mappings are used to define two views, because one view is defined in terms of two relations but the other only in terms of one relation. The mappings correspond to the definition of the views. Naturally the actual algebraic definitions of the views, or

SQL equivalents, also need to be stored in the schema.

Similarly a small Equivalence schema might look like :-

B1 S1

S2

B2

B3

S3

Here one base relation is stored in two (fragment) relations, while two other base relations are stored in one (merged) relation. Again the mappings correspond to the definitions, whose algebraic/SQL expression needs to be stored.

Another set of mappings is from relations to their physical storage arrangements.

These are called Local Conversion schemas. Their purpose is to define how each stored relation’s data is actually stored in terms of the physical objects. There will need to be a Local Conversion schema to map between each Local Storage schema and each corresponding Local Physical schema. Because Local Conversion schemas do not deal purely with the relational model, they could be handled differently by different DBMSs. In particular, it could be that Local Physical schemas could be derived automatically from Local Conversion schemas.

Page 11 of 26

A Review of Database Schemas 5 th

December 2005 (30 th

March 2001)

David Livingstone

Thus there are 4 kinds of mapping schema. If each kind is made a layer of the architecture in the same way that the Allocation schema is, then one ends up with a 9layer architecture. It can be portrayed as follows :-

Subschemas :-

................

View Schema :-

Conceptual Schema :-

Equivalence Schema :-

Storage Schema :-

Allocation Schema :-

Local Storage Schemas :-

................

Local Conversion Schemas :-

................

Local Physical Schemas :-

................

Again, a simple centralised database could have an architecture that was simplified in the obvious way.

Page 12 of 26

A Review of Database Schemas 5 th

December 2005 (30 th

March 2001)

David Livingstone

Although 9 layers may seem excessively complex, note that in reality they are all there anyway. Nothing new has been introduced. The only change is to make explicit what was previously implicit.

Together the 9 layers provide a comprehensive conceptual structure for the database so that its designers can better envisage what they need to design, and better monitor the progress of their design; it also supports the amendment of the design of current databases when evolutionary changes need to be made. All 9 components are necessary. However a proper DBMS should be able to automate the production of much of the schemas in ways already indicated, thereby giving the database designer the maximum support with the minimum of effort. A proper DBMS should also provide the tools - via its data dictionary and/or user friendly commands - to help the designer use all the schemas effectively.

There are some further identities that are always true, and which could be useful in managing the database :-

Dist[ Union ] Set-of-Subschemas-except-Views

Dom[ View ] Union Conceptual-schema

(N.B. Dom[ View ] is the domain of the View schema mapping).

Conceptual-schema Diff Dom[ Equivalence ] Union Storage-schema

Ran[ Allocation ]

Set-of-all-Stored-Relations

Dist[ Union ] Local-Storage-schemas

(N.B. Ran[ Allocation ] is the range of the Allocation schema mapping).

All the identities take advantage of the fact that the set and mapping schemas conform to the principles of traditional mathematical sets and relations respectively; i.e. a set schema is a set of items (in this case a set of database relations) and a mapping schema is a set of mappings (in this case between database relations).

Conclusion

The above is an attempt to provide a rational framework in which the study of the design of relational databases could be carried out. It tries to rise above the peculiarities of any individual relational DBMS so that it can provide a basis not only for a range of current relational DBMSs (so that they can be compared and one can easily transfer from working with one to working with another) but also for future developments in relational DBMSs (so that new developments can be evaluated using rational criteria).

The framework is based on a rationalisation of the current standard approaches to centralised and distributed databases, and can be used for both.

The proposals can be summarised as :-

1.

The standard database schema architecture should be expanded to include a

Storage schema. This will facilitate the design of the best storage mechanisms while encouraging the Conceptual schema to be a design based purely on the inherent nature and structure of the data. This overcomes the dilemma which many database designers face as to which of these two choices to go for.

Page 13 of 26

A Review of Database Schemas 5 th

December 2005 (30 th

March 2001)

David Livingstone

2.

To achieve this, stored relations should be differentiated from base relations. The latter may be stored indirectly via mappings to stored relations, while the former are always stored directly.

3.

Date and McGoveran’s method for view updating can and should also be applied for the updating of base relations whose data are held indirectly in stored relations.

If this method is inadequate, then some logically equivalent means to the same end must be found to achieve the same end.

4.

Any reasonable relational DBMS has an optimiser which uses the technique of algebraic transformations as part of its method of its optimisation strategy. Such an optimiser will be able to use the storage definitions of base relations for additional optimisation. Furthermore, it will not require an extension of the optimiser to accomplish this - the extra optimisation comes for free - since the extra optimisation can be done by algebraic transformations.

5.

It may be useful to make an Allocation schema a standard part of the database architecture, since centralised databases may also wish to have more than one copy of some data. However this does imply that to avoid being a burden in the simple centralised case, one-to-one mappings should appear as the default, and the

DBMS should automatically note one-to-one mappings and not create any overhead for them.

6.

It is worthwhile to rationalise the schema architecture, which should be built from two kinds of schema : schemas that are sets of relations and schemas that are sets of mappings between relations. A mapping schema forms a layer of the architecture that appears between two architectural layers formed from sets of relations. The mappings show how the relations above and below in the architecture relate to each other. Together they provide much better support for designing and maintaining the database.

7.

A distributed database would have 5 set schemas with 4 mapping schemas sandwiched between them, while a simple centralised database could reduce this to 4 set schemas with 3 mapping schemas.

8.

A DBMS should support the database designer by automating as much as possible of schema creation and by providing easy ways to inspect the contents of schemas.

It should also be able to exploit a variety of identities between schemas.

References

[1] Distributed Databases : Principles and Systems . S. Ceri & G. Pelagatti

(McGraw-Hill Computer Science Series, 1985).

[2] Private communication from Hugh Darwen, 1999.

[3] An Introduction to Database Systems . C. J. Date (Addison-Wesley, 2000), ch. 9, section 4.

[4] An Introduction to Database Systems . C. J. Date (Addison-Wesley, 2000), ch. 23, page 698. (David McGoveran was the original author of this chapter).

[5] The Effects of Denormalisation on Database Performance . M. Schkolnik &

P.Sorenson (Res. Rep. RJ3082(38128), IBM Research Lab, San Jose, 1981).

Page 14 of 26

A Review of Database Schemas 5 th

December 2005 (30 th

March 2001)

David Livingstone

[6] Supporting Flat Relations by a Nested Relational Kernel . M. H. Scholl, H. B.

Paul, & H. J. Schek (Proceedings of 13’th VLDB Conference, Brighton,

1987).

Page 15 of 26

A Review of Database Schemas

David Livingstone

5 th

December 2005 (30 th

March 2001)

Appendix One - Retrieving Meta Data about Schemas

To fully exploit the database schema architecture, it is useful to have operations that retrieve information about schemas. This is data dictionary or data catalogue data, i.e. data about data, so-called meta data. This is so that the designer can easily check the database’s current design. It is useful not only in developing and checking the initial design of the database, but also for monitoring an existing database and ensuring its design evolves appropriately. It is also useful for mundane everyday reasons such as checking the spelling of relation names in order to enter a query properly.

Given this requirement, it does not logically follow that new algebra operators or SQL commands are required. This data should be kept in the data dictionary or catalogue of the database. As this consists of a set of relations holding meta data about the database, they can be queried in the same way as any other relation (albeit for security they are usually made read-only).

However, because the data dictionary/catalogue is itself a non-trivial database in size and complexity, it is common to make it easier to access for the database designer and user. Sometimes this is done by having special commands. RdB follows this approach. However this is non-standard. Another approach is to allow views to be created to supplement the data dictionary/catalogue base relations. Usually each view holds an abbreviated version of a base table (or a join of base tables), and for convenience contains just commonly-needed data rather than all the meta data that the

DBMS requires to be recorded in order to run properly. By making queries on the views, users can further tailor the information to what they want. This approach is followed by Oracle, who additionally allow user-written scripts to be created which can display the contents of multiple data dictionary/catalogue base relations and views in different formats. Unfortunately data dictionary/catalogue base relations are not standardised, so one still needs to learn the structure of each DBMS’s data dictionary/catalogue. It would be useful if there were a standard set of relations in the data dictionary/catalogue to avoid this problem. DBMSs could still have their own internal design of relations, as long as they could present the standard interface via views.

Note that obtaining schema information via a data dictionary/catalogue is orthogonal to what the schema architecture actually is, and is always a necessity.

Another approach which avoids the user having to know about the data dictionary/catalogue is to use operators that take parameters and return relations holding data dictionary/catalogue information. Their output can also be further manipulated in the same way that views can, to obtain more precisely data of interest.

In fact the operators can be considered as parameterised views. This approach was successfully used in the IBM System 12 Relational DBMS [2]. However unless the operators or their SQL equivalents become standard, they will still have to be learnt for each DBMS.

Note that parameterised views would be a useful general-purpose extension to the relational model anyway. However here they are merely considered as a way of frontending the data dictionary/catalogue to be more flexible than views. If the operators were sufficiently powerful and flexible, then they could be made the only way to

Page 16 of 26

A Review of Database Schemas 5 th

December 2005 (30 th

March 2001)

David Livingstone query the data dictionary/catalogue, which would yield the benefit of making it more secure.

An operator or parameterised view is briefly considered for illustration. Let it be called the Meta operator. For example,

Meta [Conceptual]

Meta [Storage] would give all the relations in the Conceptual schema and Storage schema respectively.

Meta [ Schema-Name ] would give all the relations of the Subschema or Local Storage schema called

Schema-Name

’. (By contrast ‘Conceptual’ and ‘Storage’ would be standard names).

In these examples, each result relation would consist of a single attribute containing relation names. This is because they are all set schemas.

The mathematical identities could be exploited. Thus

Meta [Conceptual] Diff Meta [Storage] would contain all the base relations that were not directly stored but were stored via fragments or merged relations 3 .

The mapping schemas could be similarly handled by the Meta operator, although the output relations would now need two attributes to display the mapping. Again the mathematical identities could be exploited. Thus

Meta [Conceptual] Diff

( Meta [Equivalence ] Project[ Dom ] )

Union Meta [Storage] would contain all the stored relations. Note that ‘Dom’ is assumed to be the name of the attribute holding the domains of the mappings.

Other parameters would be needed to obtain (say) the attributes occurring in a relation, the foreign keys of a relation, the definition of views, etc. As this would require a comprehensive set of parameters, it is not further discussed here.

3 This and the following examples assume the Meta operator is used in conjunction with a RAQUELlike relational algebra.

Page 17 of 26

A Review of Database Schemas

David Livingstone

5 th

December 2005 (30 th

March 2001)

Appendix Two - A Worked Example

This example is written using RAQUEL.

Let a distributed database consist of the two relational variables EMP and DEPT , each defined as follows :-

DEPT <== Attributes[ DeptNo <== Dom-DeptNos; Budget <== Money;

Area <== { ‘North’, ‘South’ } ]

DEPT <==Key[ Primary <== DeptNo ]

EMP <==Attributes[ EmpNo <== Dom-EmpNos;

EName <== Dom-Names;

DeptNo <== Dom-DeptNos; Salary <== Money;

Tax <== Money ]

EMP <==Key[ Primary <== EmpNo ]

EMP Project[ DeptNo ] <==Reference[ DeptNoForKey ]

DEPT Project[ DeptNo ]

Notes :

1.

The attribute assignment creates an empty base relation value with the attributes named (each of which is drawn from the data type assigned to it) and assigns it to the variable named on the LHS. By default, base relations have their contents physically stored, and all their attributes participate in the only candidate key.

2.

The key assignment creates the candidate key specified by the parameter, and replaces any default key with it.

3.

The references assignment creates a foreign key (whose name is given as a parameter) such that the LHS value is constrained to be a subset of the RHS value 4 .

It is decided to distribute the two base relations among the following fragments :-

D-N <--Store DEPT Restrict[ Area = ‘North’ ]

D-S <--Store DEPT Restrict [ Area = ‘South’ ]

EMP1 <--Store EMP Project[ EmpNo, Salary, Tax ]

EMP2-N <--Store EMP Join[[ DeptNo ] D-N

Project[ EmpNo, EName, DeptNo ]

EMP2-S <--Store EMP Join[[ DeptNo ] D-S

Project[ EmpNo, EName, DeptNo ]

The store assignment indicates that :

 the expression on the RHS is to be evaluated and physically stored as the value of the relational variable on the LHS;

 the LHS relation is a stored relation.

The DEPT relation is split into two horizontal fragments that are stored in D-N and

D-S ; note that from the domain of the attribute ‘Area’ it is known that there are only these two possible values of ‘Area’.

EMP is split into two vertical fragments; both

4 This could in principle be extended to provide more inclusion constraint options.

Page 18 of 26

A Review of Database Schemas 5 th

December 2005 (30 th

March 2001)

David Livingstone fragments contain the only candidate key of EMP . However one vertical fragment is further fragmented horizontally by means of a left-hand semi-join, so that the employees are split between a northern and a southern group that correspond to the two areas.

All the integrity constraints that apply to the value of the RHS expression are also derived and assigned to the LHS relation variable. This is equivalent to working out the integrity constraints that apply to a view.

Since DEPT and EMP are to be replaced by these fragments, it is now necessary to remove the original DEPT and EMP relation variables (otherwise the data they contain will be stored twice, in two different ways), and then specify how DEPT and

EMP can be derived from their fragments. This is done as follows :-

DEPT <--Remove @

EMP <--Remove @

The removal assignment deletes all the tuples specified on the RHS from the relation variable on the LHS, and if the result is an empty relation then it removes that relation variable from the database altogether. Since ‘@’ is just a shorthand for the value of the relation variable on the LHS, in both the above cases the variables are removed from the database.

DEPT ==Equate D-N Union D-S

EMP ==Equate ( EMP2-N Union EMP2-S ) Join[ EmpNo ] EMP1

==

” assignments are bindings. The LHS is bound to the RHS in the manner specified by the assignment. Thus ==Equate states that :

 the relational variable on the LHS equates to the expression on the RHS;

 the value of a LHS relation is to be obtained by executing the expression on the RHS;

 the LHS relation is a base relation.

Thus DEPT can be re-created by unioning D-N and D-S . EMP can be re-created by unioning the two horizontal fragments together, and then by a natural join of that result with EMP1 . Since the integrity constraints that apply to the value of the RHS expression apply to the LHS relation variable, the integrity constraints on DEPT and

EMP are the same as those on the original versions of DEPT and EMP .

Next allocation storage assignments are used to specify on which nodes of the network replicas of the fragments are to be physically stored :-

D-N

D-S

EMP1

EMP2-N

==Allocate[

==Allocate[

Newcastle, Manchester

==Allocate[ Birmingham, London ]

==Allocate[ Newcastle ]

Newcastle, Manchester

]

]

EMP2-S ==Allocate[ Birmingham, London ]

Fragments containing northern data are stored on the Newcastle and Manchester nodes, while those containing southern data are stored on the Birmingham and

Page 19 of 26

A Review of Database Schemas 5 th

December 2005 (30 th

March 2001)

David Livingstone

London nodes. Data about employee salaries and taxes, the EMP1 relation, is stored in Newcastle (assumed to be the company head office).

The above statements together define the Allocation schema. They also automatically define the contents of the Local Storage schemas.

Finally, a series of storage statements specify how the fragments are to be actually physically stored at their node(s). For example :-

Newcastle/D-N

Manchester/D-N

Birmingham/D-S

London/D-S

==Physical[ IndexSeq ......... ]

==Physical[ Seq ......... ]

==Physical[ IndexSeq ......... ]

==Physical[ Seq ......... ]

EMP1

Newcastle/EMP2-N

Manchester/EMP2-N

Birmingham/EMP2-S

==Physical[

==Physical[

==Physical[

==Physical[

Heap

Hash

Hash

......... ]

......... ]

B-Tree

......... ]

......... ]

London/EMP2-S ==Physical[ B-Tree ......... ]

Each of the RHS keywords indicates a particular physical storage arrangement, and would have various parameters to make the specification complete. The statements above are intended only to illustrate a possible range of storage arrangements that might be used. Note that where a fragment is stored in more than one location, each replica of the fragment has to be distinguished from the others. This is particularly important where different replicas may be physically stored in different ways, as here.

In the above example, this is done by prefixing the replica name with the location name, but some other notation might be preferable

These statements together define the Local Conversion schemas, and automatically generate the contents of the Local Physical schemas, which will consist of the physical file definitions on the RHS of the statements.

Finally for completeness, some views are created to be used in populating Global

External schemas :-

DEPT-EMP <--View DEPT Join[ DeptNo ] EMP

DEPT-N <--View DEPT Restrict [ Area = ‘North’ ]

DEPT-S <--View DEPT Restrict [ Area = ‘South’ ]

EMP-FIN <--View EMP Project[ EmpNo, Salary, Tax ]

The above statements together define the View schema.

The views are put into the following subschemas, along with base relations where required :-

Page 20 of 26

A Review of Database Schemas

David Livingstone

Departments <==Create[ SubSchema ]

5 th

December 2005 (30

Departments <--SubInsert DEPT-EMP , DEPT th

March 2001)

Northern <==Create[ SubSchema ]

Northern <--SubInsert DEPT-N , EMP

Southern <==Create [ SubSchema ]

Southern <--SubInsert DEPT-S , EMP

Financial <==Create [ SubSchema ]

Financial <--SubInsert EMP-FIN

There is one statement to create each subschema, and another to insert relations into it.

Together the statements define the Subschemas. (Note that this syntax is provisional).

The Subschemas are intended to be plausible rather than a logical necessity.

The following shows the schema architecture that would result :-

SubSchemas :-

Views Departments Northern Southern Financial

DEPT-EMP DEPT-EMP DEPT-N DEPT-S

DEPT-N DEPT EMP EMP

DEPT-S

EMP-FIN

EMP-FIN

View Mapping Schema :-

DEPT-N

DEPT-S

DEPT-EMP

EMP-FIN

Conceptual Schema DEPT

DEPT

EMP

EMP

Page 21 of 26

A Review of Database Schemas

David Livingstone

5 th

December 2005 (30

Equivalence Schema :- th

March 2001)

DEPT

EMP

Storage Schema

Allocation Schema

EMP2-N

D-N

D-S

D-N

D-S

EMP1

EMP2-N

EMP2-S

Manchester

EMP1

EMP2-N

EMP2-S

D-N

EMP1

D-S

EMP2-S

Local Storage Schemas

Newcastle

Newcastle

Birmingham

London

Manchester Birmingham London

D-N

EMP1

EMP2-N

D-N

EMP2-N

D-S

EMP2-S

D-S

EMP2-S

Page 22 of 26

A Review of Database Schemas

David Livingstone

5 th

December 2005 (30

Local Conversion Schemas

Newcastle th

March 2001)

D-N

EMP1

EMP2-N

Manchester

IndexSeq

Heap

Hash

D-N

EMP2-N

Seq

B-Tree

Birmingham

D-S

EMP2-S

IndexSeq

Hash

London

D-S Seq

EMP2-S B-Tree

Local Physical Schemas :-

Newcastle Manchester Birmingham London

IndexSeq

Heap

Hash

Seq

B-Tree

IndexSeq

Hash

Seq

B-Tree

The mapping schemas are shown here pictorially. In reality, the full details of the mappings would have to be stored, not just the ‘mapping arrows’.

It so happens in the above example that at each node, the physical storage specification of each fragment is unique. However in general a node might store several fragments with the same specification, in which case it would be necessary to distinguish between each instance, at least as regards where physically the data is stored. For simplicity, this example also ignores any kind of physical storage specification that clusters the storage of two or more fragments into one physical file.

Page 23 of 26

A Review of Database Schemas 5 th

December 2005 (30 th

March 2001)

David Livingstone

Note that certain views have been deliberately made to contain the same set of data as certain fragments :-

DEPT-N

D-N

DEPT-S

D-S

EMP-FIN ≡ EMP-1

This is because in practice, views which contain commonly used data are very likely to be put in fragments for performance reasons, simply because they are commonly used. However note that a view is very different from a fragment, and either could be changed at any time to suit the needs of the database users. Of course one would expect that the DBMS would be able to use the logical equivalence of their definitions to optimise performance.

The design of the above database shows how a database that was originally a singlesite database, containing just DEPT and EMP as stored base relations, could have been re-designed in order to evolve into one that was distributed. However the distributed database could have had this design from the outset if it had initially been designed as a distributed database. In this case, the base relation variables and fragments would have been specified as follows :-

D-N <==Attributes[ DeptNo <== Dom-DeptNos; Budget <== Money;

Area <== { ‘North’ } ]

D-N <==Key[ Primary <== DeptNo ]

D-N <--Store D-N

D-S <==Attributes[ DeptNo <== Dom-DeptNos; Budget <== Money;

Area <== { ‘South’ } ]

D-S <==Key[ Primary <== DeptNo ]

D-S <--Store D-S

EMP1 <==Attributes[ EmpNo <== Dom-EmpNos; Salary <== Money;

Tax <== Money ]

EMP1 <==Key[ Primary ] EmpNo

EMP1 <--Store EMP1

EMP2-N <==Attributes[ EmpNo <== Dom-EmpNos;

EName <== Dom-Names;

DeptNo <== Dom-DeptNos ]

EMP2-N <==Key[ Primary <== EmpNo ]

EMP2-N Project[ DeptNo ] <==Reference[ DeptNorthFKey ]

D-N Project [ DeptNo ]

EMP2-N <--Store EMP2-N

EMP2-S <==Attribute[ EmpNo <== Dom-EmpNos;

EName <== Dom-Names;

DeptNo <== Dom-DeptNos ]

EMP2-S <==Key[ Primary <== EmpNo ]

Page 24 of 26

A Review of Database Schemas 5 th

December 2005 (30 th

March 2001)

David Livingstone

EMP2-S Project[ DeptNo ] <==Reference[ DeptSouthFKey ]

D-S Project[ DeptNo ]

EMP2-S <--Store EMP2-S

DEPT ==Equate D-N Union D-S

EMP ==Equate ( EMP2-N Union EMP2-S ) Join[ EmpNo ] EMP1

Note that the specification of the first five relations automatically makes them base relations; this is typically what is required. Since they are not to appear in the

Conceptual schema, they are then converted to stored relations. Finally the equate assignments create DEPT and EMP as base relations.

These statements do not show the design process as clearly as the first approach, but are logically equivalent.

In general, it is more likely that the fragments would be designed with additional general integrity constraints - corresponding to the constraints appearing in the original restrictions - as opposed to using type constraints in the fragment design; the latter happened to be easier for this simple example. In general any kind of constraints can be used to design the fragments.

Using Venn diagrams, the top three set schemas - i.e. layers 1, 3 and 5 - would be as follows :-

Northern Southern

EMP-FIN

Financial

Departments

Conceptual

Schema

Storage

Schema

DEPT-EMP

DEPT

D-N

D-S

DEPT-N DEPT-S

EMP

EMP1

EMP2-N

EMP2-S

Page 25 of 26

A Review of Database Schemas

David Livingstone

5 th

December 2005 (30 th

March 2001)

Using Venn diagrams, the set schemas from layers 5 and 7 would be:-

Manchester

Birmingham

D-N

EMP2-N

D-S

EMP2-S

EMP1

Newcastle

London

Storage

Schema

Although diagramming techniques have been used in the example, the main purpose of the example is to illustrate the schema architecture. Such diagrams can be useful, but a DBA tool to help design and manage the database schemas could use more sophisticated means.

Page 26 of 26

Download