DATABASE DESIGN Physical Database Design for Relational Databases 11/30/2023 Physical Design COMPARISON OF LOGICAL AND PHYSICAL DATABASE DESIGN The logical database design is largely independent of implementation details, but dependant on a target data model. Logical database design is concerned with the what, physical database design is concerned with the how. Sources of information for physical design process includes global logical data model and documentation that describes the model (the output of the logical design phase). The physical database designer must know how the computer system hosting the DBMS operates as well as the functionality of the target DBMS. 5.2 PHYSICAL DATABASE DESIGN Physical Design Process of producing a description of the implementation of the database on secondary storage; it describes the base relations, file organizations, and indexes used to achieve efficient access to the data, and any associated integrity constraints and security measures. 11/30/2023 Note that the physical design process is not an independent process, there is a feedback between the conceptual, logical and application design. 5.3 OVERVIEW OF PHYSICAL DATABASE DESIGN METHODOLOGY 11/30/2023 Physical Design This phase of database design is broken down into six steps. These are; a) Translate global logical data model for target DBMS b) Design physical representation c) Design User views d) Design security mechanisms e) Consider the introduction of controlled redundancy f) Monitor and tune the operational system 5.4 A) Need to know functionality of target DBMS such as how to create base relations and whether the system supports the definition of: PKs, FKs, and AKs; required data - i.e.. whether system supports NOT NULL domains; Physical Design The main aim of this step is to produce a relational database schema that can be implemented in the target DBMS from the global logical data model. 11/30/2023 TRANSLATE GLOBAL LOGICAL DATA MODEL FOR TARGET DBMS 5.5 TRANSLATE GLOBAL LOGICAL DATA MODEL FOR TARGET DBMS CONT’D Documentation after/within each step is paramount; state why a particular approach was selected among the many alternatives available (if any). Physical Design 11/30/2023 relational integrity constraints; enterprise constraints. It is made up of four steps namely; Design base relations Design representation of derived data Design enterprise constraints 5.6 STEP A.1: DESIGN BASE RELATIONS each relation already have defined: the name of the relation; a list of simple attributes in brackets; the PK and, where appropriate, AKs and FKs. a list of any derived attributes and how they should be computed; referential integrity constraints for any FKs identified. Physical Design For 11/30/2023 The objective of this step is to decide how to represent base relations identified in global logical data model in target DBMS. 5.7 STEP A.1 DESIGN BASE RELATIONS For 11/30/2023 Physical Design each attribute need to define: its domain, consisting of a data type, length, and any constraints on the domain; an optional default value for the attribute; whether the attribute can hold nulls. After defining, we decide how to implement the base relations; dependant on a target DBMS. There are three ways of implementing namely; SQL, Microsoft Access and Oracle. Documentation of the design for the base tables is then done. 5.8 DBDL FOR THE PROPERTYFORRENT RELATION 11/30/2023 Physical Design 5.9 Examine logical data model and data dictionary, and produce list of all derived attributes. Derived attribute can be stored in database or calculated every time it is needed. Physical Design To decide how to represent any derived data present in the global logical data model in the target DBMS. 11/30/2023 STEP A.2: DESIGN REPRESENTATION OF DERIVED DATA 5.10 STEP A.2: DESIGN REPRESENTATION OF DERIVED DATA Option selected is based on: additional cost to store the derived data and keep it consistent with operational data from which it is derived; cost to calculate it each time it is required. Less expensive option is chosen subject to performance constraints. The derived attribute is stored in the database when; It is accessed frequently by a query or queries. It is accessed by a query critical for performance purposes When the DBMS can not cope with the algorithm to calculate the derived attribute. 11/30/2023 Physical Design 5.11 PROPERTYFORRENT RELATION AND STAFF RELATION WITH DERIVED ATTRIBUTE NOOFPROPERTIES 11/30/2023 Physical Design 5.12 STEP A.3: DESIGN ENTERPRISE CONSTRAINTS Updates to relations may be constrained by enterprise rules governing the ‘real world’ transaction, represented by the updates. Physical Design 11/30/2023 The main objective of this step is to design the enterprise constraints for the target DBMS. Such constraints are dependant on the choice of DBMS. 5.13 STEP A.3: DESIGN ENTERPRISE CONSTRAINTS CONT’D CHECK (NOT EXISTS (SELECT staffNo FROM PropertyForRent GROUP BY staffNo HAVING COUNT(*) > 100)) Physical Design Some DBMS provide more facilities than others for defining enterprise constraints. Example: CONSTRAINT StaffNotHandlingTooMuch 11/30/2023 The constraints that can not be expressed in a given DBMS are designed into the application. 5.14 B) It is broken down into four steps namely; Analyze transactions Choose file organizations Choose indexes Estimate disk space requirements Physical Design To determine optimal file organizations to store the base relations and the indexes that are required to achieve acceptable performance; that is, the way in which relations and tuples will be held on secondary storage. 11/30/2023 DESIGN PHYSICAL REPRESENTATION 5.15 STEP B: DESIGN PHYSICAL REPRESENTATION Number Physical Design Transaction throughput: number of transactions processed in given time interval. Response time: elapsed time for completion of a single transaction. Disk storage: amount of disk space required to store database files. 11/30/2023 of factors that may be used to measure efficiency: However, no one factor is always correct. Typically, have to trade one factor off against another to achieve a reasonable balance. 5.16 STEP B.1: ANALYZE TRANSACTIONS to identify performance criteria, such as: transactions that run frequently and will have a significant impact on performance; transactions that are critical to the business operation; times during the day/week when there will be a high demand made on the database (called the peak load). Physical Design Attempt 11/30/2023 To understand the functionality of the transactions that will run on the database and to analyze the important transactions. 5.17 STEP B.1 ANALYZE TRANSACTIONS Use 11/30/2023 Physical Design this information to identify the parts of the database that may cause performance problems; e.g. the parts that are frequently accessed by transactions/queries. To select appropriate file organizations and indexes, also need to know high-level functionality of the transactions, such as: attributes that are updated in an update transaction criteria used to restrict tuples that are retrieved in a query. 5.18 STEP B.1 ANALYZE TRANSACTIONS Physical Design not possible to analyze all expected transactions, so investigate most ‘important’ ones. To help identify which transactions to investigate, can use: transaction/relation cross-reference matrix, showing relations that each transaction accesses, and/or transaction usage map, indicating which relations are potentially heavily used. 11/30/2023 Often 5.19 STEP B.2: CHOOSE FILE ORGANIZATIONS organizations include; Heap (unordered), Hash, Indexed Sequential Access Method (ISAM), B+-Tree, and Clusters. Physical Design File 11/30/2023 To determine an efficient file organization for each base relation that is, an efficient way to store data. 5.20 HEAP (UNORDERED) This is the simplest type of file organisation in which records are placed on the disk in the same order as they are inserted. Physical Design It is known as unordered because records are placed on the disk in no particular order. 11/30/2023 Insertion of data (records) is efficient; since each record is placed after the last one. 5.21 HEAP (UNORDERED) CONT’D A the page containing the record to be deleted is first retrieved then the record is marked as retrieved. The space it occupied is not re-used (performance deterioration). Re-organisation is needed to reclaim the space Physical Design Deletion: 11/30/2023 linear search must be done in order to retrieve a record from the file. 5.22 HEAP (UNORDERED) CONT’D Physical Design It is suitable as a storage structure when; Bulk loading data into the database tables; there is no overhead of calculating what page the record should be put. The relation contains few pages Bulk retrieving i.e. when every tuple in the relation is retrieved each time the relation is accessed. The relation has an additional access structure e.g. index key. 11/30/2023 5.23 HASH The hash function calculates the address of the page in which the record is to be stored based on one to more fields in the record (hash field/key). Hash files may be called direct/random files because records appear randomly distributed across available file space. Physical Design Records are placed on the disk (secondary storage) according to a hash function and not in a sequential pattern. 11/30/2023 5.24 HASH CONT’D Physical Design The Problem of hashing functions is that unique addresses for each record is not guaranteed because of collision. 11/30/2023 The hash function is chosen so that the records are as evenly distributed as possible through out the file. Folding technique (uses arithmetic functions e.g. addition) Division-remainder (MOD) Collision: this occurs when the same address is generated for two or more records (records with the same address are known as synonyms). 5.25 HASH CONT’D Physical Design However it is not good when; The hash field is frequently updated The hash field is based on; A pattern match of the hash field A range of values for the hash field A field other than the hash field Only a part of the hash field 11/30/2023 It is a good storage structure when; tuples are retrieved based on an exact match on the hash field. if the access order is random. 5.26 INDEXED SEQUENTIAL ACCESS METHOD (ISAM) Each index item is ordered and consists of one or more references as to where one find the particular data item required; thus eliminates the need to scan sequentially through the file so as to get the required record. Physical Design This is based on the same analogy and search approach used in book indexes. 11/30/2023 This is based on an Index; a data structure that allows the DBMS to locate particular records in a file more quickly thus speeding up response to user queries (enhanced performance). 5.27 INDEXED SEQUENTIAL ACCESS METHOD (ISAM) CONT’D Physical Design An index structure is associated with a particular search key and contains records consisting of the key and the address of the logical record in the file containing the key value. 11/30/2023 The file containing the logical records is known as the data file and that containing the index records, the indexing file. The records in the indexing file are ordered based on the indexing field, usually a single attribute. 5.28 INDEXED SEQUENTIAL ACCESS METHOD (ISAM) CONT’D that are faced; ISAM index is static (usually created when the file is created) Updates to the relation: cause the indexing file to lose the access sequence. Physical Design supports data retrieval based on; Exact key match Pattern matching Range of values Part key specification 11/30/2023 It Problems 5.29 B+-TREE A tree has a hierarchy of nodes; Parent (root and parent nodes) child nodes (child and leaf) Physical Design This is a file organisation structure in which the data or indexes are held in a tree format. 11/30/2023 Terminology Depth of the tree Balanced tree (B-tree) Degree/Order of the tree 5.30 B+-TREE CONT’D The structure of each node is as below; Key value1 Key Value2 Each node in the tree is actually a page/reference to actual tuple. The rules for a B+-tree include; If the root is not a leaf, it must have at least two children For a tree of order n, each node must have between n/2 and n pointers and children. 11/30/2023 Physical Design 5.31 B+-TREE CONT’D The number of key values contained in a nonleaf node is 1 less than the number of pointers. The tree must always be balanced; same length for each path from root to leaf node Leaf nodes are linked in order of key values Physical Design For a tree of order n, the number of key values in a leaf node, must have between (n-1)/2 and (n-1) pointers and children. 11/30/2023 5.32 B+-TREE CONT’D Physical Design is a more reliable/adaptable storage structure than hashing. This is because; It supports data retrieval based on; Exact value match Pattern matching Range of values Part key specification 11/30/2023 This 5.33 B+-TREE CONT’D Maintains the access key order even when a relation is updated thus retrieval based on access key is more efficient than in ISAM. Physical Design Updating a relation does not impede performance; grows as relations grow 11/30/2023 Note: Best suited when the relation is frequently updated; contain one more than the ISAM. 5.34 CLUSTERS Clusters cluster key refers to the related columns in the clustered tables. Physical Design The 11/30/2023 are groups of one or more tables physically stored together because they share common columns and are often used together; improving access time. 5.35 CLUSTERS CONT’D 11/30/2023 Physical Design 5.36 CLUSTERS CONT’D 11/30/2023 Physical Design Staff Table Branch table Cluster Key 5.37 STEP B.3: CHOOSE INDEXES The main objective of this step is to determine whether adding indexes will improve the performance of the system. There are three types of indexes. These are based on ordering that is the ordering field, ordering key and non-ordering field. Ordering This refers to the sorting of the records in a file. Ordering field; field(s) on which sorting is based. Ordering Key; the ordering field is also the primary key/the key of the file. Non-ordering field; all other fields in the file that are not the ordering field(s). 11/30/2023 Physical Design 5.38 STEP B3: CHOOSE INDEXES CONT’D 11/30/2023 Physical Design Types of Indexes: Primary Index: The data file is sequentially ordered by an ordering key field and the indexing field is built on the ordering key field, and thus is guaranteed to have a unique value for each record. Clustering index: The data file is sequentially ordered by a nonkey field and the indexing field is built on the ordering key field, and thus there can be more than one record corresponding to a value in the indexing filed. The non-key field is known as the clustering attribute. 5.39 STEP B3: CHOOSE INDEXES CONT’D 11/30/2023 Physical Design Types of Indexes cont’d: Secondary index: This is the type of index defined on the nonordering field of the data file. Note: a file may have at most one primary or clustering index but several secondary indexes. An index may be sparse (some of the search key values) or dense (all the values of a search key). 5.40 STEP B3: CHOOSE INDEXES CONT’D Physical Design indexes can be done in two ways; One approach is to keep tuples unordered and create as many secondary indexes as necessary. 11/30/2023 Choosing Another approach is to order tuples in the relation by specifying a primary or clustering index. 5.41 STEP B.3: CHOOSE INDEXES attribute that is used most often to access the tuples in a relation in order of that attribute (e.g. in SQL the attribute used most in the order by clause). Physical Design this case, choose the attribute for ordering or clustering the tuples as: attribute that is used most often for join operations - this makes join operation more efficient, or 11/30/2023 In 5.42 STEP B.3: CHOOSE INDEXES If relation can only have either a primary index or a clustering index. Physical Design Each 11/30/2023 ordering attribute chosen is key of relation, index will be a primary index; otherwise, index will be a clustering index. Secondary indexes provide a mechanism for specifying an additional key for a base relation that can be used to retrieve data more efficiently. 5.43 STEP B.3: CHOOSE INDEXES There is an overhead involved in maintenance and use of secondary indexes that has to be balanced against performance improvement gained when retrieving data. The overhead includes: adding an index record to every secondary index whenever tuple is inserted; updating a secondary index when corresponding tuple is updated; increase in disk space needed to store the secondary index; possible performance degradation during query optimization to consider all secondary indexes. 11/30/2023 Physical Design 5.44 STEP B.3: CHOOSE INDEXES – GUIDELINES FOR CHOOSING ‘WISH-LIST’ 11/30/2023 Physical Design (1) Do not index small relations. (2) Index PK of a relation if it is not a key of the file organization. (3) Add secondary index to a FK if it is frequently accessed. (4) Add secondary index to any attribute that is heavily used as a secondary key. (5) Add secondary index on attributes that are involved in: selection or join criteria; ORDER BY; GROUP BY; and other operations involving sorting (such as UNION or DISTINCT). 5.45 STEP B.3: CHOOSE INDEXES – GUIDELINES FOR CHOOSING ‘WISH-LIST’ (7) Add secondary index on attributes that could result in an index-only plan. Physical Design functions. 11/30/2023 (6) Add secondary index on attributes involved in built-in (8) Avoid indexing an attribute or relation that is frequently updated. (9) Avoid indexing an attribute if the query will retrieve a significant proportion of the tuples in the relation. (10) Avoid indexing attributes that consist of long character strings. 5.46 11/30/2023 Physical Design STEP B.4: ESTIMATE DISK SPACE REQUIREMENTS The aim of this step is to estimate the amount of disk space that will be required by the database that is, how much space will the implementation of the database on secondary storage. Estimating disk usage is highly dependant on the DBMS & the hardware used to support the DBMS. This is based on; Size of the tuples in a relation Number of tuples in a relation (consider the future growth-growth factor) 5.47 STEP C: DESIGN USER VIEWS 11/30/2023 Physical Design To design the user views that were identified during the Requirements Collection and Analysis stage of the relational database application lifecycle. This is developed based on the local conceptual and logical designs that were developed in the previous phases. Views are used to restrict user access to the database e.g. in a multi-user environment. 5.48 STEP C: DESIGN USER VIEWS CONT’D Advantages Physical Design 11/30/2023 of views include; Convenience: Users are presented with only that part of the DB that they need. Reduced complexity: They are used to simplify complex queries. Improved security: Users are assigned access rights to only the parts of the DB that has appropriate data for their functioning. Customization: The same base tables are seen differently by different users. 5.49 STEP C: DESIGN USER VIEWS CONT’D Data integrity: Using the WITH CHECK OPTION clause, no row that does not satisfy the condition in the where clause can be added to or updated in the base tables. Currency: Changes in the base tables are immediately reflected in the view. Data independence: Consistent and unchanging picture of the structure of the database even if the base tables are changed e.g. adding of columns 11/30/2023 Physical Design 5.50 STEP D: DESIGN SECURITY MEASURES 11/30/2023 Physical Design The aim of this step is to design the security measures for the database as specified by the users; how the security requirements will be realized. There are two types of database security; System Security: Deals with the access and use of the database i.e. restricting the access by the use of usernames and passwords. Data Security: Deals with access and use (actions that can be performed) of the database objects such as the tables/relation and views. 5.51 STEP E: MONITORING AND TUNING AN OPERATIONAL DATABASE 11/30/2023 Physical Design In many cases, when a system is being developed, it is fast enough since the test data is small. however, as the system gets operational, the operational data is far higher than the test data. it may slow down. There are actions that can be done to speed it up. this can be done at the application programs level 2 or at the database level. At the database level, we aim at reducing computationally expensive operations. The commonest expensive operation is the join operation. 5.52 CONTROLLED REDUNDANCY Combining 11/30/2023 Physical Design 1:1 Relationships: In case two tables were created but the entities had a one-to-one relationship, they can be merged into a single table so that accessing the associated data no longer need a join. Duplicating non-key columns of 1:* relationships: This can be one of the most frequently accessed fields. additional modules in the application program have to be added to ensure consistence of the duplicated fields. 5.53 11/30/2023 Physical Design CONTROLLED REDUNDANCY CONT’D Duplicating columns in *:* relationships: This is done like in one to many though data is got from two tables and put in a central table that was created from a relationship. A module to manage updates also has to be created. Introducing repeating groups: This is commonly on tables from multivalued attributes. Frequently queried attributes can be incorporated into the man table. For example, the first three hobbies can be put in the person table so that a big proportion of joins are eliminated. 5.54 EXTRACT TABLES Extract tables are completely denormalized tables that can store highly duplicated data. The query operations are done in the extract tables and the users are served. However, the input and updates are done in the normalized database tables. At regular intervals, data is transfered from the database and posted in an extract table. The extract table is therefore not always accurate. It is desirable in cases where updates take place once andquerying takes place continuously. It is also desirable where speed is required but a small inaccuracy is practically acceptable. 11/30/2023 Physical Design 5.55 VIEWS A view is a stored query accessible as a virtual table. A view is composed of a result set of the stored query. It is continuously updated and therefore have dynamic data. In case (many) operations would cause similar joins, the joins are done once in a view and the operations query a view. There are two types of views; the updateable and non-updateable views. A view is updateable if (i) it is from a single table and (ii) It has all fields that are required with no default values. In case these conditions are not satisfied,then it is a nonupdateable view. 11/30/2023 Physical Design 5.56