Chapter 5: Physical Database Design and Performance Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration Gonzaga University Spokane, WA 99258 chen@jepson.gonzaga.edu Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-1 Objectives • • • • • • • • Definition of terms Describe the physical database design process Choose storage formats for attributes Select appropriate file organizations Describe three types of file organization Describe indexes and their appropriate use Translate a database model into efficient structures Know when and how to use denormalization Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-2 Physical Database Design • Purpose–translate the logical description of data into the technical specifications for storing and retrieving data • Goal–create a design for storing data that will provide adequate performance and insure database integrity, security, and recoverability Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-3 Physical Design Process Inputs Normalized Volume Decisions relations Attribute data types estimates Physical record descriptions Attribute definitions Response time Data (doesn’t always match logical design) expectations security needs Leads to Backup/recovery needs Integrity expectations DBMS technology used Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems File organizations Indexes and database architectures Query optimization TM 5-4 Figure 5-1 Composite usage map (Pine Valley Furniture Company) Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems 5 TM 5-5 What will we learn from Usage Map? • Why should Manufacutred_Part and Purchased_Part tables be separated? • Why should SUPPLIER and SUPPLIES tables be combined? Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-6 Figure 5-1 Composite usage map (Pine Valley Furniture Company) (cont.) Data (transaction) volumes Database access frequencies are estimated from “transaction volumes.” Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems 7 TM 5-7 Figure 5-1 Composite usage map (Pine Valley Furniture Company) (cont.) Access Frequencies (per hour) Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-8 Figure 5-1 Composite usage map (Pine Valley Furniture Company) (cont.) Usage analysis: 14,000 purchased parts accessed per hour 8000 quotations accessed from these 140 purchased part accesses 7000 suppliers accessed from these 8000 quotation accesses Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-9 Figure 5-1 Composite usage map (Pine Valley Furniture Company) (cont.) Usage analysis: 7500 suppliers accessed per hour 4000 quotations accessed from these 7500 supplier accesses 4000 purchased parts accessed from these 4000 quotation accesses Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-10 Designing Fields • Field: smallest unit of data in database • Field design – Choosing data type – Coding, compression, encryption – Controlling data integrity Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems 11 TM 5-11 Choosing Data Types • • • • • • • CHAR–fixed-length character VARCHAR2–variable-length character (memo) LONG–large number NUMBER–positive/negative number INEGER–positive/negative whole number DATE–actual date BLOB–binary large object (good for graphics, sound clips, etc.) Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems 12 TM 5-12 Figure 5-2 Example code look-up table (Pine Valley Furniture Company) Code saves space, but costs an additional lookup to obtain actual value Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems 13 TM 5-13 Field Data Integrity • Default value–assumed value if no explicit value – • • • e.g., s_state CHAR(2) DEFAULT ‘WI’; Range control–allowable value limitations (constraints or validation rules) Null value control–allowing or prohibiting empty fields Referential integrity–range control (and null value allowances) for foreign-key to primary-key matchups Sarbanes-Oxley Act (SOX) legislates importance of financial data integrity Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems 14 TM 5-14 Handling Missing Data • Substitute an estimate of the missing value (e.g., using a formula) • Construct a report listing missing values • In programs, ignore missing data unless the value is significant (sensitivity testing) Triggers can be used to perform these operations Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-15 Physical Records • Physical Record: A group of fields stored in adjacent memory locations and retrieved together as a unit • Page: The amount of data read or written in one I/O operation • Blocking Factor: The number of physical records per page Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-16 Denormalization • Transforming normalized relations into unnormalized physical record specifications • Benefits: – Can improve performance (speed) by reducing number of table lookups (i.e. reduce number of necessary join queries) • Costs (due to data duplication) – Wasted storage space – Data integrity/consistency threats • Common denormalization opportunities – One-to-one relationship (Fig. 5-3) – Many-to-many relationship with attributes (Fig. 5-4) – Reference data (1:N relationship where 1-side has data not used in any other relationship) (Fig. 5-5) Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-17 Figure 5-3 A possible denormalization situation: two entities with oneto-one relationship mandatory Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems optional 18 TM 5-18 Figure 5-4 A possible denormalization situation: a many-to-many relationship with nonkey attributes Extra table access required Null description possible Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-19 Figure 5-5 A possible denormalization situation: reference data Extra table access required Data duplication Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-20 Partitioning • Horizontal Partitioning: Distributing the rows of a table into several separate files – Useful for situations where different users need access to different rows – Three types: Key Range Partitioning, Hash Partitioning, or Composite Partitioning • Vertical Partitioning: Distributing the columns of a table into several separate relations – Useful for situations where different users need access to different columns – The primary key must be repeated in each file • Combinations of Horizontal and Vertical Partitions often correspond with User Schemas (user views) Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-21 Partitioning (cont.) • Advantages of Partitioning: – Efficiency: Records used together are grouped together – Local optimization: Each partition can be optimized for performance – Security, recovery – Load balancing: Partitions stored on different disks, reduces contention – Take advantage of parallel processing capability • Disadvantages of Partitioning: – Inconsistent access speed: Slow retrievals across partitions – Complexity: Non-transparent partitioning – Extra space or update time: Duplicate data; access from multiple partitions Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-22 This figure shows how data are recorded on magnetic disks. Figure: Magnetic Disk Components Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems 4-23 TM 5-23 Oracle 11g Horizontal Partitioning Methods • Range partitioning – Partitions defined by range of field values – Could result in unbalanced distribution of rows – Like-valued fields share partitions • Hash partitioning – Partitions defined via hash functions – Will guarantee balanced distribution of rows – Partition could contain widely varying valued fields • List partitioning – Based on predefined lists of values for the partitioning key • Composite partitioning – Combination of the other approaches Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-24 Designing Physical Files • Physical File: – A named portion of secondary memory allocated for the purpose of storing physical records – Tablespace–named set of disk storage elements in which physical files for database tables can be stored – Extent–contiguous section of disk space • Constructs to link two pieces of data: – Sequential storage – Pointers : field of data that can be used to locate related fields or records Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems 25 TM 5-25 Figure 5-6 DBMS terminology in an Oracle 11g environment Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-26 File Organizations • Technique for physically arranging records of a file on secondary storage • Factors for selecting file organization: – – – – – – Fast data retrieval and throughput Efficient storage space utilization Protection from failure and data loss Minimizing need for reorganization Accommodating growth Security from unauthorized use • Types of file organizations – Sequential (Fig. 5-7a): the most efficient with storage space. – Indexed (Fig. 5-7b) : quick retrieval • Join Index, Fig 5-8 – Hashed (Fig. 5-7c) : easiest to update Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems 27 TM 5-27 Figure 5-7a Sequential file organization 1 2 Records of the file are stored in sequence by the primary key field values If sorted – every insert or delete requires resort If not sorted Average time to find desired record = n/2 n Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems 28 TM 5-28 Indexed File Organizations • Indexed File Organization: the storage of records either sequentially or nonsequentially with an index that allows software to locate individual records • Index: a table or other data structure used to determine in a file the location of records that satisfy some condition • Primary keys are automatically indexed • Other fields or combinations of fields can also be indexed; these are called secondary keys (or nonunique keys) • Oracle has a CREATE INDEX operation, and MS ACCESS allows indexes to be created for most field types Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-29 Figure 5-7b Indexed file organization uses a tree search Average time to find desired record = depth of the tree Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-30 Figure 5-7c Hashed file or index organization Hash algorithm Usually uses divisionremainder to determine record position. Records with same position are grouped in lists Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems 31 TM 5-31 Figure 5-8 Join Indexes–speeds up join operations b) Join index for matching foreign key (FK) and primary key (PK) a) Join index for common non-key columns Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-32 Understand their general concepts Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems 33 TM 5-33 Clustering Files • In some relational DBMSs, related records from different tables can be stored together in the same disk area • Useful for improving performance of join operations • Primary key records of the main table are stored adjacent to associated foreign key records of the dependent table • e.g. Oracle has a CREATE CLUSTER command Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems 34 TM 5-34 Cluster • Clustering is a method of storing tables that are intimately related and often joined together into the same area on disk. For example, instead of the WORKER table being in one section of the disk and the WORKERSKILL table being somewhere else, their rows could be interleaved together in a single area, called a cluster. • The cluster key is the field (or fields) by which the table are usually joined in a query (e.g., Worker_ID in the example). To cluster tables, you must own the tables you are going to cluster together. Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-35 Fig. 4-10: Mapping an entity with a multivalued attribute (a) Employee entity type with multivalued attribute [Two relations WORKER created Worker_ID with one Name containing Age all of the Loding attributes {Skill, Ability} except the multiMultivalued attribute becomes a separate relation with foreign key valued attribute, (b) Mapping a multivalued attribute and the second WORKER one Lodging Name Age Worker_ID contains the pk (on WORKERSKILL the first one) and Ability Skill Worker_ID the multivalued One–to–many relationship between original entity and new relation attribute] Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-36 The following is the basic syntax of the create cluster command: CREATE CLUSTER cluster_name (field-1 datatype, Field-2 datatype …) CREATE CLUSTER WORKERandSKILL ( Judy VARCHAR2(25)); Cluster created. This creates a cluster (a space is set aside, as it would be for a table) with nothing in it. The use of Worker_Cluster for the cluster key is irrelevant; you’ll never use it again. Next, tables are created to be included in this cluster. Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-37 CREATE TABLE WORKER ( Worker_ID NUMBER(2), Name VARCHAR2 (25) NOT NULL, Age NUMBER (2), Lodging VARCHAR2 (15) ) CLUSTER WORKERandSKILL (Worker_ID) ; Prior to inserting rows into WORKER, you must create a cluster index: CREATE INDEX WORKERandSKILL_NDX ON CLUSTER WORKERandSKILL; Notice that nowhere does either statement say explicitly that the Worker_ID column goes into the July cluster key. The mentioned in their respective cluster statements. Multiple columns and cluster keys are matched first to first, second to second, third to third, and so on. Now a second table is added to the cluster. Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-38 CREATE TABLE WORKERSKILL ( Worker_ID NUMBER(2), Skill VARCHAR2(25) NOT NULL, Ability VARCHAR2 (15) NOT NULL ) CLUSTER WORKERandSKILL (Worker_ID) ; We will investigate how the tables are stored in the cluster. Recall that WORKER table has four fields: Worker_ID, Name, Age, and Lodging. The WORKERSKILL table has three fields: Worker_ID, Skill, and Ability. When these two tables are clustered, each unique Worker_ID is actually stored only once, in the cluster key. To each Worker_ID are attached the columns from both of these tables. The data from both of these tables is actually stored in a single location, almost as if the cluster where a big table containing data drawn from both of the tables that make it up. Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-39 Figure: How data is stored in clusters AGE LODGING 23 29 22 18 16 43 41 55 15 33 27 PAPA KING ROSE HILL CRANMER ROSE HILL MATTS WEITBROCHT ROSE HILL PAPA KING 21 25 33 35 32 67 26 25 NAME WORKER_ID ADAH TALBOT ANDREW DYE BART SARJEANT DICK JONES DONALD ROLLO ELBERT TALBOT GEORGE OSCAR GERHARDT KENTGEN HELEN BRANDT JED HOPKINS JOHN PEARSON 1 2 3 4 5 6 7 8 9 10 11 ROSE HILL ROSE HILL CRANMER WEITBROCHT MATTS MULLERS PALMER WALLBOM PAT LAVAY PETER LAWSON RICHARD KOCH ROLAND BRANDT VICTORIA LYNN WILFRED LOWELL 12 13 14 15 16 17 18 CRANMER WILLIAM SWING 19 MATTS ROSE HILL from WORKER table SKILL ABILITY PHOTO GOOD SMITHY EXCELLENT COMPUTER SLOW COMBINE DRIVER VERY FAST COMBINE DRIVR WOODCUTTER SMITHY GOOD AVERAGE GOOD SMITHY PHOTO COMPUTER PRECISE AVERAGE AVERAGE from WORKERSKILL table CLUSTER key Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-40 Figure: How data is stored in clusters AGE LODGING 23 29 22 18 16 43 41 55 15 33 27 PAPA KING ROSE HILL CRANMER ROSE HILL MATTS WEITBROCHT ROSE HILL PAPA KING 21 25 33 35 32 67 26 25 NAME WORKER_ID ADAH TALBOT ANDREW DYE BART SARJEANT DICK JONES DONALD ROLLO ELBERT TALBOT GEORGE OSCAR GERHARDT KENTGEN HELEN BRANDT JED HOPKINS JOHN PEARSON 1 2 3 4 5 6 7 8 9 10 11 ROSE HILL ROSE HILL CRANMER WEITBROCHT MATTS MULLERS PALMER WALLBOM PAT LAVAY PETER LAWSON RICHARD KOCH ROLAND BRANDT VICTORIA LYNN WILFRED LOWELL 12 13 14 15 16 17 18 CRANMER WILLIAM SWING 19 MATTS ROSE HILL SKILL ABILITY PHOTO GOOD SMITHY EXCELLENT COMPUTER SLOW COMBINE DRIVER VERY FAST COMBINE DRIVR WOODCUTTER SMITHY GOOD AVERAGE GOOD SMITHY PHOTO COMPUTER PRECISE AVERAGE AVERAGE Why we do not simply create a table like the one shown above? Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-41 Optimizing Query Processing A Quick Way to Speed Access to Your data • Indexes may be used to improve the efficiency of data searches to meet particular search criteria after the table has been in use for some time. Therefore, the ability to create indexes quickly and efficiently at any time is important. • SQL indexes can be created on the basis of any selected attributes Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-42 Optimizing Query Processing (conti.) For example: SELECT index_name FROM user_indexes; SELECT * FROM inventory WHERE inv_qoh>= 130; CREATE INDEX qoh_index ON inventory (inv_qoh); Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems Note that this statement defines an index called qoh_index for the qoh column in the inventory table. This index ensures that in the next example SQL only needs to look at row in the database that satisfy the WHERE condition, and is, therefore, quicker to produce an answer. TM 5-43 SQL> SELECT * 2 FROM inventory 3 WHERE inv_qoh>= 130; INV_ID ITEM_ID COLOR ---------- ---------- -------------------3 3 Khaki 4 3 Khaki 6 3 Navy 7 3 Navy 9 4 Eggplant 10 4 Eggplant 11 4 Eggplant 19 5 Bright Pink 20 5 Bright Pink 21 5 Bright Pink INV_SIZE INV_PRICE INV_QOH ---------- ---------- ---------S 29.95 150 M 29.95 147 S 29.95 139 M 29.95 137 S 59.95 135 M 59.95 168 L 59.95 187 10 15.99 148 11 15.99 137 12 15.99 134 10 rows selected. CREATE INDEX qoh_index ON inventory (inv_qoh); SQL> SELECT * 2 FROM inventory 3 WHERE inv_qoh>= 130; INV_ID ITEM_ID COLOR ---------- ---------- -------------------21 5 Bright Pink 9 4 Eggplant 7 3 Navy 20 5 Bright Pink 6 3 Navy 4 3 Khaki 19 5 Bright Pink 3 3 Khaki 10 4 Eggplant 11 4 Eggplant INV_SIZE INV_PRICE INV_QOH ---------- ---------- ---------12 15.99 134 S 59.95 135 M 29.95 137 11 15.99 137 S 29.95 139 M 29.95 147 10 15.99 148 S 29.95 150 M 59.95 168 L 59.95 187 10 rows selected. Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-44 Optimizing Query Processing (conti.) You may even create an index that prevents you from using a value that has been used before. Such a feature is especially useful when the index field (attribute) is a primary key whose values must not be duplicated: CREATE UNIQUE INDEX <index_field> ON <tablename> (the key field); SELECT index_name FROM user_indexes; DROP INDEX <index_name>; Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-45 Summary on Optimizing Query Processing (conti.) The indexes are defined so as to optimize the processing of SELECT statements. ; An index is never explicitly referenced in a SELECT statement; the syntax of SQL does not allow this; During the processing of a statement, SQL itself determines whether an existing index will be used; An index may be created or deleted at any time; When updating, inserting or deleting rows, SQL also maintains the indexes on the tables concerned. This means that, on the one hand, the processing time for SELECT statements is reduced, while, on the other hand, the processing time for update statements (such as INSERT, UPDATE and DELETE) can increase. Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-46 Rules for Using Indexes 1. Use on larger tables 2. Index the primary key of each table 3. Index search fields (fields frequently in WHERE clause) 4. Fields in SQL ORDER BY and GROUP BY commands 5. When there are >100 values but not when there are <30 values Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-47 Rules for Using Indexes (cont.) 6. 7. 8. 9. Avoid use of indexes for fields with long values; perhaps compress values first If key to index is used to determine location of record, use surrogate (like sequence nbr) to allow even spread in storage area DBMS may have limit on number of indexes per table and number of bytes per indexed field(s) Be careful of indexing attributes with null values; many DBMSs will not recognize null values in an index search Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-48 Query Optimization • Parallel query processing–possible when working in multiprocessor systems • Overriding automatic query optimization– allows for query writers to preempt the automated optimization • Picking data block size–factors to consider include: – Block contention, random and sequential row access speed, row size • Balancing I/O across disk controllers Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-49 Query Design Guidelines • • • • • • • 1. Understand how indexes are used 2. Keep optimization statistics up-to-date 3. Use compatible data types for fields and literals 4. Write simple queries 5. Break complex queries into multiple, simple parts 6. Don’t use one query inside another 7. Don’t combine a table with itself Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems 50 TM 5-50 Query Design Guidelines (cont.) • • • • • 8. Create temporary tables for groups of queries 9. Combine update operations 10. Retrieve only the data you need 11. Don’t have the DBMS sort without an index 12. Learn! – Learning to Learn and Learning to Change! • 13. Consider the total query processing time for ad hoc queries Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems 51 TM 5-51 Database Architectures (Model) (Figure-Extra) Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems Legacy Systems Current Technology Data Warehouses multidimensional 52 TM 5-52 Exercise and Homework • Exercises (p.234) – #1 (index), 5 (storage), 8 (denormalization) • HW (p.235) [usage map] – #19 Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 5-53