Overview of Physical Database Design
File Structures
Query Optimization
Index Selection
Additional Choices in Physical Database
Design
8-1
Overview of Physical Database
Design
Sequence of decision-making processes.
Decisions involve the storage level of a database: file structure and optimization choices.
Importance of providing detailed inputs and using tools that are integrated with
DBMS usage of file structures and optimization decisions
8-2
Closest to the hardware and operating system.
Physical records organized into files.
The number of physical record accesses is an important measure of database performance.
Difficult to predict physical record accesses
8-3
(a) Multiple LRs per PR
PR
LR
LR
LR
(b) LR split across PRs
LR
PR
PR
(c) PR containing LRs from different tables
PR
LR
T1
LR
T2
LR
T2
8-4
Application buffers:
Logical records (LRs)
LR
1
DBMS Buffers:
Logical records (LRs) inside of physical records (PRs) read
PR
1 read
LR
1
Operating system:
Physical records
(PRs) on disk
LR
2
LR
2
PR
1
LR
3 write write
PR
2
LR
4
LR
3
LR
4
PR
2
8-5
Minimize response time to access and change a database.
Minimizing computing resources is a substitute measure for response time.
Database resources
Physical record transfers
CPU operations
Communication network usage (distributed processing)
8-6
Main memory and disk space
Minimizing main memory and disk space can lead to high response times.
Useful to consider additional memory and disk space
8-7
Weight combines physical record accesses and CPU usage
Weight is usually close to 0
Mmany CPU operations can be performed in the time to perform one physical record transfer.
Combined Measure of Database Performance :
PRA + W * CPU-OP
8-8
Table design
(from logical database design)
Table profiles
Application profiles
Physical database design
File structures
Data placement
Data formatting
Denormalization
Knowledge about file structures and query optimization
8-9
Difficulty of physical database design
Number of decisions
Relationship among decisions
Detailed inputs
Complex environment
Uncertainty in predicting physical record accesses
8-10
Inputs of Physical Database
Design
Physical database design requires inputs specified in sufficient detail.
Table profiles used to estimate performance measures.
Application profiles provide importance of applications.
8-11
Tables
Number of rows
Number of physical records
Columns
Number of unique values
Distribution of values
Correlation of columns
Relationships: distribution of related rows
8-12
Specify distribution of values
Two dimensional graph
Column values on the x axis
Number of rows on the y axis
Variations
Equal-width: do not work well with skewed data
Equal-height: control error by the number of ranges
8-13
Salary Histogram (Equal Width)
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
10000 -
50000
50001 -
90000
90001 -
130000
130001 -
170000
170001 -
210000
210001 -
250000
250001 -
290000
290001 -
330000
330001 -
370000
370001 -
410000
Salary
8-14
Salary Histogram (Equal Width)
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
10000 -
50000
50001 -
90000
90001 -
130000
130001 -
170000
170001 -
210000
210001 -
250000
250001 -
290000
290001 -
330000
330001 -
370000
370001 -
410000
Salary
8-15
Application profiles summarize the queries, forms, and reports that access a database.
Typical Components of an Application Profile
Application Type Statistics
Query Frequency; distribution of parameter values
Form
Report
Frequency of insert, update, delete, and retrieval operations to the main form and the subform
Frequency; distribution of parameter values
8-16
Selecting among alternative file structures is one of the most important choices in physical database design.
In order to choose intelligently, you must understand characteristics of available file structures.
8-17
Simplest kind of file structure
Unordered: insertion order
Ordered: key order
Simple to maintain
Provide good performance for processing large numbers of records
8-18
PR
1
StdSSN Name ...
123-45-6789 Joe Abbot ...
788-45-1235 Sue Peters ...
122-44-8655 Pat Heldon ...
Insert a new logical record in the last physical record .
543-01-9593 Tom Adtkins
PR n 466-55-3299 Bill Harper ...
323-97-3787 Mary Grant ...
8-19
StdSSN Name ...
PR
1 122-44-8655 Pat Heldon ...
123-45-6789 Joe Abbot ...
323-97-3787 Mary Grant ...
Rearrange physical record to insert new logical record.
PR n 466-55-3299 Bill Harper ...
788-45-1235 Sue Peters ...
543-01-9593 Tom Adtkins
8-20
Support fast access by unique key value
Convert a key value into a physical record address
Mod function: typical hash function
Divisor: large prime number close to the file capacity
Physical record number: hash function plus the starting physical record number
8-21
.
StdSSN StdSSN Mod 97 PR Number
122448655 26 176
123456789 39 189
323973787 92 242
466553299
788451235
543019593
80
24
13
230
174
163
8-22
PR
163 543-01-9593 Tom Adtkins
PR
189 123-45-6789 Joe Abbot
PR
174 788-45-1235 Sue Peters
PR
230
466-55-3299 Bill Harper
PR
176
122-44-8655 Pat Heldon
PR
242
323-97-3787 Mary Grant
8-23
Home address = Hash function value + Base address
(122448946 mod 97 = 26) + 150
PR
176
Home address (176) is full.
...
122-44-8655 Pat Heldon
...
122-44-8752 Joe Bishop
...
122-44-8849 Mary Wyatt
...
122-44-8946 Tom Atkins adtkins
PR
177
Linear probe to find a
122-44-8753 Bill Hayes
...
...
8-24
Poor performance for sequential search
Reorganization when capacity exceeds
70%
Dynamic hash files reduce random search performance but eliminate periodic reorganization
8-25
A popular file structure supported by most
DBMSs.
Btree provides good performance on both sequential search and key search.
Btree characteristics:
Balanced
Bushy: multi-way tree
Block-oriented
Dynamic
8-26
Root node
Level
0
Level
1
Level
2
...
...
...
Leaf nodes
8-27
Key
1
Key
2
... Key d
... Key
2d
Pointer 1 Pointer 2 Pointer 3 Pointer d+1 ...
Pointer 2d+1
Each non root node contains at least half capacity
( d keys and d +1 pointers).
Each non root node contains at most full capacity
(2 d keys and 2 d +1 pointers).
8-28
(a) Initial Btree
20 45 70
22 28 35 40
(b) After inserting 55
20 45 70
50 60 65
22 28 35 40
(c) After inserting 58
50 55 60 65
20 45 58 70
Middle key value
(58) moved up
22 28 35 40 50 55
Node split
60 65
8-29
(a) Initial Btree
20 45 70
22 28 35
(b) After deleting 60
20 45 70
50 60 65
22 28 35
(c) After deleting 65
20 35 70
50 65
Borrowing a key
22 28 45 50
8-30
The height of Btree dominates the number of physical record accesses operation.
Logarithmic search cost
Upper bound of height: log function
Log base: minimum number of keys in a node
Insertion cost
Cost to locate the nearest key
Cost to change nodes
8-31
Provides improved performance on sequential and range searches.
In a B+tree, all keys are redundantly stored in the leaf nodes.
To ensure that physical records are not replaced, the B+tree variation is usually implemented.
8-32
Link to first leaf
Index set
Sequence set
...
...
8-33
Determining usage of an index for a query
Complexity of condition determines match.
Single column indexes: =, <, >, <=, >=, IN
<list of values>, BETWEEN, IS NULL,
LIKE ‘Pattern’ (meta character not the first symbol)
Composite indexes: more complex and restrictive rules
8-34
C2 BETWEEN 10 and 20: match on C2
C3 IN (10,20): match on C3
C1 <> 10: no match
C4 LIKE 'A%‘: match on C4
C4 LIKE '%A‘: no match
C2 = 5 AND C3 = 20 AND C1 = 10: matches on index with C1, C2, and C3
8-35
Can be useful for stable columns with few values
Bitmap:
String of bits: 0 (no match) or 1 (match)
One bit for each row
Bitmap index record
Column value
Bitmap
DBMS converts bit position into row identifier.
8-36
Faculty Table
RowId FacSSN … FacRank
1 098-55-1234 Asst
6
7
8
9
2
3
4
5
10
11
123-45-6789 Asst
456-89-1243 Assc
111-09-0245 Prof
931-99-2034 Asst
998-00-1245 Prof
287-44-3341 Assc
230-21-9432 Asst
321-44-5588 Prof
443-22-3356 Assc
559-87-3211 Prof
12 220-44-5688 Asst
Bitmap Index on FacRank
FacRank Bitmap
Asst
Assc
Prof
110010010001
001000100100
000101001010
8-37
Bitmap identifies rows of a related table.
Represents a precomputed join
Can define for a join column or a non-join column
Typically used in query dominated environments such as data warehouses
(Chapter 16)
8-38
Sequential search
Key search
Range search
Usage
Unordered Ordered Hash
Y
Linear
N
Primary only
Y
Linear
Y
Primary only
Extra PRs Y
Constant time
N
Primary or secondary
B+tree
Primary or secondary
Bitmap
N
Logarithmic Y
Y Y
Secondary only
8-39
Query optimizer determines implementation of queries.
Major improvement in software productivity
Improve performance with knowledge about the optimization process
8-40
Query
Syntax and semantic analysis
Parsed query
Query transformation
Relational algebra query
Access plan interpretation
A ccess
Plan
Access plan evaluation
A ccess
Plan
Code generation
Query results Machine code
8-41
Optimizer evaluates thousands of access plans
Access plans vary by join order, file structures, and join algorithm.
Some optimizers can use multiple indexes on the same table.
Access plan evaluation can consume significant resources
8-42
Sort Merge Join
Sort(FacSSN)
Sort Merge Join
BTree(FacSSN)
Faculty
Btree(OfferNo)
Enrollment
BTree(OfferNo)
Offering
8-43
Sort Merge Join
Sort(OfferNo) BTree(OfferNo)
Sort Merge Join Enrollment
BTree(FacSSN)
Faculty
BTree(FacSSN)
Offering
8-44
Nested loops: inner and outer loops; universal
Sort merge: join column sorting or indexes
Hybrid join: combination of nested loops and sort merge
Hash join: uses internal hash table
Star join: uses bitmap join indexes
8-45
Monitor poorly performing access plans
Look for problems involving table profiles and query coding practices
Use hints carefully to improve results
Override optimizer judgment
Cover file structures, join algorithms, and join orders
Use as a last result
8-46
Detailed and current statistics needed
Beware of uniform value assumption and independence assumption
Use hints to overcome optimization blind spots
Estimation of result size for parameterized queries
Correlated columns: multiple index access may be useful
8-47
Avoid functions on indexable columns
Eliminate unnecessary joins
For conditions on join columns, test the condition on the parent table.
Do not use the HAVING clause for row conditions.
Avoid repetitive binding of complex queries
Beware of queries that use complex views
8-48
Most important decision
Difficult decision
Choice of clustered and nonclustered indexes
8-49
Index set
<Abe, 1> <Adam, 2> <Bill, 4> <Bob, 3>
1. Abe, Denver, ...
2. Adam, Boulder, ...
3. Bob, Denver, ...
4. Bill, Aspen, ...
Sequence set
<Carl, 5> <Carol, 6> ...
5. Carl, Denver, ...
6. Carol, Golden, ...
Physical records containing rows
8-50
Index set
<Abe, 6> <Adam, 2>
1. Carl,Denver, ...
2. Adam, Boulder, ...
<Bill, 4> <Bob, 5>
3. Carol, Golden, ...
4. Bill, Aspen, ...
Sequence set
<Carl, 1> <Carol, 3> ...
5. Bob, Denver, ...
6. Abe, Denver, ...
Physical records containing rows
8-51
SQL statements and weights
Table profiles
Index
Selection
Clustered index choices
Nonclustered index choices
8-52
Balance retrieval against update performance
Nonclustering index usage:
Few rows satisfy the condition in the query
Join column usage if a small number of rows result in child table
Clustering index usage:
Larger number of rows satisfy a condition than for nonclustering index
Use in sort merge join algorithm to avoid sorting
More expensive to maintain
8-53
Application weights are difficult to specify.
Distribution of parameter values needed
Behavior of the query optimization component must be known.
The number of choices is large.
Index choices can be interrelated.
8-54
Rule 1 : A primary key is a good candidate for a clustering index.
Rule 2 : To support joins, consider indexes on foreign keys.
Rule 3 : A column with many values may be a good choice for a non-clustering index if it is used in equality conditions.
Rule 4 : A column used in highly selective range conditions is a good candidate for a non-clustering index.
Rule 5 : A combination of columns used together in query conditions may be good candidates for nonclustering indexes if the joint conditions return few rows, the DBMS optimizer supports multiple index access, and the columns are stable.
8-55
Rule 6 : A frequently updated column is not a good index candidate.
Rule 7 : Volatile tables (lots of insertions and deletions) should not have many indexes.
Rule 8 : Stable columns with few values are good candidates for bitmap indexes if the columns appear in WHERE conditions.
Rule 9 : Avoid indexes on combinations of columns. Most optimization components can use multiple indexes on the same table.
8-56
To create the indexes, the CREATE
INDEX statement can be used.
The word following the INDEX keyword is the name of the index.
CREATE INDEX is not part of SQL:2003.
Examples
CREATE INDEX StdGPAIndex ON Student (StdGPA)
CREATE UNIQUE INDEX OfferNoIndex ON Offering
(OfferNo
)
CREATE BITMAP INDEX OffYearIndex ON Offering
(OffYear)
8-57
Additional choice in physical database design
Denormalization combines tables so that they are easier to query.
Use carefully because normalized designs have important advantages.
8-58
Better update performance
Require less coding to enforce integrity constraints
Support more indexes to improve query performance
8-59
Collection of associated values.
Normalization rules force repeating groups to be stored in an M table separate from an associated one table.
If a repeating group is always accessed with its associated parent table, denormalization may be a reasonable alternative.
8-60
Normalized Denormalized
Te rr itory
TerrNo
TerrName
TerrLoc
1
M
Te rr itorySale s
TerrNo
TerrQtr
TerrSales
Te rr itory
TerrNo
TerrName
TerrLoc
Qtr1Sales
Qtr2Sales
Qtr3Sales
Qtr4Sales
8-61
Denormalizing a Generalization
Hierarchy
Normalized Denormalized
Em p
EmpNo
EmpName
EmpHireDate
1
Em p
EmpNo
EmpName
EmpHireDate
EmpSalary
EmpRate
1
SalaryEm p
EmpNo
EmpSalary
1
HourlyEm p
EmpNo
EmpRate
8-62
Normalized
De pt
DeptNo
DeptName
DeptLoc
1
Em p
M
EmpNo
EmpName
DeptNo
Denormalized
De pt
DeptNo
DeptName
DeptLoc
1
M
Em p
EmpNo
EmpName
DeptNo
DeptName
8-63
Record formatting decisions involve compression and derived data.
Compression is a trade-off between inputoutput and processing effort.
Derived data is a trade-offs between query and update operations.
8-64
Storing Derived Data to
Improve Query Performance
Or der
OrdNo
OrdDate
OrdA mt
1 derived data
Pr oduct
ProdNo
ProdName
ProdPrice
1
M
Or dLine
OrdNo
ProdNo
Qty
M
8-65
Parallel processing can improve retrieval and modification performance.
Retrieving many records can be improved by reading physical records in parallel.
Many DBMSs provide parallel processing capabilities with RAID systems.
RAID is a collection of disks (a disk array) that operates as a single disk.
8-66
Striping in RAID
Storage Systems
Each stripe consists of four adjacent physical records. Three stripes are shown separated by dotted lines.
PR1 PR2 PR3 PR4
PR5 PR6 PR7 PR8
PR9 PR10 PR11 PR12
8-67
Transaction processing: add computing capacity and improve transaction design.
Data warehouses: add computing capacity and store derived data.
Distributed databases: allocate processing and data to various computing locations.
8-68
Goal: minimize response time
Constraints: disk space, memory, communication bandwidth
Table profiles and application profiles must be specified in sufficient detail.
Environment: file structures and query optimization
Monitor and possibly improve query optimization results
Index selection: most important decision
Other techniques: denormalization, record formatting, and parallel processing
8-69