Chapter 10 of Database Application Development and Design

advertisement

Outline

 Overview of Physical Database Design

 File Structures

 Query Optimization

 Index Selection

 Additional Choices in Physical Database

Design

8-1

Overview of Physical Database

Design

 Sequence of decision-making processes.

 Decisions involve the storage level of a database: file structure and optimization choices.

 Importance of providing detailed inputs and using tools that are integrated with

DBMS usage of file structures and optimization decisions

8-2

Storage Level of Databases

 Closest to the hardware and operating system.

 Physical records organized into files.

 The number of physical record accesses is an important measure of database performance.

 Difficult to predict physical record accesses

8-3

Logical Records (LR) and

Physical Records (PR)

(a) Multiple LRs per PR

PR

LR

LR

LR

(b) LR split across PRs

LR

PR

PR

(c) PR containing LRs from different tables

PR

LR

T1

LR

T2

LR

T2

8-4

Transferring Physical Records

Application buffers:

Logical records (LRs)

LR

1

DBMS Buffers:

Logical records (LRs) inside of physical records (PRs) read

PR

1 read

LR

1

Operating system:

Physical records

(PRs) on disk

LR

2

LR

2

PR

1

LR

3 write write

PR

2

LR

4

LR

3

LR

4

PR

2

8-5

Objectives

 Minimize response time to access and change a database.

 Minimizing computing resources is a substitute measure for response time.

 Database resources

 Physical record transfers

 CPU operations

 Communication network usage (distributed processing)

8-6

Constraints

 Main memory and disk space

 Minimizing main memory and disk space can lead to high response times.

 Useful to consider additional memory and disk space

8-7

Combined Measure of

Database Performance

 Weight combines physical record accesses and CPU usage

 Weight is usually close to 0

 Mmany CPU operations can be performed in the time to perform one physical record transfer.

Combined Measure of Database Performance :

PRA + W * CPU-OP

8-8

Inputs, Outputs, and

Environment

Table design

(from logical database design)

Table profiles

Application profiles

Physical database design

File structures

Data placement

Data formatting

Denormalization

Knowledge about file structures and query optimization

8-9

Difficulty of physical database design

 Number of decisions

 Relationship among decisions

 Detailed inputs

 Complex environment

 Uncertainty in predicting physical record accesses

8-10

Inputs of Physical Database

Design

 Physical database design requires inputs specified in sufficient detail.

 Table profiles used to estimate performance measures.

 Application profiles provide importance of applications.

8-11

Table Profile

 Tables

 Number of rows

 Number of physical records

 Columns

 Number of unique values

 Distribution of values

 Correlation of columns

 Relationships: distribution of related rows

8-12

Histogram

 Specify distribution of values

 Two dimensional graph

 Column values on the x axis

 Number of rows on the y axis

 Variations

 Equal-width: do not work well with skewed data

 Equal-height: control error by the number of ranges

8-13

Equal-Width Histogram

Salary Histogram (Equal Width)

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

10000 -

50000

50001 -

90000

90001 -

130000

130001 -

170000

170001 -

210000

210001 -

250000

250001 -

290000

290001 -

330000

330001 -

370000

370001 -

410000

Salary

8-14

Equal-Height Histogram

Salary Histogram (Equal Width)

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

10000 -

50000

50001 -

90000

90001 -

130000

130001 -

170000

170001 -

210000

210001 -

250000

250001 -

290000

290001 -

330000

330001 -

370000

370001 -

410000

Salary

8-15

Application profiles

 Application profiles summarize the queries, forms, and reports that access a database.

Typical Components of an Application Profile

Application Type Statistics

Query Frequency; distribution of parameter values

Form

Report

Frequency of insert, update, delete, and retrieval operations to the main form and the subform

Frequency; distribution of parameter values

8-16

File structures

 Selecting among alternative file structures is one of the most important choices in physical database design.

 In order to choose intelligently, you must understand characteristics of available file structures.

8-17

Sequential Files

 Simplest kind of file structure

 Unordered: insertion order

 Ordered: key order

 Simple to maintain

 Provide good performance for processing large numbers of records

8-18

Unordered Sequential File

PR

1

StdSSN Name ...

123-45-6789 Joe Abbot ...

788-45-1235 Sue Peters ...

122-44-8655 Pat Heldon ...

Insert a new logical record in the last physical record .

543-01-9593 Tom Adtkins

PR n 466-55-3299 Bill Harper ...

323-97-3787 Mary Grant ...

8-19

Ordered Sequential File

StdSSN Name ...

PR

1 122-44-8655 Pat Heldon ...

123-45-6789 Joe Abbot ...

323-97-3787 Mary Grant ...

Rearrange physical record to insert new logical record.

PR n 466-55-3299 Bill Harper ...

788-45-1235 Sue Peters ...

543-01-9593 Tom Adtkins

8-20

Hash Files

 Support fast access by unique key value

 Convert a key value into a physical record address

 Mod function: typical hash function

 Divisor: large prime number close to the file capacity

 Physical record number: hash function plus the starting physical record number

8-21

Example: Hash Function

Calculations for StdSSN Key

.

StdSSN StdSSN Mod 97 PR Number

122448655 26 176

123456789 39 189

323973787 92 242

466553299

788451235

543019593

80

24

13

230

174

163

8-22

Hash File after Insertions

PR

163 543-01-9593 Tom Adtkins

PR

189 123-45-6789 Joe Abbot

PR

174 788-45-1235 Sue Peters

PR

230

466-55-3299 Bill Harper

PR

176

122-44-8655 Pat Heldon

PR

242

323-97-3787 Mary Grant

8-23

Collision Handling Example

Home address = Hash function value + Base address

(122448946 mod 97 = 26) + 150

PR

176

Home address (176) is full.

...

122-44-8655 Pat Heldon

...

122-44-8752 Joe Bishop

...

122-44-8849 Mary Wyatt

...

122-44-8946 Tom Atkins adtkins

PR

177

Linear probe to find a

122-44-8753 Bill Hayes

...

...

8-24

Hash File Limitations

 Poor performance for sequential search

 Reorganization when capacity exceeds

70%

 Dynamic hash files reduce random search performance but eliminate periodic reorganization

8-25

Multi-Way Tree (Btrees) Files

 A popular file structure supported by most

DBMSs.

 Btree provides good performance on both sequential search and key search.

 Btree characteristics:

 Balanced

 Bushy: multi-way tree

 Block-oriented

 Dynamic

8-26

Structure of a Btree of Height 3

Root node

Level

0

Level

1

Level

2

...

...

...

Leaf nodes

8-27

Btree Node Containing Keys and Pointers

Key

1

Key

2

... Key d

... Key

2d

Pointer 1 Pointer 2 Pointer 3 Pointer d+1 ...

Pointer 2d+1

Each non root node contains at least half capacity

( d keys and d +1 pointers).

Each non root node contains at most full capacity

(2 d keys and 2 d +1 pointers).

8-28

Btree Insertion Examples

(a) Initial Btree

20 45 70

22 28 35 40

(b) After inserting 55

20 45 70

50 60 65

22 28 35 40

(c) After inserting 58

50 55 60 65

20 45 58 70

Middle key value

(58) moved up

22 28 35 40 50 55

Node split

60 65

8-29

Btree Deletion Examples

(a) Initial Btree

20 45 70

22 28 35

(b) After deleting 60

20 45 70

50 60 65

22 28 35

(c) After deleting 65

20 35 70

50 65

Borrowing a key

22 28 45 50

8-30

Cost of Operations

 The height of Btree dominates the number of physical record accesses operation.

 Logarithmic search cost

 Upper bound of height: log function

 Log base: minimum number of keys in a node

 Insertion cost

 Cost to locate the nearest key

 Cost to change nodes

8-31

B+Tree

 Provides improved performance on sequential and range searches.

 In a B+tree, all keys are redundantly stored in the leaf nodes.

 To ensure that physical records are not replaced, the B+tree variation is usually implemented.

8-32

B+tree Illustration

Link to first leaf

Index set

Sequence set

...

...

8-33

Index Matching

 Determining usage of an index for a query

 Complexity of condition determines match.

 Single column indexes: =, <, >, <=, >=, IN

<list of values>, BETWEEN, IS NULL,

LIKE ‘Pattern’ (meta character not the first symbol)

 Composite indexes: more complex and restrictive rules

8-34

Index Matching Examples

 C2 BETWEEN 10 and 20: match on C2

 C3 IN (10,20): match on C3

 C1 <> 10: no match

 C4 LIKE 'A%‘: match on C4

 C4 LIKE '%A‘: no match

 C2 = 5 AND C3 = 20 AND C1 = 10: matches on index with C1, C2, and C3

8-35

Bitmap Index

 Can be useful for stable columns with few values

 Bitmap:

 String of bits: 0 (no match) or 1 (match)

 One bit for each row

 Bitmap index record

 Column value

 Bitmap

 DBMS converts bit position into row identifier.

8-36

Bitmap Index Example

Faculty Table

RowId FacSSN … FacRank

1 098-55-1234 Asst

6

7

8

9

2

3

4

5

10

11

123-45-6789 Asst

456-89-1243 Assc

111-09-0245 Prof

931-99-2034 Asst

998-00-1245 Prof

287-44-3341 Assc

230-21-9432 Asst

321-44-5588 Prof

443-22-3356 Assc

559-87-3211 Prof

12 220-44-5688 Asst

Bitmap Index on FacRank

FacRank Bitmap

Asst

Assc

Prof

110010010001

001000100100

000101001010

8-37

Bitmap Join Index

 Bitmap identifies rows of a related table.

 Represents a precomputed join

 Can define for a join column or a non-join column

 Typically used in query dominated environments such as data warehouses

(Chapter 16)

8-38

Summary of File Structures

Sequential search

Key search

Range search

Usage

Unordered Ordered Hash

Y

Linear

N

Primary only

Y

Linear

Y

Primary only

Extra PRs Y

Constant time

N

Primary or secondary

B+tree

Primary or secondary

Bitmap

N

Logarithmic Y

Y Y

Secondary only

8-39

Query Optimization

 Query optimizer determines implementation of queries.

 Major improvement in software productivity

 Improve performance with knowledge about the optimization process

8-40

Translation Tasks

Query

Syntax and semantic analysis

Parsed query

Query transformation

Relational algebra query

Access plan interpretation

A ccess

Plan

Access plan evaluation

A ccess

Plan

Code generation

Query results Machine code

8-41

Access Plan Evaluation

 Optimizer evaluates thousands of access plans

 Access plans vary by join order, file structures, and join algorithm.

 Some optimizers can use multiple indexes on the same table.

 Access plan evaluation can consume significant resources

8-42

Access Plan Example 1

Sort Merge Join

Sort(FacSSN)

Sort Merge Join

BTree(FacSSN)

Faculty

Btree(OfferNo)

Enrollment

BTree(OfferNo)

Offering

8-43

Access Plan Example 2

Sort Merge Join

Sort(OfferNo) BTree(OfferNo)

Sort Merge Join Enrollment

BTree(FacSSN)

Faculty

BTree(FacSSN)

Offering

8-44

Join Algorithms

 Nested loops: inner and outer loops; universal

 Sort merge: join column sorting or indexes

 Hybrid join: combination of nested loops and sort merge

 Hash join: uses internal hash table

 Star join: uses bitmap join indexes

8-45

Improving Optimization Results

 Monitor poorly performing access plans

 Look for problems involving table profiles and query coding practices

 Use hints carefully to improve results

 Override optimizer judgment

 Cover file structures, join algorithms, and join orders

 Use as a last result

8-46

Table Profile Deficiencies

 Detailed and current statistics needed

 Beware of uniform value assumption and independence assumption

 Use hints to overcome optimization blind spots

 Estimation of result size for parameterized queries

 Correlated columns: multiple index access may be useful

8-47

Query Coding Practices

 Avoid functions on indexable columns

 Eliminate unnecessary joins

 For conditions on join columns, test the condition on the parent table.

 Do not use the HAVING clause for row conditions.

 Avoid repetitive binding of complex queries

 Beware of queries that use complex views

8-48

Index Selection

 Most important decision

 Difficult decision

 Choice of clustered and nonclustered indexes

8-49

Clustering Index Example

Index set

<Abe, 1> <Adam, 2> <Bill, 4> <Bob, 3>

1. Abe, Denver, ...

2. Adam, Boulder, ...

3. Bob, Denver, ...

4. Bill, Aspen, ...

Sequence set

<Carl, 5> <Carol, 6> ...

5. Carl, Denver, ...

6. Carol, Golden, ...

Physical records containing rows

8-50

Nonclustering Index Example

Index set

<Abe, 6> <Adam, 2>

1. Carl,Denver, ...

2. Adam, Boulder, ...

<Bill, 4> <Bob, 5>

3. Carol, Golden, ...

4. Bill, Aspen, ...

Sequence set

<Carl, 1> <Carol, 3> ...

5. Bob, Denver, ...

6. Abe, Denver, ...

Physical records containing rows

8-51

Inputs and Outputs of Index

Selection

SQL statements and weights

Table profiles

Index

Selection

Clustered index choices

Nonclustered index choices

8-52

Trade-offs in Index Selection

 Balance retrieval against update performance

 Nonclustering index usage:

 Few rows satisfy the condition in the query

 Join column usage if a small number of rows result in child table

 Clustering index usage:

 Larger number of rows satisfy a condition than for nonclustering index

 Use in sort merge join algorithm to avoid sorting

 More expensive to maintain

8-53

Difficulties of Index Selection

 Application weights are difficult to specify.

 Distribution of parameter values needed

 Behavior of the query optimization component must be known.

 The number of choices is large.

 Index choices can be interrelated.

8-54

Selection Rules I

Rule 1 : A primary key is a good candidate for a clustering index.

Rule 2 : To support joins, consider indexes on foreign keys.

Rule 3 : A column with many values may be a good choice for a non-clustering index if it is used in equality conditions.

Rule 4 : A column used in highly selective range conditions is a good candidate for a non-clustering index.

Rule 5 : A combination of columns used together in query conditions may be good candidates for nonclustering indexes if the joint conditions return few rows, the DBMS optimizer supports multiple index access, and the columns are stable.

8-55

Selection Rules II

Rule 6 : A frequently updated column is not a good index candidate.

Rule 7 : Volatile tables (lots of insertions and deletions) should not have many indexes.

Rule 8 : Stable columns with few values are good candidates for bitmap indexes if the columns appear in WHERE conditions.

Rule 9 : Avoid indexes on combinations of columns. Most optimization components can use multiple indexes on the same table.

8-56

Index Creation

 To create the indexes, the CREATE

INDEX statement can be used.

 The word following the INDEX keyword is the name of the index.

 CREATE INDEX is not part of SQL:2003.

 Examples

CREATE INDEX StdGPAIndex ON Student (StdGPA)

CREATE UNIQUE INDEX OfferNoIndex ON Offering

(OfferNo

)

CREATE BITMAP INDEX OffYearIndex ON Offering

(OffYear)

8-57

Denormalization

 Additional choice in physical database design

 Denormalization combines tables so that they are easier to query.

 Use carefully because normalized designs have important advantages.

8-58

Normalized designs

 Better update performance

 Require less coding to enforce integrity constraints

 Support more indexes to improve query performance

8-59

Repeating Groups

 Collection of associated values.

 Normalization rules force repeating groups to be stored in an M table separate from an associated one table.

 If a repeating group is always accessed with its associated parent table, denormalization may be a reasonable alternative.

8-60

Denormalizing a Repeating

Group

Normalized Denormalized

Te rr itory

TerrNo

TerrName

TerrLoc

1

M

Te rr itorySale s

TerrNo

TerrQtr

TerrSales

Te rr itory

TerrNo

TerrName

TerrLoc

Qtr1Sales

Qtr2Sales

Qtr3Sales

Qtr4Sales

8-61

Denormalizing a Generalization

Hierarchy

Normalized Denormalized

Em p

EmpNo

EmpName

EmpHireDate

1

Em p

EmpNo

EmpName

EmpHireDate

EmpSalary

EmpRate

1

SalaryEm p

EmpNo

EmpSalary

1

HourlyEm p

EmpNo

EmpRate

8-62

Codes and Meanings

Normalized

De pt

DeptNo

DeptName

DeptLoc

1

Em p

M

EmpNo

EmpName

DeptNo

Denormalized

De pt

DeptNo

DeptName

DeptLoc

1

M

Em p

EmpNo

EmpName

DeptNo

DeptName

8-63

Record Formatting

 Record formatting decisions involve compression and derived data.

 Compression is a trade-off between inputoutput and processing effort.

 Derived data is a trade-offs between query and update operations.

8-64

Storing Derived Data to

Improve Query Performance

Or der

OrdNo

OrdDate

OrdA mt

1 derived data

Pr oduct

ProdNo

ProdName

ProdPrice

1

M

Or dLine

OrdNo

ProdNo

Qty

M

8-65

Parallel Processing

 Parallel processing can improve retrieval and modification performance.

 Retrieving many records can be improved by reading physical records in parallel.

 Many DBMSs provide parallel processing capabilities with RAID systems.

 RAID is a collection of disks (a disk array) that operates as a single disk.

8-66

Striping in RAID

Storage Systems

Each stripe consists of four adjacent physical records. Three stripes are shown separated by dotted lines.

PR1 PR2 PR3 PR4

PR5 PR6 PR7 PR8

PR9 PR10 PR11 PR12

8-67

Other Ways to Improve

Performance

 Transaction processing: add computing capacity and improve transaction design.

 Data warehouses: add computing capacity and store derived data.

 Distributed databases: allocate processing and data to various computing locations.

8-68

Summary

 Goal: minimize response time

 Constraints: disk space, memory, communication bandwidth

 Table profiles and application profiles must be specified in sufficient detail.

 Environment: file structures and query optimization

 Monitor and possibly improve query optimization results

Index selection: most important decision

 Other techniques: denormalization, record formatting, and parallel processing

8-69

Download