Uploaded by Ssendi Samuel

Physical Design 2

advertisement
DATABASE DESIGN
Physical Database Design for
Relational Databases
11/30/2023
Physical Design
COMPARISON OF LOGICAL AND PHYSICAL
DATABASE DESIGN
 The logical database design is largely independent
of implementation details, but dependant on a
target data model.
 Logical database design is concerned with the
what, physical database design is concerned with
the how.
 Sources of information for physical design process
includes global logical data model and
documentation that describes the model (the
output of the logical design phase).
 The physical database designer must know how the
computer system hosting the DBMS operates as
well as the functionality of the target DBMS.
5.2
PHYSICAL DATABASE DESIGN
Physical Design

Process of producing a description of the implementation
of the database on secondary storage; it describes the base
relations, file organizations, and indexes used to achieve
efficient access to the data, and any associated
integrity constraints and security measures.
11/30/2023

Note that the physical design process is not an
independent process, there is a feedback between
the conceptual, logical and application design.
5.3
OVERVIEW OF PHYSICAL DATABASE
DESIGN METHODOLOGY
11/30/2023

Physical Design
This phase of database design is broken down into six
steps. These are;
a) Translate global logical data model for target DBMS
b) Design physical representation
c) Design User views
d) Design security mechanisms
e) Consider the introduction of controlled redundancy
f) Monitor and tune the operational system
5.4
A)
Need to know functionality of target DBMS such as how to
create base relations and whether the system supports the
definition of:
 PKs, FKs, and AKs;
 required data - i.e.. whether system supports NOT NULL
 domains;
Physical Design

The main aim of this step is to produce a relational
database schema that can be implemented in the target DBMS
from the global logical data model.
11/30/2023

TRANSLATE GLOBAL LOGICAL DATA
MODEL FOR TARGET DBMS
5.5
TRANSLATE GLOBAL LOGICAL DATA
MODEL FOR TARGET DBMS CONT’D
Documentation after/within each step is
paramount; state why a particular approach was
selected among the many alternatives available
(if any).
Physical Design

11/30/2023
relational integrity constraints;
 enterprise constraints.
 It is made up of four steps namely;
 Design base relations
 Design representation of derived data
 Design enterprise constraints

5.6
STEP A.1: DESIGN BASE RELATIONS





each relation already have defined:
the name of the relation;
a list of simple attributes in brackets;
the PK and, where appropriate, AKs and FKs.
a list of any derived attributes and how they
should be computed;
referential integrity constraints for any FKs
identified.
Physical Design
 For
11/30/2023
The objective of this step is to decide how to
represent base relations identified in global
logical data model in target DBMS.
5.7
STEP A.1 DESIGN BASE RELATIONS
 For
11/30/2023
Physical Design
each attribute need to define:
 its domain, consisting of a data type, length, and
any constraints on the domain;
 an optional default value for the attribute;
 whether the attribute can hold nulls.
 After defining, we decide how to implement the
base relations; dependant on a target DBMS. There
are three ways of implementing namely; SQL,
Microsoft Access and Oracle.
 Documentation of the design for the base tables is
then done.
5.8
DBDL FOR THE PROPERTYFORRENT RELATION
11/30/2023
Physical Design
5.9

Examine logical data model and data dictionary, and
produce list of all derived attributes.

Derived attribute can be stored in database or calculated
every time it is needed.
Physical Design
To decide how to represent any derived data
present in the global logical data model in the
target DBMS.
11/30/2023
STEP A.2: DESIGN REPRESENTATION OF
DERIVED DATA
5.10
STEP A.2: DESIGN REPRESENTATION OF DERIVED
DATA
Option selected is based on:
 additional cost to store the derived data and keep it
consistent with operational data from which it is
derived;
 cost to calculate it each time it is required.
 Less expensive option is chosen subject to performance
constraints.
 The derived attribute is stored in the database when;
 It is accessed frequently by a query or queries.
 It is accessed by a query critical for performance
purposes
 When the DBMS can not cope with the algorithm to
calculate the derived attribute.

11/30/2023
Physical Design
5.11
PROPERTYFORRENT RELATION AND STAFF
RELATION WITH DERIVED ATTRIBUTE
NOOFPROPERTIES
11/30/2023
Physical Design
5.12
STEP A.3: DESIGN ENTERPRISE
CONSTRAINTS

Updates to relations may be constrained by
enterprise rules governing the ‘real world’
transaction, represented by the updates.
Physical Design

11/30/2023
The main objective of this step is to design the
enterprise constraints for the target DBMS.
Such constraints are dependant on the choice of
DBMS.
5.13
STEP A.3: DESIGN ENTERPRISE
CONSTRAINTS CONT’D
CHECK (NOT EXISTS (SELECT staffNo
FROM PropertyForRent
GROUP BY staffNo
HAVING COUNT(*) > 100))

Physical Design
Some DBMS provide more facilities than others for
defining enterprise constraints. Example:
CONSTRAINT StaffNotHandlingTooMuch
11/30/2023

The constraints that can not be expressed in a given
DBMS are designed into the application.
5.14
B)

It is broken down into four steps namely;
 Analyze transactions
 Choose file organizations
 Choose indexes
 Estimate disk space requirements
Physical Design
To determine optimal file organizations to store the
base relations and the indexes that are required to
achieve acceptable performance; that is, the way in
which relations and tuples will be held on
secondary storage.
11/30/2023

DESIGN PHYSICAL REPRESENTATION
5.15
STEP B: DESIGN PHYSICAL REPRESENTATION
 Number
Physical Design
Transaction throughput: number of transactions
processed in given time interval.
 Response time: elapsed time for completion of a
single transaction.
 Disk storage: amount of disk space required to
store database files.

11/30/2023
of factors that may be used to measure
efficiency:
 However,
no one factor is always correct. Typically,
have to trade one factor off against another to achieve
a reasonable balance.
5.16
STEP B.1: ANALYZE TRANSACTIONS
to identify performance criteria, such as:
 transactions that run frequently and will have a
significant impact on performance;
 transactions that are critical to the business
operation;
 times during the day/week when there will be a
high demand made on the database (called the
peak load).
Physical Design
 Attempt
11/30/2023
To understand the functionality of the
transactions that will run on the database
and to analyze the important transactions.
5.17
STEP B.1 ANALYZE TRANSACTIONS
 Use
11/30/2023
Physical Design
this information to identify the parts of the
database that may cause performance problems;
e.g. the parts that are frequently accessed by
transactions/queries.
 To select appropriate file organizations and
indexes, also need to know high-level functionality
of the transactions, such as:
 attributes that are updated in an update
transaction
 criteria used to restrict tuples that are retrieved
in a query.
5.18
STEP B.1 ANALYZE TRANSACTIONS
Physical Design
not possible to analyze all expected
transactions, so investigate most ‘important’ ones.
 To help identify which transactions to investigate,
can use:
 transaction/relation cross-reference matrix,
showing relations that each transaction accesses,
and/or
 transaction usage map, indicating which
relations are potentially heavily used.
11/30/2023
 Often
5.19
STEP B.2: CHOOSE FILE ORGANIZATIONS





organizations include;
Heap (unordered),
Hash,
Indexed Sequential Access Method (ISAM),
B+-Tree, and
Clusters.
Physical Design
 File
11/30/2023
To determine an efficient file organization for each
base relation that is, an efficient way to store data.
5.20
HEAP (UNORDERED)

This is the simplest type of file organisation in
which records are placed on the disk in the same
order as they are inserted.
Physical Design

It is known as unordered because records are
placed on the disk in no particular order.
11/30/2023

Insertion of data (records) is efficient; since each
record is placed after the last one.
5.21
HEAP (UNORDERED) CONT’D
A
the page containing the record to
be deleted is first retrieved then the
record is marked as retrieved. The space it
occupied is not re-used (performance
deterioration).
 Re-organisation is needed to reclaim the
space
Physical Design
 Deletion:
11/30/2023
linear search must be done in order to
retrieve a record from the file.
5.22
HEAP (UNORDERED) CONT’D
Physical Design
It is suitable as a storage structure when;
 Bulk loading data into the database tables;
there is no overhead of calculating what page
the record should be put.
 The relation contains few pages
 Bulk retrieving i.e. when every tuple in the
relation is retrieved each time the relation is
accessed.
 The relation has an additional access structure
e.g. index key.
11/30/2023

5.23
HASH

The hash function calculates the address of the
page in which the record is to be stored based on
one to more fields in the record (hash field/key).
Hash files may be called direct/random files
because records appear randomly distributed
across available file space.
Physical Design

Records are placed on the disk (secondary
storage) according to a hash function and not in a
sequential pattern.
11/30/2023

5.24
HASH CONT’D

Physical Design

The Problem of hashing functions is that unique
addresses for each record is not guaranteed because of
collision.
11/30/2023

The hash function is chosen so that the records are as
evenly distributed as possible through out the file.
 Folding technique (uses arithmetic functions e.g.
addition)
 Division-remainder (MOD)
Collision: this occurs when the same address is
generated for two or more records (records with the
same address are known as synonyms).
5.25
HASH CONT’D

Physical Design
However it is not good when;
 The hash field is frequently updated
 The hash field is based on;
 A pattern match of the hash field
 A range of values for the hash field
 A field other than the hash field
 Only a part of the hash field
11/30/2023

It is a good storage structure when;
 tuples are retrieved based on an exact match
on the hash field.
 if the access order is random.
5.26
INDEXED SEQUENTIAL ACCESS METHOD (ISAM)

Each index item is ordered and consists of one or
more references as to where one find the
particular data item required; thus eliminates
the need to scan sequentially through the file so
as to get the required record.
Physical Design

This is based on the same analogy and search
approach used in book indexes.
11/30/2023

This is based on an Index; a data structure that
allows the DBMS to locate particular records in a
file more quickly thus speeding up response to
user queries (enhanced performance).
5.27
INDEXED SEQUENTIAL ACCESS METHOD
(ISAM) CONT’D

Physical Design

An index structure is associated with a particular
search key and contains records consisting of the
key and the address of the logical record in the
file containing the key value.
11/30/2023

The file containing the logical records is known
as the data file and that containing the index
records, the indexing file.
The records in the indexing file are ordered based
on the indexing field, usually a single attribute.
5.28
INDEXED SEQUENTIAL ACCESS METHOD
(ISAM) CONT’D



that are faced;
 ISAM index is static (usually created
when the file is created)
 Updates to the relation: cause the
indexing file to lose the access sequence.
Physical Design

supports data retrieval based on;
Exact key match
Pattern matching
Range of values
Part key specification
11/30/2023
 It
 Problems
5.29
B+-TREE

A tree has a hierarchy of nodes;
 Parent (root and parent nodes)
 child nodes (child and leaf)
Physical Design

This is a file organisation structure in which the
data or indexes are held in a tree format.
11/30/2023

Terminology
 Depth of the tree
 Balanced tree (B-tree)
 Degree/Order of the tree
5.30
B+-TREE CONT’D
 The
structure of each node is as below;
Key value1
Key Value2
 Each
node in the tree is actually a
page/reference to actual tuple.
 The
rules for a B+-tree include;
 If the root is not a leaf, it must have at least
two children
 For a tree of order n, each node must have
between n/2 and n pointers and children.
11/30/2023
Physical Design
5.31
B+-TREE CONT’D

The number of key values contained in a nonleaf node is 1 less than the number of pointers.

The tree must always be balanced; same
length for each path from root to leaf node

Leaf nodes are linked in order of key values
Physical Design
For a tree of order n, the number of key values
in a leaf node, must have between (n-1)/2 and
(n-1) pointers and children.
11/30/2023

5.32
B+-TREE CONT’D
Physical Design
is a more reliable/adaptable storage
structure than hashing.
 This is because;
 It supports data retrieval based on;
Exact value match
Pattern matching
Range of values
Part key specification
11/30/2023
 This
5.33
B+-TREE CONT’D

Maintains the access key order even
when a relation is updated thus
retrieval based on access key is more
efficient than in ISAM.
Physical Design
Updating a relation does not impede
performance; grows as relations grow
11/30/2023

Note: Best suited when the relation is
frequently updated; contain one more
than the ISAM.
5.34
CLUSTERS
 Clusters
cluster key refers to the related
columns in the clustered tables.
Physical Design
 The
11/30/2023
are groups of one or more tables
physically stored together because they
share common columns and are often used
together; improving access time.
5.35
CLUSTERS CONT’D
11/30/2023
Physical Design
5.36
CLUSTERS CONT’D
11/30/2023
Physical Design
Staff Table
Branch table
Cluster Key
5.37
STEP B.3: CHOOSE INDEXES
The main objective of this step is to determine
whether adding indexes will improve the
performance of the system.
 There are three types of indexes. These are based
on ordering that is the ordering field, ordering key
and non-ordering field.
Ordering
 This refers to the sorting of the records in a file.
 Ordering field; field(s) on which sorting is based.
 Ordering Key; the ordering field is also the
primary key/the key of the file.
 Non-ordering field; all other fields in the file that
are not the ordering field(s).

11/30/2023
Physical Design
5.38
STEP B3: CHOOSE INDEXES CONT’D
11/30/2023
Physical Design
Types of Indexes:
 Primary Index:
The data file is sequentially ordered by an
ordering key field and the indexing field is
built on the ordering key field, and thus is
guaranteed to have a unique value for each
record.
 Clustering index:
The data file is sequentially ordered by a nonkey field and the indexing field is built on the
ordering key field, and thus there can be more
than one record corresponding to a value in the
indexing filed. The non-key field is known as
the clustering attribute.
5.39
STEP B3: CHOOSE INDEXES CONT’D
11/30/2023
Physical Design
Types of Indexes cont’d:
 Secondary index:
This is the type of index defined on the nonordering field of the data file.
Note:
 a file may have at most one primary or clustering
index but several secondary indexes.
 An index may be sparse (some of the search key
values) or dense (all the values of a search key).
5.40
STEP B3: CHOOSE INDEXES CONT’D

Physical Design
indexes can be done in two ways;
 One approach is to keep tuples unordered and
create as many secondary indexes as
necessary.
11/30/2023
 Choosing
Another approach is to order tuples in the
relation by specifying a primary or clustering
index.
5.41
STEP B.3: CHOOSE INDEXES

attribute that is used most often to access the
tuples in a relation in order of that attribute (e.g.
in SQL the attribute used most in the order by
clause).
Physical Design
this case, choose the attribute for ordering or
clustering the tuples as:
 attribute that is used most often for join
operations - this makes join operation more
efficient, or
11/30/2023
 In
5.42
STEP B.3: CHOOSE INDEXES
 If
relation can only have either a primary index
or a clustering index.
Physical Design
 Each
11/30/2023
ordering attribute chosen is key of relation,
index will be a primary index; otherwise, index
will be a clustering index.
 Secondary
indexes provide a mechanism for
specifying an additional key for a base relation
that can be used to retrieve data more efficiently.
5.43
STEP B.3: CHOOSE INDEXES
There is an overhead involved in maintenance and use of
secondary indexes that has to be balanced against
performance improvement gained when retrieving data.
 The overhead includes:
 adding an index record to every secondary index
whenever tuple is inserted;
 updating a secondary index when corresponding tuple is
updated;
 increase in disk space needed to store the secondary
index;
 possible performance degradation during query
optimization to consider all secondary indexes.

11/30/2023
Physical Design
5.44
STEP B.3: CHOOSE INDEXES –
GUIDELINES FOR CHOOSING ‘WISH-LIST’
11/30/2023
Physical Design
(1) Do not index small relations.
(2) Index PK of a relation if it is not a key of the file
organization.
(3) Add secondary index to a FK if it is frequently accessed.
(4) Add secondary index to any attribute that is heavily used
as a secondary key.
(5) Add secondary index on attributes that are involved in:
selection or join criteria; ORDER BY; GROUP BY; and
other operations involving sorting (such as UNION or
DISTINCT).
5.45
STEP B.3: CHOOSE INDEXES –
GUIDELINES FOR CHOOSING ‘WISH-LIST’
(7) Add secondary index on attributes that could result in an
index-only plan.
Physical Design
functions.
11/30/2023
(6) Add secondary index on attributes involved in built-in
(8) Avoid indexing an attribute or relation that is frequently
updated.
(9) Avoid indexing an attribute if the query will retrieve a
significant proportion of the tuples in the relation.
(10) Avoid indexing attributes that consist of long character
strings.
5.46
11/30/2023
Physical Design
STEP B.4: ESTIMATE DISK SPACE REQUIREMENTS
The aim of this step is to estimate the amount of
disk space that will be required by the database
that is, how much space will the implementation of
the database on secondary storage.
 Estimating disk usage is highly dependant on the
DBMS & the hardware used to support the
DBMS.
 This is based on;
 Size of the tuples in a relation
 Number of tuples in a relation (consider the
future growth-growth factor)
5.47
STEP C: DESIGN USER VIEWS
11/30/2023
Physical Design
To design the user views that were identified
during the Requirements Collection and Analysis
stage of the relational database application
lifecycle.
This is developed based on the local conceptual and
logical designs that were developed in the previous
phases.
 Views are used to restrict user access to the
database e.g. in a multi-user environment.

5.48
STEP C: DESIGN USER VIEWS CONT’D
 Advantages


Physical Design

11/30/2023

of views include;
Convenience: Users are presented
with only that part of the DB that they
need.
Reduced complexity: They are used
to simplify complex queries.
Improved security: Users are
assigned access rights to only the parts
of the DB that has appropriate data for
their functioning.
Customization: The same base tables
are seen differently by different users.
5.49
STEP C: DESIGN USER VIEWS CONT’D
Data integrity: Using the WITH
CHECK OPTION clause, no row that
does not satisfy the condition in the
where clause can be added to or
updated in the base tables.
 Currency: Changes in the base tables
are immediately reflected in the view.
 Data independence: Consistent and
unchanging picture of the structure of
the database even if the base tables are
changed e.g. adding of columns

11/30/2023
Physical Design
5.50
STEP D: DESIGN SECURITY MEASURES
11/30/2023
Physical Design
The aim of this step is to design the security
measures for the database as specified by the
users; how the security requirements will be
realized.
There are two types of database security;
 System Security:
Deals with the access and use of the database
i.e. restricting the access by the use of
usernames and passwords.
 Data Security:
Deals with access and use (actions that can be
performed) of the database objects such as the
tables/relation and views.
5.51
STEP E: MONITORING AND TUNING AN
OPERATIONAL DATABASE
11/30/2023
Physical Design
In many cases, when a system is being developed,
it is fast enough since the test data is small.
however, as the system gets operational, the
operational data is far higher than the test data.
it may slow down.
 There are actions that can be done to speed it up.
this can be done at the application programs level
2 or at the database level. At the database level,
we aim at reducing computationally expensive
operations. The commonest expensive operation
is the join operation.

5.52
CONTROLLED REDUNDANCY
 Combining
11/30/2023
Physical Design
1:1 Relationships: In case
two tables were created but the entities
had a one-to-one relationship, they can be
merged into a single table so that
accessing the associated data no longer
need a join.
 Duplicating non-key columns of 1:*
relationships: This can be one of the
most frequently accessed fields. additional
modules in the application program have
to be added to ensure consistence of the
duplicated fields.
5.53
11/30/2023
Physical Design
CONTROLLED REDUNDANCY CONT’D
 Duplicating columns in *:*
relationships: This is done like in one to
many though data is got from two tables
and put in a central table that was
created from a relationship. A module to
manage updates also has to be created.
 Introducing repeating groups: This is
commonly on tables from multivalued
attributes. Frequently queried attributes
can be incorporated into the man table.
For example, the first three hobbies can
be put in the person table so that a big
proportion of joins are eliminated.
5.54
EXTRACT TABLES
Extract tables are completely denormalized
tables that can store highly duplicated data. The
query operations are done in the extract tables
and the users are served. However, the input and
updates are done in the normalized database
tables.
 At regular intervals, data is transfered from the
database and posted in an extract table. The
extract table is therefore not always accurate.
 It is desirable in cases where updates take place
once andquerying takes place continuously.
 It is also desirable where speed is required but a
small inaccuracy is practically acceptable.

11/30/2023
Physical Design
5.55
VIEWS
A view is a stored query accessible as a virtual
table. A view is composed of a result set of the
stored query. It is continuously updated and
therefore have dynamic data.
 In case (many) operations would cause similar
joins, the joins are done once in a view and the
operations query a view.
 There are two types of views; the updateable and
non-updateable views. A view is updateable if (i)
it is from a single table and (ii) It has all fields
that are required with no default values. In case
these conditions are not satisfied,then it is a nonupdateable view.

11/30/2023
Physical Design
5.56
Download