04CM20145Lecture8 - Department of Computer Science

advertisement
Dr Alwyn Barry
Dr Joanna Bryson
CM20145
Further DB Design –
Normalization
Last Time
 Database design is an ongoing,
iterative process.
 Requirements come from data, user
demands, design issues.
 Change occurs:
 Corporations & technologies grow.
 Programmers & users learn.
 Views / security.
 Lossless-join decomposition
Now: Science for improving design.
Design Process & Normalization
 We assume a schema R is given.
 R could have been generated when
converting E-R diagram to a set of tables.
 R could have been a single relation
containing all attributes that are of interest
(called universal relation).
 Normalization breaks R into smaller
relations.
R could be the result of any ad hoc
design of relations, which we then test
& convert to normal form.
Overview
 First Normal Form.
 Functional Dependencies.
 Second Normal Form.
 Third Normal Form.
 Boyce-Codd Normal Form.
 Fourth Normal Form.
 Fifth Normal Form.
 Domain Key / Normal Form.
 Design Process & Problems.
First Normal Form – 1NF
 You aren’t supposed to have more
than one value per attribute of a tuple.
 All tuples have the same number of
attributes.
 Necessary for a relational database.
Name
Office
Office Hours
Barry
2.23
1pm, 4pm
Bryson
L2.27
BAD
11am, 5pm
Getting Caught Out With 1NF
 A domain is atomic if its elements are
considered to be indivisible units.
 Examples of non-atomic domains:
 Set-valued attributes, composite attributes.
 Identifiers like CS101 that can be broken up
into parts.
 A relational schema R is in first normal
form if the domains of all attributes of
R are atomic.
 Non-atomic values:
 complicate storage,
 encourage redundancy,
 Depend on interpretation built into
application programs.
Are You Atomic?
 Atomicity is not an intrinsic property of
the elements of the domain.
 Atomicity is a property of how the
elements of the domain are used.
 E.g. strings containing a possible delimiter
(here: a space)
 cities = “Melbourne Sydney”
(non-atomic: space separated list)
 surname = “Fortescue Smythe”
(atomic: compound surname)
 E.g. strings encoding two separate fields
 bucs_login = cssjjb
 If the first two characters are extracted to find the
department, the domain bucs_login is not atomic.
 Leads to encoding of information in application
program rather than in the database.
Second Normal Form (2NF)
 Violated when a nonkey column is a fact
about part of the primary key.
 A column is not fully functionally dependent
on the full primary key.
 CUSTOMER-CREDIT in this case:
From Watson
ORDER
ITEMID
CUSTOMERID
QUANTITY
CUSTOMER-CREDIT
Desk
JJB
25
OK
Chair
AMB
3
POOR
ITEM
ORDER
*itemid
…
quantity
…
CUSTOMER
*customerid
customer-credit
…
BAD
FIX
Def: Functional Dependency
 Let R be a relation schema
  R and   R
 The functional dependency (FD)    holds on R
(“ is FD on ”) iff for any legal relations r(R):
 whenever any two tuples t1 and t2 of r agree on
the attributes 
 they also agree on the attributes .
 i.e. (t1) = (t2)   (t1) =  (t2)
 Example: Consider r(A,B) with the following
instance of r:
A: Initials
B: Chore
JJB
Grading
AMB
Setting Tutorials
JJB
Writing Unit Reviews
 A  B does NOT hold, but B  A does hold
Functional Dependencies: Uses
 Way to encode “business rules”.
 Specify constraints on the set of
legal relations.
 We say that F holds on R if all legal
relations on R satisfy the set of FDs
F.
 Test relations to see if they are
legal under a given set of FDs.
 If a relation r is legal under a set F
of FDs, we say that r satisfies F.
Functional Dependencies
 An FD is an assertion about a schema,
not an instance.
 If we only consider an instance or a
few instances, we can’t tell if an FD
holds.
 Inspecting only a few bird species (e.g.
crows, cardinals and canaries) we might
conclude colour  species.
 However, this would be a bad FD as we
would find out if we found some ravens.
 Thus, identifying FDs is part of the
data modelling process.
Trivial Functional Dependencies
 An FD is trivial if it is satisfied by
all instances of a relation
 E.g.
 customer-name, loan-number 
customer-name
 customer-name  customer-name
 In general,    is trivial if   
 Permitting such FDs makes
certain definitions and algorithms
easier to state.
Functional Dependency vs Key
 FDs can express the same constraints
we could express using keys:
 Superkeys:
 K is a superkey for relation schema R if
and only if K  R
 Candidate keys:
 K is a candidate key for R if and only if
 K  R, and
 there is no K’  K such that K’  R
 Of course, which candidate key
becomes the primary key is arbitrary.
FDs <> Keys
 FDs can represent more information than
keys can on their own.
 Consider the following Loan-info-schema:
Loan-info-schema = (customer-name, loan-number,
branch-name, amount).
We expect these FDs to hold:
loan-number  amount
loan-number  branch-name
We could try to express this by making loannumber the key, however the following FD
does not hold:
loan-number  customer-name
 Incidentally, this isn’t a very good table!
(¬2NF)
FD Closure
 Given a set F of FDs, other FDs are logically
implied.
 E.g. If A  B and B  C, we can infer that A  C
 The set of all FDs implied by F is the closure of
F, written F+ .
 Find F+ by applying Armstrong’s Axioms:
 if   , then   
 if   , then     
 if   , and   , then   
(reflexivity)
(augmentation)
(transitivity)
 Additional rules (derivable from Armstrong’s
Axioms):
 If    and    holds, then     holds (union)
 If     holds, then    holds and    holds
(decomposition)
 If    holds and     holds, then     holds
(pseudotransitivity)
Bad Decomposition Example
(From Last Time)
 A Non Lossless-Join Decomposition
R = (A, B)  R1 = (A), R2 = (B)
A B
A
B
A
B





1
2
A(r)
B(r)




1
2
1
2
1
2
1
r
A (r) ⋈ B (r)
 Thus, r is different to A (r) ⋈ B (r)
 So A,B is not a lossless-join
decomposition of R.
FDs & Lossless Decomposition
 All attributes of an original schema (R)
must appear in the decomposition (R1,
R2):
R = R 1  R2
 Lossless-join decomposition.
For all possible relations r on schema R
r = R1 (r) ⋈ R2 (r)
 A decomposition of R into R1 and R2 is
lossless-join if and only if at least one
of the following dependencies is in F+:
 R1  R2  R1
 R1  R2  R2
Second Normal Form (2NF)
 Violated when a nonkey column is a fact
about part of the primary key.
 A column is not fully functionally dependent
on the full primary key.
 CUSTOMER-CREDIT in this case:
From Watson
ORDER
ITEMID
CUSTOMERID
QUANTITY
CUSTOMER-CREDIT
Desk
JJB
25
OK
Chair
AMB
3
POOR
ITEM
ORDER
*itemid
…
quantity
…
CUSTOMER
*customerid
customer-credit
…
BAD
FIX
Third Normal Form (3NF)
 Violated when a nonkey column is a
fact about another nonkey column.
 A column is not fully functionally
dependent on the primary key.
 R is 3NF iff R is 2NF and has no
transitive dependencies.
 EXCHANGE RATE violates this.
FIX
STOCK
STOCK CODE
NATION
GOOG
USA
NOK
FIN
EXCHANGE RATE
0.67
BAD
0.46
NATION
STOCK
*nation code
nation name
exchange rate
*stock code
firm name
stock price
stock quantity
stock dividend
stock PE
Boyce-Codd (BCNF)
 Arises when a table:
 has multiple candidate keys,
 the candidate keys are composite,
 the candidate keys overlap.
 R is BCNF iff every determinant is a cand. key.
 E.g. Assume one consultant per problem per client,
and one problem per consultant.
 If client-problem is the primary key, how do you add
a new consultant?
 Like 3NF but now worry about all fields.
ADVISOR
BAD
CLIENT
PROBLEM
CONSULTANT
Alpha
Marketing
Gomez
Alpha
Production
Raginiski
CLIEN T
CLIEN T-
P ROBLEM
P ROBLEM
*clientno
…
FIX
* problemcode
*cltp ro bdate
…
CON SULTA NT
* consultid
…
…
Design Goals & their discontents
 Goals for a relational database design:
 eliminate redundancies by decomposing
relations,
 must be able to recover original data using
lossless joins,
 prefer not to loose dependencies.
 BCNF:
 no redundancies,
 no guarantee of dependency preservation.
 3NF:
 dependency preservation,
 but possible redundancies.
Fourth normal form (4NF)
 A row should not contain two or more
independent multivalued facts.
 4NF iff BCNF & no non-trivial multivalued dependencies.
 Multivalued dependency means the value of
one attributed determines a set of values for
another.
STUDENT
STUDENTID
BAD
STUDENTSPORT
proficiency
…
SPORT
SUBJECT
…
50
Football
English
…
50
Football
Music
…
50
Tennis
Botany
…
SPORT
50
Karate
Botany
…
*sportcode
…
STUDENT
ENROLMENT
*studentid
…
*sectionno
…
FIX
SUBJECT
*subjectid
…
Fifth normal form (5NF)
 5NF iff a relation has no join
dependency.
 The schemas R1, R2,.., Rn have a
join dependency over R if they
define a lossless-join
decomposition over R.
 This is way too complicated, don’t
worry about it.
Domain Key Normal Form
 Every constraint on the relation
must be a logical consequence of
the domain constraints and the
key constraints that apply to the
relation.
 Key: unique identifier.
 Constraint: rule governing attribute
values.
 Domain: set of values of the same
data type.
 No known algorithm gives DK/NF.
E-R Model and Normalization
 When an E-R diagram is carefully
designed, identifying all entities
correctly, the tables generated should
not need further normalization.
 However, in a real (imperfect) design
there can be FDs from non-key
attributes of an entity to other attributes
of the entity.
 The keys identified in E-R diagrams
might not be minimal - FDs can help us
to identify minimal keys.
 FDs from non-key attributes of a
relationship set are possible, but rare.
Denormalization & Performance
 May want to use non-normalized
schema for performance.
 E.g. displaying customer-name along with
account-number and balance requires join
of account with depositor.
 Alternative 1: Use denormalized relation
containing attributes of account as well as
depositor.
 faster lookup.
 extra space and extra execution time for updates.
 extra coding work for programmer and possibility
of error in extra code.
 Alternative 2: use a materialized view
defined as account ⋈ depositor
 as above, except less extra coding, errors.
Limits of Normalization
 Examples of bad database design, not
caught by normalization.
 Good:
 earnings(company-id, year, amount)
 Bad:
 earnings-2000, earnings-2001, earnings2002, etc., on (company-id, earnings)
 all are BCNF, but querying across years difficult
 needs a new table each year
 company-year(company-id, earnings2000,earnings-2001, earnings-2002)
 in BCNF, but querying across years difficult
 requires new attribute each year
Summary 1 – Rules to Watch
 1NF: attributes not atomic.
 2NF: non-key attribute FD on part of





key.
3NF: one non-key attribute FD on
another.
Boyce-Codd NF: overlapping but
otherwise independent candidate
keys.
4NF: multiple, independent multivalued attributes.
5NF: join dependency.
Domain Key / NF: all constraints either
domain or key
Summary 2 – Concepts
 Functional Dependencies:

Axioms & Closure.
 Lossless-join decomposition.
 Design Process.
 Normalization Problems.
Next: Interfaces and Architectures
Reading & Exercises
 Reading
 Connolly & Begg Chapter (13, 14)
 Silberschatz Chapters 7.
 Any other book, the
design/normalization chapter.
 Exercises:
 Silberschatz
 7.1, 7.2, 7.16, 7.23, 7.24, 7.27-29
Next Week
•
Architectures and Implementations
•
Integrity and Security
Slides after and including
this one you are not
responsible for, but I am
saving in case I decide to
use them in the future.
Goal: Formalize “Good Design”
 Process:
 Decide whether a particular relation R is in
“good” form.
 In the case that a relation R is not in
“good” form, decompose it into a set of
relations {R1, R2, ..., Rn} such that:
 each relation is in good form,
 the decomposition is a lossless-join
decomposition.
 Theory:
 Constraints on the set of legal relations.
 Require that the value for a certain set of
attributes determines uniquely the value
for another set of attributes –
functional dependencies.
Download