28SpCS157BMid3Revision - Department of Computer Science

advertisement
Mid3 Revision 2
Prof. Sin Min Lee
Deparment of Computer Science
San Jose State University
Functional DependenciesR



FDs defined over
two sets of
attributes: X,
YR
Notation: X  Y
reads as “X
determines Y”
If X  Y, then all
tuples that agree
on X must also
agree on Y
X
Y
Z
1
2
3
2
4
5
1
2
4
1
2
7
2
4
8
3
7
9
Q6. (1 point) Given the relation Supplies:
Snumber | Pnumber | Qty
--------|---------|----101
1
20
101
2
30
102
1
14
103
4
21
104
4
10
105
1
5
what will be returned by the SQL query:
Select Pnumber From Supplies Group By Pnumber Having Count(*) =
(Select Max(Count(*)) From Supplies Group By Pnumber)
(a) 1
(b) 2
Answer: a
(c) 3
(d) 4
Q4.(1 point) Consider the relation R(ABCDE)
with FDs:
FD1: AB -> D, FD2. AB->E, FD3. D->A, and
FD4. D->B.
The number of keys of R is:
2 (c) 3 (d) 10
(a) 1 (b)
Answer: candidate keys 2 {A,B,C}, {C,D}.
Superkeys 9 {CD},{ABC},
{ACD},{BCD},{CDE},{ABCD},{ABCE},{B
CDE},{ABCDE}


nd Normal Form
2
has to be in 1st Normal Form
each attribute A in relation schema R meets
one of the following criteria:


It appears in a candidate key.
It is not partially dependent on a candidate key.
-No need to check if the primary key has
only one attribute
-Create a new relation for each partial key
and its dependent attributes
Partial dependency

A functional dependency a  b is called
a partial dependency if there is a proper
subset g of a such that g  b We say
that b is partially dependent on a.
2NF example
A
B
C
D
1
1
3
2
2
2
3
1
3
2
4
4
4
1
4
3
1
2
1
2
nd Normal Form (cont.)
2
Lots
Property County-Id#
name
Lot #
Area
Price
Tax-Rate is partially dependent on
candidate key {County-name, Lot#}
Tax-Rate
2NF (cont.)
Lot #
Property- CountyId#
name
County-name
Area
Tax-Rate
Price
3rd Normal Form


in 2nd Normal Form
no non-key attributes are functionally
dependent on other non-key attribute
3NF (cont.)
Property- CountyId#
name
Prope
rtyId#
Lot #
County Lot
-name #
Area
Area
Price
Area
Price
Inventory(PartNbr, {Warehouse, Location}, QOH, Weight, PartColor)
PartNbr --> Weight, PartColor
PartNbr + Warehouse --> QOH QOH is Quantity On hand
Warehouse --> Location
Sample Data
PartNbr Warehouse Location QOH Weight PartColor
01
500
NW
135 11.75 Blue
01
600
SW
210 11.75 Blue
01
800
East 192 11.75 Blue
02
500
NW
75 2.50 Red
02
800
East
45 2.50 Red
03
500
NW
290 21.35 Green
03
600
SW
83 21.35 Green
Which Normal form is the Inventory table in?
Answer: key { PartNbr ,W]}
1NF, not 2NF
Q2.(1 point) Hospital(Patient, Insurance, Doctor,
{Test, Result})
Patient --> Insurance, Doctor
Patient + Test --> Result
Sample Data
Patient Insurance Doctor
Test
Result
Tweety
Tweety
Sylvester
Sylvester
Sylvester
Red Cross
Red Cross
Red Shield
Red Shield
Red Shield
Livingston Brain Scan Not Found
Livingston Blood work Yes and red
Kilder
Cat Scan Yes he is a Cat
Kilder
X Rays
No broken bones
Kilder
Flea check None
Which Normal form is the Hospital table in?
Q7. (1 point) Given the following table
(a)
Draw the functional dependency graph of this table.
(b)
Can D in 3NF ?
Closure of F

Let F be a set of functional
dependencies. The closure of F,
denoted by F+, is the set of all
functional dependencies logically
implied by F.
Armstrong’s Axiom



Reflexivity rule. If a is a set of
attributes and b  a, then a  b.
Augmentation rule. If a  b holds and
g is a set of attributes, then ga  gb
holds.
Transitivity rule. If a  b holds and b
 g holds, then a  g holds.
Q3.(1 point) Suppose we have R(A,B,C,D)
with
 FD1. A,BC
 FD2.A,C B
 FD3. B,D  A
Identify all the candidate keys.
Decompositions in General
R(A1, ..., An, B1, ..., Bm, C1, ..., Cp)
R1(A1, ..., An, B1, ..., Bm)
R2(A1, ..., An, C1, ..., Cp)
If A1, ..., An  B1, ..., Bm
Then the decomposition is lossless
Note: don’t need necessarily A1, ..., An  C1, ..., Cp
Example: name  price, hence the first decomposition is lossless
BCNF Decomposition
Algorithm
Repeat
choose A1, …, Am  B1, …, Bn that violates the BNCF condition
split R into R1(A1, …, Am, B1, …, Bn) and R2(A1, …, Am, [others])
continue with both R1 and R2
Until no more violations
B’s
R1
A’s
Others
R2
Is there a
2-attribute
relation that is
not in BCNF ?
Summary of BCNF
Decomposition
Find a dependency that violates the BCNF condition:
A1, A2, …, An  B1, B2, …, Bm
Heuristics: choose B1 , B2, … Bm“as large as possible”
Decompose:
Others
A’s
B’s
Continue until
there are no
BCNF violations
left.
2-attribute
relations are BCNF
R1
R2
Example Decomposition
Person(name, SSN, age, hairColor, phoneNumber)
SSN  name, age
age  hairColor
Decompose in BCNF (in class):
Step 1: find all keys (How ? Compute S+, for various sets S)
Step 2: now decompose
Other Example





R(A,B,C,D)
A  B,
BC
Key: AD
Violations of BCNF: A  B, A C, ABC
Pick A BC: split into R1(A,BC) R2(A,D)
What happens if we pick A  B first ?
Q5. (1 point) Given the FDs {B->D, AB->C, D->B} and the
relation R(A, B, C, D)}, give a two distinct lossless join
decomposition to BNCF indicating the keys of each of the
resulting relations
Answer: Relations in the first lossless join decomposition
R1(A, B, C)
R2(B, D)
Relation in the second lossless join decomposition
R1(A, C, D)
R2(B, D)
Lossless Decompositions
A decomposition is lossless if we can recover:
R(A,B,C)
Decompose
R1(A,B)
R2(A,C)
Recover
R’(A,B,C) should be the same as
R(A,B,C)
R’ is in general larger than R. Must ensure R’ = R
Q8.(2 points) Consider the relation schema
R(A,B,C,D) with FDs F = {ABC; BCD;
AB}. Which FD has an extraneous
attribute on the left hand side?
a. ABC
b. BCD
c. Both (b) and (a)
d. None of the above
Answer: a
Multivalued Dependencies
(MVDs)

Let R be a relation schema and let a  R and b  R.
The multivalued dependency
a  b
holds on R if in any legal relation r(R), for all pairs for
tuples t1 and t2 in r such that t1[a] = t2 [a], there exist
tuples t3 and t4 in r such that:
t1[a] = t2 [a] = t3 [a] = t4 [a]
t3[b]
= t1 [b]
t3[R – b] = t2[R – b]
t4 [b]
= t2[b]
t4[R – b] = t1[R – b]
MVD (Cont.)

Tabular representation of a

b
X ->> Y is trivial if
(a) Y  X or
(b) Y U X = R
Multivalued Dependencies


There are database schemas in BCNF that do not seem to be
sufficiently normalized
Consider a database
classes(course, teacher, book)
such that (c,t,b)  classes means that t is qualified to teach c,
and b is a required textbook for c

The database is supposed to list for each course the set of
teachers any one of which can be the course’s instructor, and
the set of books, all of which are required for the course (no
matter who teaches it).
Multivalued Dependencies
course
database
database
database
database
database
database
operating systems
operating systems
operating systems
operating systems
teacher
Avi
Avi
Hank
Hank
Sudarshan
Sudarshan
Avi
Avi
Jim
Jim
book
DB Concepts
Ullman
DB Concepts
Ullman
DB Concepts
Ullman
OS Concepts
Shaw
OS Concepts
Shaw
classes


There are no non-trivial functional dependencies and therefore
the relation is in BCNF
Insertion anomalies – i.e., if Sara is a new teacher that can teach
database, two tuples need to be inserted
(database, Sara, DB Concepts)
(database, Sara, Ullman)
Multivalued Dependencies

Therefore, it is better to decompose
classes into:
course
teacher
database
database
database
operating systems
operating systems
Avi
Hank
Sudarshan
Avi
Jim
teaches
course
book
database
database
operating systems
operating systems
DB Concepts
Ullman
OS Concepts
Shaw
text
We shall see that these two relations are in Fourth Normal
Form (4NF)
MVD (Cont.)

Tabular representation of a

b
Example:
F =
+
A =
+
B =
+
C =
+
AB
{ A  B, B  C }
ABC
BC
C
= ABC

First Normal Form

Every field contains only atomic values



No lists or sets.
Implicit in our definition of the relational
model.
Second Normal Form


every non-key attribute is fully functionally
dependent on the ENTIRE primary key.
Mainly of historical interest.
–
–
Intuitively, in a BCNF relation, the only nontrivial
dependencies are those in which a key determines
some attributes.
Each tuple can be thought of as an entity or
relationship, identified by a key and described by the
remaining attributes
Key
Nonkey
attr_1
Nonkey
attr_2
FDs in a BCNF Relation
Nonkey
attr_k
Key
Attributes X
Attributes A
A not in a key
Partial Dependencies
Key
Key
Attributes X
Attributes A
Attributes A
Attributes X
Transitive Dependencies
A not in a key
A in a key

Motivation of 3NF




By making an exception for certain dependencies
involving key attributes, we can ensure that every
relation schema can be decomposed into a
collection of 3NF relations using only
decompositions.
Such a guarantee does not exist for BCNF
relations.
It weaken the BCNF requirements just enough to
make this guarantee possible.
Unlike BCNF, some redundancy is possible
with 3NF.

The problems associate with partial and transitive
dependencies persist if there is a nontrivial
dependency XA and X is not a superkey, even if
the relation is in 3NF because A is part of a key.
Reserves


Assume: sid  cardno (a sailor uses a unique credit
card to pay for reservations).
Reserves is not in 3NF



sid is not a key and cardno is not part of a key
In fact, (sid, bid, day) is the only key.
(sid, cardno) pairs are redundantly.
Reserves


Assume: sid  cardno, and cardno  sid (we know
that credit cards also uniquely identify the owner).
Reserves is in 3NF


(cardno, sid, bid) is also a key for Reserves.
sid  cardno does not violate 3NF.
1.
Suppose that in our banking example, we had
an alternative design including the schema:
BC-schema=(loan#, cname, street, ccity)
We can see this is not BCNF, as the functional
dependency
cname street ccity holds on this schema, and
cname is not a superkey.
2.
3.
If we have customers who have several
addresses, though, then we no longer wish
to enforce this functional dependency, and
the schema is in BCNF.
However, we now have the repetition of
information problem. For each address, we
must repeat the loan numbers for a
customer, and vice versa.
4. Figure 1 shows a tabular representation of this. It
looks horrendously complicated, but is really rather
simple.
A simple example is a table with the schema (name,
address, car), as shown in Figure 2.
a
b
R-a-b
t1
t2
a1….ai
a1….ai
ai+1….aj
bi+1….bj
aj+1….an
bj+1….bn
t3
t4
a1....ai
a1....ai
ai+1….aj
bi+1….bj
bj+1….bn
aj+1….an

(name, address, car) where
nameaddress and namecar
name
address
car
Tom
Tom
Tom
Tom
North Rd.
Oak St.
North Rd.
Oak St.
Toyota
Honda
Honda
Toyota

What is a Decomposition?


Let R be a relation schema. A set of
relation schemas {R1, R2,…, Rn} is a
decomposition of R if R = R1 U R2 U…U Rn
That is, {R1, R2,…, Rn} is a decomposition
of R for I=1,2,…,n, each Ri is a subset of
R, and every attribute in R appears in at
least one Ri.

Normalization Using Functional Dependencies

Desirable properties of Decomposition

1. Lossless-Join Decomposition

Let R be a relation schema, and let F be a set of
functional dependencies on R. Let R1 and R2 form
a decomposition of R. This decomposition is a
lossless-join decomposition of R if at least one of
the following functional dependencies are in F+:
 R1  R2  R1
 R1  R2  R2

2. Dependency Preservation

When an update is made to the database, the
system should be able to check if it satisfies all the
given functional dependencies. If we want to
check updates efficiently, we should design
relational-database schemas that allow update
validation without the computation of joins.

To decide whether joins must be computed we
need to determine what functional
dependencies may be tested by checking each
relation individually.

Cont.

Let F be a set of functional dependencies on a
schema R, and Let R1, R2,…, Rn be a
decomposition of R. The restriction of F to Ri is
the set Fi of all functional dependencies in F+ that
include only attributes of Ri.
 Let F’ = F1 U F2 U… U Fn. F’ is a set of
functional dependencies on schema R, in
general, F’  F. However, it may be F’+ = F+. If
the latter is true, then every dependency in F is
logically implied by F’, and if we verify that F’ is
satisfied, we have verified that F is satisfied.
We say that a decomposition having the
property F’+ = F+ is a dependency preserving
decomposition.

Algorithm to test dependency preservation


compute F+;
for each schema Ri in D do
 begin
 Fi: = the restriction of F+ to Ri;
 end
 F’:=0
 for each restriction Fi do
 begin
 F’=F’ U Fi
 end
 compute F’+
 if(F’+ = F+) then return (true)

else return (false);

Note: since the first step, computation of F+ takes
exponential time, it is often easier not to apply the
algorithm.
Boyce-Codd Normal Form
A relation schema R is in BCNF with respect to a set
F of functional dependencies
if for all functional dependencies in F+ of the form a
 b, where a  R and b  R, at least one of the
following holds.
a  R is a trivial functional dependency ( b 
a)
• a is a superkey for schema R.
•
Cont.
"
"
A database design is in BCNF if each member of
the set of relation schemas that constitutes the
design is in BCNF.
To determine whether these schemas are in BCNF,
we need to determine what functional dependencies
apply to them.
Note: examples are available in text P225-226
BCNF Decomposition
Algorithm
Result := {R};
done := false;
compute F+;
while( not done ) do
if( there is a schema Ri in result that is not in BCNF )
then begin
let a  b be a nontrivial functional dependency that
holds on Ri
such that a  Ri is not in F+,
and a  b  
result = ( result -Ri )  ( Ri - B )   a , b 
end
else done := true;
Cont.
Not every BCNF decomposition is
dependency preserving
• We can not always satisfy all three design
goals:
•
1. BCNF
2. Lossless join
3. Dependency preservation
Cont.
Example:
Banker-schema = ( branch-name, customer-name, bankername )
This banker-schema indicates that a customer has "personal
banker" in a particular branch. The set F of functional
dependencies that we require to hold on the banker-schema
is
banker-name  branch-name
branch-name customer-name  banker  name
Banker-schema is not in BCNF because banker-name is not
a superkey
Third Normal Form
A relation schema R is in 3NF with respect to a set
F of functional dependencies if, for all functional
dependencies in F+ of the form a  b where a 
R and b  R, at least one of the following holds:
•a
 b is a trivial functional dependency.
• a is a superkey for R.
• Each attribute A in b - a is contained in a
candidate key for R.

Transitive Dependencies

The definition of 3NF allows certain
functional dependencies that are not
allowed in BCNF. A dependency a  b
satisfies only the third condition of the 3NF
definition is not allowed in BCNF, but is
allowed in 3NF. These dependencies are
examples of transitive dependencies.

Cont.

If a relation schema is in BCNF, then all
functional dependencies are of the form
“superkey determines a set of attributes,”
or the dependency is trivial. So
A BCNF schema cannot have an
transitive dependencies.
 Every BCNF schema is also in 3NF, and
BCNF is therefore a more restrictive
constraint than is 3NF.


Algorithm for Dependency-preserving,
lossless-join decomposition into 3NF




Let Fc be a canonical cover for F;
i:=0;
for each functional dependency ab in Fc do
 if none of the schemas Rj, j=1,2,…, I contains
ab
 then begin
 i:=i + 1;
 Ri:= ab;
 end
If none of the schemas Rj, j=1,2,…,I contains a
candidate key for R return (R1, R2,…, Ri)

Comparison of BCNF and 3NF

Using 2NF has an advantage which it is
always possible to obtain a 3NF design
without sacrificing a lossless join or
dependency preservation.So it is generally
preferable to choose 3NF.

Conclusion
 Now we have three design goals for a
relational-database design:
1. BCNF
 2. Lossless join
 3. Dependency preservation

If we cannot achieve all three, we can
do
 1. 3NF
 2. Lossless join
 3. Dependency preservation

Testing for Lossless Join




Fortunately, there is a simple test to
determine if a decomposition into two
schemes is lossless
Let R1 and R2 be a decomposition of R
Let F be the set of FDs of R
If either (R1  R2)  (R1 - R2) or
(R1  R2 )  (R1 - R2 ) belongs to F, the
decomposition is lossless
Data Mining and KDD
Putting the results
in practical use
What is Data Mining?

“the automated extraction of hidden
predictive information from large databases”

Algorithms produce patterns, rules

Predict future trends/behavior

Used to make business decisions
Classification



Items belong to classes
Given past items’ classification, predict class
of new item
Example: Issuing credit cards


Use information: income, educational background,
age, current debts
Credit worthiness: Bad, good, excellent
Decision Tree Classifiers



Internal Node has predicate
Leaf node is class
To classify instance



Start at root node
Traverse tree until reach leaf node
Each internal node, make decision
Credit Risk Decision Tree
Decision Tree Construction

Some Definitions


Purity: > # instances of each leaf
belonging to only 1 class means > purity
Best Split: split giving the maximum
information gain ratio (info gain/info
content)

Choose attribute and condition resulting in
maximum purity
Decision Tree Construction
Association Rules

antecedent  consequent





if  then
beer  diaper (Walmart)
economy bad  higher unemployment
Higher unemployment  higher unemployment
benefits cost
Rules associated with population,
support, confidence
Association Rules


Population: instances such as grocery store
purchases
Support


% of population satisfying antecedent and
consequent
Confidence

% consequent true when antecedent true
Association Rules

Population



Support (MS)= 3/6


MS, MSA, MSB, MA, MB, BA
M=Milk, S=Soda, A=Apple, B=beer
(MS,MSA,MSB)/(MS,MSA,MSB,MA,MB, BA)
Confidence (MS) = 3/5

(MS, MSA, MSB) / (MS,MSA,MSB,MA,MB)
Clustering

“The process of dividing a dataset into
mutually exclusive groups such that the
members of each group are as "close"
as possible to one another, and different
groups are as "far" as possible from one
another, where distance is measured
with respect to all available variables.”
Clustering




Birch Algorithm
points inserted into multidimensional
tree
items guided to leaf nodes "near"
representative internal nodes
nearby points clustered into one leaf
node
Clustering


Example of Clustering
predict what new movies a person is
interested in



1) a person’s past movie preferences
2) others with similar preferences
3) preferences of those in the pool for new
movies
Clustering



1) cluster people with similar movie
preferences
2) given a new movie goer, find a
cluster of similar movie goers
3) then predict the cluster's new movie
preferences
Amazon Examples
Amazon Examples
Download