NF-SS: A Normal Form for Semistructured Schemata Gillian Dobbie

advertisement
DASWIS 2001
NF-SS: A Normal Form for Semistructured Schemata
Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee
National University of Singapore
Gillian Dobbie
University of Auckland, New Zealand
1
DASWIS 2001
Outline
1. Motivations
2. Semistructured schema and its data tree
3. Integrity constraints for semistructured data
4. NF-SS: Normal Form for Semistructured Schemata
5. Designing of semistructured schema into NF-SS
6. Discussions of the designing approach
7. Comparison with related proposal
8. Summary
2
DASWIS 2001
1. Motivation: Example 1
<!ELEMENT department (course+)
<!ATTLIST
department
name
ID
#REQUIRED>
<!ELEMENT
course
(students*)>
<!ATTLIST
course
cid
ID
#REQUIRED
title
CDATA #implied>
<!ELEMENT student
(grade?)>
<!ATTLIST
student
sid
ID #REQUIRED
name CDATA
#REQUIRED
age
CDATA #IMPLIED>
<!ELEMENT grade (#PCDATA)>
department
name
+
course
cid title *
student
?
sid age name grade
3
DASWIS 2001
1. Motivation (cont.)
 Redundancy: name and age of a student
 Updating Anomaly:
–
–
–
Insertion
Rewriting
Deletion
department
course
name: CS
course

student
sid:
s01
name: age:
21
Jack
title: database design cid:
cid:
cs5220
cs4221 student
sid:
s02
name:
grade
Tom
sid:
s01
title: data Mining
student

name: age:
21
Jack
“A”
4
DASWIS 2001
1. Motivation:Example 2
<!ELEMENT teacher
(ClassRoom*)>
<!ATTLIST
teacher
tid
ID #REQUIRED>
teacher
name
CDATA
#REQUIRED>
<!ELEMENT ClassRoom
(subject*)>
name
*
tid
<!ATTLIST ClassRoom room# ID #REQUIRED>
ClassRoom
<!ELEMENT subject
(time)>
*
<!ATTLIST subject
room#
subject
cid ID
#REQUIRED>
<!ELEMENT time
EMPTY>
*
cid
<!ATTLIST day
CDATA
#REQUIRED
time
hour
CDATA
#REQUIRED>
day hour
Path anomaly:
–The schema doesn’t reflect the integrity constraints:
tid,day,hourcid,room#
5
DASWIS 2001
2. Semistructured Schema and Data tree
A semistructured schema is defined to be D = (E, A, B, P, R, r)
•E is a finite set of object types in D.
E: Object
r: root
type type
Object
•A is a finite set of attributes, disjoint from E.
A:
attribute
•B is a set of basic domain type like string, integer,
s
Boolean etc.
•P is a function from E to object type definition
with symbol in {*, +, ? ,1} called multiplicity
e.g: P (course) = student*
•R is a function from E to the power set of A
e.g.: R(student)
= {sid, name, age }
• r  E and is called the object type of the root.
department
name
+
course
multiplicity
cid title *
student
?
sid age name grade
e.g.: r = department
6
DASWIS 2001
2. Semistructured Schema and Data tree (Cont.)
A data tree T with respect to a semistructured schema D = (E,
A, B, P, R, r) is defined to be a tree T=(V, lab, obj, att, val, root),
showing a database instance.
department
course
name: CS
course

student
sid:
s01
name: age:
Jack 21
title: database design cid:
cid:
cs5220
cs4221 student
sid:
s02
name: grade
Tom
sid:
s01
title: data Mining
student

name: age:
Jack 21
“A”
7
DASWIS 2001
2. Semistructured Schema and Data tree (Cont.)
•The path of a node n in semistructured schema D is denoted as
pathD(n). e.g.: PathD for student is /department / course / student
•The path of a node v in data tree T is denoted as PathT(v) e.g.:
PathT for student “s02” is /department / course/ student
•The target set of node n in T, T[n], is {v: vV, nEA PathT(v)=
PathD(n)}. e.g.: the target set T[student] includes nodes of students with
sid “s02” etc.
department
department
name
+
course
name:CS

course
cid:cs4221
cid title *
student
title:database design
student
student

?
sid age name grade
sid: name:
s02 Tom
grade
“A”
8
DASWIS 2001
2. Semistructured Schema and Data tree (Cont.)
 Two nodes from two data tree w.r.t schema D satisfy value
equality iff
–
–
they are attributes nodes with the same tag and the same value;
or they are object nodes having the same tag and their children are pairwise
value equal
Two data trees T1 and T2 w.r.t schema D = (E, A, B, P, R, r), X
E  A. T1 and T2 agree on X, denoted as iff the following
condition is hold: t1T1[X],t2T2[X], such that (t1=vt2)
department
course
name: CS
course

cid:
student cs4221
sid:
s01
name: age:
21
Jack
sid:
s02
title: database design
student
name:
Tom
grade
“A”
cid:
cs5220
sid:
s01
title: data Mining
student

name:
Jack
age:
21
9
DASWIS 2001
3. Integrity Constraints for Semistructured Data
 Extended Functional Dependency(EFD)
Let D = (E, A, B, P, R, r) be a semistructured schema, let X 
EA and Y  EA. Y is extended functionally dependent on X,
is denoted as XY. Let S denotes a set of data trees that are
images of D, S satisfies XY, iff for any data trees T1, T2 in S,
if they agree on every component in X, then they will agree on
Y.that is, T1, T2 S((xX, T1=xT2) such that T1=yT2).
 Inference rule for EFD
E1:(reflexivity) If YX, then XY, for any X, Y EA
E2:(augmentation) if XY then XZYZ, for any X, Y, Z EA
E3:(transitivity) If XY, YZ then XZ, for any X, Y, Z 
EA
10
DASWIS 2001
3. Integrity Constraints for Semistructured Data (Cont.)
 Notation:O1[@X1], …, Oi[@Xi],…,On-1[@Xn-1]On[@Xn]
 EFD XY is partial EFD: If there exists an X’X such that X’Y.
Otherwise, is full EFD.
e.g.: (1) course[@cid],student[@sid]student[@name] is partial EFD
(2) student[@sid]student[@name] its full EFD
 XY is said to be coherent iff /X/Y is a path in D; otherwise it is
called an incoherent EFD.
e.g.:teacher[@tid], time [@day, @hour]subject[@cid]
is an incoherent EFD, since /teacher / time /subject is
not a path in schema.
teacher
tid
room#
cid
name
*
ClassRoom
*
subject
*
time
11
day hour
DASWIS 2001
3. Integrity Constraints for Semistructured Data (Cont.)
 If there exists ZEA, such that XY and YZ and Y
is transitively extended functionally dependent on X via Z.
e.g.: age is transitively dependent on course via student since
X, then Z
(1) course[@cid]student[@sid]
(2) student[@sid]student[@age] and
(3)student[@sid] course[@cid]
department
name
+
course
cid title *
student
?
sid age name grade
12
DASWIS 2001
3. Integrity Constraints for Semistructured Data (Cont.)
 Theorem Let D = (E, A, B, P, R, r) be a semistructured
schema, X, Y, Z  E A. If Z is transitively dependent on X
via Y, then there exists a data tree of D where a rewriting
anomaly occurs upon updating the values of Z.
department
course
name: CS
course

student
sid:
s01
name: age:
21
Jack
title: database design cid:
cid:
cs5220
cs4221 student
sid:
s02
name:
grade
Tom
“A”
sid:
s01
title: data Mining
student

name: age:
21
Jack
13
DASWIS 2001
3. Integrity Constraints for Semistructured Data (Cont.)
: Based on EFD semantics
 Notation: Ko = O1[@X1]/…/Oi[@Xi]/…/On[@Xn]/O[@X]
for key of an object type O in semistructured schema D.
/O1/…/O is a path in D
If n equals one, then Ko is called an absolute key. Otherwise it
is called a relative key.
 Key Constraints
Example
book
•Kbook= book[@isbn]. Kbook is an absolute key
•Kchapter =book[@isbn]/chapter[@number].
relative key
isbn
Kchapter is a
•Ksection=
book[@isbn]/chapter[@number]/section[@number].
Ksection is a relative key
+
chapter
number
+
section
number
14
DASWIS 2001
3. Integrity Constraints for Semistructured Data (Cont.)
Let D be a semistructured schema and O be its root object
type. The set of basic dependencies of D, denoted as BD(D), is
defined as follows:
 Let X, Y be children of O, non-trivial extended functional
dependencies of the form XY where X is a key of O or Y is
part of a key of O, are in BD(D).
 Let O1 be a sub-object type of O and D1 be a schema tree
that is rooted at O1 and add KO as attribute(s) of O1, then
BD(D1)  BD(D).
 No other non-trivial dependencies that is not generated
from above is in BD(D)
15
DASWIS 2001
4. NF-SS
Let D be a semistructured schema and O be its root object type. D
is in Normal Form for Semistructured Schemata (NF-SS), iff
1. O has at least one key.
2. For any non-trivial EFD of the form XY satisfied by O,
where X and Y are attributes of O, then either X is a key or Y
is part of the key of O
3. For any sub-object type O1 of O
(a) If adding KO to O1 as its components with other remains,
a schema tree rooted at O1 will be in NF-SS.
(b) KO KO1= or KO KO1, where KO and KO1 are O and O1’s key
respectively.
(c) O1 is not transitively dependent on KO
4. Any non-trivial EFD in D can be derived from BD(D) by using the
inference rules for EFDs.
16
DASWIS 2001
5. Designing Semistructured Schema into NF-SS
 We adopt restructuring approach for the
designing.
 We propose four heuristic restructuring rules
– Decomposition object types.
– Creation new object types.
– Regrouping components of an object type.
 Objective
– Remove transitive or partial EFD and
incoherent EFD from the given dependency and
key constraints.
17
DASWIS 2001
5. Designing Semistructured Schema into NF-SS(cont.)
Rule 1. (Remove Transitive Dependency by Decomposition)
Given an object type O in a semistructured schema D, if there is
some non-prime component(s) Y of O that is transitively
dependent on some key of O, i.e., KO X, X  Y and X KO , and
X  KO =. Then, restructuring the schema as follows.
1. Duplicate X to form a new node(s) Z.
2. Move Y and all the descendants of Y and their corresponding
edges under Z.
3. Make X as foreign key of O, and add a reference edge from
the original node X to Z.
18
DASWIS 2001
5. Designing Semistructured Schema into NF-SS(cont.)
 Example 5.1: schema D satisfies the following EFDs
(1)department[@name]course[@cid] (2) course[@cid]department
(3)course[@cid]course[@title]
(4)course[@cid]student[@sid
(5)course[@cid],student[@sid]grade (6)student[@sid]student[@name, @age]
name
department
department
+
course
+
course
cid title *
student
?
sid age name grade
name
student2
cid title * sid age name
student1
?
sid
grade
19
DASWIS 2001
5. Designing Semistructured Schema into NF-SS(cont.)
Rule 2. Remove Path Anomaly by Path Splitting
Given a semistructured schema D. Suppose there exists an
incoherent EFD: O1[@X1],…,On[@Xn]  Y, Y is either an object
type or an attribute, and there exists a path P that contains
{O1,…,On,Y}. Path P can be split into two sub-paths P1 and
P2,where P1 only contains {O1,…,On } and Y, while P2 contains
{O1,…,On} and (P-Y).
20
DASWIS 2001
5. Designing Semistructured Schema into NF-SS(cont.)
 Example 5.2:schema D satisfies following EFDs
(1) teacher[@tid],timeClassRoom (2)teacher[@tid],
timesubject
teacher
tid
room#
cid
name
*
ClassRoom
*
subject
*
time
day hour
teacher
tid
*
time
name
day hour
ClassRoom
room#
subject
cid
21
DASWIS 2001
5. Designing Semistructured Schema into NF-SS(cont.)
Rule 3. Removing Partial Dependency by Creating New
Object type
Given an object type O in a semistructured schema, let X be a
set of prime attributes of O, and Y be the set of O’s
attributes. Let O1 be a sub-object type of O. If (KO -X)  O1
and no proper superset of X satisfy this property, then
restructure the schema as follows:
1. (KO Y –X) becomes the only attribute(s) of O while O1
remains to be its sub-object type.
2.Create a new object type O2 that is a direct component of
O.
3.Move rest of the components of O and all their descendants
and corresponding edges under O2.
22
DASWIS 2001
5. Designing Semistructured Schema into NF-SS(cont.)
 Example 5.3: schema D shown in Figure (a). the following
EFDs {O[@A,@B]D, O[@A,@B]O2, O[@A] O1, O[@A]
E } and the key of O is {A,B}.
O
A B
*
O1
C
O'
Rule 3
D E
O[@K, @B]O2
O2
F
(a)Un-normalized
schema as the partial
EFD O[@A,@B} O1
O''
Rule 2
O’[@A]E
A
O1
C
*
A
*
O3
B D E
E
*
*
O3
O1
*
O2
F
(b)Un-normalized schema as the
incoherent EFD O’[@A]E
C
B
D
*
O2
F
(c)Normalized schema
23
DASWIS 2001
5. Designing Semistructured Schema into NF-SS(cont.)
Rule 4. (Restructuring To Satisfy Condition 3(b) of NF-SS
Definition)
Given an object type O in a semistructured schema D, X be a
set of O’s attributes and single-valued atomic sub-object
types, O1 be a complex sub-object type of O. O1 has relative
key KO1 , but KO  KO1 and KO1 KO .Let Y be KO  KO1  X, and Y
. D is restructured as follows:
1. O1 remains to be a sub-object type of O.
2. Make Y as components of O.
3.Create a new object type O2 to be a child of O and the rest
components of O (excluding Y) become children of O2.
24
DASWIS 2001
5. Designing Semistructured Schema into NF-SS(cont.)
 Example 5.4: schema D in Figure (a) satisfies the EFD (1) O[@K,
@A] O1 (2) O[@K, @B]O2 and the key of O is {K, A, B}.
O'
Rule 3
O
O[@K, @B]O2
K A
*
O1
*
K A
*
O2
C
C
D
*
O3
O1
B
D
B
O''
Rule 4
o  o
3
K
*
O4
*
O2
*
O1
A
*
O3
B
*
O2
E F
(a)Un-normalized schema
as O1 and O2 partially
dependent on {K,A,B}
E F
(b)Un-normalized schema as
KO=O’[@K,@A] and
KO3=O’[@K]/O3[@B] such that
KO KO3
C
D
E F
(c)Normalized
schema
25
DASWIS 2001
5. Designing Semistructured Schema into NF-SS(cont.)
Algorithm 1: Restructuring Algorithm
Input: A set S that contains semistructured schemas, and a set of
EFDs for S.
Output: A set of semistructured schemas that in NF-SS.
Begin
1. for each semistructured schema D in S do
if D is not in NF-SS then repeat until no further change:
(1) if there exists transitive EFD: KO  X, X  Y and X KO for an
object type O in D,
Case X  KO =: apply Rule 1 to remove the transitive EFD.
Case X  KO : apply Rule 3 to remove the transitive EFD.
Case X  KO : apply Rule 4 to remove the transitive EFD.
(2) if there exists incoherent EFD then apply Rule 2 to remove it.
2. output S.
End
26
DASWIS 2001
6. Discussion of Restructuring Approach for Designing
 Is the restructuring rules complete? No.
– covering is not guaranteed
– dependency preservation is not guaranteed
 Does it give unique solution? No.
– depending on the order in which the dependencies are
examined
 Designing task can be made easier if more semantics available.
– In [5], We have proposed another approach for designing
semistructured databases using ORA-SS, a semantic rich
model .
 Nevertheless, it does give practical heuristics and provides
insights into the normalization task for semistructured
databases.
27
DASWIS 2001
7. Comparison with Related Proposal


The first attempt to define normal form for semistructured data
([ER’99] S.Y.Lee, M.L.Lee, T.W.Ling, and L.A.Kalinichenko.) [3]
– Defines a schema called S3-Graph, which makes no distinction between
element node and attribute node and no cardinality specification.
– Proposes S3-NF, but missing key constraints, an essential part of
database design.
– The decomposition method may not be able to remove some other kinds
of anomalies, like partial dependency and path anomaly that may exist
in a schema.
The most recent proposal: XNF (XML Normal Form)
([ER 2001] D.W.Embley and W.Y.Mok. ) [2]
– It mainly provides algorithms to translate a schema, represented in a
conceptual model called CM hypergraphs, to a scheme-tree forest in
XNF.
– Like S3-Graph, scheme tree doesn't lend itself to XML definition.
– XNF isn’t formulated with the concept of key.
– The algorithms given suffers from efficiency.
– A large set of results is expected.
28
DASWIS 2001
8. Summary
 A normal for semistructured schemata
– It is incorporated with integrity constraints.
– It guarantees no redundancy and hence no undesirable updating
anomalies for the conforming semistructured databases.
– It gives more reasonable representations of real world
semantics
 Restructuring Approach for designing semistructured
databases
– a set of heuristic restructuring rules is proposed.
– an algorithm for iteratively restructuring a schema into NF-SS
is developed.
– It provides insights into the normalization task for
semistructured databases.
29
DASWIS 2001
References
1. J. Clark and S. DeRose. XML Path Language (XPath). W3C Working
Darft, November 1999. http://www.w3.org/TR/xpath.
2.D.W.Embley and W.Y.Mok. Developing XML Documents with
Guaranteed “Good” Properties. Proceedings of the 20th
International Conference on Conceptual Modeling (ER), 2001.
3. S. Y. Lee, M. L. Lee, T. W. Ling and L. A.. Kalinichenko. Designing
Good Semi-structured Databases. Proceedings of the 18th
International Conference on Conceptual Modeling (ER), 1999.
4. T. W. Ling and L. L. Yan. NF-NR: A Practical Normal Form for
Nested Relations. Journal of Systems Integration. Vol4, 1994,
pp309-340
5. Xiaoying Wu, Tok Wang Ling, Mong Li Lee, Gillian Dobbie. Designing
Semistructured Databases Using the ORA-SS Model, accepted for
publication in Proceedings of the 2nd International Conference on
Web Information Systems Engineering (WISE) , IEEE Computer
Society, Kyoto, Japan, December 2001.
30
DASWIS 2001
Q&A
31
Download