Resolving Structural Conflicts in the Integration of XML Schemas: A Semantic Approach

advertisement
Resolving Structural Conflicts in
the Integration of XML Schemas:
A Semantic Approach
Xia Yang
Mong Li Lee
Tok Wang Ling
National University of Singapore
1
Outline
Introduction
 Background
 Preliminaries
 Motivating Example
 Integration Algorithm
 Related work
 Conclusion

2
Introduction



Recent research in integrating XML data
sources has mainly concentrated on schema
matching.
The XML Schema or DTD is lacking in
semantics.
The source schemas are heterogeneous,
containing various conflicts involving naming
conflict, cardinality conflict, structural conflict.
3
Background



Most of the work of integration of XML has
focused on the matching problem to find
equivalent elements among the different sources.
LSD [4] employs instance information and
machine learning techniques base on the
instance information in their integration work.
E. Jeong and C.-N. Hsu [7] use schema learning
to generate a set of tree grammar rules from the
DTDs in a class and optimizes the rules to
transforms them into an integrated view.
[4].A. Doan, P. Domingos, A. Levy. Learning Source Descriptions for Data Integration. WebDB, 2000.
[7].E. Jeong, C.-N. Hsu. Induction of Integrated View for XML Data with Heterogeneous DTDs. ACM CIKM, 2001.
4
Background (cont.)

E. Jeong and C.-N. Hsu [7]
 DTD
clustering clusters DTDs in similar domains into
classes.
 Schema learning applies a tree grammar inference
technique to generate a set of tree grammar rules
from the DTD in a class from the previous step.
 Minimization optimizes the rules generated in the
previous step and transforms them into an integrated
view.
5
Background (cont.)



These work lack of semantic meaning, which
may lead to the wrong integrated schema.
All these work do not take into consideration the
importance of the individual data sources, and
how the majority of the local schemas model
their data.
They are binary strategies, cannot take the
importance of sources into consideration.
6
Preliminaries
 ORA-SS
Model
 The
ORA-SS model (Object-RelationshipAttribute model for Semi-Structured data) is a
semantically rich data model that has been
designed for semi-structured data [5].
 The ORA-SS model distinguishes between
objects, relationships and attributes.
[5]. G. Dobbie, X. Wu, T.W. Ling, M.L. Lee. ORA-SS: An Object-Relationship-Attribute Model for Semistructured Data. Technical Report TR21/00, National University of Singapore, 2000.
7
Preliminaries (cont.)

ORA-SS Schema Diagram example.
project
jp,2,1:n,1:n
jno
part
funds
project manager
jps,3,1:n,1:n
sno
supplier
jps
pno
uno
mno
name
quantity
8
Preliminaries (cont.)

Assumptions for the algorithm





The input to the proposed integration algorithm is a set of ORASS schemas with source weight.
The output of the algorithm is an integrated schema, also
modeled in ORA-SS.
The integrated schema should contain all the information
modeled in the original schemas. Further, the integrated schema
should be as simple and concise as possible to facilitate users’
understanding.
For meaningful integration to occur, we assume that the various
sources model similar domains.
Object classes and relationship sets with the same label name
are considered to be semantically equivalent. Attributes of the
same object class (or relationship set) with the same label name
are also semantically equivalent.
9
Motivating Example
project manager
project
organization
jno
project
manager
part
local funds
foreign funds
mno
name
email
org name
pno
fno
lno
abbreviation
(a) Schema S1, sw1=1
(b) Schema S2, sw2=1
project
project
js,2,1:n,1:n
jno
jp,2,1:n,1:n
staff
supplier
jno
jsp,3,1:n,1:n
part
name
funds
project manager
ordinary staff
sno
mno
part
jps,3,1:n,1:n
project manager
sno
pno
full name
name
jps
uno
mno
name
address eno
org name
abbreviation
supplier
pno
quantity
full name
(c) Schema S3, sw3=7
(d) Schema S4, sw4=1
The swi under each schema indicates the source weight, i.e., the importance of a source.
This is determined by users or computed based on some statistic information.
10
Motivating Example





A. Resolve attribute-object class conflict.
B. Resolve generalizations and
specializations.
C. Merge the schemas to obtain an
integrated graph.
D. Transform integrated graph to resolve
structural conflicts and remove redundancy.
E. Augment Graph with Attributes.
11
A. Resolve attribute-object class conflict.

This occurs when a concept has been
modeled as an attribute in one schema,
and as an object class in another schema.

This conflict can be resolved by
transforming the attribute to an object
class.
12
A. Resolve attribute-object class conflict. (cont.)
(example)
project manager
project
organization
jno
project
manager
pno
part
local funds
mno
foreign funds
name
email
org name
fno
lno
abbreviation
(a) Schema S1, sw1=1
full name
(b) Schema S2, sw2=1
project
jno project manager
mno
pno
part
local funds
lno
foreign funds
fno
(c) Schema S1’: Attribute “project manager” in schema S1 has
been transformed into an object class “project manager” in S1’.
13
B. Resolve generalizations and specializations.

A generalization exists when an object
class in one schema is the union of
several object classes in another schema.

The integrated schema will include the
generalization isa hierarchy.
14
B. Resolve generalizations and specializations. (cont.)
(example)
project
project
jp,2,1:n,1:n
jno
jno
project
manager
pno
part
local funds
lno
foreign funds
part
funds
project manager
jps,3,1:n,1:n
sno
fno
pno
Schema S1, sw1=1
supplier
jps
uno
mno
name
quantity
Schema S4, sw4=1
funds
local funds
lno
foreign funds
fno
Build a generalization hierarchy from part of S1,
which is used for next step to generate an integrated graph
15
C. Merge the schemas to obtain an integrated
graph.



Each node in the graph denotes an object class,
and edges represent the relationship sets
among the object classes.
To facilitate processing, attributes are first
omitted from the integrated graph. The attributes
will be incorporated into the final integrated
schema.
Compute the edge weight.
16
C. Merge the schemas to obtain an
integrated graph. (cont.)

Compute the edge weight.
 For
each original source, the edge weight is the
source weight multiplied by the number of relationship
sets involved in this edge.
 The edge weight in the integrated graph is the sum of
all the edge weights of this edge from the original
sources.
17
C. Merge the schemas to obtain an integrated graph.
(cont.) (example)
project manager
project
organization
jno
project
manager
part
foreign funds
local funds
mno
name
email
org name
pno
fno
lno
abbreviation
(a) Schema S1, sw1=1
(b) Schema S2, sw2=1
project
project
js,2,1:n,1:n
jno
jp,2,1:n,1:n
staff
supplier
jno
jsp,3,1:n,1:n
part
name
funds
project manager
ordinary staff
sno
mno
part
jps,3,1:n,1:n
project manager
sno
pno
full name
name
jps
uno
mno
name
address eno
org name
abbreviation
supplier
pno
quantity
full name
(c) Schema S3, sw3=7
(d) Schema S4, sw4=1
18
C. Merge the schemas to obtain an integrated
graph. (cont.)

Example of edge weight
 Since
we have “project” as the parent of “project
manager” in schemas S1 and S4, the weight of the
edge from “project” to “project manager” is given by
the sum of the weights of these schemas, that is,
1+1=2.
 Since “project” is the parent of “staff” in schema S3
only, the weight of this edge is 7. Since the edge from
“project” to “supplier” in S3 is actually involved in two
relationship sets js and jsp, its edge weight would be
given by 7*2=14.
19
C. Merge the schemas to obtain an integrated
graph. (cont.) (example)
project
js,2,1:n,1:n
14
supplier
3
staff
2
7
jsp,3,1:n,1:n
1
2
7
7
project manager
funds
7
ordinary staff
1
local funds
1
foreign funds
1
part
7
organization
1
org name
Integrated graph obtained from the schemas in page 10.
20
D. Transform integrated graph to resolve structural
conflicts and remove redundancy.

D-1. Differentiate semantically different relationship
sets among equivalent object classes.

D-2. Remove relationship sets that are projections of
higher degree relationship sets.

D-3. Resolve ancestor-descendant conflicts.

D-4. Remove transitive relationship sets.

D-5. Remove other type of multiple parent nodes.
21
multiple parent nodes

If a node has more than one incoming
edges in an integrated graph, it is called a
multiple parent node.
22
D-1. Differentiate semantically different relationship
sets among equivalent object classes.

If the relationship sets among the same
object classes are semantically different.
Then duplicate the nodes and make
foreign key-key references in the
integrated graph. Move the object classes
involved in n-nary (n>2) relationship set.
23
D-1. Differentiate semantically different relationship
sets among equivalent object classes. (cont.) (example)
Ph1 and ph2 are semantically different.
person
ph2,2,1:n,1:n
person
ph1,2,1:n,1:n
house
name
name
address
address
phc,3,1:1,1:n
house
contract
type
type
time
cid
Schema S5
Schema S6
person
person
ph1,2,1:n,1:n
ph1,2,1:n,1:n
ph2,2,1:n,1:n
ph2,2,1:n,1:n
house
house1
house2
house
phc,3,1:1,1:n
address
address
address
type
phc,3,1:1,1:n
contract
contract
Transformed graph G56’
Integrated graph G56
because ph1 and ph2 are semantically different.
back
24
D-2. Remove relationship sets that are projections of
higher degree relationship sets.
A schema may model a relationship set
that is a projection of another relationship
set in another schema.
 Keep the complete relationship set in the
integrated schema.

25
D-2. Remove relationship sets that are projections of
higher degree relationship sets. (cont.) (example)
If jp is a projection of jsp.
project
js,2,1:n,1:n
jno
jsp,3,1:n,1:n
jp
project
jno
manager
part
part
name
local funds
foreign funds
lno
fno
project manager ordinary staff
sno
pno
pno
staff
supplier
project
mno
name
org name
abbreviation
Schema S1
full name
Schema S3
project
project
supplier
supplier
part
part
Part of Integrated graph G13
address eno
Part of Transformed graph G13’
26
D-3. Resolve ancestor-descendant
conflicts.



An ancestor-descendant conflict arises when a schema
models an object class A as an ancestor of object class
B, and another schema models B as the ancestor of A.
Such conflicts appear as cycles in the integrated graph.
The simplest form of this conflict is the parent-child
conflict.
In the ancestor-descendant conflict cycle, remove the
edge with the smallest edge weight, if this relationship
set can be derived from other relationship sets in the
cycle. If there are two edges with the same smallest
edge weight, remove either one.
27
D-3. Resolve ancestor-descendant conflicts. (cont.)
(example)
project
js,2,1:n,1:n
14
supplier
3
staff
2
7
jsp,3,1:n,1:n
1
2
7
7
project manager
funds
7
ordinary staff
1
local funds
1
foreign funds
1
part
7
organization
1
org name
Parent-child conflict
28
D-3. Resolve ancestor-descendant conflicts. (cont.)
(example)
project
js,2,1:n,1:n
14
3
supplier
2
7
staff
2
7
jsp,3,1:n,1:n
7
project manager
funds
7
ordinary staff
1
1
local funds
foreign funds
1
part
7
organization
1
org name
Parent-child conflict: the edge from part to supplier is removed.
29
D-3. Resolve ancestor-descendant conflicts. (cont.)
(example)
head secretary
department
dh
sd
dh
head
dname
hs
hid
department
sid
sd
department
head
hs
head secretary
head secretary
dname
sid
Schema S7
department
Schema S8
Integrated graph G78
head secretary
dh
head
sd
hs
department
head secretary
head
hs
dh
head secretary
head
Case 1:
Case 2(a):
if sd can be derived by dh and hs
if hs derived by sd and dh
and sw7=2 and sw8=1)
and sw7=1 and sw8=2
Transformed graph G78’
Transformed graph G78’’(a)
sd
department
Case 2(b):
if dh derived by hs and sd)
and sw7=1 and sw8=2
Transformed graph G78’’(b)
30
D-4. Remove transitive relationship
sets and redundant object classes.



If one relationship set from object class A to object class
B can be derived from relationship sets which is from A
to other object class sets and back to B, it is called
transitive relationship set.
Transitive relationship sets are also redundant, and can
be removed so that the resulting integrated graph will be
concise, if the intermediate node has attribute or other
sub-object classes.
If the intermediate node has no attribute and no other
sub-object classes, it will be considered as redundant
object classes.
31
D-4. Remove transitive relationship sets and
redundant object classes. (cont.) (example)
project
js,2,1:n,1:n
14
3
supplier
2
7
staff
2
7
jsp,3,1:n,1:n
7
project manager
funds
7
ordinary staff
1
local funds
1
foreign funds
1
part
7
organization
1
org name
32
D-4. Remove transitive relationship sets and
redundant object classes. (cont.) (example)
project
js,2,1:n,1:n
14
2
7
supplier
staff
jsp,3,1:n,1:n
7
7
project manager
funds
7
ordinary staff
1
local funds
1
foreign funds
part
7
org name
33
D-5. Remove other type of multiple
parent nodes.

If a node has more than one incoming edges in
an integrated graph, it is called a multiple parent
node.
 Case
1: D-1 Different relationship sets among the
same object classes.
 Case 2: D-2 Relationship sets that are projections of
the higher degree relationship sets.
 Case 3: D-3 Ancestor-descendant conflicts.
 Case 4: D-4 Transitive relationship sets.
 Case 5: D-5 Others. As examples in the following
page.
34
D-5. Remove multiple parent nodes. (cont.)
(example)
Case 5:
school
project
school
jd
scname
jno
student
stduent
stduent
jd
snu
email
snu
Schema S9
project
address
mark
Schema S10
Integrated graph G9-10
school
project
jd
student
student1
student2
jd
snu
email
address
snu
snu
mark
Transformed Graph G9-10’
back
35
Transformed graph (summary)
project
js,2,1:n,1:n
14
supplier
3
staff
2
7
jsp,3,1:n,1:n
1
2
7
7
project manager
funds
7
ordinary staff
1
local funds
1
foreign funds
1
part
7
organization
1
org name
Original integrated graph
36
Transformed graph (cont.)
project
js,2,1:n,1:n
14
2
7
supplier
staff
jsp,3,1:n,1:n
7
7
project manager
funds
7
ordinary staff
1
local funds
1
foreign funds
part
7
org name
Transformed graph
37
E. Augment Graph with Attributes




Augment the graph with the attributes of object classes
in the integrated schema.
Augment the graph with the attributes of relationship
sets in the integrated schema.
For the attributes of duplicated object classes in case D1 and D-5, the attributes will become the attributes of the
original ones, not the duplicated object classes.D-1 D-5
For attributes of relationship sets which have been
removed, the attributes will be the attributes of
relationship sets which could derive this relationship set.
38
Final integrated schema (example)
project
js,2,1:n,1:n
jno
supplier
funds
staff
jsp,3,1:n,1:n
sno
name
part
project manager
ordinary staff
local funds
foreign funds
jsp
pno
quantity mno
name
email address eno
lno
fno
org name
abbreviation
full name
back
39
Integration Algorithm
1.
Preprocessing.
a. Resolve attribute-object class conflict.
b. Resolve generalizations and specializations.
2.
3.
4.
Construct integrated graph.
Transform graph.
Augment graph with attributes.
40
Step 3 Transform Graph





3.1 Differentiate semantically different
relationship sets among equivalent object
classes.
3.2 Remove relationship sets that are
projections of higher degree relationship sets.
3. 3 Resolve any ancestor-descendant conflicts
which create cycles in G.
3.4 Remove transitive relationship sets and
redundant object classes.
3.5 Remove other multiple parent nodes.
41
Related work (compare with [7])
project
js,2,1:n,1:n
jno
supplier
staff
local funds
foreign funds
funds
jsp,3,1:n,1:n
name
part
project manager
ordinary staff
sno quantity
pno
lno
mno name
fno
uno
email address eno
org name
abbreviation
full name
problems by Jeong and C.-N. Hsu [7].
42
Related work (compare with [7]) (cont.)
project
js,2,1:n,1:n
jno
supplier
funds
staff
jsp,3,1:n,1:n
name
sno
part
project manager
ordinary staff
local funds
foreign funds
jsp
pno
quantity mno
name
email address eno
lno
fno
org name
abbreviation
full name
Integrated schema obtained by our approach
43
Related work (compare with [7]) (cont.)



Our proposed method employs the ORA-SS conceptual
model which is able to capture the semantics necessary
for the resolution of structural conflict during integration.
We could take into consideration the importance of the
individual data sources, and how the majority of the local
schemas model their data.
We employ n-nary strategy. While The binary strategy
will not be able to utilize the source importance and how
the majority of the sources model the data. N-nary
strategy is also faster compare to the binary strategy.
( For example, sw1=2,sw2=1,sw3=1,sw4=1and schema 2, 3, 4 are same. The binary
strategy might treat sw1 as the most important schema, while in fact schema 2, 3 and
4 are.)
44
Conclusion





In this paper, we have introduced a semantic approach to resolve
structural conflicts in the integration of XML schemas.
We employed the ORA-SS semantic data model to capture the
implicit semantics in an XML schema.
We presented a comprehensive n-nary algorithm to integrate XML
schemas.
our algorithm takes into account the data semantics, the importance
of a source, and how the majority of the sources model their data.
Structural conflicts such as attribute/object class conflict, ancestordescendant conflict are resolved in our approach. We also remove
redundant object classes and relationship sets such as transitive
relationship sets, and relationship sets, which are projections of
higher degree relationship sets in order to obtain a concise
integrated schema.
45
References






















1. S.Castano, V. Antonellis, S. C. Vimercati, M. Melchiori. An XML-Based Framework for Information Integration over the Web. IIWAS,
2000.
2.Y.B. Chen, T.W. Ling, M.L. Lee. Designing Valid XML Views. ER, 2002.
3.P. Buneman, S. Davidson, W. Fan, C. Hara, W.C. Tan. Keys for XML. WWW, 2001.
4.A. Doan, P. Domingos, A. Levy. Learning Source Descriptions for Data Integration. WebDB, 2000.
5.G. Dobbie, X. Wu, T.W. Ling, M.L. Lee. ORA-SS: An Object-Relationship-Attribute Model for Semi-structured Data. Technical Report
TR21/00, National University of Singapore, 2000.
6.E. Rahm, P. Bernstein. On Matching Schemas Automatically. MSR Tech. Report MSR-TR-2001-17, 2001.
7.E. Jeong, C.-N. Hsu. Induction of Integrated View for XML Data with Heterogeneous DTDs. ACM CIKM, 2001.
8.T.W. Ling, M.L. Lee. Relational to Entity-Relationship Schema Translation Using Semantic and Inclusion Dependencies, in Journal of
Integrated Computer-Aided Engineering, John-Wiley Publishers, Vol 2, No 2, pages 125-145, 1995.
9.M.L. Lee, T.W. Ling. Resolving Structural Conflicts in the Integration of Entity-Relationship Schemas. OOER, 1995.
10.M.L. Lee, T.W. Ling. Resolving Constraint Conflicts in the Integration of Entity-Relationship Schemas. ER, 1997.
11.M.L. Lee, T.W. Ling, W.L. Low. Designing Functional Dependencies for XML, EDBT, 2002.
12.M.L. Lee, L.H. Yang, W. Hsu, X. Yang. XClust: Clustering XML Schemas for Effective Integration, ACM CIKM, 2002.
13.D. Maier. Theory of Relational Databases. Computer Science Press, 1983.
14.J. Madhavan, P.A. Bernstein, E. Rahm. Generic Schema Matching with Cupid. VLDB, 2001.
15.R. Mello, S. Castano, C.A. Heuser. A Method for the Unification of XML. Information and Software Technology Journal, 2002.
16.P. Mitra, G. Wiederhold and J. Jannink. Semi-automatic Integration of Knowledge Sources. Fusion, 1999.
17.P. Mitra, G. Wiederhold, M. Kersten. A Graph-Oriented Model for Articulation of Ontology Interdependencies. EDBT 2000.
18.F. Naumann, U. Leser, J.C. Freytag. Quality-driven Integration of Heterogeneous Information Systems. VLDB, 1999.
19.C. Reynaud, J.-P. Sirot, D. Vodislav. Semantic Integration of XML Heterogeneous Data Sources. IDEAS, 2001.
20.http://www.cogsci.princeton.edu/~wn
21.Xyleme. A dynamic warehouse for XML Data of the Web. IEEE Data Engineering Bulletin 24(2):40-47, 2001.
22.L.L. Yan, T.W. Ling. Translating Relational Schema with Constraints into OODB Schema. IFIP DS-5 Semantics of Interoperable
Database Systems. 1992
46
Download