Resolving Structural Conflicts in the Integration of XML Schemas: A Semantic Approach Xia Yang Mong Li Lee Tok Wang Ling National University of Singapore 1 Outline Introduction Background Preliminaries Motivating Example Integration Algorithm Related work Conclusion 2 Introduction Recent research in integrating XML data sources has mainly concentrated on schema matching. The XML Schema or DTD is lacking in semantics. The source schemas are heterogeneous, containing various conflicts involving naming conflict, cardinality conflict, structural conflict. 3 Background Most of the work of integration of XML has focused on the matching problem to find equivalent elements among the different sources. LSD [4] employs instance information and machine learning techniques base on the instance information in their integration work. E. Jeong and C.-N. Hsu [7] use schema learning to generate a set of tree grammar rules from the DTDs in a class and optimizes the rules to transforms them into an integrated view. [4].A. Doan, P. Domingos, A. Levy. Learning Source Descriptions for Data Integration. WebDB, 2000. [7].E. Jeong, C.-N. Hsu. Induction of Integrated View for XML Data with Heterogeneous DTDs. ACM CIKM, 2001. 4 Background (cont.) E. Jeong and C.-N. Hsu [7] DTD clustering clusters DTDs in similar domains into classes. Schema learning applies a tree grammar inference technique to generate a set of tree grammar rules from the DTD in a class from the previous step. Minimization optimizes the rules generated in the previous step and transforms them into an integrated view. 5 Background (cont.) These work lack of semantic meaning, which may lead to the wrong integrated schema. All these work do not take into consideration the importance of the individual data sources, and how the majority of the local schemas model their data. They are binary strategies, cannot take the importance of sources into consideration. 6 Preliminaries ORA-SS Model The ORA-SS model (Object-RelationshipAttribute model for Semi-Structured data) is a semantically rich data model that has been designed for semi-structured data [5]. The ORA-SS model distinguishes between objects, relationships and attributes. [5]. G. Dobbie, X. Wu, T.W. Ling, M.L. Lee. ORA-SS: An Object-Relationship-Attribute Model for Semistructured Data. Technical Report TR21/00, National University of Singapore, 2000. 7 Preliminaries (cont.) ORA-SS Schema Diagram example. project jp,2,1:n,1:n jno part funds project manager jps,3,1:n,1:n sno supplier jps pno uno mno name quantity 8 Preliminaries (cont.) Assumptions for the algorithm The input to the proposed integration algorithm is a set of ORASS schemas with source weight. The output of the algorithm is an integrated schema, also modeled in ORA-SS. The integrated schema should contain all the information modeled in the original schemas. Further, the integrated schema should be as simple and concise as possible to facilitate users’ understanding. For meaningful integration to occur, we assume that the various sources model similar domains. Object classes and relationship sets with the same label name are considered to be semantically equivalent. Attributes of the same object class (or relationship set) with the same label name are also semantically equivalent. 9 Motivating Example project manager project organization jno project manager part local funds foreign funds mno name email org name pno fno lno abbreviation (a) Schema S1, sw1=1 (b) Schema S2, sw2=1 project project js,2,1:n,1:n jno jp,2,1:n,1:n staff supplier jno jsp,3,1:n,1:n part name funds project manager ordinary staff sno mno part jps,3,1:n,1:n project manager sno pno full name name jps uno mno name address eno org name abbreviation supplier pno quantity full name (c) Schema S3, sw3=7 (d) Schema S4, sw4=1 The swi under each schema indicates the source weight, i.e., the importance of a source. This is determined by users or computed based on some statistic information. 10 Motivating Example A. Resolve attribute-object class conflict. B. Resolve generalizations and specializations. C. Merge the schemas to obtain an integrated graph. D. Transform integrated graph to resolve structural conflicts and remove redundancy. E. Augment Graph with Attributes. 11 A. Resolve attribute-object class conflict. This occurs when a concept has been modeled as an attribute in one schema, and as an object class in another schema. This conflict can be resolved by transforming the attribute to an object class. 12 A. Resolve attribute-object class conflict. (cont.) (example) project manager project organization jno project manager pno part local funds mno foreign funds name email org name fno lno abbreviation (a) Schema S1, sw1=1 full name (b) Schema S2, sw2=1 project jno project manager mno pno part local funds lno foreign funds fno (c) Schema S1’: Attribute “project manager” in schema S1 has been transformed into an object class “project manager” in S1’. 13 B. Resolve generalizations and specializations. A generalization exists when an object class in one schema is the union of several object classes in another schema. The integrated schema will include the generalization isa hierarchy. 14 B. Resolve generalizations and specializations. (cont.) (example) project project jp,2,1:n,1:n jno jno project manager pno part local funds lno foreign funds part funds project manager jps,3,1:n,1:n sno fno pno Schema S1, sw1=1 supplier jps uno mno name quantity Schema S4, sw4=1 funds local funds lno foreign funds fno Build a generalization hierarchy from part of S1, which is used for next step to generate an integrated graph 15 C. Merge the schemas to obtain an integrated graph. Each node in the graph denotes an object class, and edges represent the relationship sets among the object classes. To facilitate processing, attributes are first omitted from the integrated graph. The attributes will be incorporated into the final integrated schema. Compute the edge weight. 16 C. Merge the schemas to obtain an integrated graph. (cont.) Compute the edge weight. For each original source, the edge weight is the source weight multiplied by the number of relationship sets involved in this edge. The edge weight in the integrated graph is the sum of all the edge weights of this edge from the original sources. 17 C. Merge the schemas to obtain an integrated graph. (cont.) (example) project manager project organization jno project manager part foreign funds local funds mno name email org name pno fno lno abbreviation (a) Schema S1, sw1=1 (b) Schema S2, sw2=1 project project js,2,1:n,1:n jno jp,2,1:n,1:n staff supplier jno jsp,3,1:n,1:n part name funds project manager ordinary staff sno mno part jps,3,1:n,1:n project manager sno pno full name name jps uno mno name address eno org name abbreviation supplier pno quantity full name (c) Schema S3, sw3=7 (d) Schema S4, sw4=1 18 C. Merge the schemas to obtain an integrated graph. (cont.) Example of edge weight Since we have “project” as the parent of “project manager” in schemas S1 and S4, the weight of the edge from “project” to “project manager” is given by the sum of the weights of these schemas, that is, 1+1=2. Since “project” is the parent of “staff” in schema S3 only, the weight of this edge is 7. Since the edge from “project” to “supplier” in S3 is actually involved in two relationship sets js and jsp, its edge weight would be given by 7*2=14. 19 C. Merge the schemas to obtain an integrated graph. (cont.) (example) project js,2,1:n,1:n 14 supplier 3 staff 2 7 jsp,3,1:n,1:n 1 2 7 7 project manager funds 7 ordinary staff 1 local funds 1 foreign funds 1 part 7 organization 1 org name Integrated graph obtained from the schemas in page 10. 20 D. Transform integrated graph to resolve structural conflicts and remove redundancy. D-1. Differentiate semantically different relationship sets among equivalent object classes. D-2. Remove relationship sets that are projections of higher degree relationship sets. D-3. Resolve ancestor-descendant conflicts. D-4. Remove transitive relationship sets. D-5. Remove other type of multiple parent nodes. 21 multiple parent nodes If a node has more than one incoming edges in an integrated graph, it is called a multiple parent node. 22 D-1. Differentiate semantically different relationship sets among equivalent object classes. If the relationship sets among the same object classes are semantically different. Then duplicate the nodes and make foreign key-key references in the integrated graph. Move the object classes involved in n-nary (n>2) relationship set. 23 D-1. Differentiate semantically different relationship sets among equivalent object classes. (cont.) (example) Ph1 and ph2 are semantically different. person ph2,2,1:n,1:n person ph1,2,1:n,1:n house name name address address phc,3,1:1,1:n house contract type type time cid Schema S5 Schema S6 person person ph1,2,1:n,1:n ph1,2,1:n,1:n ph2,2,1:n,1:n ph2,2,1:n,1:n house house1 house2 house phc,3,1:1,1:n address address address type phc,3,1:1,1:n contract contract Transformed graph G56’ Integrated graph G56 because ph1 and ph2 are semantically different. back 24 D-2. Remove relationship sets that are projections of higher degree relationship sets. A schema may model a relationship set that is a projection of another relationship set in another schema. Keep the complete relationship set in the integrated schema. 25 D-2. Remove relationship sets that are projections of higher degree relationship sets. (cont.) (example) If jp is a projection of jsp. project js,2,1:n,1:n jno jsp,3,1:n,1:n jp project jno manager part part name local funds foreign funds lno fno project manager ordinary staff sno pno pno staff supplier project mno name org name abbreviation Schema S1 full name Schema S3 project project supplier supplier part part Part of Integrated graph G13 address eno Part of Transformed graph G13’ 26 D-3. Resolve ancestor-descendant conflicts. An ancestor-descendant conflict arises when a schema models an object class A as an ancestor of object class B, and another schema models B as the ancestor of A. Such conflicts appear as cycles in the integrated graph. The simplest form of this conflict is the parent-child conflict. In the ancestor-descendant conflict cycle, remove the edge with the smallest edge weight, if this relationship set can be derived from other relationship sets in the cycle. If there are two edges with the same smallest edge weight, remove either one. 27 D-3. Resolve ancestor-descendant conflicts. (cont.) (example) project js,2,1:n,1:n 14 supplier 3 staff 2 7 jsp,3,1:n,1:n 1 2 7 7 project manager funds 7 ordinary staff 1 local funds 1 foreign funds 1 part 7 organization 1 org name Parent-child conflict 28 D-3. Resolve ancestor-descendant conflicts. (cont.) (example) project js,2,1:n,1:n 14 3 supplier 2 7 staff 2 7 jsp,3,1:n,1:n 7 project manager funds 7 ordinary staff 1 1 local funds foreign funds 1 part 7 organization 1 org name Parent-child conflict: the edge from part to supplier is removed. 29 D-3. Resolve ancestor-descendant conflicts. (cont.) (example) head secretary department dh sd dh head dname hs hid department sid sd department head hs head secretary head secretary dname sid Schema S7 department Schema S8 Integrated graph G78 head secretary dh head sd hs department head secretary head hs dh head secretary head Case 1: Case 2(a): if sd can be derived by dh and hs if hs derived by sd and dh and sw7=2 and sw8=1) and sw7=1 and sw8=2 Transformed graph G78’ Transformed graph G78’’(a) sd department Case 2(b): if dh derived by hs and sd) and sw7=1 and sw8=2 Transformed graph G78’’(b) 30 D-4. Remove transitive relationship sets and redundant object classes. If one relationship set from object class A to object class B can be derived from relationship sets which is from A to other object class sets and back to B, it is called transitive relationship set. Transitive relationship sets are also redundant, and can be removed so that the resulting integrated graph will be concise, if the intermediate node has attribute or other sub-object classes. If the intermediate node has no attribute and no other sub-object classes, it will be considered as redundant object classes. 31 D-4. Remove transitive relationship sets and redundant object classes. (cont.) (example) project js,2,1:n,1:n 14 3 supplier 2 7 staff 2 7 jsp,3,1:n,1:n 7 project manager funds 7 ordinary staff 1 local funds 1 foreign funds 1 part 7 organization 1 org name 32 D-4. Remove transitive relationship sets and redundant object classes. (cont.) (example) project js,2,1:n,1:n 14 2 7 supplier staff jsp,3,1:n,1:n 7 7 project manager funds 7 ordinary staff 1 local funds 1 foreign funds part 7 org name 33 D-5. Remove other type of multiple parent nodes. If a node has more than one incoming edges in an integrated graph, it is called a multiple parent node. Case 1: D-1 Different relationship sets among the same object classes. Case 2: D-2 Relationship sets that are projections of the higher degree relationship sets. Case 3: D-3 Ancestor-descendant conflicts. Case 4: D-4 Transitive relationship sets. Case 5: D-5 Others. As examples in the following page. 34 D-5. Remove multiple parent nodes. (cont.) (example) Case 5: school project school jd scname jno student stduent stduent jd snu email snu Schema S9 project address mark Schema S10 Integrated graph G9-10 school project jd student student1 student2 jd snu email address snu snu mark Transformed Graph G9-10’ back 35 Transformed graph (summary) project js,2,1:n,1:n 14 supplier 3 staff 2 7 jsp,3,1:n,1:n 1 2 7 7 project manager funds 7 ordinary staff 1 local funds 1 foreign funds 1 part 7 organization 1 org name Original integrated graph 36 Transformed graph (cont.) project js,2,1:n,1:n 14 2 7 supplier staff jsp,3,1:n,1:n 7 7 project manager funds 7 ordinary staff 1 local funds 1 foreign funds part 7 org name Transformed graph 37 E. Augment Graph with Attributes Augment the graph with the attributes of object classes in the integrated schema. Augment the graph with the attributes of relationship sets in the integrated schema. For the attributes of duplicated object classes in case D1 and D-5, the attributes will become the attributes of the original ones, not the duplicated object classes.D-1 D-5 For attributes of relationship sets which have been removed, the attributes will be the attributes of relationship sets which could derive this relationship set. 38 Final integrated schema (example) project js,2,1:n,1:n jno supplier funds staff jsp,3,1:n,1:n sno name part project manager ordinary staff local funds foreign funds jsp pno quantity mno name email address eno lno fno org name abbreviation full name back 39 Integration Algorithm 1. Preprocessing. a. Resolve attribute-object class conflict. b. Resolve generalizations and specializations. 2. 3. 4. Construct integrated graph. Transform graph. Augment graph with attributes. 40 Step 3 Transform Graph 3.1 Differentiate semantically different relationship sets among equivalent object classes. 3.2 Remove relationship sets that are projections of higher degree relationship sets. 3. 3 Resolve any ancestor-descendant conflicts which create cycles in G. 3.4 Remove transitive relationship sets and redundant object classes. 3.5 Remove other multiple parent nodes. 41 Related work (compare with [7]) project js,2,1:n,1:n jno supplier staff local funds foreign funds funds jsp,3,1:n,1:n name part project manager ordinary staff sno quantity pno lno mno name fno uno email address eno org name abbreviation full name problems by Jeong and C.-N. Hsu [7]. 42 Related work (compare with [7]) (cont.) project js,2,1:n,1:n jno supplier funds staff jsp,3,1:n,1:n name sno part project manager ordinary staff local funds foreign funds jsp pno quantity mno name email address eno lno fno org name abbreviation full name Integrated schema obtained by our approach 43 Related work (compare with [7]) (cont.) Our proposed method employs the ORA-SS conceptual model which is able to capture the semantics necessary for the resolution of structural conflict during integration. We could take into consideration the importance of the individual data sources, and how the majority of the local schemas model their data. We employ n-nary strategy. While The binary strategy will not be able to utilize the source importance and how the majority of the sources model the data. N-nary strategy is also faster compare to the binary strategy. ( For example, sw1=2,sw2=1,sw3=1,sw4=1and schema 2, 3, 4 are same. The binary strategy might treat sw1 as the most important schema, while in fact schema 2, 3 and 4 are.) 44 Conclusion In this paper, we have introduced a semantic approach to resolve structural conflicts in the integration of XML schemas. We employed the ORA-SS semantic data model to capture the implicit semantics in an XML schema. We presented a comprehensive n-nary algorithm to integrate XML schemas. our algorithm takes into account the data semantics, the importance of a source, and how the majority of the sources model their data. Structural conflicts such as attribute/object class conflict, ancestordescendant conflict are resolved in our approach. We also remove redundant object classes and relationship sets such as transitive relationship sets, and relationship sets, which are projections of higher degree relationship sets in order to obtain a concise integrated schema. 45 References 1. S.Castano, V. Antonellis, S. C. Vimercati, M. Melchiori. An XML-Based Framework for Information Integration over the Web. IIWAS, 2000. 2.Y.B. Chen, T.W. Ling, M.L. Lee. Designing Valid XML Views. ER, 2002. 3.P. Buneman, S. Davidson, W. Fan, C. Hara, W.C. Tan. Keys for XML. WWW, 2001. 4.A. Doan, P. Domingos, A. Levy. Learning Source Descriptions for Data Integration. WebDB, 2000. 5.G. Dobbie, X. Wu, T.W. Ling, M.L. Lee. ORA-SS: An Object-Relationship-Attribute Model for Semi-structured Data. Technical Report TR21/00, National University of Singapore, 2000. 6.E. Rahm, P. Bernstein. On Matching Schemas Automatically. MSR Tech. Report MSR-TR-2001-17, 2001. 7.E. Jeong, C.-N. Hsu. Induction of Integrated View for XML Data with Heterogeneous DTDs. ACM CIKM, 2001. 8.T.W. Ling, M.L. Lee. Relational to Entity-Relationship Schema Translation Using Semantic and Inclusion Dependencies, in Journal of Integrated Computer-Aided Engineering, John-Wiley Publishers, Vol 2, No 2, pages 125-145, 1995. 9.M.L. Lee, T.W. Ling. Resolving Structural Conflicts in the Integration of Entity-Relationship Schemas. OOER, 1995. 10.M.L. Lee, T.W. Ling. Resolving Constraint Conflicts in the Integration of Entity-Relationship Schemas. ER, 1997. 11.M.L. Lee, T.W. Ling, W.L. Low. Designing Functional Dependencies for XML, EDBT, 2002. 12.M.L. Lee, L.H. Yang, W. Hsu, X. Yang. XClust: Clustering XML Schemas for Effective Integration, ACM CIKM, 2002. 13.D. Maier. Theory of Relational Databases. Computer Science Press, 1983. 14.J. Madhavan, P.A. Bernstein, E. Rahm. Generic Schema Matching with Cupid. VLDB, 2001. 15.R. Mello, S. Castano, C.A. Heuser. A Method for the Unification of XML. Information and Software Technology Journal, 2002. 16.P. Mitra, G. Wiederhold and J. Jannink. Semi-automatic Integration of Knowledge Sources. Fusion, 1999. 17.P. Mitra, G. Wiederhold, M. Kersten. A Graph-Oriented Model for Articulation of Ontology Interdependencies. EDBT 2000. 18.F. Naumann, U. Leser, J.C. Freytag. Quality-driven Integration of Heterogeneous Information Systems. VLDB, 1999. 19.C. Reynaud, J.-P. Sirot, D. Vodislav. Semantic Integration of XML Heterogeneous Data Sources. IDEAS, 2001. 20.http://www.cogsci.princeton.edu/~wn 21.Xyleme. A dynamic warehouse for XML Data of the Web. IEEE Data Engineering Bulletin 24(2):40-47, 2001. 22.L.L. Yan, T.W. Ling. Translating Relational Schema with Constraints into OODB Schema. IFIP DS-5 Semantics of Interoperable Database Systems. 1992 46