On View Support for a Native XML DBMS Ting Chen , Tok Wang Ling School of Computing, National University of Singapore Daofeng Luo, Xiaofeng Meng Information School , Remin University of China 1 Outline View for XML Documents ORA-SS: Object-Relationship-Attribute Model for Semi-structured Data ORA-SS class diagram and instance diagram ORA-SS for view schema definition Element-Based Clustering (EBC) Basic Approach ORA-SS and EBC XML View Transformation Two Main Approaches Problems Problem Definition Algorithm Conclusion 2 View for XML Documents Two main approaches to define views Define views in script languages like XQuery or XSLT Define views by Schema-Mapping General but demanding from users’ point of view because XQuery and XSLT scripts are complex Difficult to optimize from performance view point E.g. Clio[7] and eXeclon[3] A declarative approach; alleviate users from writing complex scripts to perform view transformation Schema mappings can then be translated into XQuery (or XSLT) scripts We focus on the problem of view transformation through schema mapping 3 View for XML Documents View for XML Documents via SchemaMapping Problem: Current XML schema formats are not able to express views with semantic constraints, resulting in ambiguity E.g. (Next Slide) The source XML file contains information about researchers working under different projects and the publication list for each researcher. 4 View for XML Documents The view schema in Fig (c) of the above diagram is such an example. It has at least two possible meanings which can lead to different view results! 1. For each project, list all the papers published by project members; for each paper of the project, list all the authors of the paper. 2. For each project, list all the papers published by project members; for each paper of the project, list all the authors of the paper who work for the project. 5 ORA-SS Data Model ORA-SS[2] Object Class Relationship Type Attribute( Object attribute or Relationship attribute) E.g. An ORA-SS Instance Diagram * Compared with the XML document in Slide 5, two extra fields (attribute Date and sub-element Position) are added in the above ORASS instance diagram 6 ORA-SS Data Model ORA-SS Schema Diagram •There are two binary relationship types in the schema: Project-Researcher(JR) and Researcher-Paper(RP). The set of papers under a researcher doesn’t depend on the project he/she works in. •Position is an attribute of relationship type JR instead of Researcher. This means that a researcher may hold different positions across projects he works in. •Date is a single-valued attribute of object class Paper. Different occurrences of the same paper will always have the same Date value. •J_Name,R_Name and P_ID are identifiers of object classes Project , Researcher and Paper respectively as indicated by solid circles. Identifier values are used to tell if two object occurrences are identical. 7 ORA-SS Data Model ORA-SS for View Schema Definition It is able to define views with different semantics View (a) has two binary relationship types. The intention of the view schema is to find all the papers published by researchers in a project; and for each paper to find all of its authors. View (b) has only one ternary relationship type. The view is defined to find all the papers published by researchers in a project; however, for each paper View (b) only finds those authors working for the project. View (a) View (b) 8 ORA-SS Data Model ORA-SS for View Schema Definition The two view schemas in Slide 8 correspond to two different XSLT scripts. The following is the XSLT script for schema (a): <root> <xsl:for-each-group select="root/Project" group-by="@J_Name"> <Project> <J_Name><xsl:value-of select="@J_Name"/></J_Name> <xsl:for-each-group select="current-group()/Researcher/Paper" group-by="@P_Name"> <Paper> <xsl:variable name="vPName" select="@P_Name"/> <P_Name><xsl:value-of select="@P_Name"/></P_Name> <xsl:for-each-group select="/root/Project/Researcher[Paper/@P_Name =$vPName]" group-by="@R_Name"> <Researcher> <R_Name><xsl:value-of select="@R_Name"/></R_Name> </Researcher> </xsl:for-each-group> </Paper> </xsl:for-each-group> </Project> </xsl:for-each-group> </root> 9 ORA-SS Data Model The following is the XSLT script for schema (b): <root> <xsl:for-each-group select="root/Project" group-by="@J_Name"> <Project> <J_Name><xsl:value-of select="@J_Name"/></J_Name> <xsl:for-each-group select="current-group()/Researcher/Paper" group-by="@P_Name"> <Paper> <xsl:variable name="vPName" select="@P_Name"/> <P_Name><xsl:value-of select="@P_Name"/></P_Name> <xsl:for-each-group select="current-group()/.." group-by="@R_Name"> <Researcher> <R_Name><xsl:value-of select="@R_Name"/></R_Name> </Researcher> </xsl:for-each-group> </Paper> </xsl:for-each-group> </Project> </xsl:for-each-group> </root> The main difference of two scripts lies in the third xsl:for-each-group directive for Researcher. Script for Schema (a) needs to search the whole document to find the complete author list of a paper because authors may not work for the same project. On the other hand, script for Schema (b) avoids the global search because it only needs find authors of the paper working for the same project. 10 Element-Based-Clustering (EBC) Element Based Clustering[5] Extension of Element Based (EB[6]) Strategy Element nodes (records) with the same tag name are clustered and organized as a list A A A B A A C B B A C B 11 Element-Based-Clustering (EBC) EBC: Node labeling EBC gives labels for nodes in a XML document Labels for nodes in a XML data tree can be calculated in the following manner: 1. The root element has label nil 2. Perform a pre-order traversal (i.e. Document order) on the XML document For node x: (“+” here means string concatenation) label(x) = label(x.parent) + “.” + position of x in x.parent’s childList Node A is ancestor of node B if Label(A) is the prefix of node Label(B) and vice versa 12 ORA-SS and EBC •How does ORA-SS schema help tune XML document storage? Project j1(1) Researcher j2(2) r1(1.1) r2(1.2) Position Leader(1.2.3) Paper r2(2.1) Staff(2.1.3) r3(2.2) Leader(2.2.2) p1,05/2002(1.1.1) p1,05/2002(1.2.1) p2,03/2000(1.2.2) p1,05/2002(2.1.1) p1,03/2000(2.1.2) p2,05/2002(2.2.1) 13 ORA-SS and EBC 1. Object identifier of an object will be stored together with the object. Objects of the same class form a cluster. Each cluster is a sequential file. 2. Relationship attribute values will be stored in separate cluster. 3. For object attribute values, we need some heuristics. It is attempting to store object attributes together with the object since they are likely to be accessed at the same time. However, if an object class has too many objects attributes, to store all attribute values together with the object brings us to a situation similar to Subtree-Based storage strategy. One solution is to store only those essential (this can be determined by users) object attributes with the object and leave other to separated clusters. 4. Node (which can be object, relationship value and object attribute value) labels will be stored together with nodes. 14 View Transformation: Problem Definition Problem Definition Given an XML document D1, a source schema V1 and a valid view schema V2 of V1, transform D1 to document D2, so that D2 is a valid document under V2. Source Document View Transformation View Document User Defined Schema Mapping Source Schema View Schema 15 View Transformation •View schema defined in DTD can have different interpretations which result in different views Proj View Schema (Ambiguous) JP;2 Slide 8(a) Paper Proj Paper PR;2 Researcher ? Slide 8(b) Proj Researcher Paper JPR;3 Researcher 16 View Transformation View schema expressed in ORA-SS is unambiguous The relationship set of an ORA-SS view schema clearly defines how view document should be constructed Two basic techniques are used in construction of a single relationship in view schema Structural join: SJ (based on object labels) For each relationship R in view schema, we first use structural join to find the set of paths of type R such that the object occurrences in each path locate on the same path in source document. Value join : Merge (based on logical object identifiers) In XML document an object can have many occurrences. Two occurrences are considered as the same if they have identical object identifier. The set of paths resulted from structural join is then value joined (or merged) using logical object keys. 17 View Transformation What makes things complicated? A path in view schema can have more than one relationship and we need to join two relationship together. E.g. View schema in Slide 8(a) contains two relationships. Two relationships are “joined” on their overlapping object classes. So the results constructed for one relationship may be used in construction of another. Value join (merge) based on logical keys destroys the “sorted-ness” of the output path list of structural join. The consequence is that the output of value join can’t be used in subsequent structural join with paths from other relationships efficiently. (Structural join requires sorted input lists) Solution: Duplicated-Preserving Merge (D-Merge). It keeps the structural join output path list intact. For two occurrences of the same object in the list, their child contents will be merged and then each of them will have its own copy of the merged content. D-Merge is used only when a relationship needs to join with another relationship. 18 View Transformation: Algorithm The relationship set of a view schema determines the view transformation process. Take the two view schemas in Slide 8 as an example: View(a): View(b): 1. L := SJ(list(R), list(P), {P,R}) 1. L := SJ(list(J), list(P), {J,P}) 2. L := D-Merge(L, {P}) 2. L := SJ(L, list(R), {J,P,R}); 3. L := SJ(L, list(J), {J,P}); 3. L := Merge(L,{J}); 4. L := Merge(L,{J}); 19 View Transformation: Algorithm Structural Join Based on Object Label Binary Structural Join[1] Input Two sorted (on node numbers) node lists: AList of potential ancestor nodes and DList of potential descendants nodes Output OutputList = [(ai; dj)] of join results, in which ai is the parent/ancestor of dj and ai is from AList and dj is from DList EBC schemes stores elements with the same tag name in pre-ordered( i.e. sorted on element number) way; structural join can be naturally applied 20 View Transformation: Algorithm Structural Join: Based on Object Label Complex Structural Join: (Example: structural join of three sorted input lists) Input Sorted node lists: A,B,C Output OutputList = [(ai; bj ; ck)] of join results, in which ai, bj , ck (from List A,B,C respectively) are located on the same path in source document Two binary joins: Step 1: Join A and B OutputList AB: [(ai; bj )] sorted on ai Step 2: Join AB (using ai as node label) and C OutputList = [(ai; bj ; ck)] sorted on ai Important: ck should be on the same path as both ai AND bj 21 View Transformation: Algorithm A motivating example: Source Document: Source Schema Proj Proj JR;2 Researcher RP;2 Paper View Schema JP;2 Paper PR;2 Researcher 22 View Transformation: Algorithm Data Structures Record Object label Object key (object identifier) ChildList: an array of children record references Tuple A tuple consists of an array of records A tuple has a root record Example Tuple Array Index: 0 Record:r1 (ChildList: <1,2>), ROOT 1 Record:r2 (ChildList: <3>) Array of Tuples 2 Record:r3 (ChildList: nil) 3 Corresponding Tree: Record: r4 (ChildList: nil) r1 r2 r3 r4 23 View Transformation: Algorithm Operations Structural Join (SJ) Input Output L3: Array of Tuples of type <A1,A2,A3,…, An,B1,B2,B3,…, Bn> For each tuple in L3, its objects whose types are specified in M MUST locate on the same path in the source document Merge (Merge) Input L1: Array of Tuples of type < A1,A2,A3,…, An> Mask M: Bit Array, which specifies a list of object classes Output L1: Array of Tuples of type <A1,A2,A3,…, An> L2: Array of Tuples of type <B1,B2,B3,…, Bn> Mask M: Bit Array, which specifies the object classes which participate in structural join L: Array of Tuples For any two tuples t1 and t2 in L1 which have the same set of objects whose types are specified in M, the two tuples will be merged. Duplicate-Preserving Merge (D-Merge) Input L1: Array of Tuples of type < A1,A2,A3,…, An> Mask M: Bit Array, which specifies a list of object classes Output L: Array of Tuples For any two tuples t1 and t2 in L1 which have the same set of objects whose types are specified in M, the contents of the two tuples will be merged. t1 and t2 will then both have the merged content. Note: Structural Join operation uses object number while Merge and D-Merge use object identifer. 24 View Transformation: Algorithm Definition 1:Relationship R1 is lower than R2, or R1 < R2, if the top-most participating object class of R1 is a descendent of the top-most participating object class of R2. Definition 2: A relationship is independent if its set of participating object classes is not included in any other relationship. Otherwise the relationship is nested. Algorithm (Main steps): Step 0. Initialize L1 := null, L2 := null, M := { }, //M contains the set of object classes that has been structurally joined L[O] : = null for each O in the view schema Step 1. Put the independent relationships of the ORA-SS view schema into a partially ordered set (S, < ) Step 2: While (S is not empty){ Extract the first relationship R from S; If (M∩R is not null) L1 := L[O] for the highest O(O has the smallest level in the view schema) in M∩R; Else L1 := null; For each object class O in (R – M) sorted decreasingly by their depths in the view schema M := M U {O}; L2 := Cluster[O]; //Cluster[O] is an array storing objects of class O if(L1 != null ) L1 := StructuralJoin(L1,L2,M∩R); else L1: = L2; L[O] = L1; If R is the highest level relationship L1 := Merge(L1, R.top_most_object_class); Else L1 := D-Merge(L1, R – {object classes of R which doesn’t appear in any other independent relationship}); } Step 3: return L[Root]; where Root is the root of the view schema; 25 View Transformation: Algorithm Example: View Schema Slide 8a Step 1: Construct Relationship Project-Paper Step 1.1. L := SJ (list(P), list(R), {P,R}) View Schema Source Doc: Proj JP;2 Paper PR;2 Step 1.2. L := D-Merge (L, {P}) Researcher S = {PR,JP}; 26 View Transformation: Algorithm Example: View Schema Slide 8a Step 2: Construct Relationship Project-Paper Step 2.1. L := SJ(L, list(J), {J,P}) View Schema Source Doc: Proj JP;2 Paper Step 2.2. L := Merge (L, {J} ) PR;2 Researcher S = {JP}; 27 Conclusion We demonstrate how to combine an efficient XML storage scheme (EBC) and an expressive XML data model (ORA-SS) to provide XML view support for a native DBMS system. ORA-SS, used as view schema definitions, can express a great variety of constraints graphically. More importantly, it can avoid ambiguity, which is a typical problem if by DTD or XML Schema is used as view schema format. In our transformation method, a relationship type is the basic unit of transformation. Both object label based structural join and logical key based value join are employed to construct view results. ORASS view schema information can guide correct view transformation. 28 Reference 1. 2. 3. 4. 5. 6. 7. Shurug Al-Khalifa, H. V. Jagadish, Nick Kouda, Jignesh M. Patel, Divesh Srivastava, YuqingWu. Structural Joins: A Primitive for Efficient XML Query Pattern Matching. In Proceedings of ICDE, 2002 Gillian Dobbie, Wu Xiaoying, Tok Wang Ling, Mong Li Lee: ORA-SS: An ObjectRelationship-Attribute Model for Semistructured Data TR21/00, Technical Report, Department of Computer Science, National University of Singapore, December 2000. eXcelon. An General XML Data Manager. http://www.exceloncorp.com/ H. V. Jagadish, Shurug AL-Khalifa, et al. TIMBER: A Native XML Database. Technical Report, University of Michigan,April 2002. Xiaofeng Meng, Daofeng Luo, Mong Li Lee, Jing An. OrientStore: A Schema Based Native XML Storage System. In Proceedings of the 29th VLDB Conference, Berlin, Germany, 2003 J. McHugh, S. Abiteboul, R. Goldman, D. Quass, and J.Widom. Lore: A Database Management System for Semistructured Data. SIGMOD Record, Vol.26(3):5466,September 1997. Lucian Popa, Mauricio A. Hern´andez ,Yannis Velegrakis , Ren´ee J. Miller, Felix Naumann, Howard Ho. Mapping XML and Relational Schemas with Clio. In ICDE 2002 Demo, 2002 29