On View Support for a Native XML DBMS Daofeng Luo, Xiaofeng Meng

advertisement
On View Support for a
Native XML DBMS
Ting Chen , Tok Wang Ling
School of Computing, National University of Singapore
Daofeng Luo, Xiaofeng Meng
Information School , Remin University of China
1
Outline

View for XML Documents




ORA-SS: Object-Relationship-Attribute Model for Semi-structured Data

ORA-SS class diagram and instance diagram

ORA-SS for view schema definition
Element-Based Clustering (EBC)



Basic Approach
ORA-SS and EBC
XML View Transformation



Two Main Approaches
Problems
Problem Definition
Algorithm
Conclusion
2
View for XML Documents

Two main approaches to define views

Define views in script languages like XQuery or XSLT



Define views by Schema-Mapping




General but demanding from users’ point of view because XQuery
and XSLT scripts are complex
Difficult to optimize from performance view point
E.g. Clio[7] and eXeclon[3]
A declarative approach; alleviate users from writing complex scripts
to perform view transformation
Schema mappings can then be translated into XQuery (or XSLT)
scripts
We focus on the problem of view transformation through
schema mapping
3
View for XML Documents

View for XML Documents via SchemaMapping


Problem: Current XML schema formats are not
able to express views with semantic constraints,
resulting in ambiguity
E.g. (Next Slide) The source XML file contains
information about researchers working under
different projects and the publication list for each
researcher.
4
View for XML Documents
The view schema in Fig (c) of the above diagram is such an example. It has at
least two possible meanings which can lead to different view results!
1. For each project, list all the papers published by project members; for each
paper of the project, list all the authors of the paper.
2. For each project, list all the papers published by project members; for each
paper of the project, list all the authors of the paper who work for the project.
5
ORA-SS Data Model

ORA-SS[2]




Object Class
Relationship Type
Attribute( Object attribute or Relationship attribute)
E.g. An ORA-SS Instance Diagram
* Compared with the XML document in Slide 5, two extra fields (attribute Date and sub-element
Position) are added in the above ORASS instance diagram
6
ORA-SS Data Model

ORA-SS Schema Diagram
•There are two binary relationship types in the schema:
Project-Researcher(JR) and Researcher-Paper(RP). The
set of papers under a researcher doesn’t depend on the
project he/she works in.
•Position is an attribute of relationship type JR instead of
Researcher. This means that a researcher may hold
different positions across projects he works in.
•Date is a single-valued attribute of object class Paper.
Different occurrences of the same paper will always have
the same Date value.
•J_Name,R_Name and P_ID are identifiers of object
classes Project , Researcher and Paper respectively as
indicated by solid circles. Identifier values are used to tell if
two object occurrences are identical.
7
ORA-SS Data Model
ORA-SS for View Schema Definition

It is able to define views with different semantics



View (a) has two binary relationship types. The intention of the view
schema is to find all the papers published by researchers in a project;
and for each paper to find all of its authors.
View (b) has only one ternary relationship type. The view is defined to
find all the papers published by researchers in a project; however, for
each paper View (b) only finds those authors working for the project.
View (a)
View (b)
8
ORA-SS Data Model
ORA-SS for View Schema Definition

The two view schemas in Slide 8 correspond to two different
XSLT scripts. The following is the XSLT script for schema (a):

<root>
<xsl:for-each-group select="root/Project" group-by="@J_Name">
<Project>
<J_Name><xsl:value-of select="@J_Name"/></J_Name>
<xsl:for-each-group select="current-group()/Researcher/Paper"
group-by="@P_Name">
<Paper>
<xsl:variable name="vPName" select="@P_Name"/>
<P_Name><xsl:value-of select="@P_Name"/></P_Name>
<xsl:for-each-group select="/root/Project/Researcher[Paper/@P_Name
=$vPName]" group-by="@R_Name">
<Researcher>
<R_Name><xsl:value-of select="@R_Name"/></R_Name>
</Researcher>
</xsl:for-each-group>
</Paper>
</xsl:for-each-group>
</Project>
</xsl:for-each-group>
</root>
9
ORA-SS Data Model

The following is the XSLT script for schema (b):
<root>
<xsl:for-each-group select="root/Project" group-by="@J_Name">
<Project>
<J_Name><xsl:value-of select="@J_Name"/></J_Name>
<xsl:for-each-group select="current-group()/Researcher/Paper"
group-by="@P_Name">
<Paper>
<xsl:variable name="vPName" select="@P_Name"/>
<P_Name><xsl:value-of select="@P_Name"/></P_Name>
<xsl:for-each-group select="current-group()/.." group-by="@R_Name">
<Researcher>
<R_Name><xsl:value-of select="@R_Name"/></R_Name>
</Researcher>
</xsl:for-each-group>
</Paper>
</xsl:for-each-group>
</Project>
</xsl:for-each-group>
</root>

The main difference of two scripts lies in the third xsl:for-each-group directive for
Researcher. Script for Schema (a) needs to search the whole document to find the
complete author list of a paper because authors may not work for the same project.
On the other hand, script for Schema (b) avoids the global search because it only
needs find authors of the paper working for the same project.
10
Element-Based-Clustering (EBC)

Element Based Clustering[5]
 Extension of Element Based (EB[6]) Strategy

Element nodes (records) with the same tag name are
clustered and organized as a list
A
A
A
B
A
A
C
B
B
A
C
B
11
Element-Based-Clustering (EBC)

EBC: Node labeling


EBC gives labels for nodes in a XML document
Labels for nodes in a XML data tree can be calculated
in the following manner:
1. The root element has label nil
2. Perform a pre-order traversal (i.e. Document
order) on the XML document
For node x:
(“+” here means string concatenation)
label(x) = label(x.parent) + “.” + position of x in x.parent’s childList

Node A is ancestor of node B if Label(A) is the prefix
of node Label(B) and vice versa
12
ORA-SS and EBC
•How does ORA-SS schema help tune XML document storage?
Project
j1(1)
Researcher
j2(2)
r1(1.1)
r2(1.2)
Position Leader(1.2.3)
Paper
r2(2.1)
Staff(2.1.3)
r3(2.2)
Leader(2.2.2)
p1,05/2002(1.1.1)
p1,05/2002(1.2.1)
p2,03/2000(1.2.2)
p1,05/2002(2.1.1)
p1,03/2000(2.1.2)
p2,05/2002(2.2.1)
13
ORA-SS and EBC
1. Object identifier of an object will be stored together with the object.
Objects of the same class form a cluster. Each cluster is a
sequential file.
2. Relationship attribute values will be stored in separate cluster.
3. For object attribute values, we need some heuristics.
It is attempting to store object attributes together with the object
since they are likely to be accessed at the same time. However, if
an object class has too many objects attributes, to store all attribute
values together with the object brings us to a situation similar to
Subtree-Based storage strategy. One solution is to store only those
essential (this can be determined by users) object attributes with the
object and leave other to separated clusters.
4. Node (which can be object, relationship value and object attribute
value) labels will be stored together with nodes.
14
View Transformation: Problem Definition

Problem Definition
Given an XML document D1, a source schema V1 and a valid
view schema V2 of V1, transform D1 to document D2, so that D2
is a valid document under V2.
Source Document
View
Transformation
View Document
User Defined
Schema Mapping
Source Schema
View Schema
15
View Transformation
•View schema defined in DTD can have different interpretations which
result in different views
Proj
View Schema
(Ambiguous)
JP;2
Slide 8(a)
Paper
Proj
Paper
PR;2
Researcher
?
Slide 8(b)
Proj
Researcher
Paper
JPR;3
Researcher
16
View Transformation

View schema expressed in ORA-SS is unambiguous


The relationship set of an ORA-SS view schema clearly defines how
view document should be constructed
Two basic techniques are used in construction of a single
relationship in view schema


Structural join: SJ (based on object labels)
For each relationship R in view schema, we first use structural join to find
the set of paths of type R such that the object occurrences in each path
locate on the same path in source document.
Value join
: Merge (based on logical object identifiers)
In XML document an object can have many occurrences. Two
occurrences are considered as the same if they have identical object
identifier. The set of paths resulted from structural join is then value
joined (or merged) using logical object keys.
17
View Transformation

What makes things complicated?




A path in view schema can have more than one relationship and we need to
join two relationship together. E.g. View schema in Slide 8(a) contains two
relationships.
Two relationships are “joined” on their overlapping object classes. So the
results constructed for one relationship may be used in construction of
another.
Value join (merge) based on logical keys destroys the “sorted-ness” of the
output path list of structural join. The consequence is that the output of value
join can’t be used in subsequent structural join with paths from other
relationships efficiently. (Structural join requires sorted input lists)
Solution: Duplicated-Preserving Merge (D-Merge). It keeps the structural join
output path list intact. For two occurrences of the same object in the list, their
child contents will be merged and then each of them will have its own copy
of the merged content.
D-Merge is used only when a relationship needs to join with another
relationship.
18
View Transformation: Algorithm


The relationship set of a view schema determines the
view transformation process.
Take the two view schemas in Slide 8 as an example:
View(a):
View(b):
1. L := SJ(list(R), list(P), {P,R})
1. L := SJ(list(J), list(P), {J,P})
2. L := D-Merge(L, {P})
2. L := SJ(L, list(R), {J,P,R});
3. L := SJ(L, list(J), {J,P});
3. L := Merge(L,{J});
4. L := Merge(L,{J});
19
View Transformation: Algorithm

Structural Join



Based on Object Label
Binary Structural Join[1]
Input
Two sorted (on node numbers) node lists: AList of potential
ancestor nodes and DList of potential descendants nodes
Output
OutputList = [(ai; dj)] of join results, in which ai is the
parent/ancestor of dj and ai is from AList and dj is from DList
EBC schemes stores elements with the same tag
name in pre-ordered( i.e. sorted on element number)
way; structural join can be naturally applied
20
View Transformation: Algorithm

Structural Join:


Based on Object Label
Complex Structural Join: (Example: structural join of three sorted
input lists)
Input
Sorted node lists: A,B,C
Output
OutputList = [(ai; bj ; ck)] of join results, in which ai, bj , ck (from List A,B,C
respectively) are located on the same path in source document

Two binary joins:
 Step 1: Join A and B
 OutputList AB: [(ai; bj )] sorted on ai
 Step 2: Join AB (using ai as node label) and C
 OutputList = [(ai; bj ; ck)] sorted on ai
 Important: ck should be on the same path as both ai AND bj
21
View Transformation: Algorithm

A motivating example:
Source Document:
Source
Schema
Proj
Proj
JR;2
Researcher
RP;2
Paper
View Schema
JP;2
Paper
PR;2
Researcher
22
View Transformation: Algorithm

Data Structures

Record




Object label
Object key (object identifier)
ChildList: an array of children record references
Tuple



A tuple consists of an array of records
A tuple has a root record
Example
Tuple Array Index:
0
Record:r1
(ChildList:
<1,2>), ROOT

1
Record:r2
(ChildList:
<3>)
Array of Tuples
2
Record:r3
(ChildList:
nil)
3
Corresponding Tree:
Record: r4
(ChildList:
nil)
r1
r2
r3
r4
23
View Transformation: Algorithm

Operations

Structural Join (SJ)

Input




Output


L3: Array of Tuples of type <A1,A2,A3,…, An,B1,B2,B3,…, Bn>
For each tuple in L3, its objects whose types are specified in M MUST locate on the same path in the
source document
Merge (Merge)

Input



L1: Array of Tuples of type < A1,A2,A3,…, An>
Mask M: Bit Array, which specifies a list of object classes
Output


L1: Array of Tuples of type <A1,A2,A3,…, An>
L2: Array of Tuples of type <B1,B2,B3,…, Bn>
Mask M: Bit Array, which specifies the object classes which participate in structural join
L: Array of Tuples
For any two tuples t1 and t2 in L1 which have the same set of objects whose types are specified in M,
the two tuples will be merged.
Duplicate-Preserving Merge (D-Merge)

Input



L1: Array of Tuples of type < A1,A2,A3,…, An>
Mask M: Bit Array, which specifies a list of object classes
Output

L: Array of Tuples
For any two tuples t1 and t2 in L1 which have the same set of objects whose types are specified in M,
the contents of the two tuples will be merged. t1 and t2 will then both have the merged content.
Note: Structural Join operation uses object number while Merge and D-Merge use object identifer.
24
View Transformation: Algorithm



Definition 1:Relationship R1 is lower than R2, or R1 < R2, if the top-most participating object class of R1
is a descendent of the top-most participating object class of R2.
Definition 2: A relationship is independent if its set of participating object classes is not included in any
other relationship. Otherwise the relationship is nested.
Algorithm (Main steps):
Step 0. Initialize L1 := null, L2 := null,
M := { },
//M contains the set of object classes that has been structurally joined
L[O] : = null for each O in the view schema
Step 1. Put the independent relationships of the ORA-SS view schema into a partially ordered set (S, < )
Step 2: While (S is not empty){
Extract the first relationship R from S;
If (M∩R is not null)
L1 := L[O] for the highest O(O has the smallest level in the view schema) in M∩R;
Else
L1 := null;
For each object class O in (R – M) sorted decreasingly by their depths in the view schema
M := M U {O};
L2 := Cluster[O]; //Cluster[O] is an array storing objects of class O
if(L1 != null )
L1 := StructuralJoin(L1,L2,M∩R);
else
L1: = L2;
L[O] = L1;
If R is the highest level relationship
L1 := Merge(L1, R.top_most_object_class);
Else
L1 := D-Merge(L1, R – {object classes of R which doesn’t appear in
any other independent relationship});
}
Step 3: return L[Root]; where Root is the root of the view schema;
25
View Transformation: Algorithm

Example: View Schema Slide 8a

Step 1: Construct Relationship Project-Paper
Step 1.1. L := SJ (list(P), list(R), {P,R})
View Schema
Source Doc:
Proj
JP;2
Paper
PR;2
Step 1.2. L := D-Merge (L, {P})
Researcher
S = {PR,JP};
26
View Transformation: Algorithm

Example: View Schema Slide 8a

Step 2: Construct Relationship Project-Paper
Step 2.1. L := SJ(L, list(J), {J,P})
View Schema
Source Doc:
Proj
JP;2
Paper
Step 2.2. L := Merge (L, {J} )
PR;2
Researcher
S = {JP};
27
Conclusion

We demonstrate how to combine an efficient XML storage
scheme (EBC) and an expressive XML data model (ORA-SS) to
provide XML view support for a native DBMS system.

ORA-SS, used as view schema definitions, can express a great
variety of constraints graphically. More importantly, it can avoid
ambiguity, which is a typical problem if by DTD or XML Schema
is used as view schema format.

In our transformation method, a relationship type is the basic unit
of transformation. Both object label based structural join and
logical key based value join are employed to construct view
results. ORASS view schema information can guide correct view
transformation.
28
Reference
1.
2.
3.
4.
5.
6.
7.
Shurug Al-Khalifa, H. V. Jagadish, Nick Kouda, Jignesh M. Patel, Divesh
Srivastava, YuqingWu. Structural Joins: A Primitive for Efficient XML Query
Pattern Matching. In Proceedings of ICDE, 2002
Gillian Dobbie, Wu Xiaoying, Tok Wang Ling, Mong Li Lee: ORA-SS: An ObjectRelationship-Attribute Model for Semistructured Data TR21/00, Technical Report,
Department of Computer Science, National University of Singapore, December
2000.
eXcelon. An General XML Data Manager. http://www.exceloncorp.com/
H. V. Jagadish, Shurug AL-Khalifa, et al. TIMBER: A Native XML Database.
Technical Report, University of Michigan,April 2002.
Xiaofeng Meng, Daofeng Luo, Mong Li Lee, Jing An. OrientStore: A Schema
Based Native XML Storage System. In Proceedings of the 29th VLDB Conference,
Berlin, Germany, 2003
J. McHugh, S. Abiteboul, R. Goldman, D. Quass, and J.Widom. Lore: A Database
Management System for Semistructured Data. SIGMOD Record, Vol.26(3):5466,September 1997.
Lucian Popa, Mauricio A. Hern´andez ,Yannis Velegrakis
, Ren´ee J. Miller,
Felix Naumann, Howard Ho. Mapping XML and Relational Schemas with Clio. In
ICDE 2002 Demo, 2002
29
Download