Paper

advertisement
Storing XML in ORDBMS
By Amine Kaddara
Supervisor: Dr Hachim Haddouti
1
I.
Introduction
II.
Object-Relational databases and XML
a. Object-Relational databases: Definition
b. Storing XML in ORDBMS: Motivation
III.
Mapping XML to ORDBMS
a. Introduction
b. Mapping Schemas
i. Mapping DTD’s to Object schemas
ii. Mapping Object schemas to Object relational database schema
c. Mapping Complex Content Models
i. Mapping Sequences and Choices
ii. Mapping Repeated Children
iii. Mapping Subgroups
iv. Mapping Single-Valued and Multi-Valued Attributes
v. Mapping ID/IDREF(S) Attributes
d. Generating Schema
i. Generating Relational Database Schema from DTDs
ii. Generating DTDs from Database Schema
IV.
Related Technologies
c. Querying XML in ORDBMS
d. Java DOM
e. Java Data Objects
2
I.
Introduction
In this paper, I will discuss a storage system based on Object-Relational DBMS for XML.
First, I give an introduction explaining the Object-Relational database technology. Then I will
move to explain different motivations behind storing XML in the O-R database management
systems. For this purpose, we first analyze the mapping from XML document structures( DTD’s
in this case) to Object schema and from the object schema to the Object–Relational database
schema. Then, based on the DTD structure , we will understand how the mapping is executed on
the different components of the DTD document . The following part will be an introduction of
the JDOM(Java DOM) and the JDO(Java Document Objects) API’s.
II.
Object-Relational databases and XML
a. Object-Relational databases: Definition
Object-relational database management systems combine object and relational
technology. These systems puts an object oriented front end on a relational database (RDBMS).
Programs that are based on an object oriented programming languages interface to this database
as if the data is stored as objects. However, the system will convert these objects into data tables,
rows and columns. It will then handle the data in the same way as it handles a relational
database.
In the process of retrieving the data, it must be reassembled again from simple data into
complex objects. The main benefit to this type of database lies in the fact that the software to
convert the object data between a RDBMS format and object database format is provided.
Therefore it is not necessary for programmers to write code to convert between the two formats
and database access is easy from an object oriented computer language.
b. Storing XML in ORDBMS: Motivation
It is widely accepted that XML will be the standard for documents having structural
information on the Web. The number of documents and applications that require manipulation of
large set of data is growing. Therefore an efficient managing and storing of these XML
documents is required. There exist several types of XML storage systems, but most of them use
relational DBMSs. Storing XML documents as database records requires a specification of the
mapping from the document structures to database schema. Most commercial DBMSs provide
such specification languages, but the languages are proprietary and limited to specifying a
mapping to relational databases only. Database vendors today offer hybrid systems that combine
their relational DBMS and the Object Relational technology as part of the same product. Another
important aspect about ORDBMS is that they allow a more expressive type system which
coincides with the purpose of XML (user-defined tags => representation of real-world entities).
III.
Mapping XML to ORDBMS
a. Introduction
The most important part of storing XML is how to map the XML model to OR database
model. The object relational mapping strategy models the data in XML documents rather than
the documents themselves. As a consequence this kind of mapping is better suited for datacentric documents. Another important characteristic of this mapping is that it is bidirectional:
3
that is, it can be used to transfer data both from XML documents to the database and from the
database to XML documents. As a result we can use canonical mappings where XML query
languages can be built over non-XML databases. The canonical mappings will define virtual
XML documents that can be queried with something like XQuery. Another important feature of
this mapping is that it allows data binding which are the marshalling and the unmarshalling of
data between XML documents and objects.
b. Mapping Schemas
i. Mapping DTDs to Object Schemas
Some conventions that make the analogy between XML data types and an object
programming language data types and impose in the mapping are:
 Simple elements and attributes are mapped to scalar data types (single value data types).
 Complex types mapped to classes, with each element type in the content model of the
complex type mapped to a property of that class.
 References to complex element types are mapped to pointers/references to an object of
the class to which the complex element type is mapped. The data type of each property is
the data type to which the referenced element type is mapped.
 Attributes maps to properties, with the data type of the property determined from the data
type of the attribute.
Example:
DTD
Classes
========================= =============
<!ELEMENT A (B, C)>
class A {
<!ELEMENT B (#PCDATA)>
String b;
<!ATTLIST A
==>
C
c;
F CDATA #REQUIRED>
String f;
}
<!ELEMENT C (D, E)>
class C {
<!ELEMENT D (#PCDATA)> ==>
String d;
<!ELEMENT E (#PCDATA)>
String e;
}
 Simple element types B, D, E, and the attribute F are all mapped to Strings( can be other
data types if explicitly changed by the programmer or if we use an XML schema).
 Complex element types A and C are mapped to classes A and C.
 The content models and attributes of A and C are mapped to properties of classes A and
C
 The reference to C in the content model of A is mapped to a property with the type
pointer/reference to an object of class C because element type C is mapped to class C.
Note: if an element type is referenced in two different content models, each reference must be
mapped separately.
ii. Mapping Object Schemas to Relational Database Schemas
4
The second step of object relational mapping is to map the object schema to the database
schema. The mapping involves the following steps:
 Mapping classes to tables (known as class tables).
 scalar properties are mapped to columns
 pointer/reference properties are mapped to primary key/foreign key relationships
 If the relationship between the parent and child elements is one-to-one, the primary key can
be in either table
 If the relationship is one-to-many, the primary key must be on the "one" side of the
relationship, regardless of whether this is the parent or child
 A primary key column can be created as part of the mapping
 If a primary key column is created as part of the mapping, its value must be generated by the
database
Example:
Classes
============
class A {
String b;
C
c;
String f;
}
==>
class C {
String d;
String e;
}
==>
Tables
=================
Table A:
Column b
Column c_fk
Column f
Table C:
Column d
Column e
Column c_pk
The tables are joined by a primary key (C.c_pk) and a foreign key (A.c_fk).
Note: Names can be changed during the mapping. For example, the DTD, object schema, and
relational schema can all use different names. For example, the DTD uses different names than
the class:
DTD: <! ELEMENT Part (Number, Price)> =>class name: class PartClass
Class name: class PartClass => Table name: Table PRT
Also, the objects involved in the mapping are conceptual. That is, there is no need to instantiate
them when transferring data between an XML document and a relational database.
c. Mapping Complex Content Models
i.
Mapping Sequences and choices
Each element type referenced in a sequence is mapped to a property, which is then
mapped either to a column or to a primary key, foreign key relationship. Each element type
referenced in a choice is also mapped to a property
then either to a column or a primary key, foreign key relationship. The only difference from the
way sequences are mapped is that the properties and columns can be null.
Example:
class A {
Table A (
5
String b; // Nullable
C c; // Nullable }
Column b // Nullable
Column c_fk // Nullable
ii. Mapping Repeated Children
Repeated children are mapped to multi-valued properties and then either to multiple columns in a
table or to a separate table, known as a property table.
 If a content model contains repeated references to an element type, the references are mapped
to a single property, which is an array of known size.
 Then it can be mapped either to multiple columns in a table or to a property table.
 Children that are optional in their parent are mapped to nullable properties, then to nullable
columns.
Example:
DTD
======================
<!ELEMENT E (K, K, K)>
<!ELEMENT K (#PCDATA)> ==>
Classes
==============
class E {
String[] k; ==>
Tables
==============
Table A
Column k1
Column k2
Column k3
}
<!ELEMENT A (B+, C*)>
<!ELEMENT B (#PCDATA)>
<!ELEMENT C (#PCDATA)>
==>
class A {
String[] b; ==>
String
c //nullable;
}
Table A
Column a_pk
Column c //nullable
Table B
Column a_fk
Column b
iii. Mapping Subgroups:
 References in subgroups are mapped to properties of the parent class, then to columns in the
class table.
Example:
<!ELEMENT A (B, (C | D))>
<!ELEMENT B (#PCDATA)> ==>
<!ELEMENT C (#PCDATA)>
<!ELEMENT D (E, F)>
class A {
String b; // Not nullable
String c; // Nullable
D d; // Nullable
}
Table A
column b // Not nullable
column c // Nullable
column d_fk // Nullable
iv. Mapping Single-Valued and Multi-Valued Attributes
 Single-valued attributes (CDATA, ID, IDREF, NMTOKEN, ENTITY, NOTATION, and
enumerated) map to single-valued properties and then to columns.
 Multi-valued attributes map to properties multi-valued (and then to property tables).
 The order in which attributes occur is not significant, but the order in which values occur in
multi-valued attributes is considered significant
6
Example:
DTD
============================
<!ELEMENT A
<!ATTLIST A
D
<!ELEMENT B
<!ELEMENT C
and:
(B, C)>
CDATA #REQUIRED>
(#PCDATA)>
(#PCDATA)>
DTD
========================
<!ELEMENT A (B, C)>
<!ATTLIST A
D IDREFS #IMPLIED>
<!ELEMENT B (#PCDATA)>
<!ELEMENT C (#PCDATA)>
==>
Classes
============
class A {
String b;
==> String c;
String d;
}
Tables
===========
==>
Classes
==============
class A {
String
b;
String
c;
String[] d;
}
Table A
Column B
Column C
Column D
Tables
==============
==>
Table A
Column a_pk
Column b
Column c
Table D
Column a_fk
Column d
v. Mapping ID/IDREF(S) Attributes
 ID/IDREF(S) attributes map to primary key, foreign key relationships
 IDs need to be unique inside a given XML document. Thus, if the data from more than one
document is stored in the same table, there is no guarantee that the IDs will be unique. The
solution is to change the ID by prefixing it or by mapping the attributes to two columns, one
of which contains a value that is unique to each document and the other of which contains the
ID
d. Generating Schema
i. Generating Relational Database Schema from DTDs
Relational schemas are generated by reading through the DTD and processing each element type:
 Complex element types generate class tables with primary key columns.
 Simple element types are ignored except when processing content models.
To process a content model:
 Single references to simple element types generate columns; if the reference is optional (?
operator), the column is nullable.
 Repeated references to simple element types generate property tables with foreign keys.
 References to complex element types generate foreign keys in remote class tables.
 PCDATA in mixed content generates a property table with a foreign key.
 Optionally generate order columns for all referenced element types and PCDATA.
To process attributes:
 Single-valued attributes generate columns; if the attribute is optional, the column is nullable.
 Multi-valued attributes generate property tables with foreign keys.
 If an attribute has a default, it is used as the column default.
7
Example:
DTD
=================================================
<!ELEMENT Order (OrderNum, Date, CustNum, Item*)>
<!ELEMENT OrderNum (#PCDATA)>
<!ELEMENT Date (#PCDATA)>
<!ELEMENT CustNum (#PCDATA)>
<!ELEMENT Item (ItemNum, Quantity, Part)>
<!ELEMENT ItemNum (#PCDATA)>
<!ELEMENT Quantity (#PCDATA)>
<!ELEMENT Part (PartNum, Price)>
<!ELEMENT PartNum (#PCDATA)>
<!ELEMENT Price (#PCDATA)>
Tables
=================
==>
Table Order
Column OrderPK
Column OrderNum
Column Date
Column CustNum
==>
Table Item
Column ItemPK
Column ItemNum
Column Quantity
Column OrderFK
==>
Table
Column
Column
Column
Column
Part
PartPK
PartNum
Price
PartFK
 In the first step, we generate tables for complex element types and primary keys for these
tables.
 In the second step, we generate columns for references to simple element types:
 In the final step, we generate foreign keys for references to complex element types.
ii. Generating DTDs from Database Schema
DTDs are generated by starting from a single "root" table or set of root tables and processing
each:
 Each root table generates an element type with element content in the form of a single
sequence.
 Each data (non-key) column in the table generates an element type with PCDATA-only
content and a reference in the sequence; nullable columns generate optional references.
Primary and foreign keys are generated following these steps:
 The remote table is processed in the same manner as a root table.
 A reference to the element type for the remote table is added to the sequence.
 If the key is the primary key, the reference is optional and repeated (*). This is because there
is no guarantee that a row will exist in the foreign table, nor that the row will exist.
 If the key is the primary key, PCDATA-only element types are optionally generated for each
column in the key. If these are generated, references to these element types are added to the
sequence. This is useful only if primary keys contain data.
 If the key is a foreign key and is nullable, the reference is optional (?).
Example:
Tables
==================
Table Orders
Column OrderNum
Column Date
DTD
===================================================
<!ELEMENT Orders (Date, CustNum, OrderNum, Items*)>
<!ELEMENT OrderNum (#PCDATA)>
<!ELEMENT Date (#PCDATA)>
8
Column CustNum
Table Items
Column OrderNum
Column ItemNum
Column Quantity
Column PartNum
Table Parts
==>
Column PartNum
Column Price
<!ELEMENT CustNum (#PCDATA)>
<!ELEMENT Items (ItemNum, Quantity, Parts)>
<!ELEMENT ItemNum (#PCDATA)>
<!ELEMENT Quantity (#PCDATA)>
<!ELEMENT Parts(PartNum, Price)>
<!ELEMENT PartNum (#PCDATA)>
<!ELEMENT Price (#PCDATA)>
 First step, we generate an element type for the root table (Orders).
 Next, we generate PCDATA-only elements for the data columns (Date and CustNum) and
add references to these elements to the content model of the Orders element.
 Then we generate a PCDATA-only element for the primary key (OrderNum) and add a
reference to it to the content model.
 And then add an element for the table (Items) to which the primary key is exported, as well
as a reference to it in the content model. We process the data and primary key columns in the
remote (Items) table in the same way
 Finally, we process the foreign key table (Parts).
IV.
Related Technologies
a. Querying XML in ORDBMS
XQuery can be used as a query language for object relational databases. If an objectrelational mapping is used, hierarchies of tables are treated as a single document and joins are
specified in the mapping (=> no need to explicitly specify the joins inside the query). With
XPath, an object-relational mapping must be used to query data over more than one table . This
is because XPath does not support joins across documents.
b. Java DOM
JDOM is an open source, tree-based(DOM), pure Java API for parsing, creating,
manipulating, and serializing XML documents. It represents an XML document as a tree
composed of elements, attributes, comments, processing instructions, text nodes, CDATA
sections,etc…It is written in and for Java and It consistently uses the Java coding conventions
and the class library and it implemets the cloenable and serializable interfaces.
A JDOM tree is fully read-write. All parts of the tree can be moved, deleted, and added to,
subject to the usual restrictions of XML. Unlike DOM, there are no annoying read-only sections
of the tree that one can’t change.
c. Java Data Objects
Sun's Java Data Objects (JDO) standard allows the persistence of Java objects into
databases. It supports transactions and multiple users and it differs from JDBC in that you don't
9
have to think about SQL or database models. It differs from serialization as it allows multiple
users and transactions. It allows Java developers to use their object model as a data model. There
is no need to spend time going between the "data" side and the "object" side.
References:
Storing and Querying XML Data in Object-Relational DBMSs by Kanda Runapongsa and
Jignesh M. Patel.
XML Content Management based on Object-Relational Database Technology by B. Surjanto, N.
Ritter, H. Loeser.
A New Inlining Algorithm for Mapping XML DTDs to Relational Schemas by Shiyong Lu,
Yezhou Sun, Mustafa Atay, Farshad Fotouhi
http://www.comptechdoc.org/independent/database/basicdb/dataordbms.html
http://java.sun.com/products/jdo
http://www.jdom.org/docs/apidocs
www.rpbourret.com/xml/XMLDBLinks.htm
10
Download