Storing XML in ORDBMS By Amine Kaddara Supervisor: Dr Hachim Haddouti 1 I. Introduction II. Object-Relational databases and XML a. Object-Relational databases: Definition b. Storing XML in ORDBMS: Motivation III. Mapping XML to ORDBMS a. Introduction b. Mapping Schemas i. Mapping DTD’s to Object schemas ii. Mapping Object schemas to Object relational database schema c. Mapping Complex Content Models i. Mapping Sequences and Choices ii. Mapping Repeated Children iii. Mapping Subgroups iv. Mapping Single-Valued and Multi-Valued Attributes v. Mapping ID/IDREF(S) Attributes d. Generating Schema i. Generating Relational Database Schema from DTDs ii. Generating DTDs from Database Schema IV. Related Technologies c. Querying XML in ORDBMS d. Java DOM e. Java Data Objects 2 I. Introduction In this paper, I will discuss a storage system based on Object-Relational DBMS for XML. First, I give an introduction explaining the Object-Relational database technology. Then I will move to explain different motivations behind storing XML in the O-R database management systems. For this purpose, we first analyze the mapping from XML document structures( DTD’s in this case) to Object schema and from the object schema to the Object–Relational database schema. Then, based on the DTD structure , we will understand how the mapping is executed on the different components of the DTD document . The following part will be an introduction of the JDOM(Java DOM) and the JDO(Java Document Objects) API’s. II. Object-Relational databases and XML a. Object-Relational databases: Definition Object-relational database management systems combine object and relational technology. These systems puts an object oriented front end on a relational database (RDBMS). Programs that are based on an object oriented programming languages interface to this database as if the data is stored as objects. However, the system will convert these objects into data tables, rows and columns. It will then handle the data in the same way as it handles a relational database. In the process of retrieving the data, it must be reassembled again from simple data into complex objects. The main benefit to this type of database lies in the fact that the software to convert the object data between a RDBMS format and object database format is provided. Therefore it is not necessary for programmers to write code to convert between the two formats and database access is easy from an object oriented computer language. b. Storing XML in ORDBMS: Motivation It is widely accepted that XML will be the standard for documents having structural information on the Web. The number of documents and applications that require manipulation of large set of data is growing. Therefore an efficient managing and storing of these XML documents is required. There exist several types of XML storage systems, but most of them use relational DBMSs. Storing XML documents as database records requires a specification of the mapping from the document structures to database schema. Most commercial DBMSs provide such specification languages, but the languages are proprietary and limited to specifying a mapping to relational databases only. Database vendors today offer hybrid systems that combine their relational DBMS and the Object Relational technology as part of the same product. Another important aspect about ORDBMS is that they allow a more expressive type system which coincides with the purpose of XML (user-defined tags => representation of real-world entities). III. Mapping XML to ORDBMS a. Introduction The most important part of storing XML is how to map the XML model to OR database model. The object relational mapping strategy models the data in XML documents rather than the documents themselves. As a consequence this kind of mapping is better suited for datacentric documents. Another important characteristic of this mapping is that it is bidirectional: 3 that is, it can be used to transfer data both from XML documents to the database and from the database to XML documents. As a result we can use canonical mappings where XML query languages can be built over non-XML databases. The canonical mappings will define virtual XML documents that can be queried with something like XQuery. Another important feature of this mapping is that it allows data binding which are the marshalling and the unmarshalling of data between XML documents and objects. b. Mapping Schemas i. Mapping DTDs to Object Schemas Some conventions that make the analogy between XML data types and an object programming language data types and impose in the mapping are: Simple elements and attributes are mapped to scalar data types (single value data types). Complex types mapped to classes, with each element type in the content model of the complex type mapped to a property of that class. References to complex element types are mapped to pointers/references to an object of the class to which the complex element type is mapped. The data type of each property is the data type to which the referenced element type is mapped. Attributes maps to properties, with the data type of the property determined from the data type of the attribute. Example: DTD Classes ========================= ============= <!ELEMENT A (B, C)> class A { <!ELEMENT B (#PCDATA)> String b; <!ATTLIST A ==> C c; F CDATA #REQUIRED> String f; } <!ELEMENT C (D, E)> class C { <!ELEMENT D (#PCDATA)> ==> String d; <!ELEMENT E (#PCDATA)> String e; } Simple element types B, D, E, and the attribute F are all mapped to Strings( can be other data types if explicitly changed by the programmer or if we use an XML schema). Complex element types A and C are mapped to classes A and C. The content models and attributes of A and C are mapped to properties of classes A and C The reference to C in the content model of A is mapped to a property with the type pointer/reference to an object of class C because element type C is mapped to class C. Note: if an element type is referenced in two different content models, each reference must be mapped separately. ii. Mapping Object Schemas to Relational Database Schemas 4 The second step of object relational mapping is to map the object schema to the database schema. The mapping involves the following steps: Mapping classes to tables (known as class tables). scalar properties are mapped to columns pointer/reference properties are mapped to primary key/foreign key relationships If the relationship between the parent and child elements is one-to-one, the primary key can be in either table If the relationship is one-to-many, the primary key must be on the "one" side of the relationship, regardless of whether this is the parent or child A primary key column can be created as part of the mapping If a primary key column is created as part of the mapping, its value must be generated by the database Example: Classes ============ class A { String b; C c; String f; } ==> class C { String d; String e; } ==> Tables ================= Table A: Column b Column c_fk Column f Table C: Column d Column e Column c_pk The tables are joined by a primary key (C.c_pk) and a foreign key (A.c_fk). Note: Names can be changed during the mapping. For example, the DTD, object schema, and relational schema can all use different names. For example, the DTD uses different names than the class: DTD: <! ELEMENT Part (Number, Price)> =>class name: class PartClass Class name: class PartClass => Table name: Table PRT Also, the objects involved in the mapping are conceptual. That is, there is no need to instantiate them when transferring data between an XML document and a relational database. c. Mapping Complex Content Models i. Mapping Sequences and choices Each element type referenced in a sequence is mapped to a property, which is then mapped either to a column or to a primary key, foreign key relationship. Each element type referenced in a choice is also mapped to a property then either to a column or a primary key, foreign key relationship. The only difference from the way sequences are mapped is that the properties and columns can be null. Example: class A { Table A ( 5 String b; // Nullable C c; // Nullable } Column b // Nullable Column c_fk // Nullable ii. Mapping Repeated Children Repeated children are mapped to multi-valued properties and then either to multiple columns in a table or to a separate table, known as a property table. If a content model contains repeated references to an element type, the references are mapped to a single property, which is an array of known size. Then it can be mapped either to multiple columns in a table or to a property table. Children that are optional in their parent are mapped to nullable properties, then to nullable columns. Example: DTD ====================== <!ELEMENT E (K, K, K)> <!ELEMENT K (#PCDATA)> ==> Classes ============== class E { String[] k; ==> Tables ============== Table A Column k1 Column k2 Column k3 } <!ELEMENT A (B+, C*)> <!ELEMENT B (#PCDATA)> <!ELEMENT C (#PCDATA)> ==> class A { String[] b; ==> String c //nullable; } Table A Column a_pk Column c //nullable Table B Column a_fk Column b iii. Mapping Subgroups: References in subgroups are mapped to properties of the parent class, then to columns in the class table. Example: <!ELEMENT A (B, (C | D))> <!ELEMENT B (#PCDATA)> ==> <!ELEMENT C (#PCDATA)> <!ELEMENT D (E, F)> class A { String b; // Not nullable String c; // Nullable D d; // Nullable } Table A column b // Not nullable column c // Nullable column d_fk // Nullable iv. Mapping Single-Valued and Multi-Valued Attributes Single-valued attributes (CDATA, ID, IDREF, NMTOKEN, ENTITY, NOTATION, and enumerated) map to single-valued properties and then to columns. Multi-valued attributes map to properties multi-valued (and then to property tables). The order in which attributes occur is not significant, but the order in which values occur in multi-valued attributes is considered significant 6 Example: DTD ============================ <!ELEMENT A <!ATTLIST A D <!ELEMENT B <!ELEMENT C and: (B, C)> CDATA #REQUIRED> (#PCDATA)> (#PCDATA)> DTD ======================== <!ELEMENT A (B, C)> <!ATTLIST A D IDREFS #IMPLIED> <!ELEMENT B (#PCDATA)> <!ELEMENT C (#PCDATA)> ==> Classes ============ class A { String b; ==> String c; String d; } Tables =========== ==> Classes ============== class A { String b; String c; String[] d; } Table A Column B Column C Column D Tables ============== ==> Table A Column a_pk Column b Column c Table D Column a_fk Column d v. Mapping ID/IDREF(S) Attributes ID/IDREF(S) attributes map to primary key, foreign key relationships IDs need to be unique inside a given XML document. Thus, if the data from more than one document is stored in the same table, there is no guarantee that the IDs will be unique. The solution is to change the ID by prefixing it or by mapping the attributes to two columns, one of which contains a value that is unique to each document and the other of which contains the ID d. Generating Schema i. Generating Relational Database Schema from DTDs Relational schemas are generated by reading through the DTD and processing each element type: Complex element types generate class tables with primary key columns. Simple element types are ignored except when processing content models. To process a content model: Single references to simple element types generate columns; if the reference is optional (? operator), the column is nullable. Repeated references to simple element types generate property tables with foreign keys. References to complex element types generate foreign keys in remote class tables. PCDATA in mixed content generates a property table with a foreign key. Optionally generate order columns for all referenced element types and PCDATA. To process attributes: Single-valued attributes generate columns; if the attribute is optional, the column is nullable. Multi-valued attributes generate property tables with foreign keys. If an attribute has a default, it is used as the column default. 7 Example: DTD ================================================= <!ELEMENT Order (OrderNum, Date, CustNum, Item*)> <!ELEMENT OrderNum (#PCDATA)> <!ELEMENT Date (#PCDATA)> <!ELEMENT CustNum (#PCDATA)> <!ELEMENT Item (ItemNum, Quantity, Part)> <!ELEMENT ItemNum (#PCDATA)> <!ELEMENT Quantity (#PCDATA)> <!ELEMENT Part (PartNum, Price)> <!ELEMENT PartNum (#PCDATA)> <!ELEMENT Price (#PCDATA)> Tables ================= ==> Table Order Column OrderPK Column OrderNum Column Date Column CustNum ==> Table Item Column ItemPK Column ItemNum Column Quantity Column OrderFK ==> Table Column Column Column Column Part PartPK PartNum Price PartFK In the first step, we generate tables for complex element types and primary keys for these tables. In the second step, we generate columns for references to simple element types: In the final step, we generate foreign keys for references to complex element types. ii. Generating DTDs from Database Schema DTDs are generated by starting from a single "root" table or set of root tables and processing each: Each root table generates an element type with element content in the form of a single sequence. Each data (non-key) column in the table generates an element type with PCDATA-only content and a reference in the sequence; nullable columns generate optional references. Primary and foreign keys are generated following these steps: The remote table is processed in the same manner as a root table. A reference to the element type for the remote table is added to the sequence. If the key is the primary key, the reference is optional and repeated (*). This is because there is no guarantee that a row will exist in the foreign table, nor that the row will exist. If the key is the primary key, PCDATA-only element types are optionally generated for each column in the key. If these are generated, references to these element types are added to the sequence. This is useful only if primary keys contain data. If the key is a foreign key and is nullable, the reference is optional (?). Example: Tables ================== Table Orders Column OrderNum Column Date DTD =================================================== <!ELEMENT Orders (Date, CustNum, OrderNum, Items*)> <!ELEMENT OrderNum (#PCDATA)> <!ELEMENT Date (#PCDATA)> 8 Column CustNum Table Items Column OrderNum Column ItemNum Column Quantity Column PartNum Table Parts ==> Column PartNum Column Price <!ELEMENT CustNum (#PCDATA)> <!ELEMENT Items (ItemNum, Quantity, Parts)> <!ELEMENT ItemNum (#PCDATA)> <!ELEMENT Quantity (#PCDATA)> <!ELEMENT Parts(PartNum, Price)> <!ELEMENT PartNum (#PCDATA)> <!ELEMENT Price (#PCDATA)> First step, we generate an element type for the root table (Orders). Next, we generate PCDATA-only elements for the data columns (Date and CustNum) and add references to these elements to the content model of the Orders element. Then we generate a PCDATA-only element for the primary key (OrderNum) and add a reference to it to the content model. And then add an element for the table (Items) to which the primary key is exported, as well as a reference to it in the content model. We process the data and primary key columns in the remote (Items) table in the same way Finally, we process the foreign key table (Parts). IV. Related Technologies a. Querying XML in ORDBMS XQuery can be used as a query language for object relational databases. If an objectrelational mapping is used, hierarchies of tables are treated as a single document and joins are specified in the mapping (=> no need to explicitly specify the joins inside the query). With XPath, an object-relational mapping must be used to query data over more than one table . This is because XPath does not support joins across documents. b. Java DOM JDOM is an open source, tree-based(DOM), pure Java API for parsing, creating, manipulating, and serializing XML documents. It represents an XML document as a tree composed of elements, attributes, comments, processing instructions, text nodes, CDATA sections,etc…It is written in and for Java and It consistently uses the Java coding conventions and the class library and it implemets the cloenable and serializable interfaces. A JDOM tree is fully read-write. All parts of the tree can be moved, deleted, and added to, subject to the usual restrictions of XML. Unlike DOM, there are no annoying read-only sections of the tree that one can’t change. c. Java Data Objects Sun's Java Data Objects (JDO) standard allows the persistence of Java objects into databases. It supports transactions and multiple users and it differs from JDBC in that you don't 9 have to think about SQL or database models. It differs from serialization as it allows multiple users and transactions. It allows Java developers to use their object model as a data model. There is no need to spend time going between the "data" side and the "object" side. References: Storing and Querying XML Data in Object-Relational DBMSs by Kanda Runapongsa and Jignesh M. Patel. XML Content Management based on Object-Relational Database Technology by B. Surjanto, N. Ritter, H. Loeser. A New Inlining Algorithm for Mapping XML DTDs to Relational Schemas by Shiyong Lu, Yezhou Sun, Mustafa Atay, Farshad Fotouhi http://www.comptechdoc.org/independent/database/basicdb/dataordbms.html http://java.sun.com/products/jdo http://www.jdom.org/docs/apidocs www.rpbourret.com/xml/XMLDBLinks.htm 10