Querying XML Data in DB2

advertisement
Native XML Support in
DB2 Universal Database
Matthias Nicola, Bert van der Linden
IBM Silicon Valley Lab
Presented by Mo Liu , Frate, Joseph and John Russo
Some material in the talk is adapted from
the slides of this paper’s conference talk.
Agenda







What is DB2 9 (Viper)?
Native XML in the forthcoming version of
DB2
Native XML Storage
XML Schema Support
XML indexes
Querying XML data in DB2
Summery
What is DB2 9 (Viper)?
IBM DB2 9 is the next-generation hybrid
data server with optimized management of
both XML and relational data.
 IBM extended DB2 to include:
• New storage techniques for efficient
management of hierarchical structures
inherent in XML documents.
• New indexing technology


New query language support (for XQuery), a new
graphical query builder (for XQuery), and new query
optimization techniques

New support for validating XML data based on usersupplied schemas

New administrative capabilities, including extensions to
key database utilities

Integration with popular application programming
interfaces (APIs)
XML Databases










XML-enabled Databases
The core data model is not XML (but e.g. relational)
Mapping between XML data model and DB’s data
model is required, or XML is stored as text
E.g.: DB2 XML Extender v8
Native XML Databases
Use the hierarchical XML data model to store and
process XML internally
No mapping, no storage as text
Storage format = processing format
E.g.: Forthcoming version of DB2
XML in Relational Databases –
Today's Challenge
Today’s Challenge:
XML must be force fit into relational data model – 2 choices
1. Shredding or decomposing
− Mapping from XML to relational often too complex
− Loses hierarchical dependencies
− Loses digital signature
− Often requires dozens or hundreds of tables
− Difficult to change original XML document
2. Large Object (BLOB, CLOB, Varchar)
It allows for fast insert and retrieval of full documents but it needs XML
parsing at query execution time.
− SLOW performance
− Search performance is slow (must parse at search time)
− Retrieval of sub-documents is expensive
− Update inside the document is slow
− Indexing is inefficient (based on relative position)
− Difficult to join with relational
− Costs get worse as document size increases
DB2 Hybrid XML Engine - Overview
Integration of XML & Relational
Capabilities in DB2
Native XML data type
(not Varchar, not CLOB, not objectrelational)
 XML Capabilities in all DB2 components
 Applications combine XML & relational
data

Integrating XML and Relational in DB2
DB2 Hybrid XML Engine - Interfaces






Data Definition
create table dept(deptID int, deptdoc xml);
Insert
insert into dept(deptID, deptdoc) values (?,?)
Index
create index xmlindex1 on dept(deptdoc)
generate key using xmlpattern ‘/dept/name’ as varchar(30);
Retrieve
select deptdoc from dept where deptID = ?
SQL based Query
select deptID, xmlquery('$d/dept/name' passing deptdoc as “d") from dept where
deptID <> “PR27”;
XQuery based Query
for $book in db2-fn:xmlcolumn('BOOKS')/book
for $entry in db2-fn:xmlcolumn('REVIEWS')/entry
where $book/title = $entry/title
return <review> {$entry/review/text()} </review>;
Native XML Storage
Efficient Document Tree Storage
Information for Every Node










Tag name, encoded as unique StringID
A nodeID
Node kind (e.g. element, attribute, etc.)
Namespace / Namespace prefix
Type annotation
Pointer to parent
Array of child pointers
Hints to the kind & name of child nodes
(for early-out navigation)
For text/attribute nodes: the data itself
XML Node Storage Layout
XML Storage: “Regions Index”
XML Indexes in DB2
Need index support to manage millions of
XML documents
 Path-specific value indexes on XML
columns to index frequently used elements
and attributes
 XML-aware full-text indexing

XML Value Indexes


Table DEPT has two fields: “id” and “dept_doc”
Field “dept_doc” is an XML document:
<dept>
<employee id=901>
<name>John Doe</name>
<phone>408 555 1212</phone>
<office>344</office>
</employee>
</dept>


CREATE INDEX idx1 ON DEPT(deptdoc) GENERATE KEY USING
XMLPATTERN ‘/dept/employee/name’ AS SQL VARCHAR(35)
Creates XML value index on employee name for all documents
XML Value Indexes (continued)


“xmlpattern” identifies the XML nodes to be
indexed
Subset of XPath language
 Wildcards,
namespaces allowed
 XPath predicates such as /a/b[c=5] not supported

“AS SQL” necessary to define data type, since
DB2 does not require single XML schema for all
documents in a table (so DB2 may not know
data type to use for index)
XML Value Indexes: Data Types

Allowed data types for indexes:
 VARCHAR(n)
 VARCHAR
HASHED,
 DOUBLE
 DATE
 TIMESTAMP

DB2 index manager enhanced to handle special
XML types (e.g., +0, -0, +INF, -INF, NaN)
XML Value Indexes (continued)

Node does not cast to the index type
 No
error is raised
 No index entry created for that node

Single document (e.g., XML field from
single record) may contain 0, 1, or multiple
index entries
 Different
than relational index
XML Value Indexes: unique indexes
Unique indexes enforced within a
document, and across all documents
 Example of unique index on employee id:
CREATE UNIQUE INDEX idx2 ON
DEPT(deptdoc) GENERATE KEY
USING XMLPATTERN
‘/dept/employee/@id’ AS SQL
DOUBLE

XML Value Indexes: multiple elements
or attributes
Can create indexes on multiple elements
or attributes
 Example: create index on all text nodes:

CREATE INDEX idx3 ON DEPT(deptdoc)
GENERATE KEY USING XMLPATTERN
‘//text()’ AS SQL VARCHAR(hashed)

Example: create index on all attributes
CREATE INDEX idx4 ON DEPT(deptdoc)
GENERATE KEY USING XMLPATTERN
‘//@*’ AS SQL DOUBLE
XML Value Indexes: namespaces
Can index in a particular namespace
 XMLPATTERN can contain namespace
declarations and prefixes
 Example:

CREATE INDEX idx5 ON DEPT(deptdoc)
GENERATE KEY USING XMLPATTERN
‘DECLARE NAMESPACE
m=http://www.me.com/;/m:dept/m:employee/
m:name’ AS SQL VARCHAR(45)
XML Value Indexes: internal


For each XML document, each unique path
mapped to an integer PathID (like StringID for
tags)
Each index entry includes:
 PathID to identify path of indexed node
 Value of the node cast to the index type
 RowID
 Identify rows containing the matching documents
 NodeID
 Identify matching nodes and regions within the documents
XML Value Indexes: atomic vs. non-atomic

Atomic Node:
 if
it is an attribute, or
 if it is a text node, or
 if it is an element that has no child elements
and exactly one text node child
Indexes typically defined for atomic nodes
 Possible to define index on non-atomic
nodes, e.g. index on ‘/dept/employee’

XML Value Indexes: atomic vs. non-atomic
‘/dept/employee’ non-atomic since it has
child elements
 Single index entry for all of “employee”
element, on all text nodes under
“employee” (concatenation)
 Can be useful for mixed content in textoriented XML, e.g.:


<title>The benefits of <bold>XML</bold></title>
XML Full Text Indexes



Allows full-text search of XML columns
Can be fully indexed or partially indexed
Example of full index:
CREATE INDEX myIndex FOR TEXT ON DEPT(deptdoc)
FORMAT XML CONNECT TO PERSONNELDB

Example query:
SELECT deptdoc FROM dept WHERE
CONTAINS(deptdoc,’SECTIONS(“/dept/comment”) “Brazil” ‘)
=1
Internal index structure
System RX: One Part Relational, One Part XML
Kevin Beyer, Roberta J Cochrane,
Vanja Josifovski, Jim Kleewein, George Lapis,
Guy Lohman, Bob Lyle, Fatma Özcan,
Hamid Pirahesh, Normen Seemann,
Tuong Truong, Bert Van der Linden, Brian Vickery,
Chun Zhang
Internal index structure

XML index implemented with two B+ trees
 Path
index
 Value Index
Internal index structure: Path Index


Path Index maps reverse path (revPath) to a
generated path identifier (pathId)
A “reverse path” is a list of node labels from leaf
to root
 Compressed


into vector of label identifiers
Analogy to COLUMNS catalog from relational
database
Used for efficient processing of descendent
queries
 Example:
“//name” query
Internal index structure: Value
Index
Value Index used to represent nodes
 Cconsists of the following key:

 PathId
 value
 nodeId
 rid
Internal index structure: Value
Index
“value” is representation of the node’s data
value when cast to the index’s data type
 “rid” identifies the row in the table (used for
locking)
 “nodeId” identifies a node within the

 uses
a Dewey node identifier
 can provide quick access to a node in the
XML store

“pathId” to retrieve specific path queries
Internal index structure: Tradeoffs
of Value Index key fields


Order of keys is a tradeoff
pathId first allows quick retrieval of specific
queries
 e.g.,
index on //name might match many paths
 query on /book/author/name still has consecutive
index entries
 but, query like //name=‘Maggie’ will need to examine
every location in the index per matching path
XML Schema Support
Optional XML Schema validation
 Insert, Update, Query
 Limited support for DTDs an external
entities
 Type annotation produced by validation
persisted with document (query execution)
 Conforms to XML Query standard, XML
Schema standard, XML standard

XML Schema Support
Register XML Schemas and DTDs in DB
 DB then stores type-annotated documents
on disk, compiles execution plans with
references to the XML Schemas
 Schemas stored in DB itself, for
performance

 XML
Schema Repository (XSR)
XML Schema Support: XSR

XSR consists of several new database
catalog tables:
 Original
XML schema documents for XML
schema
 Binary representation of the schema for fast
reference
XML Schema Support: Registration
Example:
REGISTER XMLSCHEMA
http://my.dept.com FROM dept.xsd AS
departments.deptschema complete
 Schema URI is http://my.dept.com
 File with schema document is “dept.xsd”
 Schema identifier in DB is “deptschema”
 Belongs to relational DB schema
“departments”

XML Schema Support: Validation
“XMLVALIDATE” function to validate
documents in SQL statements
 Schema for validation

 is
specified explicitly, or
 can be deduced from the schemaLocation
hints in the instance documents

Referenced by Schema URI or by
identifier
XML Schema Support: Validation
Example (explicit by URI):
INSERT INTO DEPT(detpdoc)
VALUES xmlvalidate(?according to
xmlschema uri ‘http://my.dept.com’)
 Example (explicit by ID):
INSERT INTO DEPT(deptdoc)
VALUES xmlvalidate(? according to
xmlschema id
departments.deptschema)

XML Schema Support: Validation
Example (implcit)
 DB2 tries to deduce schema from input
document
INSERT INTO dept(deptdoc) VALUES
xmlvalidate(?)
 Try to find it in repository

XML Schema Support: First repository
design principle

Repository will not
 require
users to modify a schema before it is
being registered
 require users to modify XML documents
before they are inserted and validated

Once document is validated in DB,it will
never require updates to remain valid
 Considered
infeasible to bulk-update all
existing documents to become valid
XML Schema Support: Second repository
design principle
Enable schema evolution
 Sequence of changes in an XML schema
over its lifetime
 New or evolving business needs
 How to accomplish schema evolution is
much-debated

 no
standards
 business demands require it; so constrain
problem
XML Schema Support: Second repository
design principle
Flexibility of schema repository
“paramount importance”
 DB2’s schema repository does not require
namespace or the schema URI of each
registered schema to be unique (user
does not have control)
 Database-specific Schema identifier must
be unique (user does have control)

XML Schema Support: Second repository
design principle



Built-in support for one very simple type of
schema evolution
If new schema is backwards-cmpatible with old
schema, then old schema can be replaced with
new schema in the schema repository
DB2 verifies all possible elements and attributes
in old schema have same named types in the
new schema
Querying XML Data in DB2

Options Supported
 XQuery/XPath
as a stand-alone language
 SQL embedded in XQuery
 XQuery/XPath embedded in SQL/XML
 Plain SQL for full-document retrieval

DB2 treats SQL and XQuery as primary query
languages.
 Both
will operate independently on their data models
 Can also be integrated
Sample Tables
create table ship (
shipNo
capacity
class
purchDate
maintenance
)
varchar(5) primary key not null,
decimal(7,2),
int,
Notice the
date,
xml datatype
xml
create table captain (
captID
varchar(5) primary key not null,
lname
varchar(20),
fname
varchar(20),
DOB
date,
contact
xml
)
Sample XML Data
Ship.maintenance
<mrecord>
<log>
<mntid>2353</mntid>
<shipno>39</shipno>
<vendorid>2345</vendorid>
<captid>9875</captid>
<maintdate>01/10/2007</maintdate>
<service>Removed rust on hull </service>
<resolution>complete</resolution>
<cost>13450.96</cost>
<nextservice>01/10/2008</nextservice>
</log>
<log>
<mntid>1254</mntid>
<shipno>39</shipno>
<vendorid>1253</vendorid>
<captid>9234</captid>
<maintdate>09/20/2005</maintdate>
<service>Replace rudder</service>
<resolution>complete</resolution>
<cost>34532.21</cost>
<nextservice>NA</nextservice>
</log>
</mrecord>
Sample XML Data
Captain.contactinfo
<contactinfo>
<Address>
<street>234 Rolling Lane</street>
<city>Rockport</city>
<state>MA</state>
<zipcode>01210</zipcode>
</Address>
<phone>
<work>9783412321</work>
<home>9722342134</home>
<cell>9782452343</cell>
<satellite>2023051243</satellite>
</phone>
<email>love2fish@finmail.com</email>
</contactinfo>
Standalone XQuery in DB2
for $s in db2-fn:xmlcolumn(‘ship.maintenance’)
Db2-fn:xmlcolumn returns
let $ml:= $s//log
sequence of all documents
where $ml/cost = > 10000
in the XML column
order by $ml/shipno
return <MaintenanceLog>
{$ml/shipno,$ml}
</MaintenanceLog>
SQL Embedded in XQuery
for $m in db2-fn:sqlquery(‘select maintenance from ship where class = 1’)
let $ml := $m//log
order by $ml/shipno
return
<maintenanceLog>
{$ml}
</mantenanceLog>
This will return the documents for all class one ships.
Select Statement using XML
Column
Select shipno,class,maintenance
from ship
where class = 1
This will produce the maintenance
document for each ship that is class 1.
 We can also create views this way

SQL/XML Queries

Restricting results using XML element
values
 select
captid,lname,fname from captain
where xmlexists(‘$c/contactinfo/Address[state=“MA”]’
passing captain.contact as “c”
•
This will return the captid, lname and fname of
all captains who live in Massachusetts
SQL/XML Queries

Projecting XML element values
 Two


functions: XMLQuery and XMLTable
XMLQuery retrieves value for 1 element
XMLTable retrieves value for multiple elements
 XMLQuery
example:
select xmlquery(‘$c/contactinfo/email’
passing contact as “c”)
from captain
where state = ‘MA’
 This
will return email addresses for all captains in
Massachusetts
SQL/XML Queries
XMLQuery (Continued)
 We
could also look for only first email for each captain
by changing the first line:
select xmlquery(‘$c/contactinfo/email[1]’ …
 Similarly, we could use xmlexists
select xmlquery(‘$c/contactinfo/email’
passing contact as “c”)
from captain
where state = ‘MA’
and xmlexists(‘$c/contactinfo/email’
passing contact as “c”)
to qualify:
SQL/XML Queries
XMLTable
XMLTable retrieves XML elements
 Elements are mapped into result set
columns
 Maps XML data as relational data

SQL/XML Queries
XMLTable Example
select s.shipNo,sm.mid,sm.vid,sm.md,sm.cost
from ship s,
xmltable(‘$c/mrecord/log’ passing s.maintenance as “c”
columns varchar(4) mid path ‘mntid’,
varchar(4) vid path ‘vendorid’,
date
md path ‘maintdate’,
decimal(7,2) cost path ‘cost’) as sm
This will produce a list of maintenance logs
for all ships
Joining XML and Relational Data
select c.captid,c.lname,c.fname
from captain, ship
where xmlexists(‘$s/mrec/log[captid=$c]’
passing ship.maintenance as “s”, captain.captid as “c”)
If the captain was the captain of any ship
when it underwent maintenance, he or she
will be listed
Using FLWR Expressions in
SQL/XML
select captid,
xmlquery(‘for $c in $cn/contactinfo
let $x := $c//city
return $x’ passing contact as “cn”)
from captain
where class = 1
Returns captid as well as city information
XMLElement
XML Element allows you to publish relational data as XML
select xmlelement(name “captain”,
xmlelement(name “captid”, captid),
xmlelement(name “lname”,lname),
xmlelement(name “fname”,fname),
xmlelement(name “class”,class))
from captain
where class <= 2
XMLElement
Output from previous command
<captain>
<captid>3563</captid>
<lname>Smith</lname>
<fname>John</fname>
<class>2</class>
</captain>
…
Aggregating and Grouping Data
select xmlelement(name “captainlist”,
xmlagg(xmlelement(name “captain”,
xmlforest(cid as “captid”,lname as “lname”,fname as “fname”,class
as “class”))
order by cid))
from captain
group by class
This query produces three captainlist
elements each with a number of captains.
Updating and Deleting XML Data

Updates
 Use
XMLParse command. You must specify
the entire XML column to update. If you
specify only 1 element to update, the rest of
the data will be lost.

Deletion
 Same
as standard SQL
 Can also use xmlexists to use XML as
qualifier
Query Execution Plans
•Separate parsers for SQL and
XQuery statements
•Integrated query compiler for both
languages
•QGMX is an internal query graph
model
•Query execution plans contain
special operators for navigation
(XSCAN), XML index access
(XISCAN) and joins over XML
indexes (XANDOR)
Source: [2]
Query Run-time Evaluation

3 major components added for processing
queries over XML:
 XML
Navigation
 XML Index Runtime
 XQuery Function Library
Summary
Problems with CLOB and Shredded XML
storage
 Native XML support in DB2 offers:

 Hierarchical
and parsed representation
 Path-specific XML indexing
 New XML join and query methods
 Integration of SQL and XQuery
References
[1] Nicola, M. and van der Linden, B. 2005. Native XML support in DB2 universal
database. In Proceedings of the 31st international Conference on Very Large Data
Bases (Trondheim, Norway, August 30 - September 02, 2005). Very Large Data
Bases. VLDB Endowment, 1164-1174.
[2] Beyer, K., Cochrane, R. J., Josifovski, V., Kleewein, J., Lapis, G., Lohman, G., Lyle,
B., Özcan, F., Pirahesh, H., Seemann, N., Truong, T., Van der Linden, B., Vickery, B.,
and Zhang, C. 2005. System RX: one part relational, one part XML. In Proceedings of
the 2005 ACM SIGMOD international Conference on Management of Data
(Baltimore, Maryland, June 14 - 16, 2005). SIGMOD '05. ACM Press, New York, NY,
347-358.
[3] http://www-128.ibm.com/developerworks/db2/library/techarticle/dm-0603saracco2/
Download