XML + Databases = ? Mike Carey (DIMACS Workshop, 3/2000)

advertisement
XML + Databases = ?
(DIMACS Workshop, 3/2000)
Mike Carey
Exploratory Database Systems Department
IBM Almaden Research Center
carey@almaden.ibm.com
Plan for Today’s Talk

Thoughts on DB and web technologies
– The web and web “querying”
– Semistructured databases
– Object-relational databases
– XML and databases

XML/DB research at IBM Almaden
– The XPERANTO project
• Motivation and approach
• Whirlwind tour of the system
The Web is Great at Supporting
URL-Based Sharing


Ex: Online conference proceedings
Web browsers have given us
– Universal file access (ftp++)
– Universal document access (html)
– Universal service access (forms)

What more could we navigational
couch potatoes possibly want?
– Universal platform for e-shopping!
The Web is Lousy at Supporting
Parametric Searches

Ex: Find all the used Musicman Sterling bass
guitars currently available for under $750
within a 50-mile radius of my San Jose home

This is hard for a number of reasons
– Data buried in web pages, news groups,
classified ads, store sites, auction sites, …
– No schema (no metal fish, please!)
– No data types (miles, US$, instruments)
– No regularity within/across (good!) sites
Aren’t We Supposed to be the
Experts on Data Management?

The DB community brought the world
– Data models, schemas, and views
– Query languages, optimizers, fast joins
– Scalable parallel servers
– Federated database systems

What do we have in our bag of tricks?
– Semistructured databases
– Object-relational database systems
Is Semistructured Database
Technology the Answer?

Database characteristics
– Collections of [name, value] pairs or
maybe [name, type, value] triples
– Collections typically set<any> or list<any>

System characteristics
– “Typeloose” query languages
– Indexes for nested, typeloose structures
– Appropriate query processing techniques
Are Semistructured Databases
the Answer? (2)

No, because schemas are critical for
– Data readers
• What info is in a given collection?
• Thus, what queries might make sense?
– Data writers
• What should I call this piece of info?
• Is it okay to put this kind of data here?
– Efficient/effective query processors
• Indexing, statistics, ... (e.g., range queries)
• Integration mappings (e.g., unit conversions)
Are Semistructured Databases
the Answer? (3)

It has some nice features, though
– Flexible, dynamic schemas
• Forgiving w.r.t. variations and exceptions
• Schema evolution is not a big deal
– Richer data modeling (vs. relational)
• Nested structures, ordered collections
– More powerful query languages
• Blurring of schema and data querying
• Ordering, nesting, restructuring handled
Is Object-Relational Database
Technology the Answer?

Database characteristics
– Base types, user-defined structured types,
inheritance, reference types, collections
– Collections are well-typed

System characteristics
– Extended SQL-based query languages
– Support for methods (fenced/unfenced)
– Also triggers, LOBs, extensible indexes
Are Object-Relational Databases
the Answer? (2)

No, because most O-R DBMSs have
– Overly rigid schemas
• Every instance is of one (known) type
• Evolving a type can be a major burden
• Distributed type management is hard
– Crufty old storage managers
• Ragged or sparse records poorly supported
– Insufficient power in extended SQL
• Prehistoric assumptions get in the way
• Weak on restructuring, schema-querying
Is XML the Answer?
(Yes!! ...What Was the Question Again?)

Structured documents (for the web)
<book>
<booktitle> Tables Are The Answer </booktitle>
<author id = “cdate”>
<name>
<firstname> Chris </firstname>
<lastname> Date </lastname>
</name>
<address>
<city> Saratoga </city>
<state> CA </state>
</address>
</author>
</book>
Is XML the Answer? (2)

W3C’s XML Schema working group
– Typed elements, attributes, documents
– Simple types and complex types
– Derived types (extension, restriction)
– Facets, anonymous types, groups, …
– Uniqueness, keys and key references

W3C’s XML Query working group
– XML-QL, Xpath, XQL, XSL/T, XSQL, …
– Recommendation due in late 2000 (?)
Is XML the Answer? (3)

XML Schema might help because
– XML has achieved a huge mindshare for
data interchange on the web
– DTD standardization is happening for
documents within vertical industries, and
XML Schemas should take over
– When finished, XML Schema should be a
widely used schema description tool
• Similar to O-R schemas, but with more
flexibility (and web-based sex appeal)
Some Useful XML+DB Topics

Publish documents with XML Schemas
from O-R databases
– B2B e-commerce messages
– B2C comparison shopping (if permitted!)
– Robust O-R DB-resident web sites with
XML for page content generation

Use XML Schema as the central data
model for data integration middleware
– I.e., web information integration
Useful XML+DB Topics (2)

Build a “native” XML Repository on top
of an O-R DBMS
– Map from XML Schema model to O-R
DBMS modeling constructs
– Map from XML queries to O-R queries
(including tag variables and loose typing)
– Thereby provide XML document storage
management with industrial-strength
robustness, scalability, and performance
Useful XML+DB Topics (3)

Evolve XML-QL into a complete web
data manipulation language
– Typing a la XML Schema
– Ordered/unordered collections
– XPath-inspired expressions
– Easier grouping and aggregation
– Updates (insert/delete, modify)
– Etc.
The XPERANTO Project

Middleware for publishing O-R (or plain
relational) DB content on the web
– Provides a virtual XML document view
– Based on a “pure XML” approach
– Using XML-QL (as W3C placeholder)

Born at Almaden in summer of 1999
– Mike Carey, Dana Florescu, Zack Ives,
Ying Lu, Jai Shanmugasundaram, Beau
Shekita, Subbu Subramanian
The XPERANTO Belief System

Databases contain, and will continue to
contain, the world’s “data jewels”
– Transactional data (RDBMS)
– Important multimedia assets (ORDBMS)

XML application developers of the
future may not love SQL like we do
– View databases as default XML documents
– Let them define appropriate (query-able)
views of these XML documents
XPERANTO Architecture
Query Translation
Metadata Services
XML Schema
Generator
View Services
XML-QL Parser
Views
XQGM
Type & Table Services
XML Schema
Query Rewrite
XQGM
Table & Type Info
Catalog
Info
XML Tagger
SQL Translation
O-R Database
SQL Queries
SQL Query Processor
Stored
Tables
System
Catalog
Data Tuples
XPERANTO Components

XML-QL Parser
– Neutral query representation (XQGM)

Query Rewrite
– View composition and other rewrites

SQL Translation
– Produce SQL query(s) to get the required
data from the underlying DBMS

XML Tagger
– Tag and structure the tabular results
XPERANTO Components

View Services
– Repository for XML view definitions

Type & Table Services
– Interface (and cache) for DB catalog info

XML Schema Generator
– Give DB catalog info in XML Schema form
for default views
– Infer XML Schema info for queries and
non-default view definitions
Consider a Simple O-R Schema
Create Table book AS
(bookID CHAR(30), name VARCHAR(255),
publisher VARCHAR(30))
Create Table publisher AS
(name VARCHAR(30), address VARCHAR(255))
Create Type author_type AS
(bookID CHAR(30), first VARCHAR(30),
last VARCHAR(30))
Create Table author OF author_type
(REF IS ssn USER GENERATED)
Part of the Default XML View
<simpleType name=”string255” source=”string”>
<maxLength value=”255” />
</simpleType>
<simpleType name=”string30” source=”string”>
<maxLength value=”30” />
</simpleType>
<complexType name=“bookTupleType”>
<element name=“bookID”
type=“string30” />
<element name=“name”
type=“string255” />
<element name=“publisher” type=“string30” />
</complexType>
<complexType name=“bookSetType”>
<element name=“bookTuple” type=“bookTupleType” maxOccurs=“*” />
</complexType>
<element name=“book” type=“bookSetType” />
.
.
XPERANTO’s Default Views

XPERANTO generates default O-R to
XML Schema mappings
– Each DB shown as an XML file
– Subtyping handled via XML Schema’s
refinement facilities
– OIDs and references become ids/idrefs

“Don’t use this at home!”
– Application developers are expected to
define the real view(s) using XML-QL
Creating a Better XML View
WHERE <library.book.bookTuple>
<bookID> $bid </>
<name> $name </>
<publisher> $bpub </>
</> IN “db2:xml:books/library”,
$bpub = “Kluwer”
CONSTRUCT <book id=$bid>
<name> $bname </>
{WHERE <library.publisher.publisherTuple>
<name> $bpub </>
<address> $addr </>
</> IN “db2:xml:books/library”
CONSTRUCT <publisher>
<address> $addr </>
</>}
{WHERE <library.author.authorTuple>
<bookID> $bid </>
<first> $fname </>
<last> $lname </>
</> IN “db2:xml:books/library”
CONSTRUCT <author first=$fname last=$lname/>}
</>
XPERANTO Query Rewrite

XML-QL queries first translated into
XQGM representation
– Neutral, well-poised for more features
– Easier to go from XML-QL to SQL
– Borrow rewrites from DB2 UDB engine

XQGM is an extension of DB2’s QGM
– XML data type for “columns”
– Set of XML-specific functions
SQL Generation and XML
Document Tagging/Structuring

Sorted Outer Union queries are used to
obtain the data
– Fetch the data in one query that brings it
back in the appropriate order
– Tag and nest it to create XML document

Advantages of this approach
– Shown to be stable as well as fast
– Simple (linear-space) tagging possible
• Just watch for nesting-related changes
Outer Union Query Example
WITH OuterUnion (type, bookID, bookName, pubName, pubAddr,
authFirst, authLast) AS
(
SELECT ‘0’, b.bookID, b.name, NULL, NULL, NULL, NULL
FROM book b
WHERE b.publisher = “Kluwer”
UNION ALL
SELECT ‘1’, b.bookID, NULL, p.name, p.address, NULL, NULL
FROM book b, publisher p
WHERE b.publisher = “Kluwer” and b.publisher = p.name
UNION ALL
SELECT ‘2’, b.bookID, NULL, NULL, NULL, a.first, a.last
FROM book b, author a
WHERE b.publisher = “Kluwer” and b.bookID = a.bookID
)
SELECT * FROM OuterUnion ORDER BY bookID
XPERANTO Project Summary

Goal is to publish O-R data in XML form
– Default XML views
– XML-QL for defining useful views
– “Look Ma, no SQL!”

Currently (re)building our prototype
– View composition is our first stop
– Updates in addition to queries
– Queries over both data and metadata
– Other needs for XML web sites...?
A Few Closing Remarks

DB community must ensure that the
web will support real queries…!
– XML Schema and XML Query standards
need ongoing input from DB researchers
– Large-scale technologies needed for XML
indexing, caching, querying, etc.

DB community should also work on
important underlying technologies
– Publishing XML both from and to RDBMSs
and ORDBMSs, for example!
Download