XML + Databases = ? (DIMACS Workshop, 3/2000) Mike Carey Exploratory Database Systems Department IBM Almaden Research Center carey@almaden.ibm.com Plan for Today’s Talk Thoughts on DB and web technologies – The web and web “querying” – Semistructured databases – Object-relational databases – XML and databases XML/DB research at IBM Almaden – The XPERANTO project • Motivation and approach • Whirlwind tour of the system The Web is Great at Supporting URL-Based Sharing Ex: Online conference proceedings Web browsers have given us – Universal file access (ftp++) – Universal document access (html) – Universal service access (forms) What more could we navigational couch potatoes possibly want? – Universal platform for e-shopping! The Web is Lousy at Supporting Parametric Searches Ex: Find all the used Musicman Sterling bass guitars currently available for under $750 within a 50-mile radius of my San Jose home This is hard for a number of reasons – Data buried in web pages, news groups, classified ads, store sites, auction sites, … – No schema (no metal fish, please!) – No data types (miles, US$, instruments) – No regularity within/across (good!) sites Aren’t We Supposed to be the Experts on Data Management? The DB community brought the world – Data models, schemas, and views – Query languages, optimizers, fast joins – Scalable parallel servers – Federated database systems What do we have in our bag of tricks? – Semistructured databases – Object-relational database systems Is Semistructured Database Technology the Answer? Database characteristics – Collections of [name, value] pairs or maybe [name, type, value] triples – Collections typically set<any> or list<any> System characteristics – “Typeloose” query languages – Indexes for nested, typeloose structures – Appropriate query processing techniques Are Semistructured Databases the Answer? (2) No, because schemas are critical for – Data readers • What info is in a given collection? • Thus, what queries might make sense? – Data writers • What should I call this piece of info? • Is it okay to put this kind of data here? – Efficient/effective query processors • Indexing, statistics, ... (e.g., range queries) • Integration mappings (e.g., unit conversions) Are Semistructured Databases the Answer? (3) It has some nice features, though – Flexible, dynamic schemas • Forgiving w.r.t. variations and exceptions • Schema evolution is not a big deal – Richer data modeling (vs. relational) • Nested structures, ordered collections – More powerful query languages • Blurring of schema and data querying • Ordering, nesting, restructuring handled Is Object-Relational Database Technology the Answer? Database characteristics – Base types, user-defined structured types, inheritance, reference types, collections – Collections are well-typed System characteristics – Extended SQL-based query languages – Support for methods (fenced/unfenced) – Also triggers, LOBs, extensible indexes Are Object-Relational Databases the Answer? (2) No, because most O-R DBMSs have – Overly rigid schemas • Every instance is of one (known) type • Evolving a type can be a major burden • Distributed type management is hard – Crufty old storage managers • Ragged or sparse records poorly supported – Insufficient power in extended SQL • Prehistoric assumptions get in the way • Weak on restructuring, schema-querying Is XML the Answer? (Yes!! ...What Was the Question Again?) Structured documents (for the web) <book> <booktitle> Tables Are The Answer </booktitle> <author id = “cdate”> <name> <firstname> Chris </firstname> <lastname> Date </lastname> </name> <address> <city> Saratoga </city> <state> CA </state> </address> </author> </book> Is XML the Answer? (2) W3C’s XML Schema working group – Typed elements, attributes, documents – Simple types and complex types – Derived types (extension, restriction) – Facets, anonymous types, groups, … – Uniqueness, keys and key references W3C’s XML Query working group – XML-QL, Xpath, XQL, XSL/T, XSQL, … – Recommendation due in late 2000 (?) Is XML the Answer? (3) XML Schema might help because – XML has achieved a huge mindshare for data interchange on the web – DTD standardization is happening for documents within vertical industries, and XML Schemas should take over – When finished, XML Schema should be a widely used schema description tool • Similar to O-R schemas, but with more flexibility (and web-based sex appeal) Some Useful XML+DB Topics Publish documents with XML Schemas from O-R databases – B2B e-commerce messages – B2C comparison shopping (if permitted!) – Robust O-R DB-resident web sites with XML for page content generation Use XML Schema as the central data model for data integration middleware – I.e., web information integration Useful XML+DB Topics (2) Build a “native” XML Repository on top of an O-R DBMS – Map from XML Schema model to O-R DBMS modeling constructs – Map from XML queries to O-R queries (including tag variables and loose typing) – Thereby provide XML document storage management with industrial-strength robustness, scalability, and performance Useful XML+DB Topics (3) Evolve XML-QL into a complete web data manipulation language – Typing a la XML Schema – Ordered/unordered collections – XPath-inspired expressions – Easier grouping and aggregation – Updates (insert/delete, modify) – Etc. The XPERANTO Project Middleware for publishing O-R (or plain relational) DB content on the web – Provides a virtual XML document view – Based on a “pure XML” approach – Using XML-QL (as W3C placeholder) Born at Almaden in summer of 1999 – Mike Carey, Dana Florescu, Zack Ives, Ying Lu, Jai Shanmugasundaram, Beau Shekita, Subbu Subramanian The XPERANTO Belief System Databases contain, and will continue to contain, the world’s “data jewels” – Transactional data (RDBMS) – Important multimedia assets (ORDBMS) XML application developers of the future may not love SQL like we do – View databases as default XML documents – Let them define appropriate (query-able) views of these XML documents XPERANTO Architecture Query Translation Metadata Services XML Schema Generator View Services XML-QL Parser Views XQGM Type & Table Services XML Schema Query Rewrite XQGM Table & Type Info Catalog Info XML Tagger SQL Translation O-R Database SQL Queries SQL Query Processor Stored Tables System Catalog Data Tuples XPERANTO Components XML-QL Parser – Neutral query representation (XQGM) Query Rewrite – View composition and other rewrites SQL Translation – Produce SQL query(s) to get the required data from the underlying DBMS XML Tagger – Tag and structure the tabular results XPERANTO Components View Services – Repository for XML view definitions Type & Table Services – Interface (and cache) for DB catalog info XML Schema Generator – Give DB catalog info in XML Schema form for default views – Infer XML Schema info for queries and non-default view definitions Consider a Simple O-R Schema Create Table book AS (bookID CHAR(30), name VARCHAR(255), publisher VARCHAR(30)) Create Table publisher AS (name VARCHAR(30), address VARCHAR(255)) Create Type author_type AS (bookID CHAR(30), first VARCHAR(30), last VARCHAR(30)) Create Table author OF author_type (REF IS ssn USER GENERATED) Part of the Default XML View <simpleType name=”string255” source=”string”> <maxLength value=”255” /> </simpleType> <simpleType name=”string30” source=”string”> <maxLength value=”30” /> </simpleType> <complexType name=“bookTupleType”> <element name=“bookID” type=“string30” /> <element name=“name” type=“string255” /> <element name=“publisher” type=“string30” /> </complexType> <complexType name=“bookSetType”> <element name=“bookTuple” type=“bookTupleType” maxOccurs=“*” /> </complexType> <element name=“book” type=“bookSetType” /> . . XPERANTO’s Default Views XPERANTO generates default O-R to XML Schema mappings – Each DB shown as an XML file – Subtyping handled via XML Schema’s refinement facilities – OIDs and references become ids/idrefs “Don’t use this at home!” – Application developers are expected to define the real view(s) using XML-QL Creating a Better XML View WHERE <library.book.bookTuple> <bookID> $bid </> <name> $name </> <publisher> $bpub </> </> IN “db2:xml:books/library”, $bpub = “Kluwer” CONSTRUCT <book id=$bid> <name> $bname </> {WHERE <library.publisher.publisherTuple> <name> $bpub </> <address> $addr </> </> IN “db2:xml:books/library” CONSTRUCT <publisher> <address> $addr </> </>} {WHERE <library.author.authorTuple> <bookID> $bid </> <first> $fname </> <last> $lname </> </> IN “db2:xml:books/library” CONSTRUCT <author first=$fname last=$lname/>} </> XPERANTO Query Rewrite XML-QL queries first translated into XQGM representation – Neutral, well-poised for more features – Easier to go from XML-QL to SQL – Borrow rewrites from DB2 UDB engine XQGM is an extension of DB2’s QGM – XML data type for “columns” – Set of XML-specific functions SQL Generation and XML Document Tagging/Structuring Sorted Outer Union queries are used to obtain the data – Fetch the data in one query that brings it back in the appropriate order – Tag and nest it to create XML document Advantages of this approach – Shown to be stable as well as fast – Simple (linear-space) tagging possible • Just watch for nesting-related changes Outer Union Query Example WITH OuterUnion (type, bookID, bookName, pubName, pubAddr, authFirst, authLast) AS ( SELECT ‘0’, b.bookID, b.name, NULL, NULL, NULL, NULL FROM book b WHERE b.publisher = “Kluwer” UNION ALL SELECT ‘1’, b.bookID, NULL, p.name, p.address, NULL, NULL FROM book b, publisher p WHERE b.publisher = “Kluwer” and b.publisher = p.name UNION ALL SELECT ‘2’, b.bookID, NULL, NULL, NULL, a.first, a.last FROM book b, author a WHERE b.publisher = “Kluwer” and b.bookID = a.bookID ) SELECT * FROM OuterUnion ORDER BY bookID XPERANTO Project Summary Goal is to publish O-R data in XML form – Default XML views – XML-QL for defining useful views – “Look Ma, no SQL!” Currently (re)building our prototype – View composition is our first stop – Updates in addition to queries – Queries over both data and metadata – Other needs for XML web sites...? A Few Closing Remarks DB community must ensure that the web will support real queries…! – XML Schema and XML Query standards need ongoing input from DB researchers – Large-scale technologies needed for XML indexing, caching, querying, etc. DB community should also work on important underlying technologies – Publishing XML both from and to RDBMSs and ORDBMSs, for example!