XML and Data Management Mary Fernández Michael Benedikt Juliana Freire Arnaud Sahuguet AT&T Research Bell Labs - Lucent Technologies Large-Scale Programming Research Network Data and Services Research 2002 by AT&T and Lucent Who We Are Researchers in industrial labs Our companies Make critical use of XML & database technology Do not sell XML products Members of W3C XQuery & XPath Groups XML users, just like you 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management We are not here to sell you any products 2 Goals of Tutorial Help you understand issues related to management of XML Querying and Data Access Publishing Storage WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management Help you articulate questions/answers when you talk to vendor/customer Help understand issues if you decide to “roll your own” solution 3 More Goals of Tutorial For each topic, we will try to answer the following questions: 2002 by AT&T and Lucent would be ideal solution? are issues you should be aware of? are commercial offerings? are emerging solutions from research? WWW2002 - Hawaii XML and Data Management What What What What 4 What this Tutorial is NOT? Not detailed study of commercial products Commercial tools are immature, rapidly evolving Product summary will be obsolete tomorrow Not research presentation Not about latest W3C proposals See W3C track for detailed information XML and Data Management WWW2002 - Hawaii 2002 by AT&T and Lucent 5 Roadmap Introduction (Mary for Arnaud) z z z z <Coffee-break/> Interfaces and APIs (Mary) z z z Key concepts and processing models Programmatic interfaces: DOM, SAX Query languages: XPath 2.0, XQuery 1.0 <Lunch/> 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management Why XML? Examples of XML in action Application scenarios What is the XML Data Management Problem? 6 Roadmap Publishing Relational Data in XML (Michael) z z z Goals & problems of publishing Publishing languages Exporting & querying documents <Coffee-break/> Storage (Juliana) z Storage strategies Issues, systems, & techniques Question time = any time! 2002 by AT&T and Lucent WWW2002 - Hawaii Thanks to Our Colleagues Sihem Amer-Yahia, AT&T Labs Jerome Simeon, Lucent – Bell Labs Philip Wadler, Avaya Labs W3C XSLT & XQuery Working Groups XML and Data Management z 7 Introduction Why XML? Lingua franca of the Web Simple, open, widely accepted Web’s secret sauce Next silver bullet <Your favorite motto here/> XML and Data Management 2002 by AT&T and Lucent WWW2002 - Hawaii 10 XML In Action GRAA_HUMAN STANDARD; PRT; 262 AA. P12544; 01-OCT-1989 (REL. 12, CREATED) 01-OCT-1989 (REL. 12, LAST SEQUENCE UPDATE) 15-DEC-1998 (REL. 37, LAST ANNOTATION UPDATE) GRANZYME A PRECURSOR (EC 3.4.21.78) (CYTOTOXIC T-LYMPHOCYTE PROTEINASE 1) (HANUKKAH FACTOR) (H FACTOR) (HF) (GRANZYME 1) (CTL TRYPTASE) (FRAGMENTIN 1). GZMA OR CTLA3 OR HFSP. HOMO SAPIENS (HUMAN). Eukaryota; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. [1] SEQUENCE FROM N.A. TISSUE=T-CELL; MEDLINE; 88125000. GERSHENFELD H.K., HERSHBERGER R.J., SHOWS T.B., WEISSMAN I.L.; "Cloning and chromosomal assignment of a human cDNA encoding a T cell- and natural killer cell-specific trypsin-like serine protease."; PROC. NATL. ACAD. SCI. U.S.A. 85:1184-1188(1988). [2] SEQUENCE OF 29-53. MEDLINE; 88330824. POE M., BENNETT C.D., BIDDISON W.E., BLAKE J.T., NORTON G.P., RODKEY J.A., SIGAL N.H., TURNER R.V., WU J.K., ZWEERINK H.J.; "Human cytotoxic lymphocyte tryptase. Its purification from granules and the characterization of inhibitor and substrate specificity."; J. BIOL. CHEM. 263:13215-13222(1988). [3] SEQUENCE OF 29-40, AND CHARACTERIZATION. MEDLINE; 89009866. HAMEED A., LOWREY D.M., LICHTENHELD M., PODACK E.R.; "Characterization of three serine esterases isolated from human IL-2 activated killer cells."; J. IMMUNOL. 141:3142-3147(1988). [4] SEQUENCE OF 29-39, AND CHARACTERIZATION. MEDLINE; 89035468. KRAEHENBUHL O., REY C., JENNE D.E., LANZAVECCHIA A., GROSCURTH P., CARREL S., TSCHOPP J.; RA <?xml version ="1.0"?> <cml title="SwissProtTree"> <molecule title="SwissProt file" scheme="SWISSPROT"> <string title="ID" scheme="SWISSPROT">GRAA_HUMAN</string> <integer title="Length">262</integer> <string dictname="AC" title="Accession Number(s)" scheme="SWISSPROT"> <list title="History"> <list dictname="DT" title="Revision" scheme="SWISSPROT"> <string title="Comments">REL. 12, CREATED</string> <date>1989-10-01</date> </list> <list dictname="DT" title="Revision" scheme="SWISSPROT"> <string title="Comments">REL. 12, LAST SEQUENCE UPDATE</string> <date>1989-10-01</date> </list> <list dictname="DT" title="Revision" scheme="SWISSPROT"> <string title="Comments">REL. 37, LAST ANNOTATION UPDATE</string> <date>1998-12-15</date> </list> </list> <string dictname="DE" title="Description" scheme="SWISSPROT"> GRANZYME <string dictname="GN" title="Gene Name(s)" scheme="SWISSPROT">GZMA OR <string dictname="OS" title="Organism Species" scheme="SWISSPROT"> HOMO <string dictname="OC" title="Organism Classification" scheme="SWISSPROT <citation number="1" title="SEQUENCE FROM N.A."> <list title="AUTHORS"> <person> <initials>H.K.</initials> <surname>GERSHENFELD</surname> </person> <person> <initials>R.J.</initials> <surname>HERSHBERGER</surname> </person> <person> <initials>T.B.</initials> <surname>SHOWS</surname> </person> <person> <initials>I.L.</initials> <surname>WEISSMAN</surname> </person> </list> […] GERSHENFELD H.K., HERSHBERGER R.J., SHOWS T.B., WEISSMAN I.L.; <list title="AUTHORS"> <person> <initials>H.K.</initials> <surname>GERSHENFELD</surname> </person> <person> <initials>R.J.</initials> <surname>HERSHBERGER</surname> </person> <person> <initials>T.B.</initials> <surname>SHOWS</surname> </person> … </list> WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management ID AC DT DT DT DE DE DE GN OS OC OC RN RP RC RX RA RT RT RT RL RN RP RX RA RA RT RT RL RN RP RX RA RT RT RL RN RP RX RA RA […] 11 What Are Benefits? Tags z Easier for machine & human to parse Tree structure z z z 2002 by AT&T and Lucent with parent/child relationships to understand to enforce to navigate XML and Data Management z Nodes Easier Easier Easier WWW2002 - Hawaii 12 What Is XML? Just a syntax But a standardized, extensible syntax Allows specification of new dialects Power comes from related technologies 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management Parsers, schemas, query languages, protocols, application-specific dialects, etc. 13 XML Standards Landscape Schema languages z z Programming APIs z z DOM SAX, SAX2 Query languages z z XPath XSL-T XQuery ML Standard organizations z z W3C OASIS 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management z DTDs XML Schemas 14 Examples of Application Domains Example documents & queries XML Application Domains http://xml.coverpages.org/gen-apps.html http://www.xml.org/xml/industry_industrysectors.jsp XML and Data Management 2002 by AT&T and Lucent WWW2002 - Hawaii 16 XML Dialect “pôt pourri” WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management Extensible Financial Reporting Markup Language (XFRML), eXtensible Business Reporting Language (XBRL), MusicXML, Spacecraft Markup Language (SML), Bank Internet Payment System (BIPS), Bioinformatic Sequence Markup Language (BSML), Biopolymer Markup Language (BIOML), Open Catalog Format (OCF), Chemical Markup Language (CML), Electronic Business XML Initiative (ebXML), Open Trading Protocol (OTP), FinXML, Financial Information eXchange protocol (FIX), RecipeML, CVML, XML Bookmark Exchange Language (XBEL), Scalable Vector Graphics (SVG), NewsML, DocBook, Real Estate Listing Markup Language (RELML), . . . 17 FpML (Finance) Complex nesting Transaction processing Example queries z z z 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management z Who are the parties involved in a given contract? When does the contract expire? What are the various components of a contract? What is the total amount of a contract? 18 BioML (Bioinformatics) Large size + complex nesting z Annotations (meta-data) describe gene sequence (data) Free text for annotations Full-text queries for sequence matching Hierarchical queries for gene finding & visualization Example queries z z Who sequenced the data? Is there a motif similar to “GTTACCTGGCCAGT” in an intron of the sequence? WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management Nature of genetic sequences 19 HL7 (Healthcare) Stream data, e.g., from medical devices Patient data in legacy systems Transaction processing Temporal queries Example queries z z Who is the patient? When did the measurement take place? For the duration of the measurement, what were the max and min value for the following vital signs (…)? XML and Data Management z WWW2002 - Hawaii 20 2002 by AT&T and Lucent SOAP XML envelope for messages Headers & bodies are XML documents <soap:Envelope xmlns:soap='http://www.w3.org/2001/10/soap-envelope'> <soap:Header> Example queries Header goes here Request goes here </soap:Body> </soap:Envelope> 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management Extract the header </soap:Header> z Extract the body <soap:Body> z Extract messages that expire before 07-May-2002 z 21 Bill (Library of Congress) Documents may have arbitrary nesting (section, subsection, etc.) Large text fragments Complex full-text search z Example queries z z Find bills that contain “water quality” within 6 terms of “EPA” Find bills that contain “homeland security” within <amendment> element 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management Localized, order, proximity, fuzziness, stemming, synonyms, relevance (document scoring) 22 Importance of Schemas Standardized vocabulary for application domain z Necessary for validation Useful for human editing Useful for storage Useful for query optimization Useful for mapping to programming languages (e.g., Java, C#) 2002 by AT&T and Lucent WWW2002 - Hawaii Where Does XML fit? Some use-cases from application vendors and some XML technologies XML and Data Management Requires deep understanding of application 23 Web Publishing Export content as XML XML to device-specific language XML and Data Management [ from Oracle 9i brochures] WWW2002 - Hawaii 2002 by AT&T and Lucent 25 XML in Business Automation Export data as XML XML to XML transformation Transport data as XML [ from Oracle 9i brochures] 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management Import XML data 26 What is the XML Data Management Problem? XML Data Management XML Producer Legacy Database XML Consumer XML XML Documents & Schemas API or Query XML Interfaces Store XML XML Persistent Database 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management Publish XML XML 28 Challenges Flexibility, expressiveness of XML data model Diverse applications No one-size-fits-all solution Immature technologies No off-the-shelf solution XML and Data Management WWW2002 - Hawaii 2002 by AT&T and Lucent 29 o ati plic Ex o act F l a tern Application rs ns ap a cy a g e L dat i se y c ert a p g x Le ee ous h In XML and Data Management Da ta St Sc ruc t Ex hem ure t S en a iz si e bi lit y Problem Dimensions Query structure Query access patterns Stored vs stream-based Transactional 2002 by AT&T and Lucent WWW2002 - Hawaii 30 XML Data Management XML Producer Legacy Database XML Consumer XML XML Documents & Schemas API or Query XML Interfaces Store XML XML Persistent Database 2002 by AT&T and Lucent 31 WWW2002 - Hawaii 32 XML and Data Management WWW2002 - Hawaii <Coffee/> 2002 by AT&T and Lucent XML and Data Management Publish XML XML XML & Data Management Part I: XML Interfaces APIs and Languages IMDB Example : Data <!ENTITY hollywood “Hollywood”> 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management <imdb> <show year=“1993”> <!-- Example Movie --> <title>Fugitive, The</title> <review> <suntimes> <reviewer>Roger Ebert</reviewer> gives <rating>two thumbs up</rating>! A fun action movie, Harrison Ford at his best. </suntimes> </review> <review> <nyt>The standard &hollywood; summer movie strikes back.</nyt> </review> <box_office>183,752,965</box_office> </show> <show year=“1994”> <!-- Example Television Show --> <title>X Files,The</title> <seasons>4</seasons> </show> . . . </imdb> 34 IMDB Example : Schema <element name=“show”> <complexType> <sequence> <element name=“title” type=“xs:string”/> <sequence minoccurs=“0” maxoccurs=“unbounded”> <element name=“review” mixed=“true”/> </sequence> <choice> <element name=“box_office” type=“xs:integer”/> </choice> </sequence> <attribute name=“year” type=“xs:integer” use=“optional”/> </complexType> </element> WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management <element name=“seasons” type=“xs:integer”/> 35 Key Concepts Data Model What features of document are important? Individual characters, synthesized strings Entity references, CDATA sections Comments, processing instructions, namespace nodes Typed values, un-typed (well-formed) values, mixed content Schema (a.k.a. Type) Specifies contract between data producers & consumers Types of literal (terminal) data Names of elements & attribute “Vertical” & “horizontal” structure of elements Regular expressions (a la Unix grep) over XML tags Impacts querying 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management 36 Interface Characteristics Expressiveness & Ease-of-use What XML content is accessible? Access method Navigational, streams, declarative Linguistic vs. programmatic What are appropriate applications? Flexibility & Completeness Safety z z Respect schema/type of input document(s)? Guarantee/enforce expected schema/type of output? WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management Support typed values, un-typed text, mixed content? Support for update? 37 Interface Landscape Document-Object Model (DOM) Simple API for XML (SAX) XPath 2.0 XQuery 1.0 Brief comparison of XSLT 2.0 with XQuery 1.0 Focus: 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management Impact of interfaces on data management 38 Relational Analogs XSLT 2.0/XQuery 1.0 XPath 2.0 XPath 2.0 Data Model Relational Data Model SAX API XML Document JDBC/ODBC Relational Database WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management DOM API SQL 39 Generic XML Processing Model • XML Information Set per-character, per-entity model of XML document DTD or XML Schema XML Document XML Infoset Expand entity references Check well-formedness 2002 by AT&T and Lucent Document Validator XML Infoset (+ Types) Application/ Storage System Validate data Add type annotations Insert default values WWW2002 - Hawaii XML and Data Management Document Parser 40 Navigational Access: DOM Language-independent, programmatic API Un-typed, object model of document content Application requirements Full navigational access to document z Dynamic update, add, & delete document content Ex: Client-side browser apps; Plumbing of Dynamic HTML z DOM Instance Application Validator WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management DOM Parser XML Document + DTD or XML Schema 41 DOM Example Document childNodes Element(“imdb”) childNodes Element(“show”) attributes Attr(“year”,“1993”) childNodes firstChild lastChild Element(“title”) Element(“box_office”) Element(“review”) childNodes Text(“Fugitive,The”) childNodes Text(“183,752,965”) next/previousSibling parentNode 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management Element(“review”) 42 DOM Characteristics Data Model No access to type information Access to everything else, e.g., entity references, CDATA sections, all node kinds Query Access Ex: Reviews of shows with title “Fugitive, The” in IMDB z Implement programmatically by hand z Use DOM Level 3 (XPath interface) /imdb/show[title=“Fugitive, The”]/review 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management for s in documentElement.getElementsByTagName(“show”) if (some t in s.getElementsByName(“title”) satisfies (t.characters() = “Fugitive, The”)) then s.getElementsByName(“review”) 43 More DOM Characteristics No access to schema information No type-safety guarantees Atomic values (integers, dates, etc) modeled as un-typed text No guarantee that processing produces valid output 2002 by AT&T and Lucent XML and Data Management Validate input => Process un-typed objects => Validate output WWW2002 - Hawaii 44 Streams Access : SAX Language-independent, programmatic API Stream of un-typed elements, attributes, text Call-backs into application Applications Content-based routing of XML messages Ex: filter stock quotes, network alerts, … z Read-once processing of large documents Ex: load XML document into storage system z SAX Parser SAX Events Application Validator 2002 by AT&T and Lucent XML and Data Management XML Document + DTD or XML Schema WWW2002 - Hawaii 45 WWW2002 - Hawaii 46 SAX Example startElement(“imdb”, null) startElement(“show”, null) comment(“Example movie”) startElement(“title”, (“year”, “1993”)) characters(“Fugitive, The”) endElement(“title”) startElement(“review”, null) startElement(“suntimes”, null) startElement(“reviewer”) characters(“Roger Ebert”) XML and Data Management endElement(“reviewer”) characters(“ gives ”) ... startElement(“rating”, null) characters(“two thumbs up”) endElement(“rating”) endElement(“suntimes”) endElement(“review”) ... 2002 by AT&T and Lucent SAX Characteristics Data Model Supports “document order” access to content Read-only access to un-typed nodes No update-in-place Stream transformation Query Access XPath expressions in descendant/following-sibling axes Use automata to preserve state 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management startElement(tag, attributes){ if (tag = “imdb”) push(“imdb”) else if (peek() = “imdb” && tag = “show” && attribute[“title”] = “Fugitive, The”) then push(“show”) else if (peek() = “show” && tag = “review”) then writeElement(“review”, attributes)... 47 More SAX Characteristics No access to schema information No type-safety guarantees Atomic values (e.g., integers) modeled as un-typed text No guarantee that processing produces valid output 2002 by AT&T and Lucent XML and Data Management Validate input => Stream of un-typed node events => Validate output WWW2002 - Hawaii 48 Common Querying Tasks Filter, select XML values z Merge, integrate values from multiple XML sources z XML construction Programmatic interfaces specify how Query languages specify what, not how z z XML and Data Management Joins, aggregation Transform XML values from one schema to another z Navigation, selection, extraction Provide abstractions for common tasks Easier than programmatic interfaces WWW2002 - Hawaii 2002 by AT&T and Lucent 49 Query Languages XPath 2.0 z z XSLT 2.0: XML ⇒ XML, HTML, Text z z z Loosely-typed scripting language Format XML in HTML for display in browser Must be highly tolerant of variability/errors in data XQuery 1.0: XML ⇒ XML z z Strongly-typed query language Large-scale database access Must guarantee safety/correctness of operations on data Over time, XSLT & XQuery may both serve needs of many application domains 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management z Common language for navigation, selection, extraction Used in XSLT, XQuery, XPointer, XML Schema, XForms, et al 50 Query Processing Model Other models possible XML Document(s) Data Model Instance Query Query Evaluator Data Model Instance Application (May) type check query Evaluates query on data model instance WWW2002 - Hawaii 2002 by AT&T and Lucent XPath 2.0 Functionality & Features XML and Data Management XPath 2.0 Data Model Parser Validator XML Schema(ta) 51 XPath 2.0 Functionality Language z z z z Uniform semantics & syntax in XSLT & XQuery Guarantees same syntactic expression has same semantics Navigation, selection, value extraction Arithmetic, logical, comparison expressions Data Model z z Minimal interface necessary to express semantics Sequences of element, attribute, comment, PI, & text nodes & atomic values Access to type information WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management z 53 XPath 2.0 Data Model Document children Element(“imdb”) children Element(“show”) attributes children Attribute(“year”,“1993”) Element(“title”), Element(“box_office”) children children xs:string(“Fugitive,The”) xs:integer(183,752,965) parent 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management Element(“review”), Element(“review”), 54 Filtering Simple (forward) navigation & extraction Constraints only on self, children, descendants Return all titles (at any level) in IMDB /imdb//title z Syntactic sugar for: root()/child::imdb/descendant-or-self::node()/child::title XML and Data Management Selection //show[year >= 2000] WWW2002 - Hawaii 2002 by AT&T and Lucent 55 Complex Filtering Navigate & select siblings, ancestors //show/reviewer[following-sibling::rating] Document order //surgery[//anesthesia[1] before //incision[1]] Constraints on following/preceding siblings, ancestors Used to identify context of a value 2002 by AT&T and Lucent XML and Data Management Often used in document-processing applications WWW2002 - Hawaii 56 Text Processing Full-text operators //*[xf:text-contains(., “Russell Crowe”)] Span structured content Myriad other text functions proposed: Phrase search Support for stop words Ex: “Tom Cruise” within two words of “Penelope Cruz” Boolean combinations Ranking, relevance 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management Proximity searching 57 Variability in XML Data Replication, absence of XML values Demands flexible semantics for selection Selection //show[year >= 2000] z Explicit expression: //show[some $v in ./child::year satisfies data($v) ge 2000] Existence/absence of value //show/reviewer[following-sibling::rating] z Explicit expression: //show/reviewer[not empty(./following-sibling::rating)] 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management 58 Variability in Schemas Documents may contain fragments with strongly typed values & un-typed text Demands flexible, but consistent, semantics Un-typed text Typed values Strict interpretation of typed values Type error is fatal /book/@isbn * 0.07 WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management Permissive conversion from PCDATA to typed values <book isbn=“ISBN 10-111”> <price>45.50</price> </book> /book/price * 0.07 59 Beyond XPath 2.0 Limitations Constructing new XML Recursive processing of recursive XML data Supported by XSLT & XQuery Differences between XSLT & XQuery Focus on XQuery XSLT covered elsewhere 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management Safety: XQuery enforces input & output types Compositionality: XQuery maps XML to XML, XSLT maps XML to anything Important feature for XML publishing 60 XQuery 1.0 Functionality & Features XQuery 1.0 Functional, strongly typed query language XQuery 1.0 = XPath 2.0 + … A few more expressions for-let-where-return (FLWR) ~ SQL’s SELECT-FROM-WHERE Sort-by XML construction (Transformation) Operators on types (Compile & run-time type tests) + User-defined functions XML and Data Management Modularize large queries Process recursive data + Strong typing Guarantees result value conforms to output type Enforced statically or dynamically 2002 by AT&T and Lucent WWW2002 - Hawaii 62 Joins & XML Construction Arbitrary nesting of expressions & literal XML For each actor, return box office receipts of films in which they starred in past 2 years WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management let $imdb := document(“www.imdb.com/imdb.xml”) for $actor in $imdb//actor let $films := $imdb//show[box_office and @year >= 2000 Iteration and $actor/name = .//actor[@role=“star”]/name] Join return <receipts> { $actor } XML Construction <total> { sum($films/box_office) } </total> </receipts> Aggregation 63 XML Transformation User-defined functions Same expressiveness as XSLT templates + parameters Signatures specify types of arguments & return values Types enforced statically or dynamically define function show2movie(element show $show) returns element movie? { // Convert a show (that is a movie) to a movie if ($show/box_office) then <movie> { $show/* } </movie> } let $imdb := document(“www.imdb.com/imdb.xml”) return <movies> for $show in $imdb/show return show2movie($show) </movies> 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management else () 64 Recursive XML Data Recursive functions support recursive data <Part id=“001”> <Part id=“002”> <Part id=“003”/> </Part> <Part id=“004”/> </Part> <PartCt count=“2” id=“001”> <PartCt count=“1” id=“002”/> <PartCt count=“0” id=“003”/> </PartCt> <PartCt count=“0” id=“004”/> </PartCt> returns element PartCt { <PartCt count=“{ count($p1/Part) }” { $p1/@id }> { for $p2 in $p1/Part return partCount($p2) } </PartCt> } WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management define function partCount(element Part $p1) 65 Safety Shared schema (Sshared) is contract between producers & consumers Producer writes query to transform input data into output data Dinput : Sinput ⇒ Qproducer ⇒ Doutput : Soutput Static Type Checking takes Sinput & Qproducer Infers Soutput : schema of output data Checks that Soutput is “subtype” of Sshared Guarantees Doutput : Sshared 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management 66 Inferring Type of Expression Expression <titles> { $imdb//title } </titles> Static type derived from expression Value <titles> <title>Fugitive, The</title> <title>X Files, The</title> </titles> 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management <element name=“titles”> <complexType> <sequence minoccurs=“0” maxoccurs=“unbounded”> <element name=“title” type=“xs:string”/> </sequence> </complexType> </element> 67 Inferring Type of Expression Expression Cannot determine statically that constraint is satisfied //show[contains(title, “Fugitive”)] Inferred type is conservative <sequence minoccurs=“0” maxoccurs=“unbounded”> <elementref name=“show”> </sequence> Value <show year=“1993”> <!-- Example Movie --> <title>Fugitive, The</title> <review> <suntimes> <reviewer>Roger Ebert</reviewer> gives <rating>two thumbs up</rating>! A fun action movie, Harrison Ford at his best. </suntimes> </review> ... </show> XML and Data Management WWW2002 - Hawaii 68 2002 by AT&T and Lucent Type-Safe Composition Expression //show[contains(title, “Fugitive”)] Inferred type <sequence minoccurs=“0” maxoccurs=“unbounded”> <elementref name=“show”> </sequence> Required type XML and Data Management define function show(element show+ $show) returns … <sequence minoccurs=“1” maxoccurs=“unbounded”> <elementref name=“show”> </sequence> Type mismatch raises an error If static typing, error raised when analyzing query If dynamic typing, error raised when evaluating query WWW2002 - Hawaii 2002 by AT&T and Lucent 69 Feature Summary XML Content What How DOM Navigational Not Preserved Not Enforced Entity refs String data Streams Not Preserved Not Enforced XPath 2.0 Typed values Declarative Preserved XSLT 2.0 Typed values Declarative Transform Preserved Not Enforced XQuery 1.0 Typed values Declarative Transform Preserved Enforced SAX 2002 by AT&T and Lucent In-place Transform Safety Input Output WWW2002 - Hawaii XML and Data Management Entity refs String data Update 70 Implementor’s Perspective Interface : multiple implementation strategies XSLT 2.0/XQuery 1.0 XPath 2.0 XPath Data Model SAX API Implement from scratch Build on existing storage system XML Parser Special-purpose streams processor XML Information Set XML Document WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management DOM API Translate into SQL/OQL/LDAP Custom query engine 71 User’s Perspective Appropriate interface depends on Processing behavior of application Requirements of application Safety, update, … Capabilities of underlying storage & query system Solutions for Publishing & Storing XML data 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management APIs & Languages complete, yet complex Vendors implement features that make “80/20” for their user base Ex: Support read-only subset of DOM API As customer, must ask whether vendor’s choices appropriate for your applications 72 References DOM http://www.w3.org/TR/REC-DOM-Level-1/ SAX http://www.saxproject.org/ XPath 2.0 http://www.w3.org/TR/query-datamodel/ http://www.w3.org/TR/xpath20/ http://www.w3.org/TR/query-operators/ XML and Data Management XQuery 1.0 http://www.w3.org/TR/xquery/ 2002 by AT&T and Lucent WWW2002 - Hawaii 73 XPath 2.0 vs. XPath 1.0 Consistent semantics for XPath 2.0 XML documents with schema Expression raises error or evaluates to unique value Ordered sequences of typed node & atomic values Expr , union | intersect except Expr for Var in Expr return Expr Conditional expression if Expr then Expr else Expr Quantified expressions some/every Var in Expr satisfies Expr 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management 74 XML and Data Management <Lunch/> WWW2002 - Hawaii 2002 by AT&T and Lucent Extra Slides 75 XML Test Drive CARS Speed Color Comfort Gas consumption Safety Fashion/Hip factor Price XML Performance User Interface Flexibility Storage overhead Maturity Hype factor Price There are trade-offs: Tutorial will help you make informed decisions about trade-offs XML and Data Management Just like for cars, it’s hard to get all the features WWW2002 - Hawaii 77 2002 by AT&T and Lucent WWW2002 - Hawaii 78 XML and Data Management 2002 by AT&T and Lucent XML & Data Management Part II: Relational Publishing in XML Agenda: z z z z Goals and problems of publishing Publishing languages Exporting documents Querying documents XML -it’s only an interchange format Data base Data base Magical Publishing Box! XML! Data base! Most data is stored in pre-existing databases Will continue to be updated through these interfaces Need to provide XML wrappers to export data Focus here on relational databases 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management HTML 80 The “easy” publishing problem Export relational data in a canonical format Actors LastName FirstName Viterelli Joe .... Useful for • web publishing • data integration XML and Data Management ... <Row> <LastName> Viterelli </LastName> <FirstName> Joe</FirstName> </Row> ... WWW2002 - Hawaii 2002 by AT&T and Lucent 81 Example: Oracle 9i XML SQL Utility Standardization of format proceeding as part of SQLX Export Embed single SQL query in XSL stylesheet Emit result in canonical, flat XML <rowset> <row> <lastname> Viterelli </lastname> <firstname> Alex </firstname> </row> ... </rowset> Similar facility available in DB2, SQL Server SQL server: SELECT Lname, Fname FROM Actor FOR XML Auto 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management <firstname> Joe </firstname> </row> <row> <lastname> Winter </lastname> 82 Harder problem: Publish This! LastName FirstName Viterelli Joe .... Typical Situation: publish in predefined format <actor> WWW2002 - Hawaii <actor> 2002 by AT&T and Lucent XML and Data Management <familyname> Viterelli </familyname> <firstname> Joseph </firstname> <roletype> slow-witted gangsters who are knowledgeable about veal</roletype> <movies> <movie title=“Analyze This”> <character name=“Jelly”/> </movie> <movie title=“See Spot Run”> <character name=“Gino Valente” /> </movie> </movies> 83 Publishing XML to a fixed interface Compared to storing XML data in a relational database: z z easier: just worry about read access to XML harder: one less degree of freedom Process XML requests (DOM, etc.) Vendor Extension or Middleware Create Schema Supplier RDBMs 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management Store/Update 84 Publishing XML to a fixed interface Compared to storing XML data in a relational database: z z easier: just worry about read access to XML view harder: one less degree of freedom Process XML requests Vendor Extension or Middleware Create Schema Supplier RDBMs WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management Store/Update 85 Publishing XML to a Fixed Interface Compared to storing XML data in a relational database: z z easier: just worry about read access to XML view harder: one less degree of freedom. Vendor Extension or Middleware Supplier RDBMs 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management Process XML requests 86 Perspectives on Publishing Application developer: How to describe map from relational data to XML How much transparency: querying document requires how much additional effort? Relational vendor implementer: Middleware developer perspective (e.g., you): Above, plus how to remain vendor-independent WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management Performance of document retrieval Performance of queries Exploitation of underlying relational engine 87 Ultimate Goal <Movies> ... </Movies> Application Vendor Extension or Middleware DOM calls XQuery requests Maintain “common illusion” of storing XML document 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management Supplier RDBMs 88 Relational Publishing in XML Agenda: z z z z Goals and problems of publishing Publishing languages Exporting documents Querying documents View Definition Languages Specification z z How user describes desired translation XML View = mapping from tables to XML No standard yet – several languages, even for one vendor! Main challenge: z In this tutorial, give: A flavor of commercial languages A general framework based on XQuery 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management accommodate enormous variation in mappings 90 Describing Views ? Appearance(mid, aid) : { <001, 011>, <001, 032> } <Actor> <Lname>Viterelli</Lname> <Fname>Joe</Fname> <Movie year=“1999”> Analyze This </Movie> <Movie year=“2001”> See Spot Run </Movie> </Actor> <Actor> <Lname>Winter</Lname> <Fname>Alex</Fname> <Movie year=“1988”> Bill and Ted’s Excellent Adventure </Movie> WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management Actor(aid, lname, fname) : { <001, “Viterelli”, “Joe”>, …} Movie(mid, title, year) : { <011, “Analyze This”, 1999>, <032, “See Spot Run”, 2001> } 91 Approach 1: Universal Relation IBM DB2 SQL Statement giving big relation <SQL_stmt> SELECT A.lname,A.fname, M.year, M.title FROM Movie M, Actor A, Appearance Ap WHERE M.mid=Ap.mid AND M.aid=A.Aid ORDER BY aid </SQL_stmt> A.lname A.fname M.year Viterelli Joe 1999 Viterelli Joe 2001 M.title See Spot Run Analyze This <Actor> <Lname>Viterelli</Lname> <Fname>Joe</Fname> <Movie year=“1999”> Analyze This + Formatting Template annotated with columns of the universal relation </Movie> </Movie> </Actor> <Actor> <Lname>Winter</Lname> <Fname>Alex</Fname> … 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management <Movie year=“2001”> See Spot Run <element_node Actor> <element_node Lname> <text_node> <Column name=“A.lname”/> </text_node> … 92 Universal Relation Vendors: IBM DB2 XML extender SQLServer 2000 Universal XML Variability in XML documents Æ complex universal relations Show actors with no movies ÆAdd Outer Join Merge television actors with movie actors Æ Outer Union small documents Æ tedious large documents Æ unthinkable Close to a particular implementation No safety guarantees on output 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management Verbose and cumbersome 93 Approach 2: Annotated Schema <xs:element name=“Actor” > <xs:complexType> <xs:sequence> <xs:element name=“Lname” type=“xs:string” /> <xs:element name=“Fname” type=“xs:string” /> <xs:element ref=“Movie” /> </xs:sequence> </xs:complexType> </xs:element> <xs:element name=“Movie” > <xs:attribute name=“Year” type=“xs:dateTime” /> </xs:element> 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management Start with desired output schema, and sprinkle with annotations saying where the data comes from 94 Approach 2: Annotated Schema > <xs:element name=“Actor” sql:relation=“Actor” > <xs:complexType> <xs:sequence> <xs:element name=“Lname” type=“xs:string” /> sql:field=“lname” /> <xs:element name=“Fname” type=“xs:string” /> sql:field=“fname” /> <xs:element ref=“Movie” /> sql:relationship=“ActorAppear” sql:relationship=“AppearMovie”/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name=“Movie” > sql:relation=“Movie” sql:field=“title”/> <xs:attribute name=“Year” type=“xs:dateTime” /> sql:field=“year” /> </xs:element> Start with desired output schema, and sprinkle with annotations saying where the data comes from 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management <xs:annotation> <xs:annotation> <xs:appinfo> <xs:appinfo> <sql:relationship name=“AppearMovie” <sql:relationship name=“ActorAppear” parent=“Appearance” parent=“Actor” parent-key=“aid” parent-key=“aid” child=“Movie” child=“Appearance” child-key=“mid” /> child-key=“aid” /> </sql:relationship> </sql:relationship> </xs:appinfo> </xs:appinfo> </xs:annotation> </xs:annotation> 95 Schema-driven mapping in SQL Server 2000 <xs:element name=“Actor" sql:relation=“Actor” > This element associated with an <xs:complexType> Actor tuple <xs:sequence> <xs:element name=“Lname” type=“xs:string” sql:field=“lname” /> <xs:element name=“Fname” type=“xs:string” sql:field=“fname” /> <xs:element ref=“Movie” sql:relationship=“ActorAppear” sql:relationship=“AppearMovie”/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name=“Movie” sql:relation=“Movie” sql:field=“title” > <xs:attribute name=“Year” type=“xs:dateTime” sql:field=“year”/> </xs:element> 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management <xs:annotation> <xs:annotation> <xs:appinfo> <xs:appinfo> <sql:relationship name=“AppearMovie” <sql:relationship name=“ActorAppear” parent=“Appearance” parent=“Actor” parent-key=“aid” parent-key=“aid” child=“Movie” child=“Appearance” child-key=“mid” /> child-key=“aid” /> </sql:relationship> </sql:relationship> </xs:appinfo> </xs:appinfo> </xs:annotation> </xs:annotation> 96 Schema-driven mapping in SQL Server 2000 Text content filled in with <xs:element name=“Actor" sql:relation=“Actor” > These fields of the tuple. <xs:complexType> <xs:sequence> <xs:element name=“Lname” type=“xs:string” sql:field=“lname” /> <xs:element name=“Fname” type=“xs:string” sql:field=“fname” /> <xs:element ref=“Movie” sql:relationship=“ActorAppear” sql:relationship=“AppearMovie”/> </xs:sequence> </xs:complexType> </xs:element> 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management <xs:element name=“Movie” sql:relation=“Movie” sql:field=“title” > <xs:attribute name=“Year” type=“xs:dateTime” sql:field=“year” /> </xs:element> <xs:annotation> <xs:annotation> <xs:appinfo> <xs:appinfo> <sql:relationship name=“AppearMovie” <sql:relationship name=“ActorAppear” parent=“Appearance” parent=“Actor” parent-key=“aid” parent-key=“aid” child=“Movie” child=“Appearance” child-key=“mid” /> child-key=“aid” /> </sql:relationship> </sql:relationship> </xs:appinfo> </xs:appinfo> </xs:annotation> </xs:annotation> 97 Schema-driven mapping in SQL Server 2000 <xs:element name=“Actor" sql:relation=“Actor” > <xs:complexType> <xs:sequence> <xs:element name=“Lname” type=“xs:string” sql:field=“lname” /> <xs:element name=“Fname” type=“xs:string” sql:field=“fname” /> <xs:element ref=“Movie” sql:relationship=“ActorAppear” sql:relationship=“AppearMovie”/> </xs:sequence> </xs:complexType> </xs:element> Condition defined below tells how many movie elements inside this Actor <xs:element name=“Movie” sql:relation=“Movie” sql:field=“title” > <xs:attribute name=“Year” type=“xs:dateTime” sql:field=“year” /> </xs:element> 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management <xs:annotation> <xs:annotation> <xs:appinfo> <xs:appinfo> <sql:relationship name=“AppearMovie” <sql:relationship name=“ActorAppear” parent=“Appearance” parent=“Actor” parent-key=“aid” parent-key=“aid” child=“Movie” child=“Appearance” child-key=“mid” /> child-key=“aid” /> </sql:relationship> </sql:relationship> </xs:appinfo> </xs:appinfo> </xs:annotation> </xs:annotation> 98 Schema-driven mapping in SQL Server 2000 <xs:element name=“Actor" sql:relation=“Actor” > <xs:complexType> <xs:sequence> <xs:element name=“Lname” sql:field=“lname” type=“xs:string” /> <xs:element name=“Fname” sql:field=“fname” type=“xs:string” /> <xs:element ref=“Movie” sql:relationship=“ActorAppear” sql:relationship=“AppearMovie”/> </xs:sequence> </xs:complexType> </xs:element> Movie element has text content taken from this Table and Field of associated tuple <xs:element name=“Movie” sql:relation=“Movie” sql:field=“title” > <xs:attribute name=“Year” type=“xs:dateTime” sql:field=“year” /> </xs:element> WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management <xs:annotation> <xs:annotation> <xs:appinfo> <xs:appinfo> <sql:relationship name=“AppearMovie” <sql:relationship name=“ActorAppear” parent=“Appearance” parent=“Actor” parent-key=“aid” parent-key=“aid” child=“Movie” child=“Appearance” child-key=“mid” /> child-key=“aid” /> </sql:relationship> </sql:relationship> </xs:appinfo> </xs:appinfo> </xs:annotation> </xs:annotation> 99 Annotated Schemas Variations from several vendors: SQL Server 2000 shown previously IBM DB2 DAD RDB_node format Pro’s and Con’s z Compared to universal relation approach Enable integration of validation and publishing z Current versions limited in expressive power Relationships key/foreign key, not arbitrary SQL Lack of support for unions, Full Outer Joins 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management Much more modular 100 Emerging Approach Transforming tables to XML similar to transforming XML to XML Æ should be easier, not harder! Why learn two languages? z Use canonical XML z Plus XQuery 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management <DB> <Actor aid=“01” lname=“Viterelli” fname=“joe”/> <Actor aid=“02” lname=“Winter” fname=“Alex”/> … <Appearance aid=“01” mid=“011”/> … <Movie aid=“011” title=“Analyze This” year=“1999”/> … </DB> 101 Describing Views with XQuery Leverage power of XQuery built-in functions dynamic attribute creation ordering Does not imply naïve implementation! 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management for $actor in //Actor return <Actor> <Fname> {$actor/@fname} </Fname> <Lname> {$actor/@lname} </Lname> for $actorappearance in //Appearance[@aid=$actor/@aid] return for $movie in //Movie[@mid=$actorappearance/@mid] return <Movie year=“{$movie/@year}”> {$movie/@title} </Movie> </Actor> 102 Relational Publishing in XML Agenda: z z z z Goals and problems of publishing Publishing languages Exporting documents Querying documents Remember the Dream! <Movies> ... </Movies> Application Vendor Extension or Middleware DOM calls XQuery requests Maintain “common illusion” of storing XML document 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management Supplier RDBMs 104 Uncoupled Approach Use vendor-provided canonical XML plus applicationdeveloper provided XSLT/XQuery Canonical XQuery XML Performance XML and Data Management Final XML WWW2002 - Hawaii 2002 by AT&T and Lucent 105 Fully-Coupled Approach Use vendor-provided XML template or postprocessing language Vendor languages currently lack flexibility Support for querying document limited or absent IBM DB2 Flex. Mapping N Y- MS SQL Server Y- Export View Y Y Y Query View N N 2002 by AT&T and Lucent XPath WWW2002 - Hawaii XML and Data Management Oracle 9i 106 Middleware Approach Write general wrapper layer that responds to requests Generating SQL queries at runtime Tagging results Focus of rest of this section Gives insight into vendor implementations as well Middleware Layer XML and Data Management Query Generator Tagger WWW2002 - Hawaii 2002 by AT&T and Lucent 107 Middleware Approach Query-cost Estimates View Definition Source Capabilities Tagger 2002 by AT&T and Lucent ?? ? SQL Queries WWW2002 - Hawaii XML and Data Management Request Query Generator 108 Middleware Systems Research Systems z SilkRoute (AT&T Research) z Xperanto (IBM Research) z PRATA (Bell Labs) z Rolex (Bell Labs) Component within Data Integration Systems E-XMLMedia z Enosys z Tibco XML and Data Management z WWW2002 - Hawaii 2002 by AT&T and Lucent 109 Generating SQL Running Example: Find Actors with last name beginning L-Z, their movies, their agent, and awards Goal: push work inside relational engine Avoid: Wrapper repeatedly interleaving querying and merging of result sets: And for each $a in $A find: SELECT AW.name , AWD.date FROM Awards AW, Awarded AWD WHERE AWD.aid = $a.aid AND AWD.awdid=AW.awdid 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management Find $A=SELECT * FROM Actor WHERE lname>’M’ Then for each $a in $A find: SELECT M.year, M.title FROM Movie M, Appear AP WHERE AP.aid = $a.aid AND M.mid=AP.aid 110 Exploring Solution Space Write down ‘building block’ queries Show dependencies: attach to tree based on output schema <Actor> 1 * Q1(actorid,fname,lname) = SELECT actorid, lname,fname FROM Actor WHERE lname > ‘ M ‘ * <Movie> mtitle awardname Q2 (agentid,name,actorid) = … Q3 (awardyear, awardname,actorid) = SELECT AWD.awardyear,AW.awardname A.actorid FROM Awarded AWD, Award AW, Actor A WHERE…. Q4 (mtitle,actorid) = SELECT FROM Actor, Appearance, Movie WHERE... WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management <Award> <Agent> 111 Example Solution <Actor> 1 <Agent> * Q1(actorid,fname,lname) = SELECT actorid, lname,fname FROM Actor WHERE lname > ‘ M ‘ * <Award> awardname Q2 (agentid,name,actorid) = … Q4 (mtitle,actorid) = SELECT FROM Actor, Appearance, Movie WHERE... “Unified Strategy” = Q1 leftjoin Q2 ∪ Q1 leftjoin Q3 ∪ Q1 leftjoin Q4 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management Q3 (awardyear, awardname,actorid) = SELECT AWD.awardyear,AW.awardname A.actorid FROM Awarded AWD, Award AW, Actor A WHERE…. <Movie> mtitle 112 Unified Strategies u v Q1(u) 1 Q3(u,w) Q2(u,v) * w x * Q4(u,x) Trivial work to tag into XML in one pass down table Maximum leverage of query optimizer Output table is wide, deep, and sparse Depending on DB implementation, may be costly WWW2002 - Hawaii in space and time 2002 by AT&T and Lucent XML and Data Management Universal relation views (e.g. SQL Server Universal XML) translate naturally to this 113 Example Solution <Actor> 1 <Agent> * Q1(actorid,fname,lname) = SELECT actorid, lname,fname FROM Actor WHERE lname > ‘ M ‘ * <Award> awardname Q2 (agentid,name,actorid) = … Q4 (mtitle,actorid) = SELECT FROM Actor, Appearance, Movie WHERE... “Fully Partitioned Strategy” = Do Q1 to Q4 separately, merge while tagging 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management Q3 (awardyear, awardname,actorid) = SELECT AWD.awardyear, AW.awardname A.actorid FROM Awarded AWD, Award AW, Actor A WHERE…. <Movie> mtitle 114 Fully-Partitioned Strategies Q1(u) 1 Q3(u,w) Q2(u,v) Q4(u,x) No outer joins or unions Merge-join in one pass in XML generation u u v u w u x WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management * * 115 Searching for Solutions Identify solution with partition of the view tree. Find ‘best’ partition – number of partitionings is exponential <Actor> 1 * * <Manager> <Award> <Movie> <Director> * <Gross> 1 <VideoSales> 2002 by AT&T and Lucent 1 <TheatreSales> WWW2002 - Hawaii XML and Data Management 1 116 Optimization Algorithms Research systems attempt to heuristically solve optimization problem of best partition SilkRoute PRATA Use RDBMS as ‘oracle’ of query cost in evaluating partition Schema impact XML and Data Management Shallow, no recursion: z fixed set of SQL queries to be produced With recursion or large nesting: z optimization is an open research problem 2002 by AT&T and Lucent WWW2002 - Hawaii Relational Publishing in XML z z z z Agenda: Goals and problems of publishing Publishing languages Exporting documents Querying documents 117 Querying a View Wrapper layer must respond to requests by generating SQL requests at runtime, tagging results View Definition Request (XQuery/Xpath) Composed XQuery SQL Generator Tagger SQL Queries ?? ? Result Tables WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management Query Composer Query-cost Estimates Source Capabilities 119 Querying a View Advantage of XQuery view specs: easy to compose query with view Request (XQuery/Xpath) 2002 by AT&T and Lucent ? WWW2002 - Hawaii XML and Data Management Query Query Composer Generator 120 Query Composition ? ° = ? View: Movies by Viterelli for $aid in DB//ActorRow[@lname=“Viterelli”]/@aid return <Actor><Fname> Joe </Fname> <Lname> Viterelli </Lname> for $actapp in DB//AppearRow[@aid=$aid] for $movie in DB//MovieRow[@mid=$actapp/@mid] return <Movie year=“{$movie/@year}”> {$movie/@title} </Movie> </Actor> XML and Data Management Query: Get each Actor + Movies in 1999 for $act in //Actor return <Actor> {$act/Lname} for $movie in $act/Movie[@year=1999] return <Movie>{$movie}</Movie> </Actor> Composed Query: Movies by Viterelli in 1999 WWW2002 - Hawaii 2002 by AT&T and Lucent Query Composition ? ° = 121 ? Composed Query on Canonical XML:= for $aid in DB//ActorRow[@lname=“Viterelli”]/@aid return <Actor>Viterelli for $actapp in DB//AppearRow[@aid=$aid] for $movie in DB//MovieRow[@mid=$actapp/@mid and @year=1999] return <Movie> {$movie/@title} </Movie> </Actor> XML and Data Management Efficient query composition involves: substitution filtering pattern matching 2002 by AT&T and Lucent WWW2002 - Hawaii 122 Issues Query composition can generate even larger queries than a pure export complex documents Æ enormous SQL queries Application where users generally want only a small portion of the document For XSLT not clear how to do composition WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management Query composition can generate much smaller queries than a pure export z DB optimizers crash 123 Publishing as DOM Focus of most publishing systems z fulfill queries from persistent relational store to serialized XML output. One research system, ROLEX, fulfills DOM or query requests against view by returning DOM User query/DOM call Application DB 2002 by AT&T and Lucent Virtual DOM WWW2002 - Hawaii XML and Data Management DOM ROLEX Middleware 124 ROLEX Needn’t return entire tree: z Optimize based on user navigation profile User query/DOM call Application DB Virtual DOM Local part of result DOM WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management ROLEX Middleware 125 Publishing Summary Goal: transparent access Huge variety of mappings need flexible view definition language Application Dependence How much transparency do you need? What interfaces do you need supported? vendor support still evolving Schema Dependence consider size of SQL queries generated by wrapper/vendor beware: deep XML views over normalized tables tend to require large joins 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management hand wrapping still common 126 XML and Data Management 2002 by AT&T and Lucent WWW2002 - Hawaii XML & Data Management Part III: Storage 127 XML Data Management XML Producer XML Consumer XML Documents & Schemas Schema Legacy (Non-XML) Database API or Query XML XML Interfaces XML XML and Data Management Publish XML XML Store XML Schema & XML Persistent (Non-XML) Database WWW2002 - Hawaii 2002 by AT&T and Lucent 129 XML Storage Architecture Logical Layer XPath XQuery DOM XML Physical Layer Relational database LDAP File System Native storage XML is storage agnostic 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management OO database 130 XML Storage Issues Data layout z z z Query support z Indexing z z FLWR queries – keyword-based searches, SELECT/PROJECT/JOIN, recursion and document construction Support fast access for full-text, value-based and navigation queries Need full-text, value and structural indexes General requirements: scalability, recovery, concurrency control, etc. WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management Many alternatives! Flexible and dynamic: both values and structure may change Schema may not be available 131 Storing XML XML is flexible: used in different applications There is no one-size-fits-all solution The best choice depends on application! Important questions What is the data like? Flat vs. structured vs. mixed; large vs. small; schema vs. schemaless; ordered vs. unordered; What are the queries like? Read-only vs. updates; full-text vs. relational vs. navigation What are the application requirements? Support for transactions; concurrency control; replication; etc. XML and Data Management WWW2002 - Hawaii 132 2002 by AT&T and Lucent Agenda Storage choices: Overview Native XML Databases z z Colonial Strategies z Issues Systems and Techniques XML storage in commercial relational databases Summary and remarks 2002 by AT&T and Lucent XML and Data Management z Issues Systems and Techniques WWW2002 - Hawaii 133 WWW2002 - Hawaii 134 Storage Choices Flat streams Native Colonial XML and Data Management 2002 by AT&T and Lucent Flat Streams − Store XML documents as is in text files or CLOBs Fast for storing and retrieving whole documents Query support: limited − − Navigational queries require parsing Full-text queries require indexes No localized updates WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management − 135 Colonial − Re-use existing storage systems Leverage mature systems Simple integration with legacy data Map XML document into underlying structures − − − E.g., shred document into flat tables Slow reconstruction of textual representation Query language mismatch Mapping overheads 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management − 136 Native − New databases designed specifically for XML XML documents stored as is Efficient support for XML queries May need to build new systems from the ground up or adapt existing systems z WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management z Re-design features for XML (isolation, recovery, etc) May have incomplete support for some general data management tasks 137 Agenda Storage choices: Overview Native XML Databases z z Colonial Strategies z Issues Systems and Techniques XML storage in commercial relational databases Summary and remarks 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management z Issues Systems and Techniques 138 Native Storage Goal: Build high-performance systems specifically designed to manage XML data XML Documents XML Queries/ APIs Indexes Access XML and Data Management Physical Design Disk pages WWW2002 - Hawaii 2002 by AT&T and Lucent 139 Native Approaches Re-think the data management problem in light of XML z z Retool existing systems to handle XML Build systems from scratch Problems addressed: z 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management z XML-specific: data layout, query support and indexing General: Access control; transactions; recovery; … 140 Native Issues: Data Layout Requirements z z z Concise representation of documents Efficient support for XML APIs and query languages Ability to update values and structure Cluster subtrees Map trees into physical disk pages z Lots of choices imdb title Fugitive, The XML and Data Management page show show year box_office 183,752,965 1993 title year Seinfeld seasons 1993 13 page page WWW2002 - Hawaii 2002 by AT&T and Lucent 141 Data Layout (cont.) page show show title 1993 show box_office 183,752,965 title Seinfeld cluster years cluster titles 2002 by AT&T and Lucent year 1993 page title Fugitive, The title Seinfeld seasons 13 Cluster similar elements WWW2002 - Hawaii XML and Data Management Fugitive, The year … … show imdb 142 Native Issues: Indexing A physical layout cannot be optimal for all possible access patterns z z List the title and year of shows: Clustering elements: too many disk accesses Clustering shows is best List the titles of shows released after 1994 Create additional structures to provide fast access to data XML requires different kinds of indexes: z Values, structure (navigation), full-text (keywordbased) WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management Neither strategy is optimal 143 Full-Text Indexing Term Refs 1993 Fugitive <imdb> <show year=“1993”> <title>Fugitive, The</title> <review>…</review> … </show> … </imdb> Find the shows where title contains “Fugitive” 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management Find all documents where “Fugitive” occurs 144 XML-Aware Full-Text Indexing Element Child of show imdb title show year show Value Term Element Fugitive &t1 1993 &y1 Refs imdb &s1 &s2 show &t1 title Fugitive, The show &t2 &y1 year 1993 box_office 183,752,965 title Seinfeld page &y2 year 1994 seasons 13 page WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management page 145 Agenda Storage choices: Overview Native XML Databases z z Colonial Strategies z Issues Systems and Techniques XML storage in commercial relational databases Summary and remarks 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management z Issues Systems and Techniques 146 Native Systems: Summary Query support Full Text XPath APIs XQuery Xyleme Built from scratch ? Yes Yes Yes Natix Built from scratch Low-level primitives N/a N/a N/a Xindice Built from scratch XML:DB, XML:RPC No Yes No eXcelon OODB DOM/XSLT Yes Yes Yes Tamino Adabas DOM/SAX Yes Yes Partial GoXML Built from scratch ? Yes Yes Yes Wide variation in supported features WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management Origin 147 NatiX Focus: data layout z z z Efficient storage for trees Minimize I/O for direct access and scanning Support updates Low-level storage primitives z 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management z Primitives to control layout of related elements on disk Support for read/write/insert/delete operations of elements 148 Xyleme Data layout: based on NatiX Indexing: sophisticated indexing of text and elements Query support: XPath, XQuery, updates More than just storage: A data warehouse for XML content z z z Document classification Data/schema integration Web crawling Document monitoring XML and Data Management z WWW2002 - Hawaii 2002 by AT&T and Lucent 149 eXcelon XIS z z z Arbitrary XML documents - no Schema or DTD Can enforce schemas Triggers; transactions; distributed caching mechanism 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management Extends Object Store – an object-oriented database Data Layout: stores parsed nodes (accessible via DOM) Indexing: value, text, structural Query Support: DOM, XSLT, XPath, XQuery, updates Other features: 150 Software A/G Tamino Extends Adabas – nested relations Indexing: full-text, value, structure Query support: z z Other features: z z Transactions; triggers; backup/restore; compression Multimedia documents, e.g., graphics, video WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management z Full-text search operators Queries return entire document or projection of document: No construction of new XML values DOM and SAX 151 Other Native Systems Xindice http://xml.apache.org/xindice/ z z Query support: XPath for its query language and XML:DB XUpdate for its update language APIs: XML:DB API for Java development; other languages using an available XML-RPC plugin GoXML z Query support: XQuery, full text searching 2002 by AT&T and Lucent XML and Data Management tree insert, replace and delete WWW2002 - Hawaii 152 Agenda Storage choices: Overview Native XML Databases z z Colonial Strategies z Issues Systems and Techniques XML storage in commercial relational databases Summary and remarks WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management z Issues Systems and Techniques 153 Colonial Storage LDAP XML Documents Mapping ObjectOriented 2002 by AT&T and Lucent Map Access Colonial Queries WWW2002 - Hawaii XML and Data Management RDBMS XML Queries 154 Colonial Issues Storage design: map XML data model onto storage model z Data loading: load XML document into mapped structure z Query translation: queries over XML document into queries over mapped document z XML document Æ edges, tuples, objects XQuery, XPath Æ SQL, LDAP, OQL Result translation: results into XML WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management XML data model Æ graph, relations, objects 155 Storing XML in RDBMSs mapping Storage Design XML Schema Data Loading XML Docs Query Translation XQuery Query XML results Translation Layer Relational Schema Tuples Relational Result Commercial RDBMS 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management SQL Query 156 Example: Storage Design * imdb actor show ? title year * tilde * | … box_office seasons reviews TABLE TVShows TABLE Reviews (show1_id INT, (show2_id INT, (review_id INT, title STRING, title STRING, tilde STRING, year INT, year INT, review STRING, box_office INT) seasons INT) parent_Show INT) 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management TABLE Movies 157 Example: Data Loading <imdb> <show year=“1993”> <!-- Example Movie --> <title>Fugitive, The</title> <review> <suntimes> <reviewer>Roger Ebert</reviewer> gives <rating>two thumbs up</rating>! A fun action movie, Harrison Ford at his best. </suntimes> </review> <review> <nyt>The standard Hollywood summer movie strikes back.</nyt> </review> <box_office>183,752,965</box_office> </show> INSERT INTO Reviews (wild,reviewer,rating,review,parent_Show) VALUES (‘suntimes’, ‘Roger Ebert’, ‘two thumbs up, ‘A fun action movie, Harrison Ford at his best.’,10927) INSERT INTO Reviews (wild,review,parent_Show) VALUES (‘nyt’, ‘The standard Hollywood summer movie strikes back.’,10927) 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management INSERT INTO Movies (year, title, show1_id, box_office) VALUES (1993, ‘Fugitive, The’, 10927, 183752965) 158 Example: Query Translation Find the title, year, box office proceeds and reviews for all 2001 movies XQuery For $v in document(“imdbdata”)/imdb/show Where $v/year=2001 Return $v/title, $v/year, $v/box_office, $v/reviews SELECT title, year, box_office, review FROM Movies, Reviews WHERE show1_id = Reviews.parent_Show GROUP BY show1_id 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management SQL 159 Relational Storage Design There are different classes of mappings z z z z 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management z User-defined: user specifies mapping Generic: fixed Data-driven: mapping inferred from data Schema/DTD driven: mapping inferred from DTD or schema Cost-based: mapping inferred from schema, query workload and data 160 User-Defined Mappings Supported by most commercial RBDMS User specifies how to map elements to tables Flexible mapping but… There are drawbacks: z z XML and Data Management z Requires knowledge of XML and relational technology Many different mappings Hard to choose the best for an application Data changes Æ need to update mapping WWW2002 - Hawaii 2002 by AT&T and Lucent 161 Generic Mapping: Edge Edge Table &0 show &1 &2 @year title review &3 &4 &5 box office &6 suntimes nytimes &7 rating Child no. tag target &0 1 show &1 &0 2 show &2 &9 rating &11 &1 1 year &3 &1 2 title &4 &1 3 review &5 &1 4 review &6 &5 1 suntimes &7 &8 &10 Find titles for all shows node SELECT Value.value FROM Value, Edge as E1, Edge as E2 &3 WHERE E1.tag=“show”, &4 E1.target=E2.source, E2.tag=“title”,E2.target=Value.node 2002 by AT&T and Lucent Value Table value 1994 Fugitive, The WWW2002 - Hawaii XML and Data Management source show 162 Generic Mapping: Tag-Based Show Table &0 show show &1 &2 @year title review &3 &4 &5 ordinal target &0 1 &1 &0 2 &2 Title Table box office &6 &11 suntimes nytimes &7 source source ordinal target &1 2 Fugitive, The &9 rating &8 &10 Review Table source ordinal target &1 3 &5 &1 4 &6 Find titles for all shows SELECT Title.target FROM Title, Show WHERE Show.target=Title.source 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management rating 163 Generic Mappings: Summary Ignore regularity in structure Canonical relational schema z z Edge: store all edges in one table Attribute: horizontal partition of Edge relation on element tag 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management Querying: Requires multi-table joins or self joins for element reconstruction 164 Schema-Driven: Shared Inlining * <!ELEMENT imdb (show*, …)> show <!ELEMENT show(title, year?, reviews*, | (box_office| ? (episode*, seasons)))> title year reviews box_office <!ELEMENT title (#PCDATA)> episode <!ELEMENT year (#PCDATA)> <!ELEMENT review(#PCDATA)> … Show ID : Int seasons * title year:Str box_office:Str seasons:Str ID: Int parentID: Int parentCODE: Str Episode ID: Int parentID: Int parentCODE: Str Title ID : Int parentID: Int parentCODE: Str XML and Data Management Reviews review: str title:Str Find titles for all shows SELECT title FROM Show,Title WHERE Title.parentID = Show.ID WWW2002 - Hawaii 2002 by AT&T and Lucent 165 Schema-Driven: Hybrid Inlining <!ELEMENT imdb (show*, …)> show <!ELEMENT show(title, year?, reviews*, (box_office| ? (episode*, seasons)))> title year * reviews | box_office <!ELEMENT title (#PCDATA)> episode <!ELEMENT year (#PCDATA)> <!ELEMENT review(#PCDATA)> … Show ID : Int title:Str seasons * title year:Str box_office:Str ID: Int parentID: Int parentCODE: Str Episode ID: Int parentID: Int parentCODE: Str review: str title:Str Find titles for all shows SELECT title FROM Show 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management Reviews seasons:Str 166 Schema-Driven: Summary Use DTD/XML Schema to decompose document Shared/Hybrid z z Querying: Fast lookup & reconstruction of inlined elements Reconstruction may require multi-table joins and unions + - 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management Rule of thumb: Inline as much as possible to minimize number of joins Shared: do not inline if shared, set-valued, recursive Hybrid: also inline if shared but not set-valued or recursive z 167 Data-Driven: STORED Schemaless data Analyze data -- try to infer schema graph: “mine” data for common (regular) patterns with high-support Example: z z Querying: use derived mapping definition to automatically translate queries 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management z Discover from IMDB data that every show has year and title Create a table for show that contains year and title Use generic mapping for patterns that are irregular and have low-support 168 More mappings… show ? * title year tilde | TABLE Show TABLE Show TABLE Show1 (show_id INT, (show_id INT, (show1_id INT, title STRING, title STRING, title STRING, year INT, year INT, year INT, box_office INT, box_office INT, box_office INT) seasons INT) seasons INT) TABLE Show2 seasons TABLE Review reviews box_office (show2_id INT, (review_id INT, (review_id INT, title STRING, tilde STRING, review STRING, year INT, review STRING, parent_Show INT) seasons INT) TABLE Review TABLE Review (review_id INT, (review_id INT, tilde STRING, tilde STRING, review STRING, review STRING, parent_Show INT) parent_Show INT) parent_Show INT) There are many alternative mappings!(I) Inline as many elements as possible (III)Split Show table (II)Partition into TV and Movies reviews table-one for NYT,one for rest WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management TABLE NYTReview 169 Mappings and Performance Implications 1.4 1.2 1 0.8 (I) 0.6 (II) 0.4 (III) 0.2 Q1 Q2 Q3 Q4 W1 W2 • Performance depends on data, schema and query workload • A fixed mapping is unlikely to be the best for all applications 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management 0 170 The LegoDB Storage Mapping Engine Application-driven shredding Automatically generates and explores a space of possible mappings z XQuery automatically translated at runtime XML and Data Management Uses a standard relational optimizer to evaluate cost of mappings WWW2002 - Hawaii 171 z Uses information from schema, data statistics and query workload Selects the mapping which has the lowest cost for a given application 2002 by AT&T and Lucent Colonial Techniques: Summary Mapping DTD/ Schema Data Query workload User defined Manual no no no Generic Automatic/ fixed no no no STORED Automatic/ dataoriented no yes no Shared/Hybrid Inlining Automatic/ DTD-based yes no no LegoDB Automatic/ cost-based yes yes yes 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management Strategy 172 Agenda Storage choices: Overview Native XML Databases z z Colonial Strategies z Issues Systems and Techniques XML storage in commercial relational databases Summary and remarks WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management z Issues Systems and Techniques 173 Commercial Systems Oracle 9i IBM DB2 Microsoft SQL Server 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management Not a comprehensive survey! 174 Oracle 9i: Schema Design Store XML documents in CLOB (character large objects) or BFILEs z Canonical mapping into object-relational tables z z z z tag names are mapped to column names elements with text-only map to scalar columns elements with sub-elements map to object types list of elements maps to collections Indexing: standard relational Hybrid: user-defined z Canonical for structured fragments; CLOBs/BFILEs for unstructured fragments WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management z Indexing: Full-text 175 Oracle 9i (cont.) Data Loading: multiple mechanisms z PL/SQL, custom code (Java, C++), SQL*Loader,… Query support: z z z 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management z CLOBs: SQL + Oracle Text; XPath Canonical: SQL XSU for publishing results in XML format No support for XQuery 176 IBM DB2 User-defined mapping through DAD (Document Access Definition) XML Collections: Declarative decomposition of XML into multiple tables z z Data loading: follows DAD mapping Query support: SQL z z z Documents stored in XML columns Side tables used for hot elements and attributes Query support: full-text search (DB2 Text Extender); extract, search, update elements and attributes WWW2002 - Hawaii 2002 by AT&T and Lucent XML and Data Management XML Columns: CLOBs + side tables for indexing individual elements 177 MS SQL Server Edge Table: Edge + inlined scalar values Shredded Rowset: Programmatic decomposition of XML into multiple tables Annotated schema CLOB Data loading: z Query support: z z SQL SQL extensions for publishing results as XML (FOR XML clause) 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management z Combine INSERT and OPENXML OPENXML: access to XML data as a relational rowset 178 Commercial Systems: Summary Data design: z z z Querying: z SQL as the main access method to XML documents – no support for XQuery “XML-aware” extensions to SQL E.g., Limited XPath navigation syntax XML and Data Management z CLOBs Fixed canonical mappings User-defined mappings Publish results in XML WWW2002 - Hawaii 2002 by AT&T and Lucent 179 Commercial Systems: Summary Data Design Loading Query Support Oracle 9i CLOB/ Canonical OR / User-defined Hybrid PL/SQL Java, C++ SQL*Loader Full-text, XPath SQL DB2 CLOB + side tables/ User-defined DAD DAD-driven SQL+Full Text SQL SQL Server EDGE/ User-defined shredded rowset/Annotat ed Schema OpenXML Annotated schema: bulk load, updategrams SQL + Full Text XPath No support for Xquery, or updates via DOM 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management System 180 Agenda Storage choices: Overview Native XML Databases z z Colonial Strategies z Issues Systems and Techniques XML and Data Management z Issues Systems and Techniques Commercial Solutions Summary and remarks WWW2002 - Hawaii 2002 by AT&T and Lucent 181 Update Support XQuery does not support updates (yet…) How to update? z z z Flat streams: overwrite document Colonial: SQL Native: DOM, proprietary APIs z z z Flat streams: re-parse document Colonial: need to understand the mapping and maintain integrity constraints Native: supported in some systems (e.g., eXcelon) 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management But how do you know you have not violated schema? 182 Summary of Storage Techniques Colonial Support XQuery SQL + extension Performance? Overheads for translation – not in commercial systems Supporting order can be expensive Store and query docs as is Data/query conversion layer is required Built from the ground up – less mature Mature systems Extensible - no schema or DTD needed May require changes to schema in order to support new tags Need translation to interoperate Easy to interoperate with legacy data 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management Native 183 How do I choose the best storage solution for my XML application? Match application requirements with vendors’ supported features z z z 2002 by AT&T and Lucent WWW2002 - Hawaii XML and Data Management z Updates via DOM? XPath/XQuery support? Relational interfaces? Distribution, concurrency control, archiving, application-development tools…. 184 References Native Xindice - http://xml.apache.org/xindice http://www.xmldb.org/index.html Natix http://www.dataexmachina.de/natix.html Carl-Christian Kanne, Guido Moerkotte: Efficient Storage of XML Data. ICDE 2000: 198 Xyleme http://www.xyleme.com Excelon http://www.exceloncorp.com Tamino http://www.softwareag.com/tamino http://www.xyzfind.com GoXML http://www.xmlglobal.com/prod/db Xupdate. http://www.xmldb.org/xupdate/ XML:DB http://www.xmldb.org/xapi/ Philip Bohannon, Juliana Freire, Prasan Roy, Jérôme Siméon: From XML Schema to Relations: A Cost-based Approach to XML Storage. ICDE 2002 IBM DB2 XML Extender http://www3.ibm.com/software/data/db2/extenders/xmlext/li brary.html Oracle XML DB http://technet.oracle.com/tech/xml/content.html Informix http://www.informix.com/xml/ SQL Server http://www.microsoft.com/sql/techinfo/xml/defau lt.asp Colonial TOX – The Toronto XML engine http://www.cs.toronto.edu/tox Daniela Florescu, Donald Kossmann: Storing and Querying XML Data using an RDMBS. IEEE Data Engineering Bulletin 22(3): 27-34 (1999) Jayavel Shanmugasundaram, Kristin Tufte, Chun Zhang, Gang He, David J. DeWitt, Jeffrey F. Naughton: Relational Databases for Querying XML Documents: Limitations and Opportunities. VLDB 1999: 302-314 Alin Deutsch, Mary F. Fernandez, Dan Suciu: Storing Semistructured Data with STORED. SIGMOD Conference 1999: 431-442 2002 by AT&T and Lucent XML and Data Management WWW2002 - Hawaii 185