XML: The Big Picture and Some Gory Details (A brief tutorial with an eye towards e-records and archival) Bertram Ludaescher ludaesch@sdsc.edu Data Intensive Computing Environments (DICE) Group San Diego Supercomputer Center, UCSD 1 DICE Members Staff Students • • • • • • • • • • • • • • • • • • • Reagan Moore Chaitan Baru Amarnath Gupta Bertram Ludäscher Richard Marciano Arcot Rajasekar Wayne Schroeder Michael Wan Ilya Zaslavsky Bing Zhu + NN * 4 XML Tutorial, Bertram Ludäscher Pratik Mukhopadhyay Azra Mulic Kevin Munroe Paul Nguyen Michail Petropolis Nicholas Puz Pavel Velikhov +/- NN 2 Tutorial Outline • Roadmap & Overview • What about XML vs. E-records and Archives? (or: why it’s good to be here ;-) • XML 101 • XML 232 • Querying & Transforming XML • Mediation of Information using XML (MIX) • Other Projects... XML Tutorial, Bertram Ludäscher 3 Some History (or: from fat via lean… • SGML (Standard Generalized Markup Language) – – – – – ISO Standard, 1986, for data storage & exchange Metalanguage for defining languages (through DTDs) A famous SGML language: HTML!! Separation of content and display Used in U.S. gvt. & contractors, large manufacturing companies, technical info. Publishers,... – SGML reference is 600 pages long • XML (eXtensible Markup Language) – W3C (World Wide Web Consortium) -- http://www.w3.org/XML/) recommendation in 1998 – Simple subset (80/20 rule) of SGML: “ASCII of the Web”, “Semantic Web”. – XML specification is 26 pages long XML Tutorial, Bertram Ludäscher 4 … to skinny and back! ) • Canonical XML – “normalization”, equivalence testing of XML documents • SML (Simple Markup Language) – “Reduce to the max”: No Attributes / No Processing Instructions (PI) / No DTD / No non-character entity-references / No CDATA marked sections / Support for only UTF-8 character encoding / No optional features • XML Schema – XML Schema definition language – Back to complex: • Part I (Structures), Part II (Data Types), Part III aehm 0 (Primer) • X-Zoo (Xoo?), “Brave New X-World” • Specifications CSS • Digital Signatures • ebxml Project Teams • ebXML • IETF Specifications • Internationalization • IOTP (Internet Open Trading Protocol) • OASIS • Requirements Documents • SMIL • SVG (Scalable Vector Graphics) • Topic Maps • W3C Activity Pages • W3C Notes • W3C Standards • W3C Standards-in-progress • WAP • WebDAV • XHTML • XLink • XPath • XSLT • Vocabularies DTDs • Music • P3P • RDF • RSS • SMIL • W3C Standards • W3C Standards-in-progress • WML • XHTML • XSL FO's • XSLT • XUL • Vertical Industries Advertising • Commerce • Consortiums • Construction • Food • Insurance • Legal • Medical • Music • OASIS • Real Estate • Science • Space Exploration • Telecommunications • Travel • Weather XML Tutorial, Bertram Ludäscher 5 … but … FEAR NOT! XML Tutorial, Bertram Ludäscher 6 Back to the Future (or Archival for the Past...) A time traveler sends a message in the virtual bottle, containing parts of the universal library of human and post-human mankind back into the last third of the 20th century... • ... when the Web, XML, WAP, B2B, and Petabytes were unheard of • ... RAM was so precious that it was ok to deal with nibbles • ... MS-DOS was still called CP/M • ... and in fact Bill hadn’t moved into the garage yet but worked on a homework assignment by Christos, trying to sort pancakes faster (Gates, W.H. and Papadimitriou, C. "Bounds for Sorting by Prefix Reversal." Discr. Math. 27, 47-57, 1979.) • Task: make sense out of the futuristic message in the past! XML Tutorial, Bertram Ludäscher 7 Our past futurist’s (future archeologist’s?) supercomputer looked like this … 62k CP/M VER 2.23 (Z80/DJDMA/VT100) A>dir A: ARK COM : ASM A: CPM2 HLP : CBIOS A: DDTZ COM : DUMP A: ERAQ COM : FORMAT A: HELP HLP : LIB A: LOAD COM : LS A: LU HLP : MAC A: MOVCPM COM : PIP A: PUTCPM ASM : PUTCPM A: STAT COM : SUBMIT A: THISSIM HLP : UNARK A: UNZIP COM : USQ A: MBASIC HLP : MBASIC A>mbasic BASIC-80 Rev. 5.22 [CP/M Version] 32783 Bytes free Ok COM ASM COM ASM COM COM COM COM COM COM COM COM COM : : : : : : : : : : : : : CLS CBOOT ED FORMAT LINK LT MAC PTRDSK SAP SURVEY UNCR VDE WS COM ASM COM COM COM COM HLP ASM COM COM COM COM HLP : : : : : : : : : : : : COPY DDT EDFILE HELP LINK LU MOUNT PTRDSK SQ SYSGEN UNERASE XSUB ASM COM COM COM HLP COM ASM COM COM SUB COM COM Ever wondered where the 8 letter filenames, 3 letter extensions came from? ;-) XML Tutorial, Bertram Ludäscher 8 Message in the bottle: 1 • • • ÐÏ^Qࡱ^Zá^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@>^@^C^@þÿ ^@^F^@^@^@^@^@^@^@^@^@^@^@^A^@^@^@#^@^@^@^@^@^@^@^@^P^@^@%^@^@ ^@^A^@^@^@þÿÿÿ^@^@^@^@"^@^@^@ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿì¥Á^@ q^@^D^@^@^@^R¿^@^@^@^@^@^@^P^@^@^@^@^@^D^@^@Ç^G^@^@^N^@bjbjt+t+^@^@ ^@ ^@Some Quotations from the Universal Library^M1 Famous Quotes^M1.1 By William I^M[2, Sonnet XVIII]^MShall I compare thee to a summer's day?^MThou art more lovely and more temperate.^MRough winds do shake the darling buds of May,^MAnd summer's lease hath all too short a date.^MSometime too hot the eye of heaven shines,^MAnd often is his gold complexion dimmed.^MAnd every fair from fair some declines,^MBy chance or nature's changing course untrimmed.^MBut thy eternal summer shall not fade,^MNor lose possession of that fair thou owest,^MNor shall Death brag thou wander'st in his shade^MWhile in eternal lines to time thou growest.^MSo long as men can breathe, or eyes can see,^MSo long live this, and this gives life to thee.^M1.2 By William II^M[1, p.265]^M\223The obvious mathematical breakthrough would be development of^Man easy way to factor large prime numbers."^MReferences^M[1] W. H. Gates. The Road Ahead. Viking Penguin, 1995.^M[2] W. Shakespeare. The Sonnets of Shakespeare.609.^M^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^ ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ^A^@þÿ^C^@^@ÿÿÿÿ^F^B^@^@^@^@^@À^@^@^ @^@^@^@F^X^@^@^@Microsoft Word Document^@^@^@^@MSWordDoc^@^P^@^@^@Word.Document.8^@ô9²q^@^@^@^@^@^@^@^@^ @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^ XML Tutorial, Bertram Ludäscher 9 Message in the bottle: 2 • • • • • • • • • • • • • • • • • • • • • {\rtf1\ansi\ansicpg1252\uc1 \deff0\deflang1033\deflangfe1033{\fonttbl{\f0\froman\fcharset0\fprq2{\*\panose02020603050405020304}Times New Roman;}\ {\f1\fswiss\fcharset0\fprq2{\*\panose 020b0604020202020204}Arial;}^M {\f17\froman\fcharset238\fprq2 Times New Roman CE;}{\f18\froman\fcharset204\fprq2 Times New Roman Cyr;}{\f20\froman\fcharset161\fprq2 Times New R\ oman Greek;}{\f21\froman\fcharset162\fprq2 Times New Roman Tur;}^M … Some Quotations from the Universal Library^M \par }\pard\plain \s2\sb240\sa60\keepn\widctlpar\outlinelevel1\adjustright \b\i\f1\cgrid {\cgrid0 1 Famous Quotes^M \par }\pard\plain \s3\sb240\sa60\keepn\widctlpar\outlinelevel2\adjustright \f1\cgrid {\cgrid0 1.1 By William I^M \par }\pard\plain \s4\sb240\sa60\keepn\widctlpar\outlinelevel3\adjustright \b\f1\cgrid {\cgrid0 [2, Sonnet XVIII]^M \par }\pard\plain \widctlpar\adjustright \fs20\cgrid {\f1\fs24\cgrid0 Shall I compare thee to a summer's day?^M \par Thou art more lovely and more temperate.^M \par Rough winds do shake the darling buds of May,^M … \par }\pard\plain \s3\sb240\sa60\keepn\widctlpar\outlinelevel2\adjustright \f1\cgrid {\cgrid0 1.2 By William II^M \par }\pard\plain \s4\sb240\sa60\keepn\widctlpar\outlinelevel3\adjustright \b\f1\cgrid {\cgrid0 [1, p.265]^M \par }\pard\plain \widctlpar\adjustright \fs20\cgrid {\f1\fs24\cgrid0 \ldblquote The obvious mathematical breakthrough would be development of^M \par an easy way to factor large prime numbers."^M \par }\pard\plain \s2\sb240\sa60\keepn\widctlpar\outlinelevel1\adjustright \b\i\f1\cgrid {\cgrid0 References^M \par }\pard\plain \widctlpar\adjustright \fs20\cgrid {\f1\fs24\cgrid0 [1] W. H. Gates. The Road Ahead. Viking Penguin, 1995.^M \par [2] W. Shakespeare. The Sonnets of Shakespeare. 1609.}{\fs28 ^M \par }} XML Tutorial, Bertram Ludäscher 10 Message in the bottle: 3 %!PS-Adobe-2.0 %%Creator: dvipsk 5.58f Copyright 1986, 1994 Radical Eye Software %%Title: msg.dvi %%Pages: 1 … /X{S N}B /TR{translate}N /isls false N /vsize 11 72 mul N /hsize 8.5 72 mul N /landplus90{false}def /@rigin{isls{[0 landplus90{1 -1}{-1 1} ifelse 0 0 0]concat}if 72 Resolution div 72 VResolution div neg scale … TeXDict begin 39158280 55380996 1000 600 600 (msg.dvi) @start /Fa 16 117 df<0000000001C0000000000003C0000000000003C00000000000 07C000000000000FC000000000000FC000000000001FC000000000001FE000000000003F E000000000003FE000000000007FE00000000000FFE00000000000EFE00000000001EFE0 0000000001CFE000000000038FE000000000038FE000000000070FE000000000070FE0 … %%EndSetup 1 0 bop 659 872 a Ff(Some)44 b(Quotations)f(from)f(the)i(Univ)l(ersal)h (Library)515 1470 y Fe(1)134 b(F)-11 b(amous)45 b(Quotes)515 1669 y Fd(1.1)112 b(By)37 b(William)d(I)515 1822 y Fc([2)o(,)d(Sonnet)h (XVI)s(I)s(I])722 2004 y Fb(Shall)c(I)g(compare)e(thee)i(to)f(a)g (summer's)g(da)n(y?)722 2104 y(Thou)h(art)f(more)f(lo)n(v)n(ely)h(and)g (more)g(temp)r(erate.)722 2204 y(Rough)g(winds)h(do)f(shak)n(e)g(the)h (darling)e(buds)i(of)g(Ma)n(y)-7 b(,)722 2303 y(And)28 b(summer's)g(lease)e(hath)i(all)f(to)r(o)h(short)f(a)g(date.)722 2403 y(Sometime)h(to)r(o)f(hot)h(the)g(ey)n(e)f(of)h(hea)n(v)n(en)e (shines,)722 2503 y(And)i(often)g(is)g(his)f(gold)g(complexion)g XML Tutorial, Bertram Ludäscher 11 Message in the bottle: 4 • • • • • • • • • • • • • • • • • • • • • • • \documentclass{article} \begin{document} \title{Some Quotations from the Universal Library} ... \section{Famous Quotes} \subsection{By William I} \textbf{\cite[Sonnet XVIII]{shakespeare-sonnets-1609}} \begin{verse} Shall I compare thee to a summer's day?\\ Thou art more lovely and more temperate. \\ Rough winds do shake the darling buds of May, \\ And summer's lease hath all too short a date. \\ Sometime too hot the eye of heaven shines, \\ And often is his gold complexion dimmed. \\ … \qquad So long as men can breathe, or eyes can see,\\ \qquad So long live this, and this gives life to thee. \\ \end{verse} ... \bibliographystyle{abbrv} \bibliography{msg} \end{document} XML Tutorial, Bertram Ludäscher 12 Message in the bottle: 5 • • • • • <HTML> <HEAD> <TITLE>Some Quotations from the Universal Library</TITLE> </HEAD> <BODY> • • • • • • • • • • • • • • • • • • • • • • • <B><FONT FACE="Arial" SIZE=5><P>Some Quotations from the Universal Library</P> </FONT><I><FONT FACE="Arial"><P>1 Famous Quotes</P> </B></I><P>1.1 By William I</P> <B><P>[2, Sonnet XVIII]</P></B> <P>Shall I compare thee to a summer's day?</P> <P>Thou art more lovely and more temperate.</P> <P>Rough winds do shake the darling buds of May,</P> <P>And summer's lease hath all too short a date.</P> <P>Sometime too hot the eye of heaven shines,</P> <P>And often is his gold complexion dimmed.</P> ... <P>So long as men can breathe, or eyes can see,</P> <P>So long live this, and this gives life to thee.</P> <P>1.2 By William II</P> <B><P>[1, p.265]</P> </B><P>&quot;The obvious mathematical breakthrough would be development of</P> <P>an easy way to factor large prime numbers."</P> <B><I><P>References</P> </B></I><P>[1] W. H. Gates. The Road Ahead. Viking Penguin, 1995.</P> <P>[2] W. Shakespeare. The Sonnets of Shakespeare. 1609.</P></FONT></BODY> </HTML> XML Tutorial, Bertram Ludäscher 13 Message in the bottle: 6 <?xml version="1.0"?> <universal_library> <books> <book> <title>Some Quotations from the Universal Library</title> <section> <title>Famous Quotes</title> <subsection> <title>By William I</title> <quote bibref="shakespeare-sonnets-1609"> <title>Sonnet XVIII</title> <verse> <line>Shall I compare thee to a summer's day?</line> <line>Thou art more lovely and more temperate. </line> <line>Rough winds do shake the darling buds of May, </line> </verse> … <subsection> <title>By William II</title> <quote bibref="gates-road-ahead-1995"> <title>Page 265</title> <line>``The obvious mathematical breakthrough would be development of an easy way to factor large prime numbers.’’</line> </quote> </subsection> </section> </book> … </books> </universal_library> XML Tutorial, Bertram Ludäscher 14 XML as a Self-Describing Format • can be “understood” using any (archaic CP/M) editor • can be parsed easily • contains its own structure (=parse tree) in the data => allows the e-archeologist to rediscover schema and content (=semantics!?) • may also include an explicit schema description (DTD) => “meta-model”: definition of a language w.r.t. which it is valid • allows separation of marked-up content from presentation (=>style sheets) • as a self-describing format good for “archival into the past” => not bad for archival into the future XML Tutorial, Bertram Ludäscher 15 Some thoughts on how XML can help with e-record management... • Assumption: represent e-records in XML => self-describing format (good for archival) => get a semistructured data model (flexible: encode regular tables, nested structures, objects, or even (cleaned up) HTML) => many tools (and many more to come -- (re)use code): parsers, validators, query languages, storage => standards (good for interoperation, integration, etc): generic standards (XML, DTDs, XML Schema, XPath,...) community/industry standards (=specific markup languages) XML Tutorial, Bertram Ludäscher 16 ...thoughts continued • “E-Record Quality Assurance”: – by “subscribing” to a certain XML DTD/XML Schema/XML ???, you can make sure that “the same language is spoken” – validation using DTDs provides a first simple quality control: • are the right tags used? • is the nesting of elements ok w.r.t. the DTD? • is the order and multiplicity of element ok? – if you need more => use validation w.r.t. an XML Schema • now: check also data types • use specialization and other mechanisms from object-oriented modeling • more integrity checking possible (cardinalities,…) – still want more integrity checks (ICs) or even “policies”? => use a declarative rule language for specifying the constraints and policies at design time. Implement them at run time, e.g., by adding the ICs to the XML DTD/Schema/… => checking ICs and policies is similar to issuing specific queries against the data => use query processors (relational DBs, XML DBs, XML tools) for integrity checking when possible => for evolution of records, look at versioning models for data bases and temporal database models and query languages XML Tutorial, Bertram Ludäscher 17 Back to XML: Different Perspectives • Document (SGML) Community – data = linear text documents – mark up (annotate) text pieces to describe context, structure, semantics of the marked text • Database Community – XML as a (most prominent) example of the semistructured data model => captures the whole spectrum from highly structured, regular data to unstructured data XML Tutorial, Bertram Ludäscher 18 More Perspectives on XML • "XML is the cure for your data exchange, information integration, e-commerce, [x-2-y, U name it] problems” (“snake oil/silver bullet theory”) • "XML is nothing but (another) syntax (for Lisp, trees,…)” (“nothing new under the sun”) (books (book (author “Shakespeare” ) (title “Sonnets”) (verse (line “Shall I compare…” ) (line …) …))) XML Tutorial, Bertram Ludäscher 19 So what is XML (all about)? Executive Summary: • XML = HTML – idiosyncrasies (simplified syntax) + user-definable ("semantic") tags • Separation of data and its presentation => simple, very flexible data exchange format: semistructured data model => new applications: • Information exchange (B2B), sharing (diglib), integration ("mediation"), archival, ... • Web site mangement (XML+XSL stylesheets), ... XML Tutorial, Bertram Ludäscher 20 Many X-cellent(?) Acronyms... • • • • • • • • XML (Extensible Markup Language) XML Namespaces XML DTDs, XML Schema RDF (Resource Description Framework) XSL (Extensible Style Sheet Language) XPath (=XSLT XPointer), XLink XQL, XML-QL (XML Query Language) XMAS (XML Matching And Structuring language) • eXcelon, ... => XML++ (i.e. += X-tensions) >> just syntax => a family of technologies (XML extensions, tools, ... ) => generic standards and industry/community standards XML Tutorial, Bertram Ludäscher 21 XML Applications & Industry Initiatives http://www.oasis-open.org/cover/xml.html#applications • • • • • • • • • • • • • • • • Advertising: adXML place an ad onto an ad network or to a single vendor Literature: Gutenberg convert the world’s great literature into XML Directories: dirXML Novell’s Directory Services Markup Language (DSML) Web Servers: apacheXML parsers, XSL, web publishing Travel: openTravel information for airlines, hotels, and car rental places News: NewsML creation, transfer and delivery of news Human Resources: XML-HR standardization of HR/electronic recruiting XML definitions International Dvt: IDML improve the mgt. and exchange of info. for sustainable development Voice: VoxML markup language for voice applications Wireless: WAP (Wireless Application Protocol) wireless devices on the World Wide Web Weather: OMF Weather Observation Markup Format (simulation) Geospatial: ANZMETA distributed national directory for land information Banking: MBA Mortgage Bankers Association of America --> credit report, loan file, underwriting… Healthcare: HL7 DTDs for prescriptions, policies & procedures, clinical trials Math: MathML (Mathematical Markup Language) Surveys: DDI (Data Documentation Initiative) “codebooks” in the social and behavioral sciences XML Tutorial, Bertram Ludäscher 22 XML E-commerce Initiatives • CommerceNet – – – – • eCo Framework XML specs. to support interoperability among e-businesses Commerce One Common Business Library (CBL): set of business components, docs. In DTD, XDR, SOX BizTalk Microsoft spec. based on XML schemas cXML (Commerce XML) -- tag-sets for e-procurement into BizTalk Electronic Data Interchange (EDI) – RosettaNet Common format for online ordering – FpML (Financial products Markup Language): sharing of financial data (interest rate & foreign exchange products) • Open Buying on the Internet (OBI) – OBI • high volume b2b purchasing transactions over the Internet (Office Depot, Lockheed, barnesandnoble, AX... E-commerce and XML – VISA Invoices The Visa Extensible Markup Language (XML) Invoice Specification provides a comprehensive list of data elements contained in most invoices, including: Buyer/Supplier, Shipping, Tax, Payment, Currency, Discount, and Line Item Detail. • B2B Integration – code360 XML-Broker is middleware software that manages XML based transactions – Bluestone XML Suite Enables to develop and deploy e-commerce, electronic data interchange, application integration and supply chain management applications. Bluestone XML Suite products include: XML-Server, VisualXML, XML-Contact and XwingML. – webMethods XML Tutorial, Bertram Ludäscher Provides companies with integrated direct links to buyers and suppliers 23 What’s Wrong with HTML? Y. Papakonstantinou, S. Abiteboul, H. Garcia-Molina. “Object Fusion in Mediator Systems”. In VLDB 96. HTML confuses presentation with content <DT> <IMG SRC="greenball.gif" >&nbsp; <A NAME="object-fusion"></A> Y.Papakonstantinou, S. Abiteboul, H. Garcia-Molina. <A HREF="http://www-cse.ucsd.edu/~yannis/papers/fusion.ps"> "ObjectFusion in Mediator Systems".</A> In <I>VLDB 96.</I> </DT> XML Tutorial, Bertram Ludäscher 24 ...What’s Wrong with HTML... No Explicit Structure, Semantics, or Object-Orientation <DT> <IMG SRC= "greenball.gif" >&nbsp; Author <A NAME="object-fusion"></A> Y.Papakonstantinou, S. Abiteboul, H. Garcia-Molina. <A HREF="http://www-cse.ucsd.edu/~yannis/papers/fusion.ps"> "ObjectFusion in Mediator Systems".</A> In <I>VLDB 96.</I> </DT> Title Conference XML Tutorial, Bertram Ludäscher 25 ... And Some Repercussions • Lack of schema/semantics when querying the Web (HTML): – "find documents (books, papers, ...) where author = Michael Jackson" (... and learn how software engineering meets the moon walker ...) – "create a list of M. Jackson's books and (if available) their prices" => HTML is inappropriate for data exchange automation of information management (retrieval, manipulation, integration) XML Tutorial, Bertram Ludäscher 26 XML is Based on Markup <bibliography> Markup indicates structure and semantics <paper ID= "object-fusion"> <authors> <author>Y.Papakonstantinou</author> <author>S. Abiteboul</author> <author>H. Garcia-Molina</author> </authors> <fullPaper source="fusion"/> <title>Object Fusion in Mediator Systems</title> <booktitle>VLDB 96</booktitle> </paper> </bibliography> XML Tutorial, Bertram Ludäscher Decoupled from presentation 27 Elements and their Content <bibliography> element name Element Content <paper ID="object-fusion"> <authors> <author>Y.Papakonstantinou</author> <author>S. Abiteboul</author> <author>H. Garcia-Molina</author> </authors> <fullPaper source="fusion"/> <title>Object Fusion in Mediator Systems</title> <booktitle>VLDB 96</booktitle> </paper> element Empty Element </bibliography> XML Tutorial, Bertram Ludäscher Character content 28 Element Attributes <bibliography> Attribute name <paper ID="object-fusion"> <authors> <author>Y.Papakonstantinou</author> <author>S. Abiteboul</author> <author>H. Garcia-Molina</author> </authors> <fullPaper source="fusion"/> <title>Object Fusion in Mediator Systems</title> <booktitle>VLDB 96</booktitle> </paper> Attribute Value </bibliography> XML Tutorial, Bertram Ludäscher 29 XML = Labeled Ordered Trees bibliography authors author Yannis ... paper paper fullpaper author ... title Object Fusion Serge semistructured data labeled trees/graphs XML Tutorial, Bertram Ludäscher can also represent • relational and • object-oriented data <bibliography> <paper ...> <authors> <author>Yannis</author> <author>Serge</author> ... </authors> <title>Object Fusion</title> ... </paper> </bibliography> 30 In Search of the Lost Structure & Semantics How do I share structure and metadata/semantics How do I learn and use with the element structure my community? of a document? XML Tutorial, Bertram Ludäscher How to make all this automatable? 31 Adding Structure and Semantics • XML Document Type Definitions (DTDs): • define the structure of "allowed" documents (i.e., valid wrt. a DTD) • database schema => improve query formulation, execution, ... • XML Schema – defines structure and data types – allows developers to build their own libraries of interchanged data types • XML Namespaces – identify your vocabulary XML Tutorial, Bertram Ludäscher 32 XML DTDs as Extended CFGs XML DTD <!element bibliography paper*> <!element paper (authors,fullPaper?,title,booktitle)> <!element authors author+> Grammar bibliography paper authors paper* authors fullPaper? title booktitle author+ lhs = element (name) rhs = regular expression over elements + strings (PCDATA) XML Tutorial, Bertram Ludäscher 33 Document Type Definitions (DTDs) Define and Constrain Element Names & Structure <!element <!element <!element <!element <!element <!element <!element <!attlist <!attlist bibliography paper*> paper (authors, fullPaper?, title, booktitle)> authors author+> Element Type author (#PCDATA)> fullPaper EMPTY> Declaration title (#PCDATA)> booktitle (#PCDATA)> fullPaper source ENTITY #REQUIRED> paper ID ID> XML Tutorial, Bertram Ludäscher Attribute List Declaration 34 Element Declarations Sequence of 0 or more paper <!element <!element <!element <!element Authors followed by optional fullpaper, followed by title, followed by booktitle bibliography paper*> paper (authors, fullPaper?, title, booktitle)> authors author+> Sequence of 1 or author (#PCDATA)> more author Character content <!element <!element <!element <!attlist <!attlist fullPaper EMPTY> title (#PCDATA)> booktitle (#PCDATA)> fullPaper source ENTITY #REQUIRED> paper ID ID> XML Tutorial, Bertram Ludäscher 35 Element Content Declarations Declaration <element 2> cardinality: R? R* R+ R1|R2|…|Rn R1, R2 , …, Rn #PCDATA EMPTY (#PCDATA e*)* ANY XML Tutorial, Bertram Ludäscher Meaning Exactly one <element 2> Zero or one instances of R Zero or more instances of R One or more instances of R One instance of R 1 or R 2 or … Rn Sequence of R’s, order matters Character content Empty element Mixed Content Anything goes 36 Attributes <person ID="yannis"> Yannis’ info </person> <bibliography> Object Identity Attribute <paper ID="object-fusion" ROLE="publication"> CDATA (character data) <authors> <author authorRef="yannis"> IDREF Y.Papakonstantinou</author> intradocument </authors> reference <fullPaper source="fusion"/> <title>Object Fusion in Mediator Systems</title> <related papers= "semistructured-data" "mediators"/> </paper> </bibliography> XML Tutorial, Bertram Ludäscher Reference to external ENTITY 37 Attribute Types Type ID IDREF IDREFS ENTITY ENTITIES CDATA NMTOKEN NMTOKENS NOTATION Enumeration Conditional Sec Meaning Token unique within the document Reference to an ID token Reference to multiple ID tokens External entity (image, video, …) External entities Character data Name token Name tokens Data other than XML Choices INCLUDE & IGNORE declarations Attributes may be: REQUIRED, IMPLIED (optional) can have: default values, which may be FIXED XML Tutorial, Bertram Ludäscher 38 Uses of XML Entities • Physical partition – size, reuse, "modularity", … (both XML docs & DTDs) • Non-XML data – unparsed entities binary data • Non-standard characters – character entities • Shorthand for phrases & markup XML Tutorial, Bertram Ludäscher 39 Entities & Physical Structure Mylife.xml DTD... <mylife> Chap1.xml <teen>yada yada </teen> A logical element can be split into multiple physical entities Chap2.xml <adult>blah blah.. </adult> </mylife> XML Tutorial, Bertram Ludäscher 40 External Text Entities External Text Entity Declaration <!ENTITY chap1 SYSTEM "chap1.xml"> URL Entity Reference <mylife> &chap1; &chap2;</mylife> Logically equivalent to inlining file contents <mylife> <teen>yada yada</teen> <adult> blah blah</adult> </mylife> XML Tutorial, Bertram Ludäscher 41 Types of Entities • Internal (to a doc) vs. External ( use URI) • General (in XML doc) vs. Parameter (in DTD) • Parsed (XML) vs. Unparsed (non-XML) XML Tutorial, Bertram Ludäscher 42 Internal Text Entities Internal Text Entity Declaration <!ENTITY WWW "World Wide Web"> Entity Reference <p>We all use the &WWW;.</p> Logically equivalent to actually appearing <p>We all use the World Wide Web.</p> XML Tutorial, Bertram Ludäscher 43 Unparsed (& "Binary") Entities Declare external... ... and unparsed entity <!ENTITY fusion SYSTEM "fusion.ps" NDATA ps> Declare attribute type to be entity <!attlist fullPaper source ENTITY #REQUIRED> Element with ENTITY attribute <fullPaper source="fusion"/> NOTATION declaration (helper app) <!NOTATION ps SYSTEM "ghostview.exe"> XML Tutorial, Bertram Ludäscher 44 From Docs to Data: XML Schema • XML DTDs (part of the XML spec.) – flexible, semistructured data model (nesting, ANY, ?, *, |, ...) – but document-oriented (SGML heritage) – no support for namespaces, datatypes, inheritance (e.g., type of book.title may be different from poem.title) • XML Schema (W3C working draft) – schema definition language in XML – data-oriented: data types – extends capabilities of DTD XML Tutorial, Bertram Ludäscher 45 XML Schema: Example Define an order "record" with (mandatory) fields and an (optional) attribute: <type name="Order" > <element name="name" type="string" /> <element name="street" type="string" /> <element name="zip" type="integer" /> <...> <attribute name="orderDate" type="date" /> </type> XML Tutorial, Bertram Ludäscher 46 XML Schema: Example New types can be derived by extension or restriction: <type name="personName"> <element name="title" minOccurs="0"/> <element name="forename" minOccurs="0" maxOccurs="*"/> <element name="surname"/> </type> <type name="extendedName" source="personName" derivedBy="extension"> <element name="generation" minOccurs="0"/> </type> <type name="simpleName" source="personName" derivedBy="restriction"> <restrictions> <element name="title" maxOccurs="0"/> <element name="forename" minOccurs="1" maxOccurs="1"/> </restrictions> </type> XML Tutorial, Bertram Ludäscher 47 W3C Work on XML Schemas • Structures: – Specify complex element structure and – Set constraints on the permitted values of the content of those elements • Datatypes: – Sets forth a standard of content datatypes and – Sets rules for generating new types from them XML Tutorial, Bertram Ludäscher 48 Further Approaches • RELAX (REgular LAnguage description for XML) – Standardized by INSTAC XML SWG of Japan. – Compared with DTD, RELAX has new features: RELAX grammars are represented in the XML instance syntax RELAX borrows rich data types of XML Schema Part 2 RELAX is namespace-aware many others – XML-Data, XML-DR, DCD, SOX, DDML, DSD, Schematron... XML Tutorial, Bertram Ludäscher 49 Normalized Data/Metadata Representation • Resource Description Framework (RDF) – Metadata model – The designer can describe objects, add properties to define and describe them, and also make complicated statements about the objects (statements about relationships between resources). – The specification comes in two sections: • Model & Syntax (viewed as directed, labeled graphs) • RDF Schemas (using an XML vocabulary) XML Tutorial, Bertram Ludäscher 50 Resource Description Framework (RDF) • Metadata is useful for information retrieval (esp. if no other schema info or semantics is available) • Idea: representation independent encoding of metadata as triples (Resource, PropertyType, Value): – (uri1, DC:creator, uri2), (uri2, vCard:name, smith), ... uri1 DC:creator • "Semantic Net" XML Tutorial, Bertram Ludäscher uri2 vCard:name smit h 51 Identifying Vocabularies • My element may not be your element: – geometry context: <element>line</element> – chemistry context: <element>oxygen</element> – SGML/XML context: .... use XML namespaces to identify the vocabulary XML Tutorial, Bertram Ludäscher 52 XML Namespaces • mechanism for globally unique tag names: <h:html xmlns:xdc="http://www.xml.com/books" xmlns:h="http://www.w3.org/HTML/1998/html4"> <h:head><h:title>Book Review</h:title></h:head> ... <xdc:bookreview> <xdc:title>XML: A Primer</xdc:title> ... </h:html> mix of different tag vocabularies without confusion • namespaces only identify the vocabulary; additional mechanisms required for structure and meaning of tags XML Tutorial, Bertram Ludäscher 53 Processing XML • Non-validating parser: – checks that XML doc is syntactically well-formed • Validating parser: – checks that XML doc is also valid w.r.t. a given DTD • Parsing yields tree/object representation: – Document Object Model (DOM) API • Or a stream of events (open/close tag, data): – Simple API for XML (SAX) XML Tutorial, Bertram Ludäscher 54 DOM Structure Model and API • hierarchy of Node objects: – document, element, attribute, text, comment, ... • language independent programming DOM API: – – – – get... first/last child, prev/next sibling, childNodes insertBefore, replace getElementsByTagName ... • alternative event-based SAX API (Simple API for XML) – does not build a parse tree (reports events when encountering begin/end tags) – for (partially) parsing large documents XML Tutorial, Bertram Ludäscher 55 DOM Summary • Object-Oriented approach to traverse the XML node tree • Automatic processing of XML docs • Manipulation & Updating of XML on client & server • Database interoperability mechanism • Memory-intensive XML Tutorial, Bertram Ludäscher 56 SAX Event-Based API • Pros: – – – – The whole file doesn’t need to be loaded into memory XML stream processing Simple and fast Allows you to ignore less interesting data • Cons: – limited expressive power (query/update) when working on streams => application needs to build (some) parse-tree when necessary XML Tutorial, Bertram Ludäscher 57 Querying XML • What can be done to XML so far: – generation: from HTML, DBs, manually, … – parsing: with/without DTD (valid/well-formed XML) – accessing: APIs for XML applications: • DOM (in memory, tree-based), SAX (event-based) • Now: Query languages for XML – XML-QL, XMAS, XPath, XSL(T), XQL, ... XML Tutorial, Bertram Ludäscher 58 Querying XML • Why not just query XML with SAX or DOM? – SAX: very simple “event-based” queries: ok – DOM: simple navigational queries (getChildNodes, getNextSibling, getElementsByTagName,…): ok • But: these are “low-level” APIs – iterator/cursor API for RDBs (but more powerful!) – used to write XML applications – “high-level” querying, restructuring and transformation (and updates??) is tedious – => analogue to high-level relational query languages (SQL, QBE, Logic (Datalog), …) => Query languages for XML XML Tutorial, Bertram Ludäscher 59 Querying XML • No "official" W3C XML QL yet (but bits and pieces) • numerous quite different XML QLs are popping up • some XML QL overviews, comparisons, and resources: – XML Query Languages: Experiences and Exemplars (co-authored by several XML QL gurus) – XML and Query Languages (Oasis Cover Pages) – Comparative Analysis of Five XML Query Languages (A. Bonifati, S. Ceri) – A Data Model and Algebra for XML Query (Philip Wadler et.al. “functional (Haskell) perspective”) – XML-QL vs XSLT queries (Geert Jan Bex and Frank Neven; for (future) XSLT experts only ;-) – Introduction to XMAS (the XML QL of the MIX project) XML Tutorial, Bertram Ludäscher 60 Querying XML • Different XML QL paradigms depending on the community: – (relational, oo, semistructured) database perspective • Lorel, YaTL, XML-QL, XMAS, FLORA/FLORID, ... – document processing perspective • XQL, XSL(T), XPath, ... – functional programming perspective • QLs with structural recursion, … XML Tutorial, Bertram Ludäscher 61 Important QL Features (DB Perspective) – typical parts of a query: • (match) pattern (selects parts of the source XML tree without looking at data) • filter condition (selects further, now looking at the data) • answer construction (putting the results together, possibly reordered, grouped, etc.) – reordering based on nested queries, grouping, sorting, or Skolem functions – tag variables, path expressions for defining the patterns without requiring knowledge of the DTD XML Tutorial, Bertram Ludäscher 62 Selection Queryies with XQL/XPath Find the root element (bookstore) of this document: /bookstore Find all author elements anywhere within the current document: //author Find all books where the value of the style attribute on the book is equal to the value of the specialty attribute of the bookstore element at the root of the document: //book[/bookstore/@specialty = @style] XML Tutorial, Bertram Ludäscher 63 Sample Queries with XQL/XPath • Find the root element (bookstore) of this document: /bookstore • Find all author elements anywhere within the current document: //author • Find all books where the value of the style attribute on the book is equal to the value of the specialty attribute of the bookstore element at the root of the document: //book[/bookstore/@specialty = @style] • Find all books with author/first-name equal to 'Bob' and all magazines with price less than 10: //(book[author/first-name = 'Bob'] $union$ magazine[price $lt$ 10]) XML Tutorial, Bertram Ludäscher 64 Presenting XML: Extensible Stylesheet Language (XSL) • Why Stylesheets? – separation of content (XML) from presentation (XSL) • Why not just CSS for XML? – XSL is far more powerful: • selecting elements • transforming the XML tree • content based display (result may depend on data) XML Tutorial, Bertram Ludäscher 65 XSL Overview • XSL stylesheets are denoted in XML syntax • XSL components: 1. a language for transforming XML documents (XSLT: integral part of the XSL specification) 2. an XML formatting vocabulary (Formatting Objects: >90% of the formatting properties inherited from CSS) XML Tutorial, Bertram Ludäscher 66 XSLT Processing Model Transformatio n XSLT stylesheet XML source tree XML Tutorial, Bertram Ludäscher XML,HTML,csv, text… result tree 67 XSLT Processing Model • XSL stylesheet: collection of template rules • template rule: (pattern template) • main steps: – match pattern against source tree – instantiate template (replace current node “.” by the template in the result tree) – select further nodes for processing • control can be a mix of – recursive processing ("push": <xsl:apply-templates> ...) – program-driven ("pull": <xsl:foreach> ...) XML Tutorial, Bertram Ludäscher 68 But first: some syntactic sugar, PLEASE... • instead of something complicated like y=f(x) • in the brave new XSLT world you can “simply” write this as: <xsl:variable name="y"> <xsl:call-template name="f"> <xsl:with-param name="x"/> </xsl:call-template> </xsl:variable name="y"> XML Tutorial, Bertram Ludäscher 69 pattern Template Rule: Example template <xsl:template match="product"> <table> <xsl:apply-templates select="sales/domestic"/> </table> <table> <xsl:apply-templates select="sales/foreign"/> </table> </xsl:template> (i) match pattern: process <product> elements (ii) instantiate template: replace each a product with two HTML tables (iii) select the <product> grandchildren (“sales/domestic”, “sales/foreign”) for further processing XML Tutorial, Bertram Ludäscher 70 Match/Select Patterns • match patterns select patterns = defined in http://w3.org/TR/xpath • Examples: – /mybook/chapter[2]/section/* – chapter|appendix – chapter//para – div[@class="appendix" and position() mod 2 = 1]//para – ../@lang XML Tutorial, Bertram Ludäscher 71 XSLT Processing Flavors: Recursive Descent Processing • take some XML file on books: books.xml • now prepare it with style: books.xsl • and enjoy the result: books.html • the recipe for cooking this was: java com.icl.saxon.StyleSheet books.xml books.xsl > books.html • and now some different flavors: books2.xsl books3.xsl XML Tutorial, Bertram Ludäscher 72 Creating the Result Tree... • Literal result elements: non-XSL elements (e.g., HTML) appear “literally” in the result tree • Constructing elements: <xsl:element name = "…"> attribute & children definition </xsl:element> (similar for xsl:attribute, xsl:text, xsl:comment,…) • Generating text: <xsl:template match="person"> <p> <xsl:value-of select="@first-name"/> <xsl:text> </xsl:text> <xsl:value-of select="@surname"/> </p> </xsl:template> XML Tutorial, Bertram Ludäscher 73 Creating the Result Tree... • Further XSL elements for ... – Numbering • <xsl:number value="position()" format="1 "> – Conditions • <xsl:if test="position() mod 2 = 0"> – Repetition... XML Tutorial, Bertram Ludäscher 74 Creating the Result Tree: Repetition <xsl:template match="/"> <html> <head> <title>customers</title> </head> <body> <table> <tbody> <xsl:for-each select="customers/customer"> <tr> <th> <xsl:apply-templates select="name"/> </th> <xsl:for-each select="order"> <td> <xsl:apply-templates/> </td> ... </html> </xsl:template> XML Tutorial, Bertram Ludäscher 75 Creating the Result Tree: Sorting <xsl:template match="employees"> <ul> <xsl:apply-templates select="employee"> <xsl:sort select="name/last"/> <xsl:sort select="name/first"/> </xsl:apply-templates> </ul> </xsl:template> <xsl:template match="employee"> <li> <xsl:value-of select="name/first"/> <xsl:text> </xsl:text> <xsl:value-of select="name/last"/> </li> </xsl:template> XML Tutorial, Bertram Ludäscher 76 More on XSL • XSL(T): – Conflict resolution for multiple applicable rules – Modularization <xsl:include> <xsl:import> – … • XSL Formatting Objects – a la CSS • XPath (navigation syntax + functions) = XSLT XPointer • ... XML Tutorial, Bertram Ludäscher 77 The MIX Project: Mediation of Information using XML Joint effort between SDSC and the UCSD CSE Department XML Tutorial, Bertram Ludäscher 78 Mediation of Information using XML (MIX) XML Query XML XML View Document(s) Wrapper Data Source (eg. home ads) XML Tutorial, Bertram Ludäscher Export: • Schema & Metadata (DTD, RDF,…) • Capabilities XML View Document(s) XML View Document(s) Wrapper Native XML Database Legacy Source 79 Integrated / Mediated views Integrated XML View View Definition in XML Query Lang Mediator XML View Document(s) Wrapper Data Source XML Tutorial, Bertram Ludäscher XML View Document(s) XML View Document(s) XML Data Source Wrapper Data Source 80 A Typical Mediation Scenario User Interface Query Results Mediator (integrated views over heterogeneous sources) Query “fragment” Convert incoming query Wrapper and outgoing data SQL Database XML Tutorial, Bertram Ludäscher Query “fragment” Wrapper Wrapper GIS HTML 81 MIX Components • MIXm Mediator tool-kit – allows definition of views across multiple resources – views are expressed in a declarative query language – query engine to execute queries on views • XML Matching And Structuring (XMAS) query language – operates on a given set of XML documents to produce a new XML document, using XMAS algebra XML Tutorial, Bertram Ludäscher 82 An XML Query (XMAS) $C:<*.condo> <address zip=$Z/> </condo> AT www.condo.com AND $S:<*.school type=elementary> <address zip=$Z/> </school> AT schools.org ... <RealEstateAgent> <name>J. Smith</name> <condos> <condo> <address ... zip=92037> <price>$170k OBO</price> <bedrooms>2</bedrooms> </condo> <condos> </RealEstateAgent> XML Tutorial, Bertram Ludäscher <folder> $C $S for $S </folder> for $C <condosAndSchools> <folder> <condo> <address ... zip=92037> <price>$170k OBO</price> <bedrooms>2</bedrooms> </condo> <school> <name>La Jolla High</name> <address … zip=92037> </school> <school>…</school> 83 </folder> MIX components... • DOM-VXD: DOM Virtual XML Document extension – a “lazy” implementation of DOM. Supports browsing/ navigation of XML documents with a server-side, “compute as you go” model • Blended Browsing and Querying (BBQ) interface – supports navigation and querying of XML documents – generates XMAS queries on mediator views – generates XMAS queries modified by DOM-VXD operations to incrementally evaluate the result set, to support navigation of XML documents XML Tutorial, Bertram Ludäscher 84 Navigation driven evaluation client navigation commands result view definition q( s1 … sn ) Lazy Mediator source navigation commands s1 XML source XML Tutorial, Bertram Ludäscher ... sn XML source 85 Blended Browsing and Querying UI (BBQ) XML Tutorial, Bertram Ludäscher 86 Another MIX Example: CDL/AMICO Mediator Prototype BBQ Interface Request for image (X.509) XMAS query HPSS XML Tutorial, Bertram Ludäscher Q2: Find creator and related metadata XML doc of paintings MIXm View based on AMICO DTD tif file SRB/MCAT Q1: Find title, type, and image ID of paintings Wrapper MARC Database AMICO XML AMICO XML Database Database AMICO/XML Demo 87 XSL Stylesheet for AMICO Answer Docs XML Tutorial, Bertram Ludäscher 88 ... and the Result (+BBQ) BBQ query composition XSL rendered output XML answer document XML Tutorial, Bertram Ludäscher 89 Projects at DICE/SDSC • National Archives and Records Administration, NARA – Persistent Archives and Electronic Records • NHPRC/NARA • XML and GIS – aXioMap • I2T: An Information Integration Testbed for Digital Government XML Tutorial, Bertram Ludäscher 90 Projects at SDSC (… cont) • AMICO – In conjunction with the California Digital Library (CDL) – Part of the NSF DLI-2 project • ESRI • Community of Science, Inc. • Networked Earthquake Engineering Simulation (NEES) – NSF program XML Tutorial, Bertram Ludäscher 91 Information Based Computing Data Storage Applications Collection Building Information Management Applications Digital Sky Neuroscience Protein Data Bank Molecular Structures Earth Systems Science XML Tutorial, Bertram Ludäscher Archival Storage Digital Library Digital Libraries CDL UCB - Elib UCSB - ADL Stanford - SDLIP U Michigan - UMDL 92 Integrating Data Set Management • Model-Based Information Management – Rule-based ontology mapping, conceptual-level mediation - CMIX • Data Grid – Data federation across multiple libraries - MIX • Digital Library – Interoperable services for information discovery and presentation SDLIP • Data Collection – Tools for managing data set collections on databases - MCAT • Data Handling – Systems for data retrieval from remote storage - SRB • Persistent Archives – Storage of data collections for 30 years XML Tutorial, Bertram Ludäscher 93 Model-Based Mediation • Knowledge-based mediation – conceptual-level integration • Rule-based ontology maps – map source XML to CM to FL (ontologies, views) • Models for exporting – – – – rules integrity constraints query capabilities data & schema (XML/DTDs) XML Tutorial, Bertram Ludäscher 94 Federation of Brain Data Result (XML/XSLT) PROTLOC Result (VML) ANATOM MODEL-BASED Mediation CCB, Montana SU Surface atlas, Van Essen Lab stereotaxic atlas LONI XML Tutorial, Bertram Ludäscher MCell, CNL, Salk NCMIR, UCSD 95 Further Information • • • • • xml.com w3.org xml.org ibm.com/xml ... • Mediation of Information using XML (MIX): – www.npaci.edu/DICE/MIX/ – www.db.ucsd.edu/Projects/MIX/ XML Tutorial, Bertram Ludäscher 96