XML and Databases 1 Outline (ambitious) • Background: documents (SGML/HTML) and databases (structured and semistructured data) • XML Basics and Document Type Descriptors • XML query languages: XML-QL and XSL. • XML additions: Xlink, Xpointer, RDF, SOX, XMLData • Document Object Model (XML API's) 2 Some Useful Articles XML, Java, and the future of the web http://webreview.com/wr/pub/97/12/19/xml/index.html XML and the Second-Generation Web http://www.sciam.com/1999/0599issue/0599bosak.html Articles/standards for XML, XSL, XML-QL http://www.w3c.org/ http://www.w3.org/TR/REC-xml 3 Background What’s the difference between the world of documents and information retrieval and databases and query interfaces? 4 Documents vs Databases Document world > plenty of small documents > usually static Database world > a few large databases > usually dynamic > implicit structure > explicit structure (schema) > tagging > records > human friendly > machine friendly > content > content section, paragraph, toc, form/layout, annotation > Paradigms “Save as”, wysiwyg > meta-data author name, date, subject schema, data, methods > Paradigms Atomicity, Concurrency, Isolation, Durability > meta-data schema description 5 What to do with them Documents • editing Database • updating • printing • spell-checking • counting words • cleaning • retrieving (IR) • querying • searching • composing/transforming 6 HTML • Publishing hypertext on the World Wide Web • Designed to describe how a Web browser should arrange text, images and push-buttons on a page. • Easy to learn, but does not convey structure. • Fixed tag set. Text (PCDATA) Opening tag <HTML> <HEAD><TITLE>Welcome to the XML course</TITLE></HEAD> <BODY> <H1>Introduction</H1> <IMG SRC=”dragon.jpeg" WIDTH="200" HEIGHT="150” > Closing tag </BODY> </HTML> “Bachelor” tag Attribute name Attribute value 7 The Structure of XML • XML consists of tags and text • Tags come in pairs <date> ...</date> • They must be properly nested <date> <day> ... </day> ... </date> --- good <date> <day> ... </date>... </day> --- bad (You can’t do <i> ... <b> ... </i> ...</b> in HTML) 8 XML text XML has only one “basic” type -- text. It is bounded by tags e.g. <title> The Big Sleep </title> <year> 1935 </ year> --- 1935 is still text XML text is called PCDATA (for parsed character data). It uses a 16-bit encoding, e.g. \&\#x0152 for the Hebrew letter Mem Later we shall see how new types are specified by XML-data 9 XML structure Nesting tags can be used to express various structures. E.g. A tuple (record) : <person> <name> Malcolm Atchison </name> <tel> (215) 898 4321 </tel> <email> mp@dcs.gla.ac.sc </email> </person> 10 XML structure (cont.) • We can represent a list by using the same tag repeatedly: <addresses> <person> ... </person> <person> ... </person> <person> ... </person> ... </addresses> 11 Terminology The segment of an XML document between an opening and a corresponding closing tag is called an element. element <person> <name> Malcolm Atchison </name> <tel> (215) 898 4321 </tel> <tel> (215) 898 4321 </tel> <email> mp@dcs.gla.ac.sc </email> </person> element, a sub-element of not an element 12 XML is tree-like person name tel tel email Malcolm Atchison (215) 898 4321 (215) 898 4321 mp@dcs.gla.ac.sc Semistructured data models typically put the labels on the edges 13 Mixed Content An element may contain a mixture of sub-elements and PCDATA <airline> <name> British Airways </name> <motto> World’s <dubious> favorite</dubious> airline </motto> </airline> Data of this form is not typically generated from databases. It is needed for consistency with HTML 14 A Complete XML Document <?xml version="1.0"?> <person> <name> Malcolm Atchison </name> <tel> (215) 898 4321 </tel> <email> mp@dcs.gla.ac.sc </email> </person> 15 Two ways of representing a DB projects: title employees: name budget ssn managedBy age 16 Project and Employee relations in XML Projects and employees are intermixed <db> <project> <title> Pattern recognition </title> <budget> 10000 </budget> <managedBy> Joe </managedBy> </project> <employee> <name> Joe </name> <ssn> 344556 </ssn> <age> 34 < /age> </employee> <employee> <name> Sandra </name> <ssn> 2234 </ssn> <age> 35 </age> </employee> <project> <title> Auto guided vehicle </title> <budget> 70000 </budget> <managedBy> Sandra </managedBy> </project> : </db> 17 Project and Employee relations in XML (cont’d) Employees follow projects <db> <employees> <projects> <employee> <project> <name> Joe </name> <title> Pattern recognition </title> <ssn> 344556 </ssn> <budget> 10000 </budget> <age> 34 </age> <managedBy> Joe </managedBy> </employee> </project> <employee> <project> <name> Sandra </name> <title> Auto guided vehicles </title> <ssn> 2234 </ssn> <budget> 70000 </budget> <age>35 </age> <managedBy> Sandra </managedBy> </employee> </project> : : <employees> </projects> </db> 18 Project and Employee relations in XML (cont’d) Or without “separator” tags … <db> <projects> <employees> <title> Pattern recognition </title> <name> Joe </name> <ssn> 344556 </ssn> <budget> 10000 </budget> <age> 34 </age> <managedBy> Joe </managedBy> <name> Sandra </name> <title> Auto guided vehicles </title> <ssn> 2234 </ssn> <budget> 70000 </budget> <age> 35 </age> <managedBy> Sandra </managedBy> : </employees> : </db> </projects> 19 Attributes An (opening) tag may contain attributes. These are typically used to describe the content of an element <entry> <word language = “en”> cheese </word> <word language = “fr”> fromage </word> <word language = “ro”> branza </word> <meaning> A food made … </meaning> </entry> Order of attributes in an element does not matter XML elements are ordered 20 Attributes (cont’d) Another common use for attributes is to express dimension or type <picture> <height dim= “cm”> 2400 </height> <width dim= “in”> 96 </width> <data encoding = “gif” compression = “zip”> M05-.+C$@02!G96YE<FEC ... </data> </picture> A document that obeys the “nested tags” rule and does not repeat an attribute within a tag is said to be well-formed . 21 When to use attributes It’s not always clear when to use attributes <person ssno= “123 45 6789”> <name> F. MacNiel </name> <email> fmacn@dcs.barra.ac.sc </email> ... </person> <person> <ssno> 123 45 6789 </ssno> <name> F. MacNiel </name> <email> fmacn@dcs.barra.ac.sc </email> ... </person> 22 XML Misc. Apart from elements and attributes, XML allows processing instructions and comments. A processing instruction is a statement of the form: <?xml version="1.0"?> <?XML ENCODING="UTF-8" VERSION="1.0"?> A comment takes the following form: enclose comments between <!- - and - -> <!– - A Comment --> 23 Document Type Descriptors Imposing structure on XML documents 24 Document Type Descriptors • Document Type Descriptors (DTDs) impose structure on an XML document. • There is some relationship between a DTD and a schema, but it is not close -- hence the need for additional “typing” systems. • The DTD is a syntactic specification. 25 Example: The Address Book <person> <name> MacNiel, John </name> <greet> Dr. John MacNiel </greet> Exactly one name At most one greeting <addr>1234 Huron Street </addr> As many address lines <addr> Rome, OH 98765 </addr> <tel> (321) 786 2543 </tel> <fax> (321) 786 2543 </fax> <tel> (321) 786 2543 </tel> <email> jm@abc.com </email> </person> as needed (in order) Mixed telephones and faxes As many as needed 26 Specifying the structure • name to specify a name element • greet? to specify an optional (0 or 1) greet elements • name,greet? to specify a name followed by an optional greet 27 Specifying the structure (cont) • addr* • tel | fax to specify 0 or more address lines a tel or a fax element • (tel | fax)* 0 or more repeats of tel or fax • email* 0 or more email elements 28 Specifying the structure (cont) So the whole structure of a person entry is specified by name, greet?, addr*, (tel | fax)*, email* This is known as a regular expression. Why is it important? 29 Regular Expressions Each regular expression determines a corresponding finite state automaton. Let’s start with a simpler example: name, addr*, email This suggests a simple parsing program addr name email 30 Another example name,address*,(tel | fax)*,email* address name email tel tel email fax fax email Adding in the optional greet further complicates things 31 A DTD for the address book <!DOCTYPE addressbook [ <!ELEMENT addressbook (person*)> <!ELEMENT person (name, greet?, address*, (fax | tel)*, email*)> <!ELEMENT name (#PCDATA)> <!ELEMENT greet (#PCDATA)> <!ELEMENT address (#PCDATA)> <!ELEMENT tel (#PCDATA)> <!ELEMENT fax (#PCDATA)> <!ELEMENT email (#PCDATA)> ]> 32 Our relational DB revisited projects: title employees: name budget ssn managedBy age 33 Two DTDs for the relational DB <!DOCTYPE db [ <!ELEMENT db (projects,employees)> <!ELEMENT projects (project*)> <!ELEMENT employees (employee*)> <!ELEMENT project (title, budget, managedBy)> <!ELEMENT employee (name, ssn, age)> ... ]> <!DOCTYPE db [ <!ELEMENT db (project | employee)*> <!ELEMENT project (title, budget, managedBy)> <!ELEMENT employee (name, ssn, age)> ... ]> 34 Some things are hard to specify Each employee element is to contain name, age and ssn elements in some order. <!ELEMENT employee ( (name, age, ssn) | (age, ssn, name) | (ssn, name, age) | ... )> Suppose there were many more fields ! 35 Summary of XML regular expressions • • • • • • • A e1,e2 e* e? e+ e1 | e2 (e) The tag A occurs The expression e1 followed by e2 0 or more occurrences of e Optional -- 0 or 1 occurrences 1 or more occurrences either e1 or e2 grouping 36 Specifying attributes in the DTD <!ELEMENT height (#PCDATA)> <!ATTLIST height dimension CDATA #REQUIRED accuracy CDATA #IMPLIED > The dimension attribute is required; the accuracy attribute is optional. CDATA is the “type” of the attribute -- it means string. 37 The DTD Language • Default modifiers in DTD attributes: Modifier Description #REQUIRED The attributes value must be specified with the element. The attribute value can remain unspecified. The attribute value is fixed and cannot be changed by the user. #IMPLIED #FIXED 38 The DTD Language • Datatypes in DTD attributes: Type Description CDATA enumerated ENTITY ENTITIES Character data A series of values of which only 1 can be chosen An entity declared in the DTD Multiple whitespace separated entities declared in the DTD ID A unique element identifier IDREF The value of a unique ID type attribute IDREFS Multiple whitespace separated IDREFs of elements NMTOKEN An XML name token NMTOKENS Multiple whitespace separated XML name tokens NOTATION A notation declared in the DTD 39 Consistency of ID and IDREF attribute values • If an attribute is declared as ID – the associated values must all be distinct (no confusion) – Id is a poor cousin of a key in relational databases. • If an attribute is declared as IDREF – the associated value must exist as the value of some ID attribute (no dangling “pointers”) – IDREF is a poor cousin of foreign key in relational databases. • Similarly for all the values of an IDREFS attribute – An attribute of type IDREFS represent a spaceseparated list of strings of references to valid IDs. • ID and IDREF attributes are not typed 40 Specifying ID and IDREF attributes <!DOCTYPE family [ <!ELEMENT family (person)*> <!ELEMENT person (name)> <!ELEMENT name (#PCDATA)> <!ATTLIST person id ID #REQUIRED mother IDREF #IMPLIED father IDREF #IMPLIED children IDREFS #IMPLIED> ]> 41 Some conforming data <family> <person id="jane" mother="mary" father="john"> <name> Jane Doe </name> </person> <person id="john" children="jane jack"> <name> John Doe </name> </person> <person id="mary" children="jane jack"> <name> Mary Doe </name> </person> <person id="jack" mother=”mary" father="john"> <name> Jack Doe </name> </person> </family> 42 An alternative specification <!DOCTYPE family [ <!ELEMENT family (person)*> <!ELEMENT person (name, mother?, father?, children?)> <!ATTLIST person id ID #REQUIRED> <!ELEMENT name (#PCDATA)> <!ELEMENT mother EMPTY> <!ATTLIST mother idref IDREF #REQUIRED> <!ELEMENT father EMPTY> <!ATTLIST father idref IDREF #REQUIRED> <!ELEMENT children EMPTY> <!ATTLIST children idrefs IDREFS #REQUIRED> ]> 43 The revised data <family> <person id = "jane”> <name> Jane Doe </name> <mother idref = "mary”></mother> <father idref = "john"></father> </person> <person id = "john”> <name> John Doe </name> <children idrefs = "jane jack"> </children> </person> ... </family> 44 The DTD Language • Example: Sales Order Document “An order document is comprised of several sales orders. Each individual order has a number and it contains the customer information, the date when the order was received, and the items ordered. Each customer has a number, a name, street, city, state, and ZIP code. Each item has an item number, parts information and a quantity. The parts information contains a number, a description of the product and its unit price. The numbers should be treated as attributes.” 45 The DTD Language • Example: Sales Order Document DTD <!-- DTD for example sales order document --> <!ELEMENT Orders (SalesOrder+)> <!ELEMENT SalesOrder (Customer,OrderDate,Item+)> <!ELEMENT Customer (CustName,Street,City,State,ZIP)> <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT <!ATTLIST <!ATTLIST <!ATTLIST <!ATTLIST OrderDate (#PCDATA)> Item (Part,Quantity)> Part (Description,Price)> CustName (#PCDATA)> Street (#PCDATA)> ... (#PCDATA)> SalesOrder SONumber CDATA #REQUIRED> Customer CustNumber CDATA #REQUIRED> Part PartNumber CDATA #REQUIRED> Item ItemNumber CDATA #REQUIRED> 46 The DTD Language • Example: Sales Order XML Document <Orders> <SalesOrder SONumber=“12345”> <Customer CustNumber=“543”> <CustName>ABC Industries</CustName> <Street>123 Main St.</Street> <City>Chicago</City> <State>IL</State> <ZIP>60609</ZIP> </Customer> <OrderDate>10222000</OrderDate> <Item ItemNumber=“1”> <Part PartNumber=“234”> <Description>Turkey wrench</Description> <Price>9.95</Price> </Part> <Quantity>10</Quantity> </Item> </SalesOrder> </Orders> 47 A useful abbreviation When an element has empty content we can use <tag blahblahbla/> for <tag blahblahbla></tag> For example: <family> <person id = "jane”> <name> Jane Doe </name> <mother idref = "mary”/> <father idref = "john”/> </person> ... </family> 48 Schema.dtd <!DOCTYPE db <!ELEMENT <!ELEMENT <!ATTLIST <!ELEMENT <!ELEMENT <!ELEMENT <!ATTLIST <!ELEMENT [ db (movie+, actor+)> movie (title,director,cast,budget)> movie id ID #REQUIRED> title (#PCDATA)> director (#PCDATA)> casts EMPTY> casts idrefs IDREFS #REQUIRED> budget (#PCDATA)> 49 Schema.dtd (cont’d) <!ELEMENT <!ATTLIST <!ELEMENT <!ELEMENT <!ATTLIST <!ELEMENT <!ELEMENT ]> actor actor name acted_In acted_In age directed (name, acted_In,age?, directed*)> id ID #REQUIRED> (#PCDATA)> EMPTY> idrefs IDREFS #REQUIRED> (#PCDATA)> (#PCDATA)> 50 Connecting the document with its DTD In line: <?xml version="1.0"?> <!DOCTYPE db [<!ELEMENT ...> … ]> <db> ... </db> Another file: <!DOCTYPE db SYSTEM "schema.dtd"> A URL: <!DOCTYPE db SYSTEM "http://www.schemaauthority.com/schema.dtd"> 51 Well-formed and Valid Documents • Well-formed applies to any document (with or without a DTD): proper nesting of tags and unique attributes • Valid specifies that the document conforms to the DTD: conforms to regular expression grammar, types of attributes correct, and constraints on references satisfied 52 DTDs v.s Schemas (or Types) • By database (or programming language) standards DTDs are rather weak specifications. – Only one base type -- PCDATA – No useful “abstractions” e.g., sets – IDREFs are untyped. You point to something, but you don’t know what! – No constraints e.g., child is inverse of parent – No methods – Tag definitions are global • Some of the XML extensions impose something like a schema or type on an XML document. We’ll see these later 53 Lots of possibilities for schemas • • • • • • XML Schema (under W3C’s spotlight) XDR (Microsoft’s BizTalk) SOX (Schema for Object-Oriented XML) Schematron DSD (AT&T Labs and BRICS) and more. 54 Some tools • XML Authority http://www.extensibility.com/tibco/solutions/xml _authority/index.htm • XML Spy http://www.xmlspy.com/download.html 55 Summary • XML is a new data format. Its main virtues are widespread acceptance and the (important) ability to handle semistructured data (data without schema) • DTDs provide some useful syntactic constraints on documents. As schemas they are weak • How to store large XML documents? • How to query them? • How to map between XML and other representations? 56