Introduction to Semistructured Data and XML Chapter 27 Database Management Systems, R. Ramakrishnan 1 How the Web is Today HTML documents • often generated by applications • consumed by humans only • easy access: across platforms, across organizations No application interoperability: • HTML not understood by applications • Database technology: client-server Database Management Systems, R. Ramakrishnan 2 New Universal Data Exchange Format: XML A recommendation from the W3C XML = data XML generated by applications XML consumed by applications Easy access: across platforms, organizations Database Management Systems, R. Ramakrishnan 3 Paradigm Shift on the Web From documents (HTML) to data (XML) From information retrieval to data management For databases, also a paradigm shift: • from relational model to semistructured data • from data processing to data/query translation • from storage to transport Database Management Systems, R. Ramakrishnan 4 HTML HTML is widely used for formatting and structuring Web documents. Designed to describe how a Web browser should arrange text, images and push-buttons on a page. Easy to learn, but does not convey structure and meaning of data in the Web pages. Fixed tag set. Text (PCDATA) Opening tag <HTML> <HEAD><TITLE>Welcome to the XML course</TITLE></HEAD> <BODY> <H1>Introduction</H1> <IMG SRC=”dragon.jpeg" WIDTH="200" HEIGHT="150” > Closing tag </BODY> </HTML> “Bachelor” tag Attribute name Database Management Systems, R. Ramakrishnan Attribute value 5 Semistructure data 1. 2. 3. 4. Information integration: important new application that motivates what follows. Semistructured data: a new data model designed to cope with problems of information integration. XML (Extensible Markup Language) : a new Web standard that is essentially semistructured data. XQUERY: an emerging standard query language for XML data. Database Management Systems, R. Ramakrishnan 6 Information Integration Problem: related data exists in many places. They talk about the same things, but differ in model, schema, conventions (e.g., terminology). Example: In the real world, every bar has its own database. Some may have relations like beer-price; others have an Microsoft Word file from which the menu is printed. Some keep phones of manufacturers but not addresses. Some distinguish beers and ales; others do not. Database Management Systems, R. Ramakrishnan 7 The Semistructured Data Model Bib Object Exchange Model (OEM) &o1 complex object paper paper book references &o12 &o24 references author title year &o29 references author http page author title publisher title author author author &o43 &25 &96 1997 last firstname firstname lastname &243 “Serge” “Abiteboul” “Victor” lastname first &206 “Vianu” 122 133 atomic object Database Management Systems, R. Ramakrishnan 8 Characteristics of Semistructured Data Missing or additional attributes Multiple attributes Different types in different objects Heterogeneous collections Self-describing, irregular data, no a priori structure Database Management Systems, R. Ramakrishnan 9 Comparison with Relational Data row nam e phone John 3634 Sue 6343 D ic k 6363 Database Management Systems, R. Ramakrishnan row row name phone name phone name phone “John” 3634 “Sue” 6343 “Dick” 6363 { row: { name: “John”, phone: 3634 }, row: { name: “Sue”, phone: 6343 }, row: { name: “Dick”, phone: 6363 } } 10 XML (Extensible Markup Language) A W3C standard to complement HTML Origins: Structured text SGML • Large-scale electronic publishing • Data exchange on the web Motivation: • HTML describes presentation • XML describes content Database Management Systems, R. Ramakrishnan 11 From HTML to XML HTML describes the presentation Database Management Systems, R. Ramakrishnan 12 HTML <h1> Bibliography </h1> <p> <i> Foundations of Databases </i> Abiteboul, Hull, Vianu <br> Addison Wesley, 1995 <p> <i> Data on the Web </i> Abiteboul, Buneman, Suciu <br> Morgan Kaufmann, 1999 Database Management Systems, R. Ramakrishnan 13 XML <bibliography> <book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> … </bibliography> XML describes the content Database Management Systems, R. Ramakrishnan 14 Why are we DB’ers interested? It’s data. That’s us. Database issues: • How are we going to model XML? (graphs). • How are we going to query XML? (XQuery) • How are we going to store XML (in a relational database? object-oriented? native?) • How are we going to process XML efficiently? (many interesting research questions!) Database Management Systems, R. Ramakrishnan 15 XML Terminology Tags: book, title, author, … • start tag: <book>, end tag: </book> Elements: <book>…<book>,<author>…</author> • elements can be nested • empty element: <red></red> (Can be abbrv. <red/>) XML document: Has a single root element Well-formed XML document: Has matching tags Valid XML document: conforms to a schema Database Management Systems, R. Ramakrishnan 16 Well-Formed XML 1. Declaration = <? ... ?> . • Normal declaration is <? XML VERSION = "1.0" STANDALONE = "yes" ?> • “Standalone” means that there is no DTD specified. 2. Root tag surrounds the entire balance of the document. <FOO> is balanced by </FOO>, as in HTML. 3. Any balanced structure of tags OK. • Option of tags that don’t require balance, like <P> in HTML. Database Management Systems, R. Ramakrishnan 17 XML: An Example <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <BOOKLIST> <BOOK genre="Science" format="Hardcover"> <AUTHOR> <FIRSTNAME>Richard</FIRSTNAME><LASTNAME>Feynman</LASTNAME> </AUTHOR> <TITLE>The Character of Physical Law</TITLE> <PUBLISHED>1980</PUBLISHED> </BOOK> <BOOK genre="Fiction"> <AUTHOR> <FIRSTNAME>R.K.</FIRSTNAME><LASTNAME>Narayan</LASTNAME> </AUTHOR> <TITLE>Waiting for the Mahatma</TITLE> <PUBLISHED>1981</PUBLISHED> </BOOK> <BOOK genre="Fiction"> <AUTHOR> <FIRSTNAME>R.K.</FIRSTNAME><LASTNAME>Narayan</LASTNAME> </AUTHOR> <TITLE>The English Teacher</TITLE> <PUBLISHED>1980</PUBLISHED> </BOOK> </BOOKLIST> Database Management Systems, R. Ramakrishnan 18 XML – Elements <BOOK genre="Science" format="Hardcover">…</BOOK> attribute open tag element name attribute value data closing tag Xml is case and space sensitive Element opening and closing tag names must be identical Opening tags: “<” + element name + “>” Closing tags: “</” + element name + “>” Empty Elements have no data and no closing tag: • They begin with a “<“ and end with a “/>” <BOOK/> Database Management Systems, R. Ramakrishnan 19 XML – Attributes <BOOK genre="Science" format="Hardcover">…</BOOK> attribute open tag attribute value element name data closing tag Attributes provide additional information for element tags. There can be zero or more attributes in every element; each one has the the form: attribute_name=‘attribute_value’ - There is no space between the name and the “=‘” - Attribute values must be surrounded by “ or ‘ characters Multiple attributes are separated by white space (one or more spaces or tabs). Database Management Systems, R. Ramakrishnan 20 Elements The segment of an XML document between an opening and a corresponding closing tag is called an element. element <person> <name> Malcolm Atchison </name> <tel> (215) 898 4321 </tel> <tel> (215) 898 4321 </tel> <email> mp@dcs.gla.ac.sc </email> </person> element, a sub-element of Database Management Systems, R. Ramakrishnan not an element 21 XML – Data and Comments <BOOK genre="Science" format="Hardcover">…</BOOK> attribute open tag attribute value element name closing tag data Xml data is any information between an opening and closing tag Xml data must not contain the ‘<‘ or ‘>’ characters Comments: <!- comment -> Database Management Systems, R. Ramakrishnan 22 XML text XML has only one “basic” type -- text. It is bounded by tags, e.g. <title> The Big Sleep </title> <year> 1935 </ year> --- 1935 is still text XML text is called PCDATA (for parsed character data). It uses a 16-bit encoding. Database Management Systems, R. Ramakrishnan 23 XML – Nesting & Hierarchy Xml tags can be nested in a tree hierarchy Xml documents can have only one root tag Between an opening and closing tag you can insert: 1. Data 2. More Elements 3. A combination of data and elements <root> <tag1> Some Text <tag2>More</tag2> </tag1> </root> Database Management Systems, R. Ramakrishnan 24 Representing relational DBs: Two ways projects: title employees: name budget ssn Database Management Systems, R. Ramakrishnan managedBy age 25 Project and Employee relations in XML Projects and employees are intermixed <db> <project> <employee> <name> Sandra </name> <title> Pattern recognition </title> <ssn> 2234 </ssn> <budget> 10000 </budget> <age> 35 </age> <managedBy> Joe</managedBy> </employee> </project> <project> <employee> <title> Auto guided vehicle </title> <budget> 70000 </budget> <name> Joe </name> <managedBy> Sandra </managedBy> <ssn> 344556 </ssn> </project> <age> 34 < /age> : </employee> </db> Database Management Systems, R. Ramakrishnan 26 Project and Employee relations in XML (cont’d) Employees follows projects <db> <projects> <project> <title> Pattern recognition </title> <budget> 10000 </budget> <managedBy>Joe </managedBy> </project> <project> <title>Auto guided vehicles</title> <budget> 70000 </budget> <managedBy>Sandra</managedBy> </project> : </projects> Database Management Systems, R. Ramakrishnan <employees> <employee> <name> Joe </name> <ssn> 344556 </ssn> <age> 34 </age> </employee> <employee> <name> Sandra </name> <ssn> 2234 </ssn> <age>35 </age> </employee> : <employees> </db> 27 More XML: Oids and References <person id=“o555”> <name> Jane </name> </person> <person id=“o456”> <name> Mary </name> <children idref=“o123 o555”/> </person> <person id=“o123” mother=“o456”><name>John</name> </person> oids and references in XML are just syntax Database Management Systems, R. Ramakrishnan 28 XML Data Model (Graph) db #0 book book publisher b1 b2 pub title #1 pcdata mkp author #2 pcdata title #3 pcdata pub author #5 #4 pcdata pcdata Complete... Chamberlin Principles... Bernstein Database Management Systems, R. Ramakrishnan author Newcomer name #6 pcdata state #7 pcdata Morgan... CA 29 Document Type Descriptors Sort of like a schema but not really. <!ELEMENT Book (title, author*) > <!ELEMENT title #PCDATA> <!ELEMENT author (name, address,age?)> <!ATTLIST Book id ID #REQUIRED> <!ATTLIST Book pub IDREF #IMPLIED> Inherited from SGML DTD standard BNF grammar establishing constraints on element structure and content Definitions of entities Database Management Systems, R. Ramakrishnan 30 DTD – An Example <?xml version='1.0'?> <!ELEMENT Basket (Cherry+, (Apple | Orange)*) > <!ELEMENT Cherry EMPTY> <!ATTLIST Cherry flavor CDATA #REQUIRED> <!ELEMENT Apple EMPTY> <!ATTLIST Apple color CDATA #REQUIRED> <!ELEMENT Orange EMPTY> <!ATTLIST Orange location ‘Florida’> -------------------------------------------------------------------------------- <Basket> <Cherry flavor=‘good’/> <Apple color=‘red’/> <Apple color=‘green’/> </Basket> Database Management Systems, R. Ramakrishnan <Basket> <Apple/> <Cherry flavor=‘good’/> <Orange/> </Basket> 31 DTD - !ELEMENT <!ELEMENT Basket (Cherry+, (Apple | Orange)*) > Name Children !ELEMENT declares an element name, and what children elements it should have Content types: • • • • • Other elements #PCDATA (parsed character data) EMPTY (no content) ANY (no checking inside this structure) A regular expression Database Management Systems, R. Ramakrishnan 32 DTD - !ELEMENT (Contd.) A regular expression has the following structure: • exp1, exp2, exp3, …, expk: A list of regular expressions • exp*: An optional expression with zero or more occurrences • exp+: An optional expression with one or more occurrences • exp1 | exp2 | … | expk: A disjunction of expressions Database Management Systems, R. Ramakrishnan 33 DTD - !ATTLIST <!ATTLIST Cherry flavor CDATA #REQUIRED> Element Attribute Type Flag <!ATTLIST Orange location CDATA #REQUIRED color ‘orange’> !ATTLIST defines a list of attributes for an element Attributes can be of different types, can be required or not required, and they can have default values. Database Management Systems, R. Ramakrishnan 34 DTD – Well-Formed and Valid <?xml version='1.0'?> <!ELEMENT Basket (Cherry+)> <!ELEMENT Cherry EMPTY> <!ATTLIST Cherry flavor CDATA #REQUIRED> -------------------------------------------------------------------------------- Not Well-Formed Well-Formed but Invalid <basket> <Job> <Cherry flavor=good> <Location>Home</Location> </Basket> </Job> Well-Formed and Valid <Basket> <Cherry flavor=‘good’/> </Basket> Database Management Systems, R. Ramakrishnan 35 Example: An Address Book <person> <name> MacNiel, John </name> <greet> Dr. John MacNiel </greet> <addr>1234 Huron Street </addr> <addr> Rome, OH 98765 </addr> <tel> (321) 786 2543 </tel> <fax> (321) 786 2543 </fax> <tel> (321) 786 2543 </tel> <email> jm@abc.com </email> </person> Database Management Systems, R. Ramakrishnan Exactly one name At most one greeting As many address lines as needed (in order) Mixed telephones and faxes As many as needed 36 Specifying the structure name to specify a name element greet? to specify an optional (0 or 1) greet elements name,greet? to specify a name followed by an optional greet Database Management Systems, R. Ramakrishnan 37 Specifying the structure (cont) addr* to specify 0 or more address lines tel | fax a tel or a fax element (tel | fax)* 0 or more repeats of tel or fax email* 0 or more email elements Database Management Systems, R. Ramakrishnan 38 A DTD for the address book <!DOCTYPE addressbook [ <!ELEMENT addressbook (person*)> <!ELEMENT person (name, greet?, address*, (fax | tel)*, email*)> <!ELEMENT name (#PCDATA)> <!ELEMENT greet (#PCDATA)> <!ELEMENT address (#PCDATA)> <!ELEMENT tel (#PCDATA)> <!ELEMENT fax (#PCDATA)> <!ELEMENT email (#PCDATA)> ]> Database Management Systems, R. Ramakrishnan 39 DTD for the example relational DB <!DOCTYPE db [ <!ELEMENT db <!ELEMENT projects <!ELEMENT employees <!ELEMENT project <!ELEMENT employee ... ]> Database Management Systems, R. Ramakrishnan (projects,employees)> (project*)> (employee*)> (title, budget, managedBy)> (name, ssn, age)> 40 Summary of XML regular expressions Each element name is a tag. Its components are the tags that appear nested within, in the order specified. A The tag A occurs e1,e2 The expression e1 followed by e2 e* 0 or more occurrences of e e? Optional -- 0 or 1 occurrences e+ 1 or more occurrences e1 | e2 either e1 or e2 (e) grouping Database Management Systems, R. Ramakrishnan 41 XML Querying Path Expressions : Bib.paper Bib.book.publisher Bib.paper.author.lastname Given an OEM instance, the value of a path expression p is a set of objects Database Management Systems, R. Ramakrishnan 42 Path Expressions Bib &o1 Examples: paper paper book references &o12 &o24 references DB = author title &o43 year &o44 &o29 references author http author title publisher title author author author &o45 &o46 &o52 page &25 &96 1997 firstname lastname &o70 “Serge” &o47 &o48 &o49 &o50 &o51 firstname &o71 “Abiteboul” Bib.paper={&o12,&o29} Bib.book.publisher={&o51} Bib.paper.author.lastname={&o71,&206} Database Management Systems, R. Ramakrishnan &243 “Victor” last lastname first &206 “Vianu” 122 133 43 XQuery Emerging standard for querying XML documents. Basic form: FOR <variables ranging over sets of elements> WHERE <condition> RETURN <set of elements>; Sets of elements described by paths, consisting of: 1. URL, if necessary. 2. Element names forming a path in the semistructured data graph, e.g., //BAR/NAME = “start at any BAR node and go to a NAME child.” 3. Ending condition of the form [<condition about subelements, @attributes, and values>] Database Management Systems, R. Ramakrishnan 44 XQuery Overview: FOR-LET-WHERE-ORDERBY-RETURN = FLWOR FOR/LET Clauses List of tuples WHERE Clause List of tuples ORDERBY/RETURN Clause Database Management Systems, R. Ramakrishnan Instance of Xquery data model 45 XQuery FOR $x in expr -- binds $x to each value in the list expr LET $x = expr -- binds $x to the entire list expr • Useful for common subexpressions and for aggregations Database Management Systems, R. Ramakrishnan 46 FOR v.s. LET FOR $x IN document("bib.xml")/bib/book RETURN <result> $x </result> LET $x IN document("bib.xml")/bib/book RETURN <result> $x </result> Database Management Systems, R. Ramakrishnan Returns: <result> <book>...</book></result> <result> <book>...</book></result> <result> <book>...</book></result> ... Returns: <result> <book>...</book> <book>...</book> <book>...</book> ... </result> 47 XQuery Find all book titles published after 1995: FOR $x IN document("bib.xml")/bib/book WHERE $x/year > 1995 RETURN $x/title Result: <title> abc </title> <title> def </title> <title> ghi </title> Database Management Systems, R. Ramakrishnan 48 XQuery For each author of a book by Morgan Kaufmann, list all books s/he published: FOR $a IN distinct(document("bib.xml") /bib/book[publisher=“Morgan Kaufmann”]/author) RETURN <result> $a, FOR $t IN /bib/book[author=$a]/title RETURN $t </result> distinct = a function that eliminates duplicates Database Management Systems, R. Ramakrishnan 49 XQuery Result: <result> <author>Jones</author> <title> abc </title> <title> def </title> </result> <result> <author> Smith </author> <title> ghi </title> </result> Database Management Systems, R. Ramakrishnan 50 XQuery <big_publishers> FOR $p IN distinct(document("bib.xml")//publisher) LET $b := document("bib.xml")/book[publisher = $p] WHERE count($b) > 100 RETURN $p </big_publishers> count = a (aggregate) function that returns the number of elms Database Management Systems, R. Ramakrishnan 51 XQuery Find books whose price is larger than average: LET $a=avg(document("bib.xml")/bib/book/price) FOR $b in document("bib.xml")/bib/book WHERE $b/price > $a RETURN $b Database Management Systems, R. Ramakrishnan 52 Examples for XQuery queries FOR $x IN doc(www.company.com/info.xml) //employee [employeeSalary gt 70000]/employeeName RETURN <res> $x/firstName, $x/lastName </res> FOR $x IN doc(www.company.com/info.xml)/company/employee WHERE $x/employeeSalary gt 70000 RETURN <res> $x/employeeName/firstName, $x/employeeName/lastName </res> FOR $x IN doc(www.company.com/info.xml)/company /project [projectNumber = 5]/projectWorker, $y IN doc(www.company.com/info.xml)/company/employee WHERE $x/hours gt 20.0 AND $y.ssn = $x.ssn RETURN <res> $x/EmployeeName/firstName, $y/employeeName/lastName, $x/hours </res> Database Management Systems, R. Ramakrishnan 53