XML SNU OOPSLA Lab. October 2005 Contents Semistructured Data Introduction History XML Application DTD & XML Schema DOM & SAX Summary Online Resources 2 Semistructured Data(1/3) Semistructured Data and XML Integration of heterogeneous sources Data sources with non-rigid structure Biological data Web data Characteristics of Semistructured Data Missing or additional attributes Multiple attributes Different types in different objects Heterogeneous collections self-describing, irregular data, no a priori structure 3 Semistructured Data(2/3) Data Model Bib &o1 complex object paper paper book references &o12 &o24 references author title year &o29 references author http page author title publisher title author author author &o43 &25 &96 1997 firstname lastname atomic object last firstname &243 “Serge” “Abiteboul” “Victor” lastname first &206 “Vianu” 122 133 Object Exchange Model (OEM) 4 Semistructured Data(3/3) Syntax for Semistructured Data Bib: &o1 { paper: &o12 { … }, book: &o24 { … }, paper: &o29 { author: &o52 “Abiteboul”, author: &o96 { firstname: &243 “Victor”, lastname: &o206 “Vianu”}, title: &o93 “Regular path queries with constraints”, references: &o12, references: &o24, pages: &o25 { first: &o64 122, last: &o92 133} } } 5 Introduction(1/4) XML An acronym for ‘eXtensible Markup Language’ A meta-language that describes other languages A data format for storing structured and semi-structured text for dissemination and ultimate publication, perhaps on a variety of media 6 Introduction(2/4) Properties Tags enclose identifiable parts of the document Self-describing Physical/logical structure Physical structure : allows components of the document, called entities Logical structure : allows a document to be divided into named units and sub-units, called elements 7 Introduction(3/4) Logical Structure Physical Structure Document Unit Sub-unit entities (internal) (separate) elements 8 Introduction(4/4) XML markup <warning> <para> This substance if hazardous to health </para> <para> See procedure 12A. 7 for information on protective clothing required. </para> <logo …/> </warning> <transaction> <time date=“19980509”/> <amount>123</amount> <currency type=“pounds”/> <from id=“x98765”> J. Smith</from> <to id=“x56565>M. Jones</to> </transaction> XML document 9 History(1/2) XML 1997 HTML WWW 1992 SGML 1986 GM 1960 Internet GM = Generalized Markup 10 History(1/2) 1960’s, IBM GML(Generalized Markup Language) 1980’s, ISO 8879, SGML(Standard Generalized Markup Language) Early 1990’s, HTML(HyperText Markup Language) 1996, W3C’s XML 1998, XML 1.0 1999, RDF(Resource Description Framework) 11 Application ASP, Java,VB DBMS XSL Processor Tree DOM DOM API Parser DTD SAX Events XML HTML Browser DOM(Document Object Model) SAX(Simple APIs for XML) XSL(eXtensible Stylesheet Language) ASP(Active Server Page) Data exchange applications 12 An XML Document <?xml version=“1.0”?> <!DOCTYPE sigmodRecord SYSTEM sigmodRecord.dtd”> <sigmodRecord><issue> <volume>1</volume> <number>1</number> <articles><articles> <title> XML Research Issues</title> <initPage> 1 </initPage> <endPage> 5 </endPage> <authors> <author AuthorPosition=“00”> Tom Hanks </author> … </authors></article></articles></issue> </sigmodRecord> 13 DTD(1/2) DTD(Document Type Definition) An optional but powerful feature of XML Comprises a set of declarations that define a document structure tree Some XML processors read the DTD and use it to build the document model in memory Establishes formal document structure rules It define the elements and dictates where they may be applied in relation to each other 14 DTD(2/2) Declare Vs. Define Declare “This document is a concert poster” Define “A concert poster must have the following features” DTD define Element type + Attribute + Entities Valid Vs. Invalid Valid conforms to DTD Invalid fail to conform to DTD Well formed XML Document Valid XML Document 15 XML Schema Schema W3C standard : specifies structure of XML documents Data types for elements/attributes String, int, float Unordered set is also allowed Derivation of types are allowed Replaces DTDs Removes syntactic distinctions between DTD and XML Richer types compared to DTD 16 XML Schema Example <xsd:element name=“article” minOccurs=“0” maxOccurs=“unbounded”> <xsd:complexType><xsd:sequence> <xsd:element name=“title” type=“xsd:string”/> <xsd:element name=“initPage” type=“xsd:string”/> <xsd:element name=“endPage” type=“xsd:string”/> <xsd:element name=“author” type=“xsd:string”/> </xsd:sequence></xsd:complexType> <xsd:element> DTD <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT article (title,initPage,endPage,author)> title (#PCDATA)> initPage (#PCDATA)> endPage (#PCDATA)> author (#PCDATA)> 17 DOM(1) Characteristics Hierarchical (tree) object model for XML documents Associate list of children with every node Preserves the sequence of the elements in the XML documents sigmodRecord issue volume XML document number title articles initPage endPage 18 DOM(2) DOM interfaces Node : The base data type of the DOM. Element : The vast majority of the objects you’ll deal with are Elements. Attr : Represents an attribute of an element. Text : The actual content of an Element or Attr. Document : Represents the entire XML document 19 SAX(1) DOM : expensive to materialize for a large XML collection Characteristics Event-driven : fire an event for every open tag/end tag Does not require full parsing Enables custom object model building Document Handler create startDocument() Application startElement() characters() Feedback When event driven endElement() give <!……………> <-> …………. </-> endDocument() parsing Parser Event driven 20 SAX(2) The SAX API actually defines four interfaces for handling events EntityHandler TDHandler DocumentHandler ErrorHandler All of these interfaces are implemented by HandlerBase. 21 DOM vs SAX(1/3) Why use DOM? Need to know a lot about the structure of a document Need to move parts of the document around Need to use the information in the document more than once Why use SAX? Only need to extract a few elements from an XML document 22 DOM vs SAX(2/3) <book id="1"> <verse> Sing, O goddess, the anger of Achilles son of Peleus, that brought countless ills upon the Achaeans. Many a brave soul did it send hurrying down to Hades, and many a hero did it yield a prey to dogs and vultures, for so were the counsels of Jove fulfilled from the day on which the son of Atreus, king of men, and great Achilles, first fell out with one another. </verse> <verse> And which of the gods was it that set them on to quarrel? It was the son of Jove and Leto; for he was angry with the king and sent a pestilence upon ... • Doing this with the DOM would take a lot of memory • SAX API would be much more efficient 23 DOM vs SAX(3/3) ... <address> <name> <first-name>Mary</first-name> <last-name>McGoon</last-name> </name> <street>1401 Main Street</street> <city>Anytown</city> <state>NC</state> <zip>34829</zip> </address> <address> <name>….. <street> ….. </address> <address> <name>….. <street> ….. </address> If we were parsing an XML document containing 10,000 addresses, and we wanted to sort them by last name?? DOM would automatically store all of the data. We could use DOM functions to move the nodes n the DOM tree 24 Summary XML eXtensible Markup Language A data format for storing structured and semi-structured text physical/logical structure DTD& XML Schema Establishes formal document structure rules DOM & SAX API DOM: Need to know a lot about the structure of a document SAX: Need to extract a few elements from an XML document 25 Online Resources XML tutorial http://www.xml.com http://www.w3c.org http://www.w3schools.com/ http://www.xmltraining.com/course-searchxml+online+tutorials http://xmlfiles.com/ 26