COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011 1 February 2011 Kaiser: COMS E6125 1 Today’s Topics • History of markup languages – HTML = HyperText Markup Language – XML = eXtensible Markup Language • Document Structure Definition – Document Type Definition (DTD) – XML Schema (XSD) • Processing XML 1 February 2011 Kaiser: COMS E6125 2 What is Markup? • Special text (“mark”) that is added to the regular text of a document in order to convey some information about it • A markup language is a formalized way of providing markup, and specifies: – – – – what markup is allowed (the lexicon) what markup is required how markup is distinguished from content text what the markup “means” 1 February 2011 Kaiser: COMS E6125 3 Specific Coding • Historically, electronic manuscripts contained procedural control codes (markup) that caused the text to be formatted in a particular way – tj6 – troff – TeX 1 February 2011 Kaiser: COMS E6125 4 Procedural Markup • Advantages: – Instructs agent how to process text – Generally concerned with formatting and presentation – Is “efficient” because requires little further interpretation • Disadvantages – Often specific to one proprietary processing system – Usually ties a document to a single purpose • printing on a paper • viewing on a screen • provides no information on “meaning” 1 February 2011 Kaiser: COMS E6125 5 Example Specific Coding .SK 1 Text processing and word processing systems typically require additional information to be interspersed among the natural text of the document being processed. This added information, called "markup", serves two purposes: .TB 4 TaB stop .OF 4 OFfset .SK 1 1.#Separating the logical elements of the document; and .OF 4 .SK 1 2.#Specifying the processing functions to be performed on those elements. .OF 0 .SK 1 SKipping vertical space 1 February 2011 Kaiser: COMS E6125 6 Generic Coding • In contrast, generic (or generalized, or descriptive) coding uses descriptive tags (e.g., “heading”) – Scribe – LaTeX – HTML 1 February 2011 Kaiser: COMS E6125 7 Descriptive Markup • Advantages – Is (usually) human and machine readable – Identifies information content, the logical components of a document – Is not directed towards a particular purpose or rendition of the document, therefore can be non-proprietary • Disadvantages: – Generally concerned with what text is – Does not specify what procedures are to be applied to text – Therefore requires that other process(es) supply formatting and presentation 1 February 2011 Kaiser: COMS E6125 8 Example Generic Coding <p> Text processing and word processing systems typically require additional information to be interspersed among the natural text of the document being processed. This added information, called <em>markup</em>, serves two purposes: <ol> <li>Separating the logical elements of the document; and <li>Specifying the processing functions to be performed on those elements. </ol> 1 February 2011 Kaiser: COMS E6125 9 Who Invented Markup? • Specialized markup: ??? • Generalized markup: – Many credit William Tunnicliffe, chairman of the Graphic Communications Association Composition Committee, who presented a talk on the separation of information content of documents from their format during a meeting at the Canadian Government Printing Office, September 1967 – Others credit Stanley Rice, a New York book designer, who proposed the idea of a universal catalog of parameterized editorial structure macros in several articles, e.g., "Editorial Text Structures," Memorandum to Standards Planning and Requirements Committee, ANSI, March 17, 1970 1 February 2011 Kaiser: COMS E6125 10 An Early Implementation • At IBM in 1969, Charles Goldfarb, Ed Mosher and Ray Lorie invented Generalized Markup Language (GML) as part of an office automation project integrating text editing with information retrieval and page composition • Instead of a simple tagging scheme, GML introduced the concept of a formally-defined document type (DTD = Document Type Definition) with an explicit nested element structure • By 1971 developed first DTD, for the manuals for IBM's “Telecommunications Access Method”, which enabled all the headings of a given headlevel to be automatically formatted identically • Productized in 1973 in IBM’s Document Composition Facility (DCF) 1 February 2011 Kaiser: COMS E6125 11 Example GML :h1.Chapter 1: Introduction :p.GML supported hierarchical containers, such as :ol :li.Ordered lists (like this one), :li.Unordered lists, and :li.Definition lists :eol. as well as simple structures. :p.Markup minimization (later generalized and formalized in SGML), allowed the end-tags to be omitted for the "h1" and "p" elements. 1 February 2011 Kaiser: COMS E6125 12 SGML = Standard GML • Standardization effort started in 1978, when ANSI (American National Standards Institute ) creates The Computer Languages for the Processing of Text Committee • Series of draft standards 1980-1986 (1983 version adopted by IRS and DoD) • ISO (International Standard Organization) joins ANSI effort in 1984 • International standard in 1986 based in part on an SGML system developed by Anders Berglund, then of the European Particle Physics Laboratory (CERN) • Hmm… isn’t CERN where Tim Berners-Lee invented the “World Wide Web” in 1989? 1 February 2011 Kaiser: COMS E6125 13 SGML • A metalanguage (grammar) • How to write tags, how to define the document structure • Structural paradigm is that of – an inverted tree structure, a root component branching out into leaves – or a series of nested containers • Defines three kinds of objects – Elements are the basic structural components – Attributes are qualities of elements – Entities are a short representation of special characters 1 February 2011 Kaiser: COMS E6125 14 SGML Pro and Con • Advantages: – Documents held in a standards-based, non-proprietary, platform-independent storage format – Scope for document re-use and re-presentation, enhancement of retrieval possibilities – Easy to process – Can (optionally) validate against DTDs • Disadvantages: – Remained a niche market in the 1980s – Not well supported by the major document processing vendors, tools expensive 1 February 2011 Kaiser: COMS E6125 15 Then Came the Web… • HyperText Markup Language (HTML) is derived from SGML • As an SGML-compliant language, it has a DTD with a fixed set of tags • Initially, the number of tags were very limited ( ~ 10 ) and very easy to remember and to use 1 February 2011 Kaiser: COMS E6125 16 HTML Example • From original IETF Internet Draft (1993) for HTML See <A HREF="http://info.cern.ch/">CERN</A>'s information for more details. A <A NAME=serious>serious</A> crime is one which is associated with imprisonment. The Organization may refuse employment to anyone convicted of a <a href="#serious">serious</A> crime. Warning: < IMG SRC ="triangle.gif" ALT="Warning:"> This must b e done by a qualified technician. < A HREF="Go">< IMG SRC ="Button"> Press to start</A> 1 February 2011 Kaiser: COMS E6125 17 HTML Pro and Con • Advantages – Simple to learn and to use – Easy to create from scratch or by converting legacy text files – Easy to parse and render • Drawbacks – Syntaxless – Much more a presentation language than a structural language – Too limited, not a good substitute for a word processor 1 February 2011 Kaiser: COMS E6125 18 HTML History • 1990: First implementation by TBL on a NeXT computer at CERN, using SGML tools to create original HTML language (DTD, parser) • 1991-1992: Various text-only and graphical browsers developed, latter usually platform-specific • 1993: NCSA Mosaic – First widely available graphical WWW browser (Unix X-Windows and Mac) – Developed primarily by UIUC undergraduate Marc Andreessen – The killer app of the Internet is born and the number of Web servers explode 1 February 2011 Kaiser: COMS E6125 19 HTML History • 1994: Competition – Mosaic team leaves NCSA to found Netscape – Microsoft adopts the Web (Internet Explorer bundled with Windows 95) – Divergence of supported HTML tags between Internet Explorer and Netscape –> browser wars • 1994-1995: HTML 2.0 adds image maps, forms 1 February 2011 Kaiser: COMS E6125 20 HTML History • 1995 and beyond: Commercial websites – Java development started (as “Oak”) for programming set-top boxes in 1991, BIG FAILURE - but launched on Web in March 1995 (in HotJava) and May 1995 (in Netscape), BIG SUCCESS – Amazon.com opens in July 1995 – “dot com” era begins (and soon ends) • 1997: HTML 3.2 and HTML 4.0 add tables, applets, text flow around images, superscripts and subscripts, frames, cascading style sheets, more multimedia options, scripting languages, web accessibility conventions, internationalization, … 1 February 2011 Kaiser: COMS E6125 21 XHTML = eXtensible HyperText Markup Language • Jan 2000: XHTML 1.0 W3C Recommendation • Made element and attribute names case-sensitive (in particular, use lowercase) • Include end tags, e.g., <p> … </p> • Add a “/” to empty elements, e.g., <br/> and <hr/> • Quote all attribute values, e.g., <img src="duck.jpg" alt="A Duck"/> • Most browsers still work fine with older HTML 1 February 2011 Kaiser: COMS E6125 22 Where did the “X” come from? • XML = eXtensible Markup Language • XHTML is a reformulation of HTML 4.x in XML • XHTML can be used in conjunction with other XML vocabularies – SMIL (Synchronized Multimedia Integration Language) – SVG (Scalable Vector Graphics) – MathML (Mathematical Markup Language) – Plus hundreds dedicated to specific applications (the extensible part) 1 February 2011 Kaiser: COMS E6125 23 What is XML for? • The “universal” markup format for structured documents and data on the Web • For data exchange (messages) and persistent data • Syntax, Data Modeling, Data Processing • Conceptually an SGML descendant • Unlike SGML, it quickly became widespread 1 February 2011 Kaiser: COMS E6125 24 SGML->XML • Like SGML, XML is a grammar (or a metalanguage), NOT a specific language • Relatively simple specification • Parsing made simpler through two-level mechanism – Well-formed – Valid 1 February 2011 Kaiser: COMS E6125 25 Well-Formed • (Optionally) starts with XML declaration <?xml version="1.0"?> • Rest of document inside the root element <myroot>…</myroot> • All text contained in some element <someelement>text text text</someelement> • Explicit “empty” elements <anotherelement></anotherelement> <anotherelement/> • Element tags must be properly nested (no crossing tags) NO <i><b>blah blah blah</i></b> • Start and end tags must match exactly (same case) • Quotes placed around all attribute values <a href=“stuff.html”>stuff</a> 1 February 2011 Kaiser: COMS E6125 26 Valid • Well-formed, plus • Conforms to a DTD or Schema – tags and attributes are all declared – tags and attributes are used correctly • XML browsers and editors usually require validity • Other tools might not (e.g., search engines) 1 February 2011 Kaiser: COMS E6125 27 XML Goes Beyond Document Processing • XML more oriented to distributed computing than to document markup • Thus complements rather than replaces HTML (or XHTML) 1 February 2011 • DOM = Document Object Model • SAX = Simple API for XML • SOAP = Simple Object Access Protocol • Web Services Kaiser: COMS E6125 28 XML Anatomy element name element attribute name <bibliography> attribute value (attributes cannot contain elements) element content <paper ID= “goto”> <authors> <author>Edsger W. Dijkstra </author> </authors> <title>Go To Statement Considered Harmful</title> <booktitle>Communications of the ACM</booktitle> <year>1968</year> <fullPaper source=“harmful”/> </paper> </bibliography> number content 1 February 2011 empty element Kaiser: COMS E6125 character content 29 Perspectives on XML • Document (SGML) Community – data = linear text documents – markup (annotate) text to describe context, structure, semantics • Database Community – prominent example of the semi-structured data model – captures the whole spectrum from highly structured, regular data to unstructured data XML is the cure for your data exchange, information integration, e-commerce, … problems” (also cures baldness, lose 28 pounds in 14 days, get rich quick, …) 1 February 2011 Kaiser: COMS E6125 30 Identifying Vocabularies • My element may not be your element: – geometry context: <element>line</element> – chemistry context: <element>oxygen</element> 1 February 2011 Kaiser: COMS E6125 31 Identifying Vocabularies • An XML Schema defines a vocabulary of names of type definitions, element and attribute declarations • Use XML Namespaces to identify which vocabulary – Simple method for qualifying element and attribute names used in XML documents – Useful when a single XML document contains elements and attributes that are defined for and used by multiple software modules 1 February 2011 Kaiser: COMS E6125 32 Namespace Scoping • XML namespaces are declared with an xmlns attribute, which can associate a prefix with the namespace • The declaration is in scope for the element containing the attribute and all its descendants 1 February 2011 <html:html xmlns:html='http:// www.w3.org/1999/xhtml'> <html:head> <html:title>Frobnostication </html:title> </html:head> <html:body> <html:p>Moved to <html:a href='http://frob. example.com'>here. </html:a> </html:p> </html:body> </html:html> Kaiser: COMS E6125 33 Namespace Defaulting <?xml version="1.1"?> <!-- elements are in the HTML namespace, in this case by default --> <html xmlns='http://www.w3.org/1999/xhtml'> <head> <title>Frobnostication</title> </head> <body> <p>Moved to <a href='http://frob.example.com'>here</a>.</p> </body> </html> 1 February 2011 Kaiser: COMS E6125 34 Multiple Namespaces All element types are prefixed <bk:book xmlns:bk='urn:loc.gov:books' xmlns:isbn='urn:ISBN:0-395-36341-6' xmlns:money='urn:Finance:AllAboutMoney'> <bk:title>Cheaper by the Dozen</bk:title> <isbn:number>1568491379</isbn:number> <bk:price money:currencySymbol="$">99.99</bk:price> </bk:book> 1 February 2011 Kaiser: COMS E6125 35 Nested Scoping <?xml version="1.1"?> <!-- initially, the default namespace is "books" --> <book xmlns='urn:loc.gov:books' xmlns:isbn='urn:ISBN:0-395-36341-6'> <title>Cheaper by the Dozen</title> <isbn:number>1568491379</isbn:number> <notes> <!-- make HTML the default namespace for some commentary --> <p xmlns='urn:w3-org-ns:HTML'> This is a <i>funny</i> book! </p> </notes> </book> 1 February 2011 Kaiser: COMS E6125 36 How to Define the Actual Namespace • W3C namespace specification doesn’t say (!) • A namespace doesn’t actually have to exist as a physical or conceptual entity • All that is needed is a qualifier — the XML namespace URI — that, in combination with an element type or attribute name, creates a universal (and universally unique) name • In other words, there doesn’t actually have to be a definition or anything else at that URI 1 February 2011 Kaiser: COMS E6125 37 Pure XML - Instance Model • XML 1.0 implicit data model: – nested containers ("boxes within boxes") – labeled ordered trees (= semistructured data model) – Relational or object-oriented easy to encode <A> <B>foo</B> <C>bar</C> <C>psl</C> </A> A A: B: "foo" C: "bar" C: "psl" 1 February 2011 B C C "foo" "bar" "psl" Kaiser: COMS E6125 children are ordered 38 XML + Namespaces Allows mixing of different tag vocabularies • Only identifies the vocabulary (lexicon) • Additional mechanisms required for structure and meaning (or at least metadata) of tags – explicit data model 1 February 2011 Kaiser: COMS E6125 39 From Documents to Data <memo importance=‘medium' date=‘2011-01-30'> • We want to be able to – Extract the element structure of a document – Re-use this structure for other similar documents – Share structure and metadata with others – Automate processing of this structure and metadata 1 February 2011 <from>Gail Kaiser</from> <to>Pranith Ramamurthy</to> <subject>whim this week</subject> <body>Bring blue books for a surprise quiz!</body> </memo> <invoice> <orderDate>2010-12-01</orderDate> <shipDate>2010-12-26</shipDate> <billingAddress> <name>Gail Kaiser</name> <street>500 West 120th Street</street> <city>New York</city> <state>NY</state> <zip>10027</zip> </billingAddress> <voice>212-555-1234</voice> <fax>212-555-4321</fax> </invoice> Kaiser: COMS E6125 40 Adding Structure and Semantics • A Document Structure Description (DSD) defines the syntax of XML documents for a particular application domain • Defines the grammar for an XMLbased markup language 1 February 2011 Kaiser: COMS E6125 41 Processing XML • Non-validating parser: – checks that XML doc is syntactically wellformed, e.g., all open-tags have matching close-tags and they are properly nested, attributes only appear once in an element, etc. • Validating parser: – checks that XML doc is also valid wrt a given DSD (usually XML Schema) 1 February 2011 Kaiser: COMS E6125 42 Using DSD Validators • A DSD processor can be useful both on the server side (when writing XML documents) and on the client side (when processing XML documents): – Checking validity (conformance) of XML documents – Performing default insertion (inserts missing fragments) 1 February 2011 Kaiser: COMS E6125 43 DSD Processing 1 February 2011 Kaiser: COMS E6125 44 Several Proposed DSDs • XML Document Type Definitions (DTDs): – Define the structure of “allowed” documents – Database schema – Non-XML syntax • XML Schemas (XSDs) – Define structure and data types – Allows developers to build their own libraries of interchange-able data types – Written in an XML vocabulary • Others (e.g., RELAX NG, DSD) 1 February 2011 Kaiser: COMS E6125 45 XML Schema Design Principles 1. 2. 3. 4. 5. 6. 7. 8. More expressive than DTDs (from SGML) Notation is itself an XML vocabulary Self-describing Usable by a wide variety of applications that employ XML Straightforwardly usable on the Internet Optimized for interoperability Simple enough to be implemented with modest design and runtime resources Coordinated with relevant W3C specs 1 February 2011 Kaiser: COMS E6125 46 Purpose of an XML Schema • Defines a class of XML instances • Neither instances nor schemas need exist as documents, per se, may exist as: –Byte stream sent between applications –Fields in a database record –Collection of XML “infoset” information items 1 February 2011 Kaiser: COMS E6125 47 What is an XML “infoset”? • XML Information Set, 2nd edition, W3C Recommendation February 2004 • For use by other specs that need to refer to the information in a well-formed XML document [or PSVI = post schema validated infoset] • Defines abstract data set generated by parser or by other means, conceptually tree of items each with several properties 1 February 2011 Kaiser: COMS E6125 48 (Some) Information Items • Document (root of infoset) – properties include base UR, XML version, character encoding, etc. • One root element - and its children • Attributes of elements • Namespace scoping for elements • Processing instructions • Unexpanded entities (processor may or may not expand all entities) 1 February 2011 Kaiser: COMS E6125 49 file po.xml Example Instance Document <?xml version="1.0"?> <purchaseOrder orderDate=“2008-08-20"> <shipTo country="US"> <name>Robert Smith</name> <street>123 Maple Street</street> <city>Mill Valley</city> <state>CA</state> <zip>90952</zip> </shipTo> <billTo country="US"> <name>Alice Smith</name> <street>8 Oak Avenue</street> <city>Old Town</city> <state>PA</state> <zip>95819</zip> </billTo> <comment>Hurry, my lawn is going wild!</comment> <items> <item partNum="872-AA"> <productName>Lawnmower</productName> <quantity>1</quantity> <USPrice>148.95</USPrice> <comment>Confirm this is electric</comment> </item> <item partNum="926-AA"> . . . </item> </items> Kaiser: COMS E6125 </purchaseOrder> 50 Where is the Schema? • The instance document may reference a schema explicitly, or a processor may obtain a schema separately without reference from the instance • Schema defines elements and attributes, and their complex and simple types • Determines the appearance of elements and their content in instance documents 1 February 2011 Kaiser: COMS E6125 51 file po.xsd Example Schema <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:annotation> . . . </xsd:annotation> <xsd:element name="purchaseOrder" type="PurchaseOrderType"/> <xsd:element name="comment" type="xsd:string"/> <xsd:complexType name="PurchaseOrderType"> . . . </xsd:complexType> </xsd:schema> • The schema consists of a schema element and various subelements, e.g., element, complexType • The prefix xsd: associates names with the XML Schema namespace specified in the xmlns:xsd declaration • Same prefix, and hence same association, also appears on names of built-in types, e.g., xsd:string • Identifies elements and simple types as belonging to XML Schema language vocabulary rather than vocabulary of schema author 1 February 2011 Kaiser: COMS E6125 52 file po.xsd Example Schema <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:annotation> . . . </xsd:annotation> <xsd:element name="purchaseOrder" type="PurchaseOrderType"/> <xsd:element name="comment" type="xsd:string"/> <xsd:complexType name="PurchaseOrderType"> . . . </xsd:complexType> </xsd:schema> • An annotation element may appear at the beginning of most schema constructions • Contains two subelements – Documentation: Human readable material – appInfo: For tools and applications 1 February 2011 Kaiser: COMS E6125 53 Complex Type Definitions <xsd:complexType name="USAddress"> <xsd:sequence> <xsd:element name="name" type="xsd:string"/> <xsd:element name="street" type="xsd:string"/> <xsd:element name="city" type="xsd:string"/> <xsd:element name="state" type="xsd:string"/> <xsd:element name="zip" type="xsd:decimal"/> </xsd:sequence> <xsd:attribute name="country" type="xsd:NMTOKEN" fixed="US"/> </xsd:complexType> • • • New complex types are defined using the complexType element; it contains element declarations, attribute declarations and element references This example says elements of type USAddress must have – 5 subelements that must be called name, street, city, state and zip (in this order), each having the corresponding type declared above – 1 attribute called country may appear with the element; NMTOKEN represents an atomic indivisible value All element declarations within USAddress involve simple types 54 Complex Type Definitions <xsd:complexType name="USAddress"> <xsd:sequence> <xsd:element name="name" type="xsd:string"/> <xsd:element name="street" type="xsd:string"/> <xsd:element name="city" type="xsd:string"/> <xsd:element name="state" type="xsd:string"/> <xsd:element name="zip" type="xsd:decimal"/> </xsd:sequence> <xsd:attribute name="country" type="xsd:NMTOKEN" fixed="US"/> </xsd:complexType> • • • • An attribute may be specified as fixed or default. Default attribute values apply when attributes are missing. For fixed attributes, if a value appears, it must be the value declared with a fixed value. The schema processor will provide the value for missing attributes. 1 February 2011 Kaiser: COMS E6125 55 Complex Type Definitions <xsd:complexType name="PurchaseOrderType"> <xsd:sequence> <xsd:element name="shipTo" type="USAddress"/> <xsd:element name="billTo" type="USAddress"/> <xsd:element ref="comment" minOccurs="0"/> <xsd:element name="items" type="Items"/> </xsd:sequence> <xsd:attribute name="orderDate" type="xsd:date"/> </xsd:complexType> • A declaration may reference an existing element, e.g., comment; the value of the ref attribute must reference a global element (i.e., declared under schema) • Every element of type PurchaseOrderType must consist of subelements shipTo and billTo, each containing the five subelements declared as part of USAddress, items and (optionally) comment; it may have one attribute called orderDate 1 February 2011 Kaiser: COMS E6125 56 Complex Type Definitions <xsd:complexType name="PurchaseOrderType"> <xsd:sequence> <xsd:element name="shipTo" type="USAddress"/> <xsd:element name="billTo" type="USAddress"/> <xsd:element ref="comment" minOccurs="0"/> <xsd:element name="items" type="Items"/> </xsd:sequence> <xsd:attribute name="orderDate" type="xsd:date"/> </xsd:complexType> • Occurrence constraint may specify minoccurs and/or maxoccurs 1 February 2011 Kaiser: COMS E6125 57 Complex Type Definitions <xsd:complexType name="PurchaseOrderType"> <xsd:sequence> <xsd:element name="shipTo" type="USAddress"/> <xsd:element name="billTo" type="USAddress"/> <xsd:element ref="comment" minOccurs="0"/> <xsd:element name="items" type="Items"/> </xsd:sequence> <xsd:attribute name="orderDate" type="xsd:date"/> </xsd:complexType> • Attributes may appear once or not at all (the default), but no more than once • use may be specified as optional, required, or prohibited 1 February 2011 Kaiser: COMS E6125 58 Simple Built-in Types • string, normalizedString, token • byte, unsignedByte • integer, positiveInteger, etc • long, short, etc • decimal, float, double • boolean • time, dateTime, duration, date, etc • anyURI • etc 1 February 2011 • • • • ID IDREF, IDREFS ENTITY, ENTITIES NMTOKEN, NMTOKENS • The types in this column should only be used in attributes (to retain compatibility with XML 1.0 DTDs) Kaiser: COMS E6125 59 Simple Derived Types <xsd:simpleType name="myInteger"> <xsd:restriction base="xsd:integer"> <xsd:minInclusive value="10000"/> <xsd:maxInclusive value="99999"/> </xsd:restriction> </xsd:simpleType> • The simpleType element is used to define and name a new simple type • The restriction element indicates the base type and identifies the “facets” that constrain the range of values (here minInclusive and maxInclusive) 1 February 2011 Kaiser: COMS E6125 60 Simple Derived Types (pattern facet) <!-- Stock Keeping Unit, a code for identifying products --> <xsd:simpleType name="SKU"> <xsd:restriction base="xsd:string"> <xsd:pattern value="\d{3}-[A-Z]{2}"/> </xsd:restriction> </xsd:simpleType> • Constrain the values of SKU using the pattern facet in conjunction with the regular expression "\d{3}-[A-Z]{2}“ (3 digits followed by a hyphen followed by 2 upper-case ASCII letters) 1 February 2011 Kaiser: COMS E6125 61 Simple Derived Types (enumeration facet) <xsd:simpleType name="USState"> <xsd:restriction base="xsd:string"> <xsd:enumeration value="AK"/> <xsd:enumeration value="AL"/> <xsd:enumeration value="AR"/> <!-- and so on ... --> </xsd:restriction> </xsd:simpleType> • The enumeration facet limits a simple type to a set of distinct values • Enables a better definition of USAddress type 1 February 2011 <xsd:complexType name="USAddress"> . . . <xsd:element name="state" type="USState"/> . . . </xsd:complexType Kaiser: COMS E6125 62 Anonymous Type Definitions <xsd:complexType name="Items"> <xsd:sequence> <xsd:element name="item" minOccurs="0" maxOccurs="unbounded"> <xsd:complexType> <xsd:sequence> <xsd:element name="productName" type="xsd:string"/> <xsd:element name="quantity"> <xsd:simpleType> <xsd:restriction base="xsd:positiveInteger"> <xsd:maxExclusive value="100"/> </xsd:restriction> </xsd:simpleType> </xsd:element> <xsd:element name="USPrice" type="xsd:decimal"/> <xsd:element ref="comment" minOccurs="0"/> <xsd:element name="shipDate" type="xsd:date" minOccurs="0"/> </xsd:sequence> <xsd:attribute name="partNum" type="SKU" use="required"/> </xsd:complexType> </xsd:element> Kaiser: COMS E6125 63 </xsd:sequence> </xsd:complexType> Recap Example Schema file po.xsd <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:annotation> . . . </xsd:annotation> <xsd:element name="purchaseOrder" type="PurchaseOrderType"/> <xsd:element name="comment" type="xsd:string"/> <xsd:complexType name="PurchaseOrderType"> <xsd:sequence> <xsd:element name="shipTo" type="USAddress"/> <xsd:element name="billTo" type="USAddress"/> <xsd:element ref="comment" minOccurs="0"/> <xsd:element name="items" type="Items"/> </xsd:sequence> <xsd:attribute name="orderDate" type="xsd:date"/> </xsd:complexType> <xsd:complexType name="USAddress"> . . . </xsd:complexType> <xsd:complexType name="Items"> . . . </xsd:complexType> </xsd:schema> 1 February 2011 Kaiser: COMS E6125 64 XML Schema Data Types • • • • Complex types Built-in simple types Derived simple types Also derived complex types, lists and unions of simple types Define structure – what about the content? 1 February 2011 Kaiser: COMS E6125 65 Element Content: Simple content • Declare an element that has an attribute and contains a simple value <internationalPrice currency="EUR">423.46</internationalPrice> <xsd:element name="internationalPrice"> <xsd:complexType> <xsd:simpleContent> <xsd:extension base="xsd:decimal"> <xsd:attribute name="currency“ type="xsd:string"/> </xsd:extension> </xsd:simpleContent> </xsd:complexType> </xsd:element> 1 February 2011 Kaiser: COMS E6125 66 Element Content: Empty content • Declare an element with attributes only no content at all <internationalPrice currency="EUR" value="423.46"/> <xsd:element name="internationalPrice"> <xsd:complexType> <xsd:attribute name="currency" type="xsd:string"/> <xsd:attribute name="value" type="xsd:decimal"/> </xsd:complexType> </xsd:element> 1 February 2011 Kaiser: COMS E6125 67 Element Content: Entire element omitted • The absence of an element does not carry any particular meaning; it could be – Information unknown – Information not applicable – I just forgot to enter the information • Absence does/should not imply some value like zero, empty string, empty list, etc. • Database systems faced with similar problems have introduce “null” values • XML does not provide a null value representation that actually appears in element content; instead, there is an attribute to indicate content is nil <xsd:element name="shipDate" type="shipDateType" nillable="true"> <shipDate xsi:nil="true"></shipDate> 68 Element Content: Mixed content <letterBody> <salutation>Dear Mr.<name>Stanley Steamer</name>.</salutation> Your order of <quantity>1</quantity> <productName>Baby Monitor</productName> shipped from our warehouse on <shipDate>2008-12-26</shipDate>. .... </letterBody> • Text appears between the elements salutation, quantity, productName, and shipDate (all children of letterBody) • To allow this, the mixed attribute of the parent’s complexType must be set to true 1 February 2011 Kaiser: COMS E6125 69 Element Content: Mixed content <xsd:element name="letterBody"> <xsd:complexType mixed="true"> <xsd:sequence> <xsd:element name="salutation"> <xsd:complexType mixed="true"> <xsd:sequence> <xsd:element name="name" type="xsd:string"/> </xsd:sequence> </xsd:complexType> </xsd:element> <xsd:element name="quantity" type="xsd:positiveInteger"/> <xsd:element name="productName" type="xsd:string"/> <xsd:element name="shipDate" type="xsd:date" minOccurs="0"/> <!-- etc. --> </xsd:sequence> </xsd:complexType> </xsd:element> • The order and number of child elements appearing in an instance must agree with order/number of child elements specified in the content model 70 Element Content: anyType • The anyType type does not constrain its content in any way <xsd:element name="anything" type="anyType"/> • When no type is defined, anyType is the default, so could be written as <xsd:element name="anything"/> 1 February 2011 Kaiser: COMS E6125 71 Other XML Schema Topics • There’s lots more… 1 February 2011 Kaiser: COMS E6125 72 Drawbacks of XML Schemas • Another vocabulary to learn • Verbose (like XML itself) • Many constraints cannot be expressed (without adding separate stylesheet or code) <Demo xmlns="http://www.demo.org"• xmlns:xsi="http://www.w3.or g/2001/XMLSchema-instance" xsi:schemaLocation="http:// www.demo.org demo.xsd"> <A>10</A> <B>20</B> </Demo> Can constrain: the Demo element contains a sequence of elements A followed by B; the A element contains an integer; the B element contains an integer • Can’t constrain: A>B 1 February 2011 Kaiser: COMS E6125 73 Processing XML • Tree representation: – Document Object Model (DOM) API – Cursor APIs, e.g., .NET’s XPathNavigator, Java StAX • Stream of events representation: – Push Model, e.g., Simple API for XML (SAX) – Pull Model, e.g., Common API for XML Pull Parsing (XmlPull) • Others 1 February 2011 Kaiser: COMS E6125 74 Document Object Model • Object-oriented approach to traversing the XML document as a tree • Typically loads the entire XML document into memory (random access but memory intensive) • Provides mechanisms for loading, saving, accessing, querying, modifying and deleting nodes from an XML document 1 February 2011 Kaiser: COMS E6125 75 DOM API • Hierarchy of Node objects mapping to XML concepts: document, element, attribute, processing instruction, comment, … • Language-independent API: – get first/last child, previous/next sibling, set of nodes – insert before/after, replace – getElementsByTagName • W3C DOM offers fairly limited functionality, so implementations often add helper method extensions 1 February 2011 Kaiser: COMS E6125 76 Push Model • XML producer (typically an XML parser) controls the pace of the application and informs the XML consumer when certain events occur (e.g., reports events when encountering begin/end tags) • XML consumer registers callbacks with the producer, which invokes the callbacks as various parts of the XML document are seen (as events are reported) • Does not necessarily build a parse tree 1 February 2011 Kaiser: COMS E6125 77 Push Model Pro • The entire XML document does not need to be stored in memory, only the information about the node currently being processed • This makes it possible to process large XML documents without incurring massive memory costs • Can also process XML streams whose contents arrive over time • Allows consumer to ignore less interesting data 1 February 2011 Kaiser: COMS E6125 78 Push Model Con • Certain context and state information such as the parents of the current node or its depth in the XML tree must be tracked by the programmer • Limited expressive power (query/update) when working on streams • To register callbacks one needs to create a class devoted to handling events from the producer • Many developers find callbacks to be an unintuitive way to control program flow 1 February 2011 Kaiser: COMS E6125 79 Pull Model • XML consumer controls the program flow by requesting events from the XML producer as needed • Operates in a forward-only, streaming fashion while only showing information about a single node at any given time • Programmer creates a loop that continually reads from the XML document until the end of the document is reached, but acts solely on items of interest as they are seen 1 February 2011 Kaiser: COMS E6125 80 Pull Model Comparison • As memory efficient as push model processing but with a more familiar programming model • Does not require a specialized class for handling XML processing to implement specific interfaces or subclass certain classes to register callbacks • The need to explicitly track application states using boolean flags and similar variables is significantly reduced 1 February 2011 Kaiser: COMS E6125 81 XML Cursors • Cursor acts like a lens that focuses on one XML node at a time, but, unlike pull-based or pushbased APIs, the cursor can be positioned anywhere along the XML document at any given time • Allows one to navigate, query, and manipulate an XML document loaded in memory • Does not require the heavyweight interface of a traditional tree model API, where every significant token in the underlying XML must map to an object • Can create XML views of non-XML data 1 February 2011 Kaiser: COMS E6125 82 Other Alternatives • Object to XML Mapping APIs – Represent nodes and text as classes and programming language primitives – Cannot represent all XML information with full fidelity, e.g., lose processing instructions and comments, element ordering – Impedance mismatches between XML Schema and object-oriented concepts • XML-specific languages – XQuery, XSLT, … 1 February 2011 Kaiser: COMS E6125 83 Summary • Content intended for humans usually formatted using HTML • Content intended for machine processing (other than rendering) usually formatted using XML • Humans (and some browsers) are forgiving, other machine processing is not 1 February 2011 Kaiser: COMS E6125 84 Second Assignment: Paper Outline • Plan your paper • Pretend your reader is another student in this class, rather than the teaching staff • It is “ok” to switch topics from your original proposal, but clarify that you’re doing so 1 February 2011 Kaiser: COMS E6125 85 Second Assignment: Its All In The Details • Each full paper must have a title, abstract (approx. 100 words), introduction, several body sections, conclusion, references list • Figure out what will be in the body sections: drill down to subsections (or possibly even subsubsections) • Consider the point of view you will portray in your introduction and conclusion • Motivate your reading 1 February 2011 Kaiser: COMS E6125 86 Second Assignment: Logistics • Due Tuesday February 15th by 10am • Maximum four pages (not including optional figures and required reference list) • Submit by posting in Paper Outlines folder on CourseWorks – contents of this folder are visible only to teaching staff • Must be in a format I can read, which means pdf, word, html, plain ascii text (with all figures embedded or viewable in a browser without special “plugins”) 1 February 2011 Kaiser: COMS E6125 87 Upcoming Assignments: Paper • Full paper due Tuesday March 9th 18 January 2011 Kaiser: COMS E6125 88 Second’ Assignment: Student Presentations • Individual ~10 minute talk in class (plus ~5 minutes Q&A, discussion) • Schedule has been assigned (see Syllabus) • One paragraph Presentation Proposal, due Tuesday February 15th • May be based on paper, project, or some other topic • Post in Presentation Proposals folder on Courseworks 18 January 2011 Kaiser: COMS E6125 89 Heads Up on Project • Project Proposal due Tuesday March 9th • Optionally work in teams (see http://bank.cs.columbia.edu/classes/cs6125/team _advice.htm) • Build a new system or extend an existing system • OR evaluate/compare one or more existing system(s) • You may "continue" your paper topic towards the project, or do something entirely different 18 January 2011 Kaiser: COMS E6125 90 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2011 1 February 2011 Kaiser: COMS E6125 91