Structuring Web Content - Programming Systems Lab

advertisement
COMS E6125 Web-enHanced
Information Management
(WHIM)
Prof. Gail Kaiser
Spring 2011
1 February 2011
Kaiser: COMS E6125
1
Today’s Topics
• History of markup languages
– HTML = HyperText Markup Language
– XML = eXtensible Markup Language
• Document Structure Definition
– Document Type Definition (DTD)
– XML Schema (XSD)
• Processing XML
1 February 2011
Kaiser: COMS E6125
2
What is Markup?
• Special text (“mark”) that is added to the
regular text of a document in order to
convey some information about it
• A markup language is a formalized way of
providing markup, and specifies:
–
–
–
–
what markup is allowed (the lexicon)
what markup is required
how markup is distinguished from content text
what the markup “means”
1 February 2011
Kaiser: COMS E6125
3
Specific Coding
• Historically, electronic manuscripts
contained procedural control codes
(markup) that caused the text to be
formatted in a particular way
– tj6
– troff
– TeX
1 February 2011
Kaiser: COMS E6125
4
Procedural Markup
• Advantages:
– Instructs agent how to process text
– Generally concerned with formatting and
presentation
– Is “efficient” because requires little further
interpretation
• Disadvantages
– Often specific to one proprietary processing
system
– Usually ties a document to a single purpose
• printing on a paper
• viewing on a screen
• provides no information on “meaning”
1 February 2011
Kaiser: COMS E6125
5
Example Specific Coding
.SK 1 Text processing and word processing systems typically
require additional information to be interspersed among the
natural text of the document being processed. This added
information, called "markup", serves two purposes:
.TB 4
TaB stop
.OF 4
OFfset
.SK 1
1.#Separating the logical elements of the document; and
.OF 4
.SK 1
2.#Specifying the processing functions to be performed on those
elements.
.OF 0
.SK 1
SKipping vertical space
1 February 2011
Kaiser: COMS E6125
6
Generic Coding
• In contrast, generic (or generalized,
or descriptive) coding uses
descriptive tags (e.g., “heading”)
– Scribe
– LaTeX
– HTML
1 February 2011
Kaiser: COMS E6125
7
Descriptive Markup
• Advantages
– Is (usually) human and machine readable
– Identifies information content, the logical
components of a document
– Is not directed towards a particular purpose or
rendition of the document, therefore can be
non-proprietary
• Disadvantages:
– Generally concerned with what text is
– Does not specify what procedures are to be
applied to text
– Therefore requires that other process(es)
supply formatting and presentation
1 February 2011
Kaiser: COMS E6125
8
Example
Generic Coding
<p> Text processing and word processing systems
typically require additional information to be
interspersed among the natural text of the
document being processed. This added
information, called <em>markup</em>, serves
two purposes:
<ol>
<li>Separating the logical elements of the
document; and
<li>Specifying the processing functions to be
performed on those elements.
</ol>
1 February 2011
Kaiser: COMS E6125
9
Who Invented Markup?
• Specialized markup: ???
• Generalized markup:
– Many credit William Tunnicliffe, chairman of
the Graphic Communications Association
Composition Committee, who presented a talk
on the separation of information content of
documents from their format during a meeting
at the Canadian Government Printing Office,
September 1967
– Others credit Stanley Rice, a New York book
designer, who proposed the idea of a universal
catalog of parameterized editorial structure
macros in several articles, e.g., "Editorial Text
Structures," Memorandum to Standards
Planning and Requirements Committee, ANSI,
March 17, 1970
1 February 2011
Kaiser: COMS E6125
10
An Early Implementation
• At IBM in 1969, Charles Goldfarb, Ed Mosher and
Ray Lorie invented Generalized Markup Language
(GML) as part of an office automation project
integrating text editing with information retrieval
and page composition
• Instead of a simple tagging scheme, GML
introduced the concept of a formally-defined
document type (DTD = Document Type Definition)
with an explicit nested element structure
• By 1971 developed first DTD, for the manuals for
IBM's “Telecommunications Access Method”,
which enabled all the headings of a given headlevel to be automatically formatted identically
• Productized in 1973 in IBM’s Document
Composition Facility (DCF)
1 February 2011
Kaiser: COMS E6125
11
Example GML
:h1.Chapter 1: Introduction
:p.GML supported hierarchical containers, such as
:ol
:li.Ordered lists (like this one),
:li.Unordered lists, and
:li.Definition lists
:eol.
as well as simple structures.
:p.Markup minimization (later generalized and
formalized in SGML), allowed the end-tags to be
omitted for the "h1" and "p" elements.
1 February 2011
Kaiser: COMS E6125
12
SGML = Standard GML
• Standardization effort started in 1978, when
ANSI (American National Standards Institute )
creates The Computer Languages for the
Processing of Text Committee
• Series of draft standards 1980-1986 (1983
version adopted by IRS and DoD)
• ISO (International Standard Organization) joins
ANSI effort in 1984
• International standard in 1986 based in part on an
SGML system developed by Anders Berglund, then
of the European Particle Physics Laboratory
(CERN)
• Hmm… isn’t CERN where Tim Berners-Lee
invented the “World Wide Web” in 1989?
1 February 2011
Kaiser: COMS E6125
13
SGML
• A metalanguage (grammar)
• How to write tags, how to define the document
structure
• Structural paradigm is that of
– an inverted tree structure, a root component
branching out into leaves
– or a series of nested containers
• Defines three kinds of objects
– Elements are the basic structural components
– Attributes are qualities of elements
– Entities are a short representation of special
characters
1 February 2011
Kaiser: COMS E6125
14
SGML Pro and Con
• Advantages:
– Documents held in a standards-based, non-proprietary,
platform-independent storage format
– Scope for document re-use and re-presentation,
enhancement of retrieval possibilities
– Easy to process
– Can (optionally) validate against DTDs
• Disadvantages:
– Remained a niche market in the 1980s
– Not well supported by the major document processing
vendors, tools expensive
1 February 2011
Kaiser: COMS E6125
15
Then Came the Web…
• HyperText Markup Language (HTML)
is derived from SGML
• As an SGML-compliant language, it
has a DTD with a fixed set of tags
• Initially, the number of tags were
very limited ( ~ 10 ) and very easy to
remember and to use
1 February 2011
Kaiser: COMS E6125
16
HTML Example
• From original IETF Internet Draft
(1993) for HTML
See <A HREF="http://info.cern.ch/">CERN</A>'s
information for more details.
A <A NAME=serious>serious</A> crime is one which is
associated with imprisonment.
The Organization may refuse employment to anyone
convicted of a <a href="#serious">serious</A> crime.
Warning: < IMG SRC ="triangle.gif" ALT="Warning:"> This
must b e done by a qualified technician.
< A HREF="Go">< IMG SRC ="Button"> Press to start</A>
1 February 2011
Kaiser: COMS E6125
17
HTML Pro and Con
• Advantages
– Simple to learn and to use
– Easy to create from scratch or by
converting legacy text files
– Easy to parse and render
• Drawbacks
– Syntaxless
– Much more a presentation language than
a structural language
– Too limited, not a good substitute for a
word processor
1 February 2011
Kaiser: COMS E6125
18
HTML History
• 1990: First implementation by TBL on a NeXT
computer at CERN, using SGML tools to create
original HTML language (DTD, parser)
• 1991-1992: Various text-only and graphical browsers
developed, latter usually platform-specific
• 1993: NCSA Mosaic
– First widely available graphical WWW browser
(Unix X-Windows and Mac)
– Developed primarily by UIUC undergraduate
Marc Andreessen
– The killer app of the Internet is born and the
number of Web servers explode
1 February 2011
Kaiser: COMS E6125
19
HTML History
• 1994: Competition
– Mosaic team leaves NCSA to found Netscape
– Microsoft adopts the Web (Internet Explorer
bundled with Windows 95)
– Divergence of supported HTML tags between
Internet Explorer and Netscape –> browser
wars
• 1994-1995: HTML 2.0 adds image maps,
forms
1 February 2011
Kaiser: COMS E6125
20
HTML History
• 1995 and beyond: Commercial websites
– Java development started (as “Oak”) for
programming set-top boxes in 1991, BIG
FAILURE - but launched on Web in March 1995
(in HotJava) and May 1995 (in Netscape), BIG
SUCCESS
– Amazon.com opens in July 1995
– “dot com” era begins (and soon ends)
• 1997: HTML 3.2 and HTML 4.0 add tables,
applets, text flow around images, superscripts and
subscripts, frames, cascading style sheets, more
multimedia options, scripting languages, web
accessibility conventions, internationalization, …
1 February 2011
Kaiser: COMS E6125
21
XHTML = eXtensible
HyperText Markup Language
• Jan 2000: XHTML 1.0 W3C Recommendation
• Made element and attribute names case-sensitive
(in particular, use lowercase)
• Include end tags, e.g., <p> … </p>
• Add a “/” to empty elements, e.g., <br/> and <hr/>
• Quote all attribute values, e.g.,
<img src="duck.jpg" alt="A Duck"/>
• Most browsers still work fine with older HTML
1 February 2011
Kaiser: COMS E6125
22
Where did the “X” come from?
• XML = eXtensible Markup Language
• XHTML is a reformulation of HTML 4.x in
XML
• XHTML can be used in conjunction with
other XML vocabularies
– SMIL (Synchronized Multimedia Integration
Language)
– SVG (Scalable Vector Graphics)
– MathML (Mathematical Markup Language)
– Plus hundreds dedicated to specific
applications (the extensible part)
1 February 2011
Kaiser: COMS E6125
23
What is XML for?
• The “universal” markup format for
structured documents and data on the
Web
• For data exchange (messages) and
persistent data
• Syntax, Data Modeling, Data Processing
• Conceptually an SGML descendant
• Unlike SGML, it quickly became widespread
1 February 2011
Kaiser: COMS E6125
24
SGML->XML
• Like SGML, XML is a grammar (or a
metalanguage), NOT a specific language
• Relatively simple specification
• Parsing made simpler through two-level
mechanism
– Well-formed
– Valid
1 February 2011
Kaiser: COMS E6125
25
Well-Formed
• (Optionally) starts with XML declaration
<?xml version="1.0"?>
• Rest of document inside the root element
<myroot>…</myroot>
• All text contained in some element
<someelement>text text text</someelement>
• Explicit “empty” elements
<anotherelement></anotherelement>
<anotherelement/>
• Element tags must be properly nested (no crossing tags)
NO <i><b>blah blah blah</i></b>
• Start and end tags must match exactly (same case)
• Quotes placed around all attribute values
<a href=“stuff.html”>stuff</a>
1 February 2011
Kaiser: COMS E6125
26
Valid
• Well-formed, plus
• Conforms to a DTD or Schema
– tags and attributes are all declared
– tags and attributes are used correctly
• XML browsers and editors usually require
validity
• Other tools might not (e.g., search
engines)
1 February 2011
Kaiser: COMS E6125
27
XML Goes Beyond
Document Processing
• XML more oriented
to distributed
computing than to
document markup
• Thus complements
rather than
replaces HTML (or
XHTML)
1 February 2011
• DOM = Document
Object Model
• SAX = Simple API
for XML
• SOAP = Simple
Object Access
Protocol
• Web Services
Kaiser: COMS E6125
28
XML Anatomy
element name
element
attribute name
<bibliography>
attribute value
(attributes cannot
contain elements)
element content
<paper ID= “goto”>
<authors>
<author>Edsger W. Dijkstra </author>
</authors>
<title>Go To Statement Considered Harmful</title>
<booktitle>Communications of the ACM</booktitle>
<year>1968</year>
<fullPaper source=“harmful”/>
</paper>
</bibliography>
number content
1 February 2011
empty element
Kaiser: COMS E6125
character content
29
Perspectives on XML
• Document (SGML) Community
– data = linear text documents
– markup (annotate) text to describe context, structure,
semantics
• Database Community
– prominent example of the semi-structured data model
– captures the whole spectrum from highly structured,
regular data to unstructured data
 XML is the cure for your data exchange,
information integration, e-commerce, …
problems” (also cures baldness, lose 28 pounds in
14 days, get rich quick, …)
1 February 2011
Kaiser: COMS E6125
30
Identifying Vocabularies
• My element may not be your element:
– geometry context:
<element>line</element>
– chemistry context:
<element>oxygen</element>
1 February 2011
Kaiser: COMS E6125
31
Identifying Vocabularies
• An XML Schema defines a vocabulary of
names of type definitions, element and
attribute declarations
• Use XML Namespaces to identify which
vocabulary
– Simple method for qualifying element and
attribute names used in XML documents
– Useful when a single XML document contains
elements and attributes that are defined for
and used by multiple software modules
1 February 2011
Kaiser: COMS E6125
32
Namespace Scoping
• XML namespaces
are declared with
an xmlns
attribute, which
can associate a
prefix with the
namespace
• The declaration is
in scope for the
element containing
the attribute and
all its descendants
1 February 2011
<html:html xmlns:html='http://
www.w3.org/1999/xhtml'>
<html:head>
<html:title>Frobnostication
</html:title>
</html:head>
<html:body>
<html:p>Moved to
<html:a href='http://frob.
example.com'>here.
</html:a>
</html:p>
</html:body>
</html:html>
Kaiser: COMS E6125
33
Namespace Defaulting
<?xml version="1.1"?>
<!-- elements are in the HTML namespace, in this
case by default -->
<html xmlns='http://www.w3.org/1999/xhtml'>
<head>
<title>Frobnostication</title>
</head>
<body>
<p>Moved to
<a href='http://frob.example.com'>here</a>.</p>
</body>
</html>
1 February 2011
Kaiser: COMS E6125
34
Multiple Namespaces
All element types are prefixed
<bk:book xmlns:bk='urn:loc.gov:books'
xmlns:isbn='urn:ISBN:0-395-36341-6'
xmlns:money='urn:Finance:AllAboutMoney'>
<bk:title>Cheaper by the Dozen</bk:title>
<isbn:number>1568491379</isbn:number>
<bk:price
money:currencySymbol="$">99.99</bk:price>
</bk:book>
1 February 2011
Kaiser: COMS E6125
35
Nested Scoping
<?xml version="1.1"?>
<!-- initially, the default namespace is "books" -->
<book xmlns='urn:loc.gov:books'
xmlns:isbn='urn:ISBN:0-395-36341-6'>
<title>Cheaper by the Dozen</title>
<isbn:number>1568491379</isbn:number>
<notes>
<!-- make HTML the default namespace for
some commentary -->
<p xmlns='urn:w3-org-ns:HTML'>
This is a <i>funny</i> book!
</p>
</notes>
</book>
1 February 2011
Kaiser: COMS E6125
36
How to Define the Actual
Namespace
• W3C namespace specification doesn’t say (!)
• A namespace doesn’t actually have to exist
as a physical or conceptual entity
• All that is needed is a qualifier — the XML
namespace URI — that, in combination with
an element type or attribute name, creates
a universal (and universally unique) name
• In other words, there doesn’t actually have
to be a definition or anything else at that
URI
1 February 2011
Kaiser: COMS E6125
37
Pure XML - Instance Model
• XML 1.0 implicit data model:
– nested containers ("boxes within boxes")
– labeled ordered trees (= semistructured data
model)
– Relational or object-oriented easy to encode
<A>
<B>foo</B>
<C>bar</C>
<C>psl</C>
</A>
A
A:
B:
"foo"
C:
"bar"
C:
"psl"
1 February 2011
B
C
C
"foo"
"bar"
"psl"
Kaiser: COMS E6125
children are ordered
38
XML + Namespaces
Allows mixing of different tag
vocabularies
• Only identifies the vocabulary (lexicon)
• Additional mechanisms required for
structure and meaning (or at least
metadata) of tags – explicit data model
1 February 2011
Kaiser: COMS E6125
39
From Documents to Data
<memo importance=‘medium' date=‘2011-01-30'>
• We want to be able to
– Extract the element
structure of a
document
– Re-use this structure
for other similar
documents
– Share structure and
metadata with others
– Automate processing of
this structure and
metadata
1 February 2011
<from>Gail Kaiser</from> <to>Pranith
Ramamurthy</to>
<subject>whim this week</subject>
<body>Bring blue books for a surprise quiz!</body>
</memo>
<invoice>
<orderDate>2010-12-01</orderDate>
<shipDate>2010-12-26</shipDate>
<billingAddress>
<name>Gail Kaiser</name>
<street>500 West 120th Street</street>
<city>New York</city> <state>NY</state>
<zip>10027</zip>
</billingAddress>
<voice>212-555-1234</voice>
<fax>212-555-4321</fax>
</invoice>
Kaiser: COMS E6125
40
Adding Structure and
Semantics
• A Document Structure Description
(DSD) defines the syntax of XML
documents for a particular
application domain
• Defines the grammar for an XMLbased markup language
1 February 2011
Kaiser: COMS E6125
41
Processing XML
• Non-validating parser:
– checks that XML doc is syntactically wellformed, e.g., all open-tags have matching
close-tags and they are properly nested,
attributes only appear once in an element,
etc.
• Validating parser:
– checks that XML doc is also valid wrt a
given DSD (usually XML Schema)
1 February 2011
Kaiser: COMS E6125
42
Using DSD Validators
• A DSD processor can be useful both on the
server side (when writing XML documents)
and on the client side (when processing
XML documents):
– Checking validity (conformance) of XML
documents
– Performing default insertion (inserts
missing fragments)
1 February 2011
Kaiser: COMS E6125
43
DSD Processing
1 February 2011
Kaiser: COMS E6125
44
Several Proposed DSDs
• XML Document Type Definitions (DTDs):
– Define the structure of “allowed” documents
–  Database schema
– Non-XML syntax
• XML Schemas (XSDs)
– Define structure and data types
– Allows developers to build their own libraries
of interchange-able data types
– Written in an XML vocabulary
• Others (e.g., RELAX NG, DSD)
1 February 2011
Kaiser: COMS E6125
45
XML Schema
Design Principles
1.
2.
3.
4.
5.
6.
7.
8.
More expressive than DTDs (from SGML)
Notation is itself an XML vocabulary
Self-describing
Usable by a wide variety of applications that
employ XML
Straightforwardly usable on the Internet
Optimized for interoperability
Simple enough to be implemented with modest
design and runtime resources
Coordinated with relevant W3C specs
1 February 2011
Kaiser: COMS E6125
46
Purpose of an XML Schema
• Defines a class of XML instances
• Neither instances nor schemas need
exist as documents, per se, may exist
as:
–Byte stream sent between
applications
–Fields in a database record
–Collection of XML “infoset”
information items
1 February 2011
Kaiser: COMS E6125
47
What is an XML
“infoset”?
• XML Information Set, 2nd edition, W3C
Recommendation February 2004
• For use by other specs that need to refer
to the information in a well-formed XML
document [or PSVI = post schema
validated infoset]
• Defines abstract data set generated by
parser or by other means, conceptually
tree of items each with several properties
1 February 2011
Kaiser: COMS E6125
48
(Some)
Information Items
• Document (root of infoset) – properties
include base UR, XML version, character
encoding, etc.
• One root element - and its children
• Attributes of elements
• Namespace scoping for elements
• Processing instructions
• Unexpanded entities (processor may or
may not expand all entities)
1 February 2011
Kaiser: COMS E6125
49
file po.xml
Example Instance Document
<?xml version="1.0"?>
<purchaseOrder orderDate=“2008-08-20">
<shipTo country="US">
<name>Robert Smith</name>
<street>123 Maple Street</street>
<city>Mill Valley</city>
<state>CA</state> <zip>90952</zip>
</shipTo>
<billTo country="US">
<name>Alice Smith</name>
<street>8 Oak Avenue</street>
<city>Old Town</city>
<state>PA</state> <zip>95819</zip>
</billTo>
<comment>Hurry, my lawn is going wild!</comment>
<items>
<item partNum="872-AA">
<productName>Lawnmower</productName>
<quantity>1</quantity>
<USPrice>148.95</USPrice>
<comment>Confirm this is electric</comment>
</item>
<item partNum="926-AA"> . . . </item>
</items>
Kaiser: COMS E6125
</purchaseOrder>
50
Where is the Schema?
• The instance document may reference a
schema explicitly, or a processor may
obtain a schema separately without
reference from the instance
• Schema defines elements and attributes,
and their complex and simple types
• Determines the appearance of elements
and their content in instance documents
1 February 2011
Kaiser: COMS E6125
51
file po.xsd
Example Schema
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:annotation> . . . </xsd:annotation>
<xsd:element name="purchaseOrder" type="PurchaseOrderType"/>
<xsd:element name="comment" type="xsd:string"/>
<xsd:complexType name="PurchaseOrderType">
. . .
</xsd:complexType>
</xsd:schema>
• The schema consists of a schema element and various
subelements, e.g., element, complexType
• The prefix xsd: associates names with the XML Schema
namespace specified in the xmlns:xsd declaration
• Same prefix, and hence same association, also appears on
names of built-in types, e.g., xsd:string
• Identifies elements and simple types as belonging to XML
Schema language vocabulary rather than vocabulary of
schema author
1 February 2011
Kaiser: COMS E6125
52
file po.xsd
Example Schema
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:annotation> . . . </xsd:annotation>
<xsd:element name="purchaseOrder" type="PurchaseOrderType"/>
<xsd:element name="comment" type="xsd:string"/>
<xsd:complexType name="PurchaseOrderType">
. . .
</xsd:complexType>
</xsd:schema>
• An annotation element may appear at the
beginning of most schema constructions
• Contains two subelements
– Documentation: Human readable material
– appInfo: For tools and applications
1 February 2011
Kaiser: COMS E6125
53
Complex Type Definitions
<xsd:complexType name="USAddress">
<xsd:sequence>
<xsd:element name="name" type="xsd:string"/>
<xsd:element name="street" type="xsd:string"/>
<xsd:element name="city" type="xsd:string"/>
<xsd:element name="state" type="xsd:string"/>
<xsd:element name="zip" type="xsd:decimal"/>
</xsd:sequence>
<xsd:attribute name="country" type="xsd:NMTOKEN"
fixed="US"/>
</xsd:complexType>
•
•
•
New complex types are defined using the complexType element; it
contains element declarations, attribute declarations and element
references
This example says elements of type USAddress must have
– 5 subelements that must be called name, street, city,
state and zip (in this order), each having the corresponding
type declared above
– 1 attribute called country may appear with the element;
NMTOKEN represents an atomic indivisible value
All element declarations within USAddress involve simple types
54
Complex Type Definitions
<xsd:complexType name="USAddress">
<xsd:sequence>
<xsd:element name="name" type="xsd:string"/>
<xsd:element name="street" type="xsd:string"/>
<xsd:element name="city" type="xsd:string"/>
<xsd:element name="state" type="xsd:string"/>
<xsd:element name="zip" type="xsd:decimal"/>
</xsd:sequence>
<xsd:attribute name="country" type="xsd:NMTOKEN"
fixed="US"/>
</xsd:complexType>
•
•
•
•
An attribute may be specified as fixed or default.
Default attribute values apply when attributes are missing.
For fixed attributes, if a value appears, it must be the value
declared with a fixed value.
The schema processor will provide the value for missing attributes.
1 February 2011
Kaiser: COMS E6125
55
Complex Type Definitions
<xsd:complexType name="PurchaseOrderType">
<xsd:sequence>
<xsd:element name="shipTo" type="USAddress"/>
<xsd:element name="billTo" type="USAddress"/>
<xsd:element ref="comment" minOccurs="0"/>
<xsd:element name="items" type="Items"/>
</xsd:sequence>
<xsd:attribute name="orderDate" type="xsd:date"/>
</xsd:complexType>
• A declaration may reference an existing element, e.g.,
comment; the value of the ref attribute must reference a
global element (i.e., declared under schema)
• Every element of type PurchaseOrderType must consist
of subelements shipTo and billTo, each containing the
five subelements declared as part of USAddress, items
and (optionally) comment; it may have one attribute called
orderDate
1 February 2011
Kaiser: COMS E6125
56
Complex Type Definitions
<xsd:complexType name="PurchaseOrderType">
<xsd:sequence>
<xsd:element name="shipTo" type="USAddress"/>
<xsd:element name="billTo" type="USAddress"/>
<xsd:element ref="comment" minOccurs="0"/>
<xsd:element name="items" type="Items"/>
</xsd:sequence>
<xsd:attribute name="orderDate" type="xsd:date"/>
</xsd:complexType>
• Occurrence constraint may specify
minoccurs and/or maxoccurs
1 February 2011
Kaiser: COMS E6125
57
Complex Type Definitions
<xsd:complexType name="PurchaseOrderType">
<xsd:sequence>
<xsd:element name="shipTo" type="USAddress"/>
<xsd:element name="billTo" type="USAddress"/>
<xsd:element ref="comment" minOccurs="0"/>
<xsd:element name="items" type="Items"/>
</xsd:sequence>
<xsd:attribute name="orderDate" type="xsd:date"/>
</xsd:complexType>
• Attributes may appear once or not at all (the default), but
no more than once
• use may be specified as optional, required, or
prohibited
1 February 2011
Kaiser: COMS E6125
58
Simple Built-in Types
• string, normalizedString,
token
• byte, unsignedByte
• integer, positiveInteger, etc
• long, short, etc
• decimal, float, double
• boolean
• time, dateTime, duration,
date, etc
• anyURI
• etc
1 February 2011
•
•
•
•
ID
IDREF, IDREFS
ENTITY, ENTITIES
NMTOKEN, NMTOKENS
• The types in this
column should only be
used in attributes (to
retain compatibility
with XML 1.0 DTDs)
Kaiser: COMS E6125
59
Simple Derived Types
<xsd:simpleType name="myInteger">
<xsd:restriction base="xsd:integer">
<xsd:minInclusive value="10000"/>
<xsd:maxInclusive value="99999"/>
</xsd:restriction>
</xsd:simpleType>
• The simpleType element is used to define and name a new
simple type
• The restriction element indicates the base type and
identifies the “facets” that constrain the range of values
(here minInclusive and maxInclusive)
1 February 2011
Kaiser: COMS E6125
60
Simple Derived Types
(pattern facet)
<!-- Stock Keeping Unit, a code for identifying
products -->
<xsd:simpleType name="SKU">
<xsd:restriction base="xsd:string">
<xsd:pattern value="\d{3}-[A-Z]{2}"/>
</xsd:restriction>
</xsd:simpleType>
• Constrain the values of SKU using the pattern
facet in conjunction with the regular expression
"\d{3}-[A-Z]{2}“ (3 digits followed by a hyphen
followed by 2 upper-case ASCII letters)
1 February 2011
Kaiser: COMS E6125
61
Simple Derived Types
(enumeration facet)
<xsd:simpleType name="USState">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="AK"/>
<xsd:enumeration value="AL"/>
<xsd:enumeration value="AR"/>
<!-- and so on ... -->
</xsd:restriction>
</xsd:simpleType>
• The enumeration facet limits a simple type to a set
of distinct values
• Enables a better definition of USAddress type
1 February 2011
<xsd:complexType name="USAddress">
. . .
<xsd:element name="state" type="USState"/>
. . .
</xsd:complexType
Kaiser: COMS E6125
62
Anonymous Type Definitions
<xsd:complexType name="Items">
<xsd:sequence>
<xsd:element name="item" minOccurs="0"
maxOccurs="unbounded">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="productName" type="xsd:string"/>
<xsd:element name="quantity">
<xsd:simpleType>
<xsd:restriction base="xsd:positiveInteger">
<xsd:maxExclusive value="100"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:element>
<xsd:element name="USPrice" type="xsd:decimal"/>
<xsd:element ref="comment"
minOccurs="0"/>
<xsd:element name="shipDate" type="xsd:date"
minOccurs="0"/>
</xsd:sequence>
<xsd:attribute name="partNum" type="SKU" use="required"/>
</xsd:complexType>
</xsd:element>
Kaiser: COMS E6125
63
</xsd:sequence>
</xsd:complexType>
Recap Example Schema
file po.xsd
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:annotation> . . . </xsd:annotation>
<xsd:element name="purchaseOrder" type="PurchaseOrderType"/>
<xsd:element name="comment" type="xsd:string"/>
<xsd:complexType name="PurchaseOrderType">
<xsd:sequence>
<xsd:element name="shipTo" type="USAddress"/>
<xsd:element name="billTo" type="USAddress"/>
<xsd:element ref="comment" minOccurs="0"/>
<xsd:element name="items" type="Items"/>
</xsd:sequence>
<xsd:attribute name="orderDate" type="xsd:date"/>
</xsd:complexType>
<xsd:complexType name="USAddress"> . . . </xsd:complexType>
<xsd:complexType name="Items"> . . . </xsd:complexType>
</xsd:schema>
1 February 2011
Kaiser: COMS E6125
64
XML Schema Data Types
•
•
•
•
Complex types
Built-in simple types
Derived simple types
Also derived complex types, lists and
unions of simple types
Define structure – what about
the content?
1 February 2011
Kaiser: COMS E6125
65
Element Content:
Simple content
• Declare an element that has an attribute
and contains a simple value
<internationalPrice currency="EUR">423.46</internationalPrice>
<xsd:element name="internationalPrice">
<xsd:complexType>
<xsd:simpleContent>
<xsd:extension base="xsd:decimal">
<xsd:attribute name="currency“ type="xsd:string"/>
</xsd:extension>
</xsd:simpleContent>
</xsd:complexType>
</xsd:element>
1 February 2011
Kaiser: COMS E6125
66
Element Content:
Empty content
• Declare an element with attributes only no content at all
<internationalPrice currency="EUR" value="423.46"/>
<xsd:element name="internationalPrice">
<xsd:complexType>
<xsd:attribute name="currency" type="xsd:string"/>
<xsd:attribute name="value"
type="xsd:decimal"/>
</xsd:complexType>
</xsd:element>
1 February 2011
Kaiser: COMS E6125
67
Element Content:
Entire element omitted
• The absence of an element does not carry any
particular meaning; it could be
– Information unknown
– Information not applicable
– I just forgot to enter the information
• Absence does/should not imply some value like
zero, empty string, empty list, etc.
• Database systems faced with similar problems
have introduce “null” values
• XML does not provide a null value representation
that actually appears in element content; instead,
there is an attribute to indicate content is nil
<xsd:element name="shipDate" type="shipDateType"
nillable="true">
<shipDate xsi:nil="true"></shipDate>
68
Element Content:
Mixed content
<letterBody>
<salutation>Dear Mr.<name>Stanley
Steamer</name>.</salutation>
Your order of <quantity>1</quantity> <productName>Baby
Monitor</productName> shipped from our warehouse on
<shipDate>2008-12-26</shipDate>. ....
</letterBody>
• Text appears between the elements salutation,
quantity, productName, and shipDate (all
children of letterBody)
• To allow this, the mixed attribute of the parent’s
complexType must be set to true
1 February 2011
Kaiser: COMS E6125
69
Element Content: Mixed content
<xsd:element name="letterBody">
<xsd:complexType mixed="true">
<xsd:sequence>
<xsd:element name="salutation">
<xsd:complexType mixed="true">
<xsd:sequence>
<xsd:element name="name" type="xsd:string"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element name="quantity"
type="xsd:positiveInteger"/>
<xsd:element name="productName" type="xsd:string"/>
<xsd:element name="shipDate"
type="xsd:date"
minOccurs="0"/>
<!-- etc. -->
</xsd:sequence>
</xsd:complexType>
</xsd:element>
• The order and number of child elements appearing
in an instance must agree with order/number of
child elements specified in the content model
70
Element Content:
anyType
• The anyType type does not constrain
its content in any way
<xsd:element name="anything" type="anyType"/>
• When no type is defined, anyType is
the default, so could be written as
<xsd:element name="anything"/>
1 February 2011
Kaiser: COMS E6125
71
Other XML Schema Topics
• There’s lots more…
1 February 2011
Kaiser: COMS E6125
72
Drawbacks of XML
Schemas
• Another vocabulary to learn
• Verbose (like XML itself)
• Many constraints cannot be expressed
(without adding separate stylesheet or
code)
<Demo
xmlns="http://www.demo.org"•
xmlns:xsi="http://www.w3.or
g/2001/XMLSchema-instance"
xsi:schemaLocation="http://
www.demo.org demo.xsd">
<A>10</A> <B>20</B> </Demo>
Can constrain: the Demo element
contains a sequence of elements
A followed by B; the A element
contains an integer; the B
element contains an integer
• Can’t constrain: A>B
1 February 2011
Kaiser: COMS E6125
73
Processing XML
• Tree representation:
– Document Object Model (DOM) API
– Cursor APIs, e.g., .NET’s XPathNavigator,
Java StAX
• Stream of events representation:
– Push Model, e.g., Simple API for XML
(SAX)
– Pull Model, e.g., Common API for XML Pull
Parsing (XmlPull)
• Others
1 February 2011
Kaiser: COMS E6125
74
Document Object Model
• Object-oriented approach to
traversing the XML document as a
tree
• Typically loads the entire XML
document into memory (random
access but memory intensive)
• Provides mechanisms for loading,
saving, accessing, querying, modifying
and deleting nodes from an XML
document
1 February 2011
Kaiser: COMS E6125
75
DOM API
• Hierarchy of Node objects mapping to XML
concepts: document, element, attribute,
processing instruction, comment, …
• Language-independent API:
– get first/last child, previous/next sibling, set
of nodes
– insert before/after, replace
– getElementsByTagName
• W3C DOM offers fairly limited functionality, so
implementations often add helper method
extensions
1 February 2011
Kaiser: COMS E6125
76
Push Model
• XML producer (typically an XML parser)
controls the pace of the application and
informs the XML consumer when certain
events occur (e.g., reports events when
encountering begin/end tags)
• XML consumer registers callbacks with the
producer, which invokes the callbacks as
various parts of the XML document are seen
(as events are reported)
• Does not necessarily build a parse tree
1 February 2011
Kaiser: COMS E6125
77
Push Model Pro
• The entire XML document does not need to
be stored in memory, only the information
about the node currently being processed
• This makes it possible to process large XML
documents without incurring massive memory
costs
• Can also process XML streams whose
contents arrive over time
• Allows consumer to ignore less interesting
data
1 February 2011
Kaiser: COMS E6125
78
Push Model Con
• Certain context and state information such as the
parents of the current node or its depth in the
XML tree must be tracked by the programmer
• Limited expressive power (query/update) when
working on streams
• To register callbacks one needs to create a class
devoted to handling events from the producer
• Many developers find callbacks to be an unintuitive
way to control program flow
1 February 2011
Kaiser: COMS E6125
79
Pull Model
• XML consumer controls the program flow by
requesting events from the XML producer as
needed
• Operates in a forward-only, streaming
fashion while only showing information about
a single node at any given time
• Programmer creates a loop that continually
reads from the XML document until the end
of the document is reached, but acts solely
on items of interest as they are seen
1 February 2011
Kaiser: COMS E6125
80
Pull Model Comparison
• As memory efficient as push model
processing but with a more familiar
programming model
• Does not require a specialized class for
handling XML processing to implement
specific interfaces or subclass certain
classes to register callbacks
• The need to explicitly track application
states using boolean flags and similar
variables is significantly reduced
1 February 2011
Kaiser: COMS E6125
81
XML Cursors
• Cursor acts like a lens that focuses on one XML
node at a time, but, unlike pull-based or pushbased APIs, the cursor can be positioned
anywhere along the XML document at any given
time
• Allows one to navigate, query, and manipulate an
XML document loaded in memory
• Does not require the heavyweight interface of a
traditional tree model API, where every
significant token in the underlying XML must map
to an object
• Can create XML views of non-XML data
1 February 2011
Kaiser: COMS E6125
82
Other Alternatives
• Object to XML Mapping APIs
– Represent nodes and text as classes and
programming language primitives
– Cannot represent all XML information
with full fidelity, e.g., lose processing
instructions and comments, element
ordering
– Impedance mismatches between XML
Schema and object-oriented concepts
• XML-specific languages – XQuery, XSLT, …
1 February 2011
Kaiser: COMS E6125
83
Summary
• Content intended for humans usually
formatted using HTML
• Content intended for machine processing
(other than rendering) usually formatted
using XML
• Humans (and some browsers) are forgiving,
other machine processing is not
1 February 2011
Kaiser: COMS E6125
84
Second Assignment:
Paper Outline
• Plan your paper
• Pretend your reader is another student
in this class, rather than the teaching
staff
• It is “ok” to switch topics from your
original proposal, but clarify that you’re
doing so
1 February 2011
Kaiser: COMS E6125
85
Second Assignment:
Its All In The Details
• Each full paper must have a title, abstract
(approx. 100 words), introduction, several
body sections, conclusion, references list
• Figure out what will be in the body
sections: drill down to subsections (or
possibly even subsubsections)
• Consider the point of view you will portray
in your introduction and conclusion
• Motivate your reading
1 February 2011
Kaiser: COMS E6125
86
Second Assignment:
Logistics
• Due Tuesday February 15th by 10am
• Maximum four pages (not including optional
figures and required reference list)
• Submit by posting in Paper Outlines folder
on CourseWorks – contents of this folder
are visible only to teaching staff
• Must be in a format I can read, which
means pdf, word, html, plain ascii text
(with all figures embedded or viewable in a
browser without special “plugins”)
1 February 2011
Kaiser: COMS E6125
87
Upcoming Assignments:
Paper
• Full paper due Tuesday March 9th
18 January 2011
Kaiser: COMS E6125
88
Second’ Assignment:
Student Presentations
• Individual ~10 minute talk in class (plus ~5
minutes Q&A, discussion)
• Schedule has been assigned (see Syllabus)
• One paragraph Presentation Proposal, due
Tuesday February 15th
• May be based on paper, project, or some
other topic
• Post in Presentation Proposals folder on
Courseworks
18 January 2011
Kaiser: COMS E6125
89
Heads Up on Project
• Project Proposal due Tuesday March 9th
• Optionally work in teams (see
http://bank.cs.columbia.edu/classes/cs6125/team
_advice.htm)
• Build a new system or extend an existing system
• OR evaluate/compare one or more existing
system(s)
• You may "continue" your paper topic towards the
project, or do something entirely different
18 January 2011
Kaiser: COMS E6125
90
COMS E6125 Web-enHanced
Information Management
(WHIM)
Prof. Gail Kaiser
Spring 2011
1 February 2011
Kaiser: COMS E6125
91
Download