04.XML-XSD.ppt

advertisement

XML & XML Schema

Semantic Web - Fall 2005

Computer Engineering Department

Sharif University of Technology

Outline

• Markup Languages

– SGML, HTML, XML

• XML Building Blocks

• XML Applications

• Namespaces

• XML Schema

Semantic web - Computer Engineering Dept. - Spring 2006

2

SGML(ISO 8879)

S tandard G eneralized M arkup L anguage

• The international standard for defining descriptions of structure and content in text documents

• Interchangeable: device-independent, system-independent

• tags are not predefined

• Using DTD to validate the structure of the document

• Large, powerful, and very complex

• Heavily used in industrial and commercial usages for over a decade

3

Semantic web - Computer Engineering Dept. - Spring 2006

HTML(RFC 1866)

H yper T ext M arkup L anguage

• A small SGML application used on web (a DTD and a set of processing conventions)

• Only uses a predefined set of tags

Semantic web - Computer Engineering Dept. - Spring 2006

4

What is XML?

• e

X tensible M arkup L anguage

• A simplified version of SGML

• Maintains the most useful parts of SGML

• Designed so that SGML can be delivered over the

Web

• More flexible and adaptable than HTML

• XHTML : a reformulation of HTML 4 in XML 1.0

5

Semantic web - Computer Engineering Dept. - Spring 2006

HTML vs. XML

HTML is used to mark up text so it can be displayed to users .

HTML describes both structure (e.g. <p> , <h2>

<em> ) and appearance

(e.g. <br> , <font> , <i> )

,

XML is used to mark up data so it can be processed by computers .

XML describes only content, or “meaning”

HTML uses a fixed, unchangeable set of tags.

In XML, you make up your own tags.

Semantic web - Computer Engineering Dept. - Spring 2006

6

HTML vs. XML (2)

• HTML is for humans

– HTML describes web pages

– You don’t want to see error messages about the web pages you visit

– Browsers ignore and/or correct as many HTML errors as they can, so HTML is often sloppy

• XML is for computers

– XML describes data

– The rules are strict and errors are not allowed

• In this way, XML is like a programming language

– Current versions of most browsers can display XML

• However, browser support of XML is spotty at best

7

Semantic web - Computer Engineering Dept. - Spring 2006

XML-related technologies

DTD (Document Type Definition) and XML

Schemas are used to define legal XML tags and their attributes for particular purposes

• XSLT (eXtensible Stylesheet Language

Transformations) and XPath are used to translate from one form of XML to another

• SAX (Simple API for XML)

Semantic web - Computer Engineering Dept. - Spring 2006

8

XML Building blocks - Elements

• Delimited by angle brackets

• Identify the nature of the content they surround

• General format: <element> … </element>

• Empty element:

<empty-Element />

• XML Elements have Relationships

– Elements are related as parents and children

• Elements have Content

– Elements can have different content types:

• Element, mixed, Simple, empty

Semantic web - Computer Engineering Dept. - Spring 2006

9

XML Building blocks - Attributes

Name-value pairs that occur inside start-tags after element name, like:

<element attribute=“value” />

• Provide additional information about elements that often is not a part of data.

• Attributes and elements are somewhat interchangeable

• Should I use an element or an attribute?

• Example using just elements:

<name>

<first>David</first>

<last>Matuszek</last>

</name>

• Example using attributes:

<name first="David" last="Matuszek"></name> metadata (data about data) should be stored as attributes, and that data itself should be stored as elements

Semantic web - Computer Engineering Dept. - Spring 2006

10

XML Building blocks - Entities

Five special characters must be written as entities:

&amp; for

&

(almost always necessary)

&lt; for

<

(almost always necessary)

&gt; for

>

(not usually necessary)

&quot; for

"

(necessary inside double quotes)

&apos; for ' (necessary inside single quotes)

These entities can be used even in places where they are not absolutely required.

These are the only predefined entities in XML.

Semantic web - Computer Engineering Dept. - Spring 2006

11

XML Building blocks - Declaration

The XML declaration looks like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

– The XML declaration is not required by browsers, but is required by most XML processors (so include it!)

– If present, the XML declaration must be first-not even whitespace should precede it

– Note that the brackets are <?

and

?>

– version="1.0" is required (this is the only version so far)

– encoding can be

"UTF-8"

(ASCII) or

"UTF-16"

(Unicode), or something else, or it can be omitted

– standalone tells whether there is a separate DTD

Semantic web - Computer Engineering Dept. - Spring 2006

12

XML Building blocks - Processing instructions

• PIs (Processing Instructions) may occur anywhere in the

XML document (but usually first)

• A PI is a command to the program processing the XML document to handle it in a certain way

• XML documents are typically processed by more than one program

• Programs that do not recognize a given PI should just ignore it

• General format of a PI: <?

target instructions ?>

• Example: <?xml-stylesheet type="text/css" href="mySheet.css"?>

Semantic web - Computer Engineering Dept. - Spring 2006

13

XML Building blocks - Comments

• <!-- This is a comment in both HTML and XML -->

• Comments can be put anywhere in an XML document

• Comments are useful for:

– Explaining the structure of an XML document

– Commenting out parts of the XML during development and testing

• The character sequence -cannot occur in the comment

• Comments are not displayed by browsers, but can be seen by anyone who looks at the source code

14

Semantic web - Computer Engineering Dept. - Spring 2006

CDATA

• By default, all text inside an XML document is parsed

• You can force text to be treated as unparsed character data by enclosing it in

<![CDATA[ ... ]]>

• Any characters, even & and

<

, can occur inside a CDATA

• Whitespace inside a CDATA is (usually) preserved

• The only real restriction is that the character sequence ]]> cannot occur inside a CDATA

• CDATA is useful when your text has a lot of illegal characters (for example, if your XML document contains some HTML text)

15

Semantic web - Computer Engineering Dept. - Spring 2006

XML Syntax

• All XML elements must have a closing tag

• XML tags are case sensitive

• All XML elements must be properly nested

• All XML documents must have a root tag

• Attribute values must always be quoted

• With XML, white space is preserved

• With XML, a new line is always stored as LF

• Comments in XML:

<!-- This is a comment -->

Semantic web - Computer Engineering Dept. - Spring 2006

16

Well-formed XML

• Every element must have both a start tag and an end tag, e.g.

<name> ... </name>

– But empty elements can be abbreviated: <break /> .

– XML tags are case sensitive

– XML tags may not begin with the letters xml

, in any combination of cases

• Elements must be properly nested, e.g. not

<b><i>bold and italic</b></i>

• Every XML document must have one and only one root element

• The values of attributes must be enclosed in single or double quotes, e.g. <time unit="days">

• Character data cannot contain < or

&

Semantic web - Computer Engineering Dept. - Spring 2006

17

Displaying XML

• XML documents do not carry information about how to display the data

• We can add display information to XML with

– CSS (Cascading Style Sheets)

– XSL (eXtensible Stylesheet Language) --- preferred

Semantic web - Computer Engineering Dept. - Spring 2006

18

XML Applications (1)

Separate data

XML can Separate Data from HTML

• Store data in separate XML files

• Using HTML for layout and display

• Using Data Islands

• Data Islands can be bound to HTML elements

Benefits:

Changes in the underlying data will not require any changes to your

HTML

Semantic web - Computer Engineering Dept. - Spring 2006

19

XML Applications (2)

Exchange data

XML is used to Exchange Data

• Text format

• Software-independent, hardware-independent

• Exchange data between incompatible systems, given that they agree on the same tag definition.

• Can be read by many different types of applications

Benefits:

• Reduce the complexity of interpreting data

• Easier to expand and upgrade a system

Semantic web - Computer Engineering Dept. - Spring 2006

20

XML Application (3)

Store Data

XML can be used to Store Data

• Plain text file

• Store data in files or databases

• Application can be written to store and retrieve information from the store

• Other clients and applications can access your XML files as data sources

Benefits:

Accessible to more applications

21

Semantic web - Computer Engineering Dept. - Spring 2006

XML Applications (4)

Create new language

XML can be used to Create new Languages, e.g. :

• WML (Wireless Markup Language) used to markup Internet applications for handheld devices like mobile phones (WAP)

• MusicXML used to publishing musical scores

Semantic web - Computer Engineering Dept. - Spring 2006

22

Names in XML

• Names (as used for tags and attributes) must begin with a letter or underscore, and can consist of:

– Letters, both Roman (English) and foreign

– Digits, both Roman and foreign

.

(dot)

-

(hyphen)

_

(underscore)

:

(colon) should be used only for namespaces

– Combining characters and extenders (not used in English)

23

Semantic web - Computer Engineering Dept. - Spring 2006

Namespaces

• Namespaces are a simple mechanism for creating globally unique names for the elements and attributes of your markup language.

• Benefits:

– De-conflicts the meaning of identical names in different markup languages.

– Allows different markup languages to be mixed together without ambiguity.

• Namespaces are implemented by requiring every XML name to consist of two parts: a prefix and a local part:

<xsd:integer>

24

Semantic web - Computer Engineering Dept. - Spring 2006

Namespaces and URIs

• A namespace is defined as a unique string

– To guarantee uniqueness, typically a URI (Uniform

Resource Indicator) is used, because the author

“owns” the domain

– It doesn't have to be a “real” URI; it just has to be a unique string

– Example: http://ce.sharif.edu/sw

• There are two ways to use namespaces:

– Declare a default namespace

– Associate a prefix with a namespace, then use the prefix in the XML to refer to the namespace

Semantic web - Computer Engineering Dept. - Spring 2006

25

Namespace syntax

• In any start tag you can use the reserved attribute name xmlns

:

<book xmlns= "http://ce.sharif.edu/sw">

– This namespace will be used as the default for all elements up to the corresponding end tag

– You can override it with a specific prefix

• You can use almost this same form to declare a prefix:

<book xmlns:dave= "http://ce.sharif.edu/sw">

– Use this prefix on every tag and attribute you want to use from this namespace, including end tags--it is not a default prefix

< dave: chapter dave: number="1">To Begin</ dave: chapter>

• You can use the prefix in the start tag in which it is defined:

< dave: book xmlns:dave =“http://ce.sharif.edu/sw">

Semantic web - Computer Engineering Dept. - Spring 2006

26

Review of XML rules

• Start with <?xml version="1"?>

• XML is case sensitive

• You must have exactly one root element that encloses all the rest of the XML

• Every element must have a closing tag

• Elements must be properly nested

• Attribute values must be enclosed in double or single quotation marks

• There are only five pre-declared entities

Semantic web - Computer Engineering Dept. - Spring 2006

27

XML as a tree

• An XML document represents a hierarchy; a hierarchy is a tree novel foreword chapter number="1" paragraph paragraph paragraph

This is the great

American novel.

It was a dark and stormy night.

Suddenly, a shot rang out!

Semantic web - Computer Engineering Dept. - Spring 2006

28

Extended document standards

• You can define your own XML tag sets, but here are some already available:

– XHTML: HTML redefined in XML

– SMIL: Synchronized Multimedia Integration Language

– MathML: Mathematical Markup Language

– SVG: Scalable Vector Graphics

– DrawML: Drawing MetaLanguage

– ICE: Information and Content Exchange

– ebXML: Electronic Business with XML

– cxml: Commerce XML

– CBL: Common Business Library

29

Semantic web - Computer Engineering Dept. - Spring 2006

XML Schema

XML Validation

• "Well Formed" XML document

– correct XML syntax

• "Valid" XML document

– “well formed”

– Conforms to the rules of a DTD

• XML DTD

– defines the legal building blocks of an XML document

– Can be inline in XML or as an external reference

• XML Schema

– an XML based alternative to DTD, more powerful

– Support namespace and data types

Semantic web - Computer Engineering Dept. - Spring 2006

31

An Example XML with DTD

<?xml version="1.0"?>

<!DOCTYPE note [

<!ELEMENT note (to,from,heading,body)>

<!ELEMENT to (#PCDATA)>

<!ELEMENT from (#PCDATA)>

<!ELEMENT heading (#PCDATA)>

<!ELEMENT body (#PCDATA)>

]>

<note>

<to>Tove</to>

<from>Jani</from>

<heading>Reminder</heading>

<body>Don't forget me this weekend</body>

</note>

Semantic web - Computer Engineering Dept. - Spring 2006

32

XML Schemas

• “Schema” is a general term

– DTDs are a form of XML schemas

• When we say “XML Schemas,” we usually mean the W3C XML Schema Language

– This is also known as “XML Schema Definition” language, or XSD .

Semantic web - Computer Engineering Dept. - Spring 2006

33

XSD vs. DTD

• DTDs provide a very weak specification language

– You can’t put any restrictions on text content

– You have very little control over mixed content (text plus elements)

– You have little control over ordering of elements

• DTDs are written in a strange (non-XML) format

– You need separate parsers for DTDs and XML

• The XML Schema Definition language solves these problems

– XSD gives you much more control over structure and content

– XSD is written in XML

34

Semantic web - Computer Engineering Dept. - Spring 2006

Referring to a schema

• To refer to a DTD in an XML document, the reference goes before the root element:

– <?xml version="1.0"?>

<!DOCTYPE rootElement SYSTEM " url

">

<rootElement> ... </rootElement>

• To refer to an XML Schema in an XML document, the reference goes in the root element:

– <?xml version="1.0"?>

<rootElement xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

(The XML Schema Instance reference is required) xsi:noNamespaceSchemaLocation=" url

.xsd">

(This is where your XML Schema definition can be found)

...

</rootElement>

Semantic web - Computer Engineering Dept. - Spring 2006

35

The XSD document

• Since the XSD is written in XML, it can get confusing which we are talking about.

• The file extension is .xsd

• The root element is <schema>

• The XSD starts like this:

• <?xml version="1.0"?>

<xs:schema xmlns:xs="http://www.w3.rg/2001/XMLSchema">

Semantic web - Computer Engineering Dept. - Spring 2006

36

<schema>

• The <schema> element may have attributes:

– xmlns:xs="http://www.w3.org/2001/XMLSchema"

• This is necessary to specify where all our XSD tags are defined

– elementFormDefault="qualified"

• This means that all XML elements must be qualified (use a namespace)

• It is highly desirable to qualify all elements, or problems will arise when another schema is added

37

Semantic web - Computer Engineering Dept. - Spring 2006

“Simple” and “complex” elements

• A “simple” element is one that contains text and nothing else

– A simple element cannot have attributes

– A simple element cannot contain other elements

– A simple element cannot be empty

– However, the text can be of many different types, and may have various restrictions applied to it

• If an element isn’t simple, it’s “complex”

– A complex element may have attributes

– A complex element may be empty, or it may contain text, other elements, or both text and other elements

Semantic web - Computer Engineering Dept. - Spring 2006

38

Defining a simple element

• A simple element is defined as

<xs:element name=" name

" type=" type

" /> where:

– name is the name of the element

– the most common values for type are xs:boolean xs:integer xs:date xs:decimal xs:string xs:time

• Other attributes a simple element may have:

– default=" default value

" if no other value is specified

– fixed=" value

" no other value may be specified

Semantic web - Computer Engineering Dept. - Spring 2006

39

Defining an attribute

• Attributes themselves are always declared as simple types

• An attribute is defined as

<xs:attribute name=" name

" type=" type

" /> where:

– name and type are the same as for xs:element

• Other attributes a simple element may have:

– default=" default value

" if no other value is specified

– fixed=" value

" no other value may be specified

– use="optional" the attribute is not required (default)

– use="required" the attribute must be present

40

Semantic web - Computer Engineering Dept. - Spring 2006

Restrictions, or “facets”

• The general form for putting a restriction on a text value is:

(or xs:attribute

) – <xs:element name="name">

<xs:restriction base="type">

... the restrictions ...

</xs:restriction>

</xs:element>

• For example:

– <xs:element name="age">

<xs:restriction base="xs:integer">

<xs:minInclusive value="0">

<xs:maxInclusive value="140">

</xs:restriction>

</xs:element>

Semantic web - Computer Engineering Dept. - Spring 2006

41

Restrictions on numbers

• minInclusive

-number must be ≥ the given value

• minExclusive

-- number must be > the given value

• maxInclusive

-number must be ≤ the given value

• maxExclusive

-- number must be < the given value

• totalDigits

-- number must have exactly value digits

• fractionDigits

-- number must have no more than value digits after the decimal point

Semantic web - Computer Engineering Dept. - Spring 2006

42

Restrictions on strings

• length

-- the string must contain exactly value characters

• minLength

-- the string must contain at least value characters

• maxLength

-- the string must contain no more than value characters

• pattern

-- the value is a regular expression that the string must match

• whiteSpace

-not really a “restriction”--tells what to do with whitespace

– value="preserve"

Keep all whitespace

– value="replace"

Change all whitespace characters to spaces

– value="collapse"

Remove leading and trailing whitespace, and replace all sequences of whitespace with a single space

43

Semantic web - Computer Engineering Dept. - Spring 2006

Enumeration

• An enumeration restricts the value to be one of a fixed set of values

• Example:

– <xs:element name="season">

<xs:simpleType>

<xs:restriction base="xs:string">

<xs:enumeration value="Spring"/>

<xs:enumeration value="Summer"/>

<xs:enumeration value="Autumn"/>

<xs:enumeration value="Fall"/>

<xs:enumeration value="Winter"/>

</xs:restriction>

</xs:simpleType>

</xs:element>

Semantic web - Computer Engineering Dept. - Spring 2006

44

Complex elements

• A complex element is defined as

<xs:element name="name">

<xs:complexType>

... information about the complex type...

</xs:complexType>

</xs:element>

• Example:

<xs:element name="person">

<xs:complexType>

<xs:sequence>

<xs:element name="firstName" type="xs:string" />

<xs:element name="lastName" type="xs:string" />

</xs:sequence>

</xs:complexType>

</xs:element>

• <xs:sequence> says that elements must occur in this order

• Remember that attributes are always simple types

Semantic web - Computer Engineering Dept. - Spring 2006

45

Declaration and use

• So far we’ve been talking about how to declare types, not how to use them

• To use a type we have declared, use it as the value of type="..."

– Examples:

• <xs:element name="student" type="person"/>

• <xs:element name="professor" type="person"/>

– Scope is important: you cannot use a type if is local to some other type

Semantic web - Computer Engineering Dept. - Spring 2006

46

xs:sequence

• We’ve already seen an example of a complex type whose elements must occur in a specific order:

• <xs:element name="person">

<xs:complexType>

<xs:sequence>

<xs:element name="firstName" type="xs:string" />

<xs:element name="lastName" type="xs:string" />

</xs:sequence>

</xs:complexType>

</xs:element>

47

Semantic web - Computer Engineering Dept. - Spring 2006

xs:all

• xs:all allows elements to appear in any order

• <xs:element name="person">

<xs:complexType>

<xs:all>

<xs:element name="firstName" type="xs:string" />

<xs:element name="lastName" type="xs:string" />

</xs:all>

</xs:complexType>

</xs:element>

• Despite the name, the members of an xs:all group can occur once or not at all

• You can use minOccurs="0" to specify that an element is optional (default value is

1

)

– In this context, maxOccurs is always

1

Semantic web - Computer Engineering Dept. - Spring 2006

48

Empty elements

• Empty elements are (ridiculously) complex

• <xs:complexType name="counter">

<xs:complexContent>

<xs:extension base="xs:anyType"/>

<xs:attribute name="count" type="xs:integer"/>

</xs:complexContent>

</xs:complexType>

Semantic web - Computer Engineering Dept. - Spring 2006

49

Mixed elements

• Mixed elements may contain both text and elements

• We add mixed="true" to the xs:complexType element

• The text itself is not mentioned in the element, and may go anywhere (it is basically ignored)

• <xs:complexType name="paragraph" mixed="true" >

<xs:sequence>

<xs:element name="someName” type="xs:anyType"/>

</xs:sequence>

</xs:complexType>

Semantic web - Computer Engineering Dept. - Spring 2006

50

Extensions

• You can base a complex type on another complex type

• <xs:complexType name="newType">

<xs:complexContent>

<xs:extension base="otherType">

...new stuff...

</xs:extension>

</xs:complexContent>

</xs:complexType>

51

Semantic web - Computer Engineering Dept. - Spring 2006

Predefined string types

• Recall that a simple element is defined as:

<xs:element name=" name

" type=" type

" />

• Here are a few of the possible string types:

– xs:string

-- a string

– xs:normalizedString

-a string that doesn’t contain tabs, newlines, or carriage returns

– xs:token

-a string that doesn’t contain any whitespace other than single spaces

• Allowable restrictions on strings:

– enumeration

, length

, maxLength

, minLength

, pattern

, whiteSpace

Semantic web - Computer Engineering Dept. - Spring 2006

52

Predefined date and time types

• xs:date -- A date in the format CCYY MM DD , for example, 2002-11-05

• xs:time

-- A date in the format hh : mm : ss (hours, minutes, seconds)

• xs:dateTime -- Format is CCYY MM -

DD T hh : mm : ss

– The T is part of the syntax

• Allowable restrictions on dates and times:

– enumeration

, minInclusive

, minExclusive

, maxInclusive

, maxExclusive

, pattern

, whiteSpace

53

Semantic web - Computer Engineering Dept. - Spring 2006

Predefined numeric types

• Here are some of the predefined numeric types: xs:decimal xs:positiveInteger xs:byte xs:short xs:int xs:long xs:negativeInteger xs:nonPositiveInteger xs:nonNegativeInteger

• Allowable restrictions on numeric types:

– enumeration

, minInclusive

, minExclusive

, maxInclusive

, maxExclusive

, fractionDigits

, totalDigits

, pattern

, whiteSpace

Semantic web - Computer Engineering Dept. - Spring 2006

54

Questions?

References

• http://www.w3.org/XML/

• http://www.w3.org/XML/Schema

Semantic web - Computer Engineering Dept. - Spring 2006

56

Download