Semantic Web - Fall 2005
Computer Engineering Department
Sharif University of Technology
• Markup Languages
– SGML, HTML, XML
• XML Building Blocks
• XML Applications
• Namespaces
• XML Schema
Semantic web - Computer Engineering Dept. - Spring 2006
2
•
S tandard G eneralized M arkup L anguage
• The international standard for defining descriptions of structure and content in text documents
• Interchangeable: device-independent, system-independent
• tags are not predefined
• Using DTD to validate the structure of the document
• Large, powerful, and very complex
• Heavily used in industrial and commercial usages for over a decade
3
Semantic web - Computer Engineering Dept. - Spring 2006
H yper T ext M arkup L anguage
• A small SGML application used on web (a DTD and a set of processing conventions)
• Only uses a predefined set of tags
Semantic web - Computer Engineering Dept. - Spring 2006
4
• e
X tensible M arkup L anguage
• A simplified version of SGML
• Maintains the most useful parts of SGML
• Designed so that SGML can be delivered over the
Web
• More flexible and adaptable than HTML
• XHTML : a reformulation of HTML 4 in XML 1.0
5
Semantic web - Computer Engineering Dept. - Spring 2006
HTML is used to mark up text so it can be displayed to users .
HTML describes both structure (e.g. <p> , <h2>
<em> ) and appearance
(e.g. <br> , <font> , <i> )
,
XML is used to mark up data so it can be processed by computers .
XML describes only content, or “meaning”
HTML uses a fixed, unchangeable set of tags.
In XML, you make up your own tags.
Semantic web - Computer Engineering Dept. - Spring 2006
6
• HTML is for humans
– HTML describes web pages
– You don’t want to see error messages about the web pages you visit
– Browsers ignore and/or correct as many HTML errors as they can, so HTML is often sloppy
• XML is for computers
– XML describes data
– The rules are strict and errors are not allowed
• In this way, XML is like a programming language
– Current versions of most browsers can display XML
• However, browser support of XML is spotty at best
7
Semantic web - Computer Engineering Dept. - Spring 2006
•
DTD (Document Type Definition) and XML
Schemas are used to define legal XML tags and their attributes for particular purposes
• XSLT (eXtensible Stylesheet Language
Transformations) and XPath are used to translate from one form of XML to another
• SAX (Simple API for XML)
Semantic web - Computer Engineering Dept. - Spring 2006
8
• Delimited by angle brackets
• Identify the nature of the content they surround
• General format: <element> … </element>
• Empty element:
<empty-Element />
– Elements are related as parents and children
– Elements can have different content types:
• Element, mixed, Simple, empty
Semantic web - Computer Engineering Dept. - Spring 2006
9
Name-value pairs that occur inside start-tags after element name, like:
<element attribute=“value” />
• Provide additional information about elements that often is not a part of data.
• Attributes and elements are somewhat interchangeable
• Should I use an element or an attribute?
• Example using just elements:
<name>
<first>David</first>
<last>Matuszek</last>
</name>
• Example using attributes:
<name first="David" last="Matuszek"></name> metadata (data about data) should be stored as attributes, and that data itself should be stored as elements
Semantic web - Computer Engineering Dept. - Spring 2006
10
Five special characters must be written as entities:
& for
&
(almost always necessary)
< for
<
(almost always necessary)
> for
>
(not usually necessary)
" for
"
(necessary inside double quotes)
' for ' (necessary inside single quotes)
These entities can be used even in places where they are not absolutely required.
These are the only predefined entities in XML.
Semantic web - Computer Engineering Dept. - Spring 2006
11
The XML declaration looks like this:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
– The XML declaration is not required by browsers, but is required by most XML processors (so include it!)
– If present, the XML declaration must be first-not even whitespace should precede it
– Note that the brackets are <?
and
?>
– version="1.0" is required (this is the only version so far)
– encoding can be
"UTF-8"
(ASCII) or
"UTF-16"
(Unicode), or something else, or it can be omitted
– standalone tells whether there is a separate DTD
Semantic web - Computer Engineering Dept. - Spring 2006
12
• PIs (Processing Instructions) may occur anywhere in the
XML document (but usually first)
• A PI is a command to the program processing the XML document to handle it in a certain way
• XML documents are typically processed by more than one program
• Programs that do not recognize a given PI should just ignore it
• General format of a PI: <?
target instructions ?>
• Example: <?xml-stylesheet type="text/css" href="mySheet.css"?>
Semantic web - Computer Engineering Dept. - Spring 2006
13
• <!-- This is a comment in both HTML and XML -->
• Comments can be put anywhere in an XML document
• Comments are useful for:
– Explaining the structure of an XML document
– Commenting out parts of the XML during development and testing
• The character sequence -cannot occur in the comment
• Comments are not displayed by browsers, but can be seen by anyone who looks at the source code
14
Semantic web - Computer Engineering Dept. - Spring 2006
• By default, all text inside an XML document is parsed
• You can force text to be treated as unparsed character data by enclosing it in
<![CDATA[ ... ]]>
• Any characters, even & and
<
, can occur inside a CDATA
• Whitespace inside a CDATA is (usually) preserved
• The only real restriction is that the character sequence ]]> cannot occur inside a CDATA
• CDATA is useful when your text has a lot of illegal characters (for example, if your XML document contains some HTML text)
15
Semantic web - Computer Engineering Dept. - Spring 2006
• All XML elements must have a closing tag
• XML tags are case sensitive
• All XML elements must be properly nested
• All XML documents must have a root tag
• Attribute values must always be quoted
• With XML, white space is preserved
• With XML, a new line is always stored as LF
• Comments in XML:
<!-- This is a comment -->
Semantic web - Computer Engineering Dept. - Spring 2006
16
• Every element must have both a start tag and an end tag, e.g.
<name> ... </name>
– But empty elements can be abbreviated: <break /> .
– XML tags are case sensitive
– XML tags may not begin with the letters xml
, in any combination of cases
• Elements must be properly nested, e.g. not
<b><i>bold and italic</b></i>
• Every XML document must have one and only one root element
• The values of attributes must be enclosed in single or double quotes, e.g. <time unit="days">
• Character data cannot contain < or
&
Semantic web - Computer Engineering Dept. - Spring 2006
17
• XML documents do not carry information about how to display the data
• We can add display information to XML with
– CSS (Cascading Style Sheets)
– XSL (eXtensible Stylesheet Language) --- preferred
Semantic web - Computer Engineering Dept. - Spring 2006
18
Separate data
XML can Separate Data from HTML
• Store data in separate XML files
• Using HTML for layout and display
• Using Data Islands
• Data Islands can be bound to HTML elements
Benefits:
Changes in the underlying data will not require any changes to your
HTML
Semantic web - Computer Engineering Dept. - Spring 2006
19
Exchange data
XML is used to Exchange Data
• Text format
• Software-independent, hardware-independent
• Exchange data between incompatible systems, given that they agree on the same tag definition.
• Can be read by many different types of applications
Benefits:
• Reduce the complexity of interpreting data
• Easier to expand and upgrade a system
Semantic web - Computer Engineering Dept. - Spring 2006
20
Store Data
XML can be used to Store Data
• Plain text file
• Store data in files or databases
• Application can be written to store and retrieve information from the store
• Other clients and applications can access your XML files as data sources
Benefits:
Accessible to more applications
21
Semantic web - Computer Engineering Dept. - Spring 2006
Create new language
XML can be used to Create new Languages, e.g. :
• WML (Wireless Markup Language) used to markup Internet applications for handheld devices like mobile phones (WAP)
• MusicXML used to publishing musical scores
Semantic web - Computer Engineering Dept. - Spring 2006
22
• Names (as used for tags and attributes) must begin with a letter or underscore, and can consist of:
– Letters, both Roman (English) and foreign
– Digits, both Roman and foreign
.
(dot)
-
(hyphen)
_
(underscore)
:
(colon) should be used only for namespaces
– Combining characters and extenders (not used in English)
23
Semantic web - Computer Engineering Dept. - Spring 2006
• Namespaces are a simple mechanism for creating globally unique names for the elements and attributes of your markup language.
• Benefits:
– De-conflicts the meaning of identical names in different markup languages.
– Allows different markup languages to be mixed together without ambiguity.
• Namespaces are implemented by requiring every XML name to consist of two parts: a prefix and a local part:
<xsd:integer>
24
Semantic web - Computer Engineering Dept. - Spring 2006
• A namespace is defined as a unique string
– To guarantee uniqueness, typically a URI (Uniform
Resource Indicator) is used, because the author
“owns” the domain
– It doesn't have to be a “real” URI; it just has to be a unique string
– Example: http://ce.sharif.edu/sw
• There are two ways to use namespaces:
– Declare a default namespace
– Associate a prefix with a namespace, then use the prefix in the XML to refer to the namespace
Semantic web - Computer Engineering Dept. - Spring 2006
25
• In any start tag you can use the reserved attribute name xmlns
:
<book xmlns= "http://ce.sharif.edu/sw">
– This namespace will be used as the default for all elements up to the corresponding end tag
– You can override it with a specific prefix
• You can use almost this same form to declare a prefix:
<book xmlns:dave= "http://ce.sharif.edu/sw">
– Use this prefix on every tag and attribute you want to use from this namespace, including end tags--it is not a default prefix
< dave: chapter dave: number="1">To Begin</ dave: chapter>
• You can use the prefix in the start tag in which it is defined:
< dave: book xmlns:dave =“http://ce.sharif.edu/sw">
Semantic web - Computer Engineering Dept. - Spring 2006
26
• Start with <?xml version="1"?>
• XML is case sensitive
• You must have exactly one root element that encloses all the rest of the XML
• Every element must have a closing tag
• Elements must be properly nested
• Attribute values must be enclosed in double or single quotation marks
• There are only five pre-declared entities
Semantic web - Computer Engineering Dept. - Spring 2006
27
• An XML document represents a hierarchy; a hierarchy is a tree novel foreword chapter number="1" paragraph paragraph paragraph
This is the great
American novel.
It was a dark and stormy night.
Suddenly, a shot rang out!
Semantic web - Computer Engineering Dept. - Spring 2006
28
• You can define your own XML tag sets, but here are some already available:
– XHTML: HTML redefined in XML
– SMIL: Synchronized Multimedia Integration Language
– MathML: Mathematical Markup Language
– SVG: Scalable Vector Graphics
– DrawML: Drawing MetaLanguage
– ICE: Information and Content Exchange
– ebXML: Electronic Business with XML
– cxml: Commerce XML
– CBL: Common Business Library
29
Semantic web - Computer Engineering Dept. - Spring 2006
• "Well Formed" XML document
– correct XML syntax
• "Valid" XML document
– “well formed”
– Conforms to the rules of a DTD
• XML DTD
– defines the legal building blocks of an XML document
– Can be inline in XML or as an external reference
• XML Schema
– an XML based alternative to DTD, more powerful
– Support namespace and data types
Semantic web - Computer Engineering Dept. - Spring 2006
31
<?xml version="1.0"?>
<!DOCTYPE note [
<!ELEMENT note (to,from,heading,body)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT heading (#PCDATA)>
<!ELEMENT body (#PCDATA)>
]>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend</body>
</note>
Semantic web - Computer Engineering Dept. - Spring 2006
32
• “Schema” is a general term
– DTDs are a form of XML schemas
• When we say “XML Schemas,” we usually mean the W3C XML Schema Language
– This is also known as “XML Schema Definition” language, or XSD .
Semantic web - Computer Engineering Dept. - Spring 2006
33
• DTDs provide a very weak specification language
– You can’t put any restrictions on text content
– You have very little control over mixed content (text plus elements)
– You have little control over ordering of elements
• DTDs are written in a strange (non-XML) format
– You need separate parsers for DTDs and XML
• The XML Schema Definition language solves these problems
– XSD gives you much more control over structure and content
– XSD is written in XML
34
Semantic web - Computer Engineering Dept. - Spring 2006
• To refer to a DTD in an XML document, the reference goes before the root element:
– <?xml version="1.0"?>
<!DOCTYPE rootElement SYSTEM " url
">
<rootElement> ... </rootElement>
• To refer to an XML Schema in an XML document, the reference goes in the root element:
– <?xml version="1.0"?>
<rootElement xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
(The XML Schema Instance reference is required) xsi:noNamespaceSchemaLocation=" url
.xsd">
(This is where your XML Schema definition can be found)
...
</rootElement>
Semantic web - Computer Engineering Dept. - Spring 2006
35
• Since the XSD is written in XML, it can get confusing which we are talking about.
• The file extension is .xsd
• The root element is <schema>
• The XSD starts like this:
• <?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.rg/2001/XMLSchema">
Semantic web - Computer Engineering Dept. - Spring 2006
36
• The <schema> element may have attributes:
– xmlns:xs="http://www.w3.org/2001/XMLSchema"
• This is necessary to specify where all our XSD tags are defined
– elementFormDefault="qualified"
• This means that all XML elements must be qualified (use a namespace)
• It is highly desirable to qualify all elements, or problems will arise when another schema is added
37
Semantic web - Computer Engineering Dept. - Spring 2006
• A “simple” element is one that contains text and nothing else
– A simple element cannot have attributes
– A simple element cannot contain other elements
– A simple element cannot be empty
– However, the text can be of many different types, and may have various restrictions applied to it
• If an element isn’t simple, it’s “complex”
– A complex element may have attributes
– A complex element may be empty, or it may contain text, other elements, or both text and other elements
Semantic web - Computer Engineering Dept. - Spring 2006
38
• A simple element is defined as
<xs:element name=" name
" type=" type
" /> where:
– name is the name of the element
– the most common values for type are xs:boolean xs:integer xs:date xs:decimal xs:string xs:time
• Other attributes a simple element may have:
– default=" default value
" if no other value is specified
– fixed=" value
" no other value may be specified
Semantic web - Computer Engineering Dept. - Spring 2006
39
• Attributes themselves are always declared as simple types
• An attribute is defined as
<xs:attribute name=" name
" type=" type
" /> where:
– name and type are the same as for xs:element
• Other attributes a simple element may have:
– default=" default value
" if no other value is specified
– fixed=" value
" no other value may be specified
– use="optional" the attribute is not required (default)
– use="required" the attribute must be present
40
Semantic web - Computer Engineering Dept. - Spring 2006
• The general form for putting a restriction on a text value is:
(or xs:attribute
) – <xs:element name="name">
<xs:restriction base="type">
... the restrictions ...
</xs:restriction>
</xs:element>
• For example:
– <xs:element name="age">
<xs:restriction base="xs:integer">
<xs:minInclusive value="0">
<xs:maxInclusive value="140">
</xs:restriction>
</xs:element>
Semantic web - Computer Engineering Dept. - Spring 2006
41
• minInclusive
-number must be ≥ the given value
• minExclusive
-- number must be > the given value
• maxInclusive
-number must be ≤ the given value
• maxExclusive
-- number must be < the given value
• totalDigits
-- number must have exactly value digits
• fractionDigits
-- number must have no more than value digits after the decimal point
Semantic web - Computer Engineering Dept. - Spring 2006
42
• length
-- the string must contain exactly value characters
• minLength
-- the string must contain at least value characters
• maxLength
-- the string must contain no more than value characters
• pattern
-- the value is a regular expression that the string must match
• whiteSpace
-not really a “restriction”--tells what to do with whitespace
– value="preserve"
Keep all whitespace
– value="replace"
Change all whitespace characters to spaces
– value="collapse"
Remove leading and trailing whitespace, and replace all sequences of whitespace with a single space
43
Semantic web - Computer Engineering Dept. - Spring 2006
• An enumeration restricts the value to be one of a fixed set of values
• Example:
– <xs:element name="season">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:enumeration value="Spring"/>
<xs:enumeration value="Summer"/>
<xs:enumeration value="Autumn"/>
<xs:enumeration value="Fall"/>
<xs:enumeration value="Winter"/>
</xs:restriction>
</xs:simpleType>
</xs:element>
Semantic web - Computer Engineering Dept. - Spring 2006
44
• A complex element is defined as
<xs:element name="name">
<xs:complexType>
... information about the complex type...
</xs:complexType>
</xs:element>
• Example:
<xs:element name="person">
<xs:complexType>
<xs:sequence>
<xs:element name="firstName" type="xs:string" />
<xs:element name="lastName" type="xs:string" />
</xs:sequence>
</xs:complexType>
</xs:element>
• <xs:sequence> says that elements must occur in this order
• Remember that attributes are always simple types
Semantic web - Computer Engineering Dept. - Spring 2006
45
– Examples:
• <xs:element name="student" type="person"/>
• <xs:element name="professor" type="person"/>
– Scope is important: you cannot use a type if is local to some other type
Semantic web - Computer Engineering Dept. - Spring 2006
46
• <xs:element name="person">
<xs:complexType>
<xs:sequence>
<xs:element name="firstName" type="xs:string" />
<xs:element name="lastName" type="xs:string" />
</xs:sequence>
</xs:complexType>
</xs:element>
47
Semantic web - Computer Engineering Dept. - Spring 2006
• xs:all allows elements to appear in any order
• <xs:element name="person">
<xs:complexType>
<xs:all>
<xs:element name="firstName" type="xs:string" />
<xs:element name="lastName" type="xs:string" />
</xs:all>
</xs:complexType>
</xs:element>
• Despite the name, the members of an xs:all group can occur once or not at all
• You can use minOccurs="0" to specify that an element is optional (default value is
1
)
– In this context, maxOccurs is always
1
Semantic web - Computer Engineering Dept. - Spring 2006
48
• <xs:complexType name="counter">
<xs:complexContent>
<xs:extension base="xs:anyType"/>
<xs:attribute name="count" type="xs:integer"/>
</xs:complexContent>
</xs:complexType>
Semantic web - Computer Engineering Dept. - Spring 2006
49
• Mixed elements may contain both text and elements
• We add mixed="true" to the xs:complexType element
• The text itself is not mentioned in the element, and may go anywhere (it is basically ignored)
• <xs:complexType name="paragraph" mixed="true" >
<xs:sequence>
<xs:element name="someName” type="xs:anyType"/>
</xs:sequence>
</xs:complexType>
Semantic web - Computer Engineering Dept. - Spring 2006
50
• You can base a complex type on another complex type
• <xs:complexType name="newType">
<xs:complexContent>
<xs:extension base="otherType">
...new stuff...
</xs:extension>
</xs:complexContent>
</xs:complexType>
51
Semantic web - Computer Engineering Dept. - Spring 2006
• Recall that a simple element is defined as:
<xs:element name=" name
" type=" type
" />
• Here are a few of the possible string types:
– xs:string
-- a string
– xs:normalizedString
-a string that doesn’t contain tabs, newlines, or carriage returns
– xs:token
-a string that doesn’t contain any whitespace other than single spaces
• Allowable restrictions on strings:
– enumeration
, length
, maxLength
, minLength
, pattern
, whiteSpace
Semantic web - Computer Engineering Dept. - Spring 2006
52
• xs:date -- A date in the format CCYY MM DD , for example, 2002-11-05
• xs:time
-- A date in the format hh : mm : ss (hours, minutes, seconds)
• xs:dateTime -- Format is CCYY MM -
DD T hh : mm : ss
– The T is part of the syntax
• Allowable restrictions on dates and times:
– enumeration
, minInclusive
, minExclusive
, maxInclusive
, maxExclusive
, pattern
, whiteSpace
53
Semantic web - Computer Engineering Dept. - Spring 2006
• Here are some of the predefined numeric types: xs:decimal xs:positiveInteger xs:byte xs:short xs:int xs:long xs:negativeInteger xs:nonPositiveInteger xs:nonNegativeInteger
• Allowable restrictions on numeric types:
– enumeration
, minInclusive
, minExclusive
, maxInclusive
, maxExclusive
, fractionDigits
, totalDigits
, pattern
, whiteSpace
Semantic web - Computer Engineering Dept. - Spring 2006
54
• http://www.w3.org/XML/
• http://www.w3.org/XML/Schema
Semantic web - Computer Engineering Dept. - Spring 2006
56