XML document

advertisement
Sistemi basati su conoscenza
XML
Prof. M.T. PAZIENZA
a.a. 2004-2005
Introduction to XML
HTML (1990) was designed to display data
(documents), and to focus on how data
looks
XML (1996) was designed to describe data
(documents), and to focus on what data is
HTML is about displaying information,
XML is about describing information
both derive from SGML (1988)
XML is a standard for describing content in addition
to presentation aspects.
HTML
HTML is a markup language: it augments regular
text with “marks” that hold special meaning for
Web browser handling the document.
Commands in the language are called tags (start –
end),
<TAG>, </TAG>
and have a fixed meaning.
HTML is adequate to represent the structure of
documents only for display purposes.
XML
(EXtensible Markup Language)
XML tags are not predefined in XML. The author must
define his own tags and his own document structure.
XML uses a DTD (Document Type Definition) to describe
any type data (document).
XML with a DTD is designed to be self-descriptive.
XML is free and extensible
XML is as a cross-platform, software and hardware
independent tool for transmitting information.
XML does not DO anything
XML was not designed to DO anything.
It is just “pure information” wrapped in XML tags.
Someone must write a piece of software to send it,
receive it or display it.
XML is not a language; it is a syntax supporting
creation of personalized markup languages.
XML does not DO anything
Ex:
<?xml version="1.0"? Encoding=“ISO_8859-1”>
<note>
<to>Tove</to> <from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
The “note” has a header, and a message body. It also has
sender and receiver information. But still, this XML
document does not DO anything. Someone must write
a piece of software to send it, receive it or display it.
XML to Exchange Data
XML was designed to store, carry and exchange
data.With XML, data can be exchanged between
incompatible systems.
In the real world, computer systems and databases
contain data in incompatible formats.
Converting the data to XML can greatly reduce the
complexity of data exchange and create data that
can be read by many different types of
applications.
XML used to Share Data
With XML, plain text files can be used to share
data.
Since XML data is stored in plain text format,
XML provides a software- and hardwareindependent way of sharing data.
This makes it much easier to create data that
different applications can work with. It also
makes it easier to expand or upgrade a system
to new operating systems, servers, applications,
and new browsers.
XML used to Store Data
With XML, plain text files can be used to store
data.
XML can also be used to store data in files or in
databases. Applications can be written to store
and retrieve information from the store, and
generic applications can be used to display
the data.
XML Syntax
The syntax rules of XML are very simple and very
strict. The rules are very easy to learn, and very
easy to use.
Creating software that can read and manipulate
XML is very easy to do.
XML Syntax
Element (also called tag) is the primary building block of an XML
document. Xml elements are case sensitive and must be properly
nested. Elements are related as parents and children.
With XML, the tag <Letter> is different from the tag <letter>.
Opening and closing tags must therefore be written with the same
case.
Attributes provide additional information about the element. Their
values (enclosed in quotes) are inside the start tag of an element.
An attribute is a name-value pair separated by an equal sign (=)
Entities are shortcuts for portions of common text (entity reference
starts with “&” and ends with “;”)
…
XML Syntax
Comments – arbitrary text- may be inserted anywhere in an XML
document (comment starts with “<!-” and ends with “->”)
Comment ::= '<!--' ((Char - '-') | ('-' (Char - '-')))* '-->‘
An example of a comment:
<!-- declarations for <head> & <body> -->
Note that the grammar does not allow a comment ending in --->.
The following example is not well-formed
<!-- B+, B, or B--->
XML Syntax
Document type declaration (DTD) is the set of rules that allows
to specify own set of elements, attributes and entities. A DTD
specifies which elements can be used and constraints on
elements
A DTD defines the legal elements of an XML document that is
the legal building blocks of an XML document. It defines the
document structure with a list of legal elements.
XML Schema is an XML based alternative to DTD.
Why use a DTD?
XML provides an application independent way of
sharing data.
With a DTD, independent groups of people can
agree to use a common DTD for interchanging
data.
Any application can use a standard DTD to verify
that data received from the outside world is
valid.
XML Syntax
All XML elements (a part XML declaration )must have
a closing tag. The XML declaration
<?xml version="1.0"? Encoding=“ISO_8859-1”>
is not a part of the XML document itself. It is not an
XML element, and it should not have a closing tag.
The XML declaration defines the XML version and the
character encoding used in the document.
XML Syntax
All XML documents must have a root tag
The first tag in an XML document is the root tag.
All XML documents must contain a single tag pair to define the root
element (ex.<note> ).
All other elements must be nested within the root element.
All elements can have sub-elements (children). Sub-elements must
be correctly nested within their parent element:
<root> <child> <subchild>.....</subchild> </child> </root>
In previous example there are 4 child elements of the root (to, from,
heading, body)
XML Syntax
Attribute values must always be quoted
With XML, it is illegal to omit quotation marks
around attribute values.
XML elements can have attributes in name/value pairs
just like in HTML.
In XML the attribute value must always be quoted.
XML Syntax
<?xml version="1.0"?> <note date=12/11/99>
<to>Tove</to> <from>Jani</from>
<heading>Reminder</heading> <body>Don't forget
me this weekend!</body> </note>
Incorretto
<?xml version="1.0"?> <note date="12/11/99">
<to>Tove</to> <from>Jani</from>
<heading>Reminder</heading> <body>Don't forget
me this weekend!</body> </note>
corretto
XML Syntax
White Space is Preserved
CR / LF is Converted to LF
A new line is always stored as LF
XML Syntax
There is nothing special about XML. It is just plain text
with the addition of some XML tags enclosed in angle
brackets.
Software that can handle plain text can also handle XML.
In a simple text editor, the XML tags will be visible
and will not be handled specially.
In an XML-aware application however, the XML tags can
be handled specially. The tags may or may not be
visible, or have a functional meaning, depending on
the nature of the application.
XML Elements
XML Elements are Extensible
XML documents can be extended to carry more
information.
XML Elements have Relationships
Elements are related as parents and children
XML Elements
Book Title: My First XML
Chapter 1: Introduction to XML
What is HTML
What is XML
Chapter 2: XML Syntax
Elements must have a closing tag
Elements must be properly nested
XML element (book description)
<book>
<title>My First XML</title>
<prod id="33-657" media="paper"></prod>
<chapter>Introduction to XML
<para>What is HTML</para>
<para>What is XML</para>
</chapter>
<chapter>XML Syntax
<para>Elements must have a closing tag</para>
<para>Elements must be properly nested</para>
</chapter>
</book>
XML element (book description)
book is the root element.
title, prod, and chapter are child elements of book.
book is the parent element of siblings (or sister elements)
because they have the same parent.
Elements have Content
Elements can have different content types.
An XML element is everything from (including)
the element's start tag to (including) the
element's end tag.
An element can have element content, mixed
content, simple content, or empty content. An
element can also have attributes.
Elements have Content
In the book description:
book has element content, because it contains other
element;
chapter has mixed content because it contains both text
and other elements;
para has simple content (or text content) because it
contains only text;
prod has empty content because it carries no
information.
Element Naming
Names can contain letters, numbers, and other
characters
Names must not start with a number or punctuation
character
Names must not start with the letters xml (or XML
or Xml ..)
Names cannot contain spaces
Element Naming
XML documents often have a corresponding database, in
which fields exist corresponding to elements in the
XML document. A good practice is to use the naming
rules of the database for the elements in the XML
documents
Ex. XML News document
<?xml version="1.0"?>
<nitf> <head>
<title>Colombia Earthquake</title> </head>
<body> <body.head> <headline> <hl1>143 Dead in
Colombia Earthquake</hl1> </headline>
<byline> <bytag>By Jared Kotler, Associated Press
Writer</bytag> </byline>
<dateline> <location>Bogota,Colombia</location>
<story.date>Monday January 25 1999 7:28
ET</story.date> </dateline> </body.head> </body>
</nitf>
DTD
A DTD is enclosed in
<!DOCTYPE name [DTD declaration ]>
where name is the name of the outermost enclosing tag,
and [DTD declaration ] is the text of the rules of the
DTD
The DTD starts with the outermost element, called the root
of the element
Internal DTD
<?xml version="1.0"?> <!DOCTYPE note [
<!ELEMENT note (to,from,heading,body)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT heading (#PCDATA)>
<!ELEMENT body (#PCDATA)> ]>
<note> <to>Tove</to> <from>Jani</from>
<heading>Reminder</heading> <body>Don't forget me this
weekend!</body> </note>
The DTD is interpreted like this:
!ELEMENT note defines the element "note" as having four elements:
"to,from,heading,body".
!ELEMENT to defines the "to" element to be of the type "CDATA".
!ELEMENT from defines the "from" element to be of the type "CDATA"
and so on.....
CDATA Sections
CDATA sections may occur anywhere character data
may occur; they are used to escape blocks of text
containing characters which would otherwise be
recognized as markup.
CDATA sections begin with the string "<![CDATA["
and end with the string
"]]>"
CDATA Sections
CDSect ::= CDStart CDataCDEnd
CDStart ::= '<![CDATA[‘
CData ::= (Char* - (Char* ']]>' Char*))
CDEnd ::= ']]>‘
Within a CDATA section, only the CDEnd string is
recognized as markup, so that left angle brackets
and ampersands may occur in their literal form;
they need not (and cannot) be escaped using "<"
and "&". CDATA sections cannot nest.
CDATA Sections
An example of a CDATA section, in which
"<greeting>" and "</greeting>" are recognized
as character data, not markup:
<![CDATA[<greeting>Hello,world!</greeting>]]>
External DTD
This is the same XML document with an external DTD:
<?xml version="1.0"?>
<!DOCTYPE note SYSTEM "note.dtd">
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
External DTD
This is a copy of the file "note.dtd" containing the Document
Type Definition
<?xml version="1.0"?>
<!ELEMENT note (to,from,heading,body)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT heading (#PCDATA)>
<!ELEMENT body (#PCDATA)>
XML document (with DTD)
An example of an XML document with a document type
declaration
<?xml version="1.0"?>
<!DOCTYPE greeting SYSTEM "hello.dtd"> <greeting>Hello,
world!</greeting>
The system identifier "hello.dtd" gives the address (a URI
reference) of a DTD for the document
XML document (with DTD)
The declarations can also be given locally, as in this
example:
<?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE
greeting [ <!ELEMENT greeting (#PCDATA)> ]>
<greeting>Hello, world!</greeting>
XML document (with DTD)
If both the external and internal subsets are used, the
internal subset is considered to occur before the
external subset.
This has the effect that entity and attribute-list
declarations in the internal subset take precedence
over those in the external subset.
Language identification
In document processing, it is often useful to identify the
natural or formal language in which the content is
written.
A special attribute named xml:lang may be inserted in
documents to specify the language used in the
contents and attribute values of any element in an
XML document.
In valid documents, this attribute, like any other, must
be declared if it is used.
Language identification
A simple declaration for xml:lang might take the form
xml:lang NMTOKEN #IMPLIED
The intent declared with xml:lang is considered to apply to all
attributes and content of the element where it is specified, unless
overridden with an instance of xml:lang on another element within
that content.
Specific default values may also be given, if appropriate. In a
collection of French poems for English students, with glosses and
notes in English, the xml:lang attribute might be declared this
way:
<!ATTLIST poem xml:lang NMTOKEN 'fr'>
<!ATTLIST gloss xml:lang NMTOKEN 'en'>
<!ATTLIST note xml:lang NMTOKEN 'en'>
Language identification
<p xml:lang="en">The quick brown fox jumps over
the lazy dog.</p>
<p xml:lang="en-GB">What colour is it?</p>
<p xml:lang="en-US">What color is it?</p>
<sp who="Faust" desc='leise' xml:lang="de">
<l>Habe nun, ach! Philosophie,</l> <l>Juristerei,
und Medizin</l> <l>und leider auch Theologie</l>
<l>durchaus studiert mit heißem Bemüh'n.</l>
</sp>
“Well formed” XML documents
A “well formed” XML document has correct
XML syntax (i.e. is a document that conforms
to the XML syntax rules.
A “valid” XML document is a “well formed”
XML document which also conforms to the
rules of a DTD (Document Type Definition).
http://www.xml.com/pub/a/98/10/guid
e0.html
http://xmlfiles.com/xml/default.asp
http://www.brics.dk/~amoeller/XML/ind
ex.html
http://msdn.microsoft.com/library/defau
lt.asp?url=/library/enus/xmlsdk30/htm/xmtutxmltutorial.asp
http://www.w3.org/TR/2000/REC-xml20001006#sec-well-formed
Download