Well Formed XML Documents

advertisement
CPAN 330 XML
Lecture #2: Well Formed XML Documents
XML Document Components
 XML Declaration
Used to identify the document as XML. It is not required, but it is
recommended to be included in the XML document.
The XML declaration starts with <?xml and ends with ?> . This declaration
should be the first line in the document.
Three attributes can be used inside XML declaration:
1. version: the current version is 1.0.
2. encoding: Instruct the parser what character encoding to use when
reading this document. It can be: UTF-8, UTF-16 ,ISO-8859-1,
windows-1252 , or EBCDIC.
The character encoding is the method used to represent characters
in bytes. For documents in English and most other Western
European languages, the widely supported encoding ISO-8859-1 is
typically used.
Unicode (UTF-8 and UTF-16) was designed to encode all characters
in all human languages.
If the encoding is not specified in the declaration, then the parser
uses either UTF-8 or UTF16.
3. standalone: Specifies if the document exists entirely by its own, or if
it depends on external Document Type Definition .It must be set
either to yes or no. A DTD is a document that defines the rules
(vocabulary) in the XML document. An XML document is considered
valid if it complies with the rules defined in its DTD.
Only version attribute is required. If the other attributes are used, they must
appear in the same sequence shown above.
For example
<?xml version=”1.0” encoding=”UTF-8” standalone=”yes” ?>
 Processing Instructions
Processing Instruction (PI) enables to embed application specific
instructions into the XML document. PI should be enclosed between <?
and ?> . For example we can use PI to associate the XML document with
the stylesheet document:
<?xml-stylesheet type=”text/css” href=”style.css” ?>
1
If you are familiar with HTML, you may notice that HTML use similar
syntax to associate the HTML document with external Cascading Style
Sheet document.
 Tags
A tag is any sequence of characters between < and >. Tags are used to
Markup the document, and enable to distinguish between the information
and the Markup. Tags are paired together, that is any opening tag has
closing tag. For example <name> is an opening tag, and </name> is a
closing tag. Tags can also be self-closing. For example if there is no
content between <name> and </name>, we may write it as <name />.
 Elements
An element is all the information from the beginning of the starting tag to
the end of the end tag. For example
<name> John Smith </name>
 The Root Element
The top-level element in the document is called the root element. For
example, in HTML, the root element is html.
 Element content
The text between the start tag and the end tag is called the element
content. It can be either data (referred to as Parsed Character DATA) or
other elements.
The characters < and & cannot be used in an PCDATA, because they
have specific meaning in the XML document. Instead we can use
escaping characters or CDATA section.
To escape these two characters, we use entity references. They are
special characters that are used instead of the illegal characters. The
following table summarize the entity references defined by XML:
Entity Reference
&
&lt
&gt
&apos
&quot
Corresponding Character
&
<
>
‘
“
The extra entity references are added for consistency.
Special characters can be escaped using character references. They are
used in the format &#nnn; . For example to include the character © we
use ©
To use Character DATA section, the text must be enclosed between <!
[CDATA[ and ]]>. For example:
2
<report> <! [CDATA[ income is < 1000 & sales > 20 items ]]>
</report>
CDATA Section instructs the XML parser not to parse the text included in
this section. The most important use for CDATA section is to include client
side scripting in the document like JavaScript.
 Attributes
They are name/value pairs that are associated with an element. Attributes
are added to the start tag of an element. For example
<emp sin=”123456789”> Dona Blond </emp>
 Comments
Comments in XML documents are enclosed within <!-- and -->. For
example:
<!-- This is students records document -->
Comments are used to provide the readers with detailed information about
the XML document.
Rules for Well Formed XML document:
Well-formed XML is XML that meets certain grammatical rules outlined in
the language specification. Well-formed XML document enable any XML
parser to read the information contained in this document. In order for an
XML document to be well formed, it must meet the following rules:
 Every starting tag must have a matching end tag, or it should be selfclosing tag (empty).
Either one of the following syntax is valid:
<note no=”1” ></note>
or
<note no=”1” />
 Tags can’t overlap.
Tags must be proper nested. This means that we must close the
child element before we close it parent element. For example the
following is not well formed XML document:
< student>
<name> John
</student>
</name>
To make it well formed, we write the document as follow:
< student>
3
<name> John
</name>
</student>
 XML document can have only one root element.
The following is not well-formed document because it has two root
elements:
< student>
<name> John
</name>
</student>
< student>
<name> John
</name>
</student>
To make it well formed, we can write it as follow:
<students>
< student>
<name> John
</name>
</student>
< student>
<name> John
</name>
</student>
</students>
 Element names must obey XML naming convention.
Names may start with letters or ‘_’, but not with a number or other
punctuation character. Also names can’t start with the letters ‘xml’.
There can’t be a space after the opening tag character’<’, but the
end tag character may be preceded by a space if desired.
 XML is case sensitive.
Unlike HTML, XML elements’ names are case sensitive.
 XML keeps white spaces in the text.
White space characters include (space character, new line
character, and tab character).
If you are familiar with HTML, you know that any insignificant space
will be stripped out from the document. To retain space we have to
use the element pre.
4
Unlike HTML, all white spaces are considered significant in XML. For
example the white space in the following XML document will not be
stripped out.
<note> memo 1
there will be a meeting on
All paper work should be done
one day before
</note>
Oct. 10
However, Internet Explorer uses default style sheet rules to display
XML documents. This means that the browser will transform the
document into HTML before display it, therefore it will seems that the
white spaces has been stripped out from the above XML document.
 All attributes must have values, and these values must be quoted
using single or double quote. If an attribute has no value, then it
should be assigned a blank.
The following is an example of well-formed XML document:
<bank>
<account type=”saving”> 12345 </account>
<account type=””> 45678 </account>
</bank>
 Comments cannot be inserted inside a tag, and the string – cannot
be used inside a comment.
For example, the following is invalid:
<customer <!—customer record--> > John Smith</customer>
To put it in a valid way we code:
<!—customer record-->
<customer>
John Smith
</customer>
Parsing XML using Internet Explorer
If you have Internet Explorer 5 or higher, then it should have Microsoft
XML Parser (MSXML) built in. You can download the latest version
from
http://msdn.microsoft.com/downloads/webtechnology/xml/msxml.asp
5
If an XML document was not well formed, then Internet Explorer will not
display this document and will report the location of the error(s). This
provides you with a tool to validate that your document is well formed.
6
Download