CPAN 330 XML Lecture #2: Well Formed XML Documents XML Document Components XML Declaration Used to identify the document as XML. It is not required, but it is recommended to be included in the XML document. The XML declaration starts with <?xml and ends with ?> . This declaration should be the first line in the document. Three attributes can be used inside XML declaration: 1. version: the current version is 1.0. 2. encoding: Instruct the parser what character encoding to use when reading this document. It can be: UTF-8, UTF-16 ,ISO-8859-1, windows-1252 , or EBCDIC. The character encoding is the method used to represent characters in bytes. For documents in English and most other Western European languages, the widely supported encoding ISO-8859-1 is typically used. Unicode (UTF-8 and UTF-16) was designed to encode all characters in all human languages. If the encoding is not specified in the declaration, then the parser uses either UTF-8 or UTF16. 3. standalone: Specifies if the document exists entirely by its own, or if it depends on external Document Type Definition .It must be set either to yes or no. A DTD is a document that defines the rules (vocabulary) in the XML document. An XML document is considered valid if it complies with the rules defined in its DTD. Only version attribute is required. If the other attributes are used, they must appear in the same sequence shown above. For example <?xml version=”1.0” encoding=”UTF-8” standalone=”yes” ?> Processing Instructions Processing Instruction (PI) enables to embed application specific instructions into the XML document. PI should be enclosed between <? and ?> . For example we can use PI to associate the XML document with the stylesheet document: <?xml-stylesheet type=”text/css” href=”style.css” ?> 1 If you are familiar with HTML, you may notice that HTML use similar syntax to associate the HTML document with external Cascading Style Sheet document. Tags A tag is any sequence of characters between < and >. Tags are used to Markup the document, and enable to distinguish between the information and the Markup. Tags are paired together, that is any opening tag has closing tag. For example <name> is an opening tag, and </name> is a closing tag. Tags can also be self-closing. For example if there is no content between <name> and </name>, we may write it as <name />. Elements An element is all the information from the beginning of the starting tag to the end of the end tag. For example <name> John Smith </name> The Root Element The top-level element in the document is called the root element. For example, in HTML, the root element is html. Element content The text between the start tag and the end tag is called the element content. It can be either data (referred to as Parsed Character DATA) or other elements. The characters < and & cannot be used in an PCDATA, because they have specific meaning in the XML document. Instead we can use escaping characters or CDATA section. To escape these two characters, we use entity references. They are special characters that are used instead of the illegal characters. The following table summarize the entity references defined by XML: Entity Reference &amp; &lt &gt &apos &quot Corresponding Character & < > ‘ “ The extra entity references are added for consistency. Special characters can be escaped using character references. They are used in the format &#nnn; . For example to include the character © we use &#169; To use Character DATA section, the text must be enclosed between <! [CDATA[ and ]]>. For example: 2 <report> <! [CDATA[ income is < 1000 & sales > 20 items ]]> </report> CDATA Section instructs the XML parser not to parse the text included in this section. The most important use for CDATA section is to include client side scripting in the document like JavaScript. Attributes They are name/value pairs that are associated with an element. Attributes are added to the start tag of an element. For example <emp sin=”123456789”> Dona Blond </emp> Comments Comments in XML documents are enclosed within <!-- and -->. For example: <!-- This is students records document --> Comments are used to provide the readers with detailed information about the XML document. Rules for Well Formed XML document: Well-formed XML is XML that meets certain grammatical rules outlined in the language specification. Well-formed XML document enable any XML parser to read the information contained in this document. In order for an XML document to be well formed, it must meet the following rules: Every starting tag must have a matching end tag, or it should be selfclosing tag (empty). Either one of the following syntax is valid: <note no=”1” ></note> or <note no=”1” /> Tags can’t overlap. Tags must be proper nested. This means that we must close the child element before we close it parent element. For example the following is not well formed XML document: < student> <name> John </student> </name> To make it well formed, we write the document as follow: < student> 3 <name> John </name> </student> XML document can have only one root element. The following is not well-formed document because it has two root elements: < student> <name> John </name> </student> < student> <name> John </name> </student> To make it well formed, we can write it as follow: <students> < student> <name> John </name> </student> < student> <name> John </name> </student> </students> Element names must obey XML naming convention. Names may start with letters or ‘_’, but not with a number or other punctuation character. Also names can’t start with the letters ‘xml’. There can’t be a space after the opening tag character’<’, but the end tag character may be preceded by a space if desired. XML is case sensitive. Unlike HTML, XML elements’ names are case sensitive. XML keeps white spaces in the text. White space characters include (space character, new line character, and tab character). If you are familiar with HTML, you know that any insignificant space will be stripped out from the document. To retain space we have to use the element pre. 4 Unlike HTML, all white spaces are considered significant in XML. For example the white space in the following XML document will not be stripped out. <note> memo 1 there will be a meeting on All paper work should be done one day before </note> Oct. 10 However, Internet Explorer uses default style sheet rules to display XML documents. This means that the browser will transform the document into HTML before display it, therefore it will seems that the white spaces has been stripped out from the above XML document. All attributes must have values, and these values must be quoted using single or double quote. If an attribute has no value, then it should be assigned a blank. The following is an example of well-formed XML document: <bank> <account type=”saving”> 12345 </account> <account type=””> 45678 </account> </bank> Comments cannot be inserted inside a tag, and the string – cannot be used inside a comment. For example, the following is invalid: <customer <!—customer record--> > John Smith</customer> To put it in a valid way we code: <!—customer record--> <customer> John Smith </customer> Parsing XML using Internet Explorer If you have Internet Explorer 5 or higher, then it should have Microsoft XML Parser (MSXML) built in. You can download the latest version from http://msdn.microsoft.com/downloads/webtechnology/xml/msxml.asp 5 If an XML document was not well formed, then Internet Explorer will not display this document and will report the location of the error(s). This provides you with a tool to validate that your document is well formed. 6