Chapter 1 GETTING STARTED SYS-ED/ COMPUTER EDUCATION TECHNIQUES, INC. XML: Extensible Markup Language Getting Started Objectives You will learn: C SGML and SGML document components. C XML: What is it. C XML as compared to SGML and HTML. C XML format. C XML specifications. C XML architecture. C Data structure namespaces. C Data delivery, manipulation. C Parsing XML. C Manipulating and editing data using the Document Object Mode. C Displaying XML-based data in HTML C XSL: Extensible Stylesheet Language. C XML: Transforming and querying. SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01) Ch 1: Page i XML: Extensible Markup Language 1 Getting Started SGML SGML stands for the Standard Generalized Markup Language; it is the method of marking up parts of a document with tags. SGML is a language used for the definition of device and system independent methods of representing electronic texts. SGML is used for describing the terms and rules for marking up texts in a particular application. By design, SGML is flexible in order to utilize descriptive mark-up, which allows different processing applications to operate on the same text. It also recognizes different document types, which allow parser programs to verify that each uniquely marked up document is correctly tagged and organized. SGML uses ASCII characters, rather than specialized character sets. The use of ASCII serves to prevent incompatibility between programming language and hardware or software environments, and is recognizable by any user regardless of programming experience. It is also possible to create non-ASCII characters with certain SGML tags. By defining the elements that mark up a text, and the attributes that qualify and describe those elements, SGML-tagged documents can be organized into indexed databases. This means that the texts can be retrieved according to user-specified parameters and individual parts to separate texts can also be accessed. 1.1. SGML Document Components Each SGML document consists of an SGML prolog and a document instance. SGML Prolog The prolog contains an SGML declaration and a document type declaration - DTD. C The SGML declaration is made up of character sets and codes specific to the version of SGML being used. It is typically kept in compiled tables by the SGML processor and will not normally be seen by the user. C The DTD contains the list of elements, attributes, entities and other declarations by which the tags that mark up the text are distinguished. The DTD may exist separately from the texts that refer to it, or may be contained within the texts. Document Instance The Document Instance is the text itself. An important distinction between Document Instance and DTD is that the Document Instance can not contain any tag definitions. Any tag within a text not defined by the text's DTD will not by recognized. SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01) Ch 1: Page 1 XML: Extensible Markup Language 2 Getting Started SGML: How it is Used SGML can be used for creating, validating and processing electronic documents. Most use a piece of software called a parser, which can take a DTD and use it to verify whether or not a marked up text conforms to the standards of that DTD. Many parsers will produce a new version of the document which has all tags translated according to DTD specifications. These new versions of marked up text can be used in conjunction with other software packages in a variety of ways, such as: structured editors, which can determine for users when and where different word-processing tags should be inserted automatically; formatting, or producing typographically distinct printed versions of documents; and text-based database systems; where users can search a DTD archive for occurrences of word or word patterns. SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01) Ch 1: Page 2 XML: Extensible Markup Language 3 Getting Started HTML HTML stands for HyperText Markup Language and it is the standard for publishing hypertext on the World Wide Web. It is a non-proprietary format based upon SGML, and can be created and processed by simple plain text editors or sophisticated WYSIWYG authoring tools. HTML uses tags to structure text into headings, paragraphs, lists, hypertext links, etc. HTML gives authors the means to: C Publish online documents with headings, text, tables, lists, photos, etc. C Retrieve online information via hypertext links, at the click of a button. C Design forms for conducting transactions with remote services, for use in searching for information, making reservations, ordering products, etc. C Include spreadsheets, video clips, sound clips, and other applications directly in their documents. 3.1. HTML: Evolution HTML was originally developed by Tim Berners-Lee while at CERN, and popularized by the Mosaic browser developed at NCSA. During the 1990s its use expanded with the explosive growth of the World Wide Web. HTML has been extended in a number of ways. The Web depends on Web page authors and vendors sharing the same conventions for HTML. It is important that HTML documents work well across different browsers and platforms. Naturally, achieving interoperability lowers costs to content providers since they must develop only one version of a document. Each version of HTML has attempted to reflect greater consensus among industry players in order that the investment made by content providers will not be wasted and that their documents will not become unreadable in a short period of time. HTML 4 HTML 4 extends HTML with mechanisms for style sheets, scripting, frames, embedding objects, improved support for right to left and mixed direction text, richer tables, and enhancements to forms, offering improved accessibility for people with disabilities. HTML 4.01 is a revision of HTML 4.0 that corrects errors and makes some changes since the previous revision. SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01) Ch 1: Page 3 XML: Extensible Markup Language 4 Getting Started XML: What is it XML stands for Extensible Markup Language; it is a markup language for documents containing structured information. All documents have some structure. The structured information contains both content such as words, pictures, etc. and some indication as to the role that content plays. A markup language is a mechanism for identifying structures in a document. The XML specification defines a standard way for adding markup to documents. 4.1. XML versus SGML The Extensible Markup Language shares fundamental goals and methods with SGML, but it is in fact a restricted subset of SGML. This balance is fundamental to XML, due largely to the demand for consumption of generalized markup on the web and to promote software production. SGML users can use XML as a replacement for SGML, but in many situations will use it as a distribution medium for displaying documents on the World Wide Web. Prior to XML this typically meant converting documents to the HyperText Markup Language. The shortcoming to this approach was a reduction in the structural intelligence of SGML documents, rendering them near useless for high-quality formatting and processing. The advantage to SGML users in using XML instead of HTML is the retention of the structural integrity of the documents. 4.2. XML versus HTML In HTML, both the tag semantics and the tag set are fixed. The W3C, in conjunction with browser vendors and the WWW community, is constantly working to extend the definition of HTML to allow new tags to keep pace with changing technology and to bring variations in presentation to the Web. XML specifies neither semantics nor a tag set; XML is really a meta-language for describing markup languages. XML provides a facility to define tags and the structural relationships between them. Since there is no predefined tag set, there can't be any preconceived semantics. All of the semantics of an XML document will either be defined by the applications that process them or by stylesheets. SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01) Ch 1: Page 4 XML: Extensible Markup Language 5 Getting Started XML: Purpose and Function XML was created in order that richly structured documents can be used over the web. The only viable alternatives, HTML and SGML, are not practical for this purpose. HTML comes bound with a set of semantics and does not provide arbitrary structure. SGML provides arbitrary structure, but is too difficult to implement only for a web browser. Full SGML systems solve large, complex problems that justify their expense. Viewing structured documents sent over the web rarely carries such justification. It is not expected that XML will completely replace SGML. In many organizations, filtering SGML to XML will be the standard procedure for web delivery. SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01) Ch 1: Page 5 XML: Extensible Markup Language 6 Getting Started XML Format There are several benefits in writing a much longer XML document: C The data is self-describing. C The data can be manipulated with standard tools. C The data can be viewed with standard tools. C Different views of the same data can be created with style sheets. The first major benefit of the XML format is that the data is self-describing. The meaning of each number is clearly associated with the number itself. The second benefit to providing the data in XML is that it enables the data to be manipulated in a wide range of XML-enabled tools. The data probably will be bigger, but the extra redundancy allows more tools to process it. In some ways, XML is just another data format. However, XML has several advantages over other formats when storing information. C XML allows developers to create their own labeled structures for storing information. C XML parsing is well-defined and widely implemented: making it possible to retrieve information from XML documents in a variety of environments. C XML is built on a Unicode foundation, making it easier to create internationalized documents. C Applications can rely on XML parsers to do some structural validation, as well as data type checking when schemas are used. C XML formats are text-based, making them more readable, easier to document, and sometimes easier to debug. C Tools are available for XML processing on different platforms, making it simpler to use XML instead of binary formats to exchange complex information streams. C XML documents can use much of the infrastructure already built for HTML, including the HTTP protocol and some browsers. SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01) Ch 1: Page 6 XML: Extensible Markup Language 7 Getting Started XML Development Goals The XML specification sets out the following goals for XML: C It shall be straightforward to use XML over the Internet. Users must be able to view XML documents as quickly and easily as HTML documents. C XML shall support a wide variety of applications: authoring, browsing, content analysis, etc. Although the initial focus is on serving structured documents over the web, it is not meant to narrowly define XML. C XML shall be compatible with SGML. XML has been designed to be compatible with existing standards while solving the relatively new problem of sending richly structured documents over the web. C It shall be easy to write programs that process XML documents. C The number of optional features in XML is to be kept to an absolute minimum. Optional features inevitably raise compatibility problems when users want to share documents and sometimes lead to confusion and frustration. C XML documents should be human-legible and reasonably clear. C The XML design should be prepared quickly; standards efforts are notoriously slow. C The design of XML shall be formal and concise. C XML documents shall be easy to create. There will eventually be sophisticated editors for creating and editing XML content. In the interim, it must be possible to create XML documents in other way such as directly in a text editor, with simple shell and Perl scripts, etc. Several SGML language features were designed to minimize the amount of typing required to manually key in SGML documents. These features are not supported in XML. Most modern editors offer better facilities to define shortcuts when entering text. SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01) Ch 1: Page 7 XML: Extensible Markup Language 8 Getting Started XML Specifications XML is defined by a number of related specifications: C Extensible Markup Language (XML) 1.0 Defines the syntax of XML. The XML specification is the primary focus of this article. C XML Pointer Language (XPointer) and XML Linking Language (Xlink) Defines a standard way to represent links between resources. In addition to simple links, like HTML's <A> tag, XML has mechanisms for links between multiple resources and links between read-only resources. XPointer describes how to address a resource, XLink describes how to associate two or more resources. C Extensible Style Language (XSL) Defines the standard stylesheet language for XML. For the most part, reading and understanding the XML specifications does not require extensive knowledge of SGML or any of the related technologies. SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01) Ch 1: Page 8 XML: Extensible Markup Language 9 Getting Started XML Features Feature Explanation Keeping Data Separated From HTML HTML pages are used to display data. Data is often stored inside HTML pages. With XML this data can be stored in a separate XML file. XML Can Store Data Inside HTML Documents XML data can also be stored inside HTML pages as data islands. HTML can still be used for formatting and displaying the data. XML Can Be Used to Exchange Data Computer systems and databases contain data in incompatible formats. One of the most time consuming challenges has been to exchange data between such systems over the Internet. Converting the data to XML can greatly reduce this complexity and create data that can be read by different types of applications. XML Can Be Used to Store Data XML can also be used to store data in files or in databases. Applications can be written to store and retrieve information from the store, and generic applications can be used to display the data. SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01) Ch 1: Page 9 XML: Extensible Markup Language 10 Getting Started XML Architecture The XML language, XML namespaces, and the DOM are W3C recommendations; the final stage in the W3C development and approval process. As a result of stable specifications, developers can start tagging and exchanging their data in the XML format. XML offers a solution as the underlying architecture for data in three-tier architectures. XML can be generated from existing databases using a scalable three-tier model. With XML, structured data is maintained separately from the business rules and the display. Data integration, delivery, manipulation, and display are the steps in the underlying process as summarized in the following diagram. Three-tier Web architecture for flexible Web applications SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01) Ch 1: Page 10 XML: Extensible Markup Language 11 Getting Started Data Structure Namespaces XML namespaces provide developers with the ability to qualify element names in a recognizable manner to avoid conflicts between elements with the same name. Elements referenced in one document can be defined in different schemas on the Web. Namespaces ensure that element names do not conflict and clarify their origins, but do not determine how to process elements. Parsers must know what elements mean and how to process them. Tags from multiple namespaces can be mixed, which is essential with data coming from multiple sources across the Web. With namespaces, both elements could exist in the same XML-based document instance but could refer back to two different schemas, uniquely qualifying their semantics. The W3C has released XML namespaces as a recommendation, allowing elements to be subordinate to a URL. This ensures that names remain unambiguous even when chosen by multiple authors. In the same way anyone can publish their own Web page or view those of others, the namespace facility allows users to define private dictionaries of terms, or use a public namespace of common terms. Example: <orders xmlns:person="http://www.schemas.org/people" xmlns:dsig="http://dsig.org"> <order> <sold-to> <person:name> <person:last-name>Layman</person:last-name> <person:first-name>Andrew</person:first-name> </person:name> </sold-to> <sold-on>1997-03-17</sold-on> <dsig:digital-signature>1234567890</dsig:digital-signature> </order> </orders> This code informs any reader that when a name begins with "dsig:" its meaning is defined by whoever owns the "http://www.dsig.org" namespace. Similarly, elements beginning with the "person:" prefix have meanings defined by the "http://www.schemas.org/people" namespace. Namespaces ensure that element names do not conflict, and clarify who defined which term. They do not give instructions on how to process the elements. Readers still need to know what the elements mean and decide how to process them. Namespaces keep the names straight. SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01) Ch 1: Page 11 XML: Extensible Markup Language 12 Getting Started Data Delivery, Manipulation Since XML is an open, text-based format, it can be delivered through HTTP in the same way HTML can today. Data currently on the desktop can be manipulated using the DOM. Agents will also support the ability to generate XML updates, which can be sent in both directions to inform clients of changes made to data on the middle tier or database server and vice versa. Consequently, the agents will be able to receive updates from the client and send them to a storage server. 12.1. Parsing XML The XML parser in the browser can read a string of XML data, process it, generate a structured tree, and expose all data elements as objects using the DOM. The parser displays this data using a CSS or XSL style sheet, or makes the data available for further manipulation by script or hands it off to other applications or objects for further processing. Namespaces, data types, queries, and XSL transformations are supported with extended methods available in the DOM. 12.2. Manipulating and Editing Data Using the Document Object Model The DOM is an Application Program Interface (API) that defines a standard way in which developers can interact with the elements of the XML structured tree. The object model controls how users communicate with trees and exposes all tree elements as objects, which can be accessed programmatically without any return trips to the server. SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01) Ch 1: Page 12 XML: Extensible Markup Language 13 Getting Started Displaying XML-Based Data in HTML An XML document does not by itself specify whether or how its information should be displayed; the XML data contains the facts. HTML is an ideal display language for presenting this data to an end user. The mechanisms of data binding and style sheets can be used to arrange XML data into a visual presentation, and to add interactivity. Data binding is an aspect of Dynamic HTML - DHTML that moves individual items of data from an information source such as an XML document into an HTML display, allowing HTML to be used as a template for displaying XML data. SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01) Ch 1: Page 13 XML: Extensible Markup Language 14 Getting Started XSL: Extensible Stylesheet Language XSL stands for an Extensible Stylesheet Language. An XSL style sheet contains instructions for how to pull information out of an XML document and transform it into another format, such as HTML. The transformation of XML into formats is done in a declarative way, making it often easier and more accessible than through scripting. XSL uses XML as its syntax, freeing XML authors from having to learn another markup language. CSS can still be used for structured XML data. However, CSS does not provide a display structure that deviates from the structure of the data source. With XSL, it is possible to generate presentation structures that are very different from the original XML data structures. SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01) Ch 1: Page 14 XML: Extensible Markup Language 15 Getting Started Augmenting HTML Adding semantic information to HTML pages is not easy. Historically, various programs have attempted to deal with this problem by using nonstandard tricks, such as hiding data inside HTML comments. These These comments can be awkward and, unlike XML, are not exposed to the object model. The W3C has defined a format for putting XML-based data inside HTML pages. Extending HTML through the use of data islands will allow a wide range of applications to use HTML as the primary document or display format and also use XML embedded within these documents to hold data. An HTML page could therefore include, among other things, specific data about the subject of the page. SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01) Ch 1: Page 15 XML: Extensible Markup Language 16 Getting Started XML: Transforming and Querying With the utilization of XML as a standard way for interchanging data on the Web, there will be a need for mechanisms to query XML, shaping extracted data, including sorting and filtering, and transforming one XML grammar into another. XSL and the XSL Pattern language that is part of XSL provide a measure of this capability. XSL Patterns are a simple and concise syntax for identifying nodes in an XML document, based on the node's type, name, content, and context in relation to other nodes in the tree. XSL provides a grammar in which the results of XSL Pattern queries are associated with templates to describe the materialization of data in the XML source document as a new XML document. While this forms the basis for transforming data to display formats such as HTML, any XML grammar can be output, providing for sorting and filtering within a single XML grammar, or translating data from one schema to another. SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01) Ch 1: Page 16