Chapter 1 - Sys-Ed

advertisement
Chapter
1
GETTING
STARTED
SYS-ED/
COMPUTER
EDUCATION
TECHNIQUES, INC.
XML: Extensible Markup Language
Getting Started
Objectives
You will learn:
C
SGML and SGML document components.
C
XML: What is it.
C
XML as compared to SGML and HTML.
C
XML format.
C
XML specifications.
C
XML architecture.
C
Data structure namespaces.
C
Data delivery, manipulation.
C
Parsing XML.
C
Manipulating and editing data using the Document Object Mode.
C
Displaying XML-based data in HTML
C
XSL: Extensible Stylesheet Language.
C
XML: Transforming and querying.
SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01)
Ch 1: Page i
XML: Extensible Markup Language
1
Getting Started
SGML
SGML stands for the Standard Generalized Markup Language; it is the method of marking up parts of a
document with tags.
SGML is a language used for the definition of device and system independent methods of representing
electronic texts. SGML is used for describing the terms and rules for marking up texts in a particular
application.
By design, SGML is flexible in order to utilize descriptive mark-up, which allows different processing
applications to operate on the same text. It also recognizes different document types, which allow parser
programs to verify that each uniquely marked up document is correctly tagged and organized.
SGML uses ASCII characters, rather than specialized character sets. The use of ASCII serves to prevent
incompatibility between programming language and hardware or software environments, and is
recognizable by any user regardless of programming experience. It is also possible to create non-ASCII
characters with certain SGML tags.
By defining the elements that mark up a text, and the attributes that qualify and describe those elements,
SGML-tagged documents can be organized into indexed databases. This means that the texts can be
retrieved according to user-specified parameters and individual parts to separate texts can also be
accessed.
1.1.
SGML Document Components
Each SGML document consists of an SGML prolog and a document instance.
SGML Prolog
The prolog contains an SGML declaration and a document type declaration - DTD.
C
The SGML declaration is made up of character sets and codes specific to the version of SGML
being used. It is typically kept in compiled tables by the SGML processor and will not normally be
seen by the user.
C
The DTD contains the list of elements, attributes, entities and other declarations by which the tags
that mark up the text are distinguished. The DTD may exist separately from the texts that refer to
it, or may be contained within the texts.
Document Instance
The Document Instance is the text itself. An important distinction between Document Instance and DTD is
that the Document Instance can not contain any tag definitions. Any tag within a text not defined by the
text's DTD will not by recognized.
SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01)
Ch 1: Page 1
XML: Extensible Markup Language
2
Getting Started
SGML: How it is Used
SGML can be used for creating, validating and processing electronic documents.
Most use a piece of software called a parser, which can take a DTD and use it to verify whether or not a
marked up text conforms to the standards of that DTD. Many parsers will produce a new version of the
document which has all tags translated according to DTD specifications.
These new versions of marked up text can be used in conjunction with other software packages in a
variety of ways, such as: structured editors, which can determine for users when and where different
word-processing tags should be inserted automatically; formatting, or producing typographically distinct
printed versions of documents; and text-based database systems; where users can search a DTD archive
for occurrences of word or word patterns.
SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01)
Ch 1: Page 2
XML: Extensible Markup Language
3
Getting Started
HTML
HTML stands for HyperText Markup Language and it is the standard for publishing hypertext on the World
Wide Web. It is a non-proprietary format based upon SGML, and can be created and processed by simple
plain text editors or sophisticated WYSIWYG authoring tools.
HTML uses tags to structure text into headings, paragraphs, lists, hypertext links, etc.
HTML gives authors the means to:
C
Publish online documents with headings, text, tables, lists, photos, etc.
C
Retrieve online information via hypertext links, at the click of a button.
C
Design forms for conducting transactions with remote services, for use in searching for
information, making reservations, ordering products, etc.
C
Include spreadsheets, video clips, sound clips, and other applications directly in their documents.
3.1.
HTML: Evolution
HTML was originally developed by Tim Berners-Lee while at CERN, and popularized by the Mosaic
browser developed at NCSA. During the 1990s its use expanded with the explosive growth of the
World Wide Web.
HTML has been extended in a number of ways. The Web depends on Web page authors and vendors
sharing the same conventions for HTML. It is important that HTML documents work well across different
browsers and platforms. Naturally, achieving interoperability lowers costs to content providers since they
must develop only one version of a document.
Each version of HTML has attempted to reflect greater consensus among industry players in order that the
investment made by content providers will not be wasted and that their documents will not become
unreadable in a short period of time.
HTML 4
HTML 4 extends HTML with mechanisms for style sheets, scripting, frames, embedding objects, improved
support for right to left and mixed direction text, richer tables, and enhancements to forms, offering
improved accessibility for people with disabilities.
HTML 4.01 is a revision of HTML 4.0 that corrects errors and makes some changes since the previous
revision.
SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01)
Ch 1: Page 3
XML: Extensible Markup Language
4
Getting Started
XML: What is it
XML stands for Extensible Markup Language; it is a markup language for documents containing structured
information. All documents have some structure. The structured information contains both content such
as words, pictures, etc. and some indication as to the role that content plays.
A markup language is a mechanism for identifying structures in a document. The XML specification
defines a standard way for adding markup to documents.
4.1.
XML versus SGML
The Extensible Markup Language shares fundamental goals and methods with SGML, but it is in fact a
restricted subset of SGML. This balance is fundamental to XML, due largely to the demand for
consumption of generalized markup on the web and to promote software production.
SGML users can use XML as a replacement for SGML, but in many situations will use it as a distribution
medium for displaying documents on the World Wide Web. Prior to XML this typically meant converting
documents to the HyperText Markup Language. The shortcoming to this approach was a reduction in the
structural intelligence of SGML documents, rendering them near useless for high-quality formatting and
processing.
The advantage to SGML users in using XML instead of HTML is the retention of the structural integrity of
the documents.
4.2.
XML versus HTML
In HTML, both the tag semantics and the tag set are fixed. The W3C, in conjunction with browser vendors
and the WWW community, is constantly working to extend the definition of HTML to allow new tags to
keep pace with changing technology and to bring variations in presentation to the Web.
XML specifies neither semantics nor a tag set; XML is really a meta-language for describing markup
languages. XML provides a facility to define tags and the structural relationships between them.
Since there is no predefined tag set, there can't be any preconceived semantics. All of the semantics of
an XML document will either be defined by the applications that process them or by stylesheets.
SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01)
Ch 1: Page 4
XML: Extensible Markup Language
5
Getting Started
XML: Purpose and Function
XML was created in order that richly structured documents can be used over the web. The only viable
alternatives, HTML and SGML, are not practical for this purpose.
HTML comes bound with a set of semantics and does not provide arbitrary structure.
SGML provides arbitrary structure, but is too difficult to implement only for a web browser. Full SGML
systems solve large, complex problems that justify their expense. Viewing structured documents sent
over the web rarely carries such justification.
It is not expected that XML will completely replace SGML. In many organizations, filtering SGML to XML
will be the standard procedure for web delivery.
SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01)
Ch 1: Page 5
XML: Extensible Markup Language
6
Getting Started
XML Format
There are several benefits in writing a much longer XML document:
C
The data is self-describing.
C
The data can be manipulated with standard tools.
C
The data can be viewed with standard tools.
C
Different views of the same data can be created with style sheets.
The first major benefit of the XML format is that the data is self-describing. The meaning of each number
is clearly associated with the number itself.
The second benefit to providing the data in XML is that it enables the data to be manipulated in a wide
range of XML-enabled tools. The data probably will be bigger, but the extra redundancy allows more tools
to process it.
In some ways, XML is just another data format. However, XML has several advantages over other
formats when storing information.
C
XML allows developers to create their own labeled structures for storing information.
C
XML parsing is well-defined and widely implemented: making it possible to retrieve information
from XML documents in a variety of environments.
C
XML is built on a Unicode foundation, making it easier to create internationalized documents.
C
Applications can rely on XML parsers to do some structural validation, as well as data type
checking when schemas are used.
C
XML formats are text-based, making them more readable, easier to document, and sometimes
easier to debug.
C
Tools are available for XML processing on different platforms, making it simpler to use XML
instead of binary formats to exchange complex information streams.
C
XML documents can use much of the infrastructure already built for HTML, including the HTTP
protocol and some browsers.
SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01)
Ch 1: Page 6
XML: Extensible Markup Language
7
Getting Started
XML Development Goals
The XML specification sets out the following goals for XML:
C
It shall be straightforward to use XML over the Internet. Users must be able to view XML
documents as quickly and easily as HTML documents.
C
XML shall support a wide variety of applications: authoring, browsing, content analysis, etc.
Although the initial focus is on serving structured documents over the web, it is not meant to
narrowly define XML.
C
XML shall be compatible with SGML. XML has been designed to be compatible with existing
standards while solving the relatively new problem of sending richly structured documents over
the web.
C
It shall be easy to write programs that process XML documents.
C
The number of optional features in XML is to be kept to an absolute minimum. Optional features
inevitably raise compatibility problems when users want to share documents and sometimes lead
to confusion and frustration.
C
XML documents should be human-legible and reasonably clear.
C
The XML design should be prepared quickly; standards efforts are notoriously slow.
C
The design of XML shall be formal and concise.
C
XML documents shall be easy to create. There will eventually be sophisticated editors for creating
and editing XML content. In the interim, it must be possible to create XML documents in other
way such as directly in a text editor, with simple shell and Perl scripts, etc.
Several SGML language features were designed to minimize the amount of typing required to manually
key in SGML documents. These features are not supported in XML. Most modern editors offer better
facilities to define shortcuts when entering text.
SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01)
Ch 1: Page 7
XML: Extensible Markup Language
8
Getting Started
XML Specifications
XML is defined by a number of related specifications:
C
Extensible Markup Language (XML) 1.0
Defines the syntax of XML. The XML specification is the primary focus of this article.
C
XML Pointer Language (XPointer) and XML Linking Language (Xlink)
Defines a standard way to represent links between resources. In addition to simple links, like
HTML's <A> tag, XML has mechanisms for links between multiple resources and links between
read-only resources. XPointer describes how to address a resource, XLink describes how to
associate two or more resources.
C
Extensible Style Language (XSL)
Defines the standard stylesheet language for XML.
For the most part, reading and understanding the XML specifications does not require extensive
knowledge of SGML or any of the related technologies.
SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01)
Ch 1: Page 8
XML: Extensible Markup Language
9
Getting Started
XML Features
Feature
Explanation
Keeping Data
Separated From HTML
HTML pages are used to display data. Data is often stored inside
HTML pages. With XML this data can be stored in a separate XML
file.
XML Can Store Data
Inside HTML Documents
XML data can also be stored inside HTML pages as data islands.
HTML can still be used for formatting and displaying the data.
XML Can Be Used
to Exchange Data
Computer systems and databases contain data in incompatible
formats. One of the most time consuming challenges has been to
exchange data between such systems over the Internet. Converting
the data to XML can greatly reduce this complexity and create data
that can be read by different types of applications.
XML Can Be Used
to Store Data
XML can also be used to store data in files or in databases.
Applications can be written to store and retrieve information from the
store, and generic applications can be used to display the data.
SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01)
Ch 1: Page 9
XML: Extensible Markup Language
10
Getting Started
XML Architecture
The XML language, XML namespaces, and the DOM are W3C recommendations; the final stage in the
W3C development and approval process.
As a result of stable specifications, developers can start tagging and exchanging their data in the XML
format. XML offers a solution as the underlying architecture for data in three-tier architectures.
XML can be generated from existing databases using a scalable three-tier model. With XML, structured
data is maintained separately from the business rules and the display. Data integration, delivery,
manipulation, and display are the steps in the underlying process as summarized in the following diagram.
Three-tier Web architecture for flexible Web applications
SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01)
Ch 1: Page 10
XML: Extensible Markup Language
11
Getting Started
Data Structure Namespaces
XML namespaces provide developers with the ability to qualify element names in a recognizable manner
to avoid conflicts between elements with the same name.
Elements referenced in one document can be defined in different schemas on the Web. Namespaces
ensure that element names do not conflict and clarify their origins, but do not determine how to process
elements. Parsers must know what elements mean and how to process them.
Tags from multiple namespaces can be mixed, which is essential with data coming from multiple sources
across the Web. With namespaces, both elements could exist in the same XML-based document instance
but could refer back to two different schemas, uniquely qualifying their semantics.
The W3C has released XML namespaces as a recommendation, allowing elements to be subordinate to a
URL. This ensures that names remain unambiguous even when chosen by multiple authors. In the same
way anyone can publish their own Web page or view those of others, the namespace facility allows users
to define private dictionaries of terms, or use a public namespace of common terms.
Example:
<orders xmlns:person="http://www.schemas.org/people"
xmlns:dsig="http://dsig.org">
<order>
<sold-to>
<person:name>
<person:last-name>Layman</person:last-name>
<person:first-name>Andrew</person:first-name>
</person:name>
</sold-to>
<sold-on>1997-03-17</sold-on>
<dsig:digital-signature>1234567890</dsig:digital-signature>
</order>
</orders>
This code informs any reader that when a name begins with "dsig:" its meaning is defined by whoever
owns the "http://www.dsig.org" namespace.
Similarly, elements beginning with the "person:" prefix have meanings defined by the
"http://www.schemas.org/people" namespace.
Namespaces ensure that element names do not conflict, and clarify who defined which term. They do not
give instructions on how to process the elements. Readers still need to know what the elements mean and
decide how to process them. Namespaces keep the names straight.
SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01)
Ch 1: Page 11
XML: Extensible Markup Language
12
Getting Started
Data Delivery, Manipulation
Since XML is an open, text-based format, it can be delivered through HTTP in the same way HTML can
today. Data currently on the desktop can be manipulated using the DOM. Agents will also support the
ability to generate XML updates, which can be sent in both directions to inform clients of changes made to
data on the middle tier or database server and vice versa.
Consequently, the agents will be able to receive updates from the client and send them to a storage
server.
12.1. Parsing XML
The XML parser in the browser can read a string of XML data, process it, generate a structured tree, and
expose all data elements as objects using the DOM.
The parser displays this data using a CSS or XSL style sheet, or makes the data available for further
manipulation by script or hands it off to other applications or objects for further processing.
Namespaces, data types, queries, and XSL transformations are supported with extended methods
available in the DOM.
12.2. Manipulating and Editing Data
Using the Document Object Model
The DOM is an Application Program Interface (API) that defines a standard way in which developers can
interact with the elements of the XML structured tree. The object model controls how users communicate
with trees and exposes all tree elements as objects, which can be accessed programmatically without any
return trips to the server.
SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01)
Ch 1: Page 12
XML: Extensible Markup Language
13
Getting Started
Displaying XML-Based Data in HTML
An XML document does not by itself specify whether or how its information should be displayed; the XML
data contains the facts. HTML is an ideal display language for presenting this data to an end user.
The mechanisms of data binding and style sheets can be used to arrange XML data into a visual
presentation, and to add interactivity. Data binding is an aspect of Dynamic HTML - DHTML that moves
individual items of data from an information source such as an XML document into an HTML display,
allowing HTML to be used as a template for displaying XML data.
SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01)
Ch 1: Page 13
XML: Extensible Markup Language
14
Getting Started
XSL: Extensible Stylesheet Language
XSL stands for an Extensible Stylesheet Language. An XSL style sheet contains instructions for how to
pull information out of an XML document and transform it into another format, such as HTML. The
transformation of XML into formats is done in a declarative way, making it often easier and more
accessible than through scripting.
XSL uses XML as its syntax, freeing XML authors from having to learn another markup language.
CSS can still be used for structured XML data. However, CSS does not provide a display structure that
deviates from the structure of the data source. With XSL, it is possible to generate presentation structures
that are very different from the original XML data structures.
SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01)
Ch 1: Page 14
XML: Extensible Markup Language
15
Getting Started
Augmenting HTML
Adding semantic information to HTML pages is not easy. Historically, various programs have attempted to
deal with this problem by using nonstandard tricks, such as hiding data inside HTML comments. These
These comments can be awkward and, unlike XML, are not exposed to the object model.
The W3C has defined a format for putting XML-based data inside HTML pages. Extending HTML through
the use of data islands will allow a wide range of applications to use HTML as the primary document or
display format and also use XML embedded within these documents to hold data.
An HTML page could therefore include, among other things, specific data about the subject of the page.
SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01)
Ch 1: Page 15
XML: Extensible Markup Language
16
Getting Started
XML: Transforming and Querying
With the utilization of XML as a standard way for interchanging data on the Web, there will be a need for
mechanisms to query XML, shaping extracted data, including sorting and filtering, and transforming one
XML grammar into another. XSL and the XSL Pattern language that is part of XSL provide a measure of
this capability.
XSL Patterns are a simple and concise syntax for identifying nodes in an XML document, based on the
node's type, name, content, and context in relation to other nodes in the tree.
XSL provides a grammar in which the results of XSL Pattern queries are associated with templates to
describe the materialization of data in the XML source document as a new XML document. While this
forms the basis for transforming data to display formats such as HTML, any XML grammar can be output,
providing for sorting and filtering within a single XML grammar, or translating data from one schema to
another.
SYS-ED®\COMPUTER EDUCATION TECHNIQUES, INC. (XML: 1.01)
Ch 1: Page 16
Download