WORLD METEOROLOGICAL ORGANIZATION _________________________ ET-ADRS-1/ Doc. 2.5(2) (14.04.2008) ____________ COMMISSION FOR BASIC SYSTEMS ITEM 2.5 OPAG ON INFORMATION SYSTEMS & SERVICES EXPERT TEAM ON ASSESSMENT OF DATA REPRESENTATION SYSTEMS ENGLISH only FIRST MEETING SILVER SPRING, MD, USA, 23 TO 25 APRIL 2008 REVIEW OF DATA REPRESENTATION SYSTEMS Interoperability Standards at Levels of Syntax and Semantics (Submitted by the Secretariat) Summary and purpose of document The document discusses the need to distinguish between interoperability standards that operate at the level of syntax (e.g, ASN.1 Abstract Syntax Notation and XML, eXtensible Markup Language) in contrast to standards that address semantics (e.g., ISO/IEC 11179, Metadata Registries). ACTION PROPOSED The ET-ADRS is invited to consider: (a) Syntactic interoperability should be achieved using automated transformations based on precise structural and data type definitions as expressed through syntax description languages such as ASN.1 specifications and XML schemas, and standards in these areas should be promoted through WMO policies and procedures; (b) In the absence of a semantic relationship between elements of interest (i.e., precise definitions of the data or information elements), interoperability at the syntactic level cannot assure that the resulting integrated or transformed information is meaningful; (c) The standard ISO/IEC 11179, Metadata Registries, provides useful guidance for documenting the meanings of data or information elements, thereby allowing schema elements to be related to each other within and among data dictionaries; (d) A policy promoting broader WMO use of guidance available in ISO/IEC 11179 would align well with existing WMO procedures, and would complement broader use of the ISO 191xx series of standards, especially ISO 19115 which uses ISO/IEC 11179 in defining metadata elements. Contents: Interoperability Standards at Levels of Syntax and Semantics............................................... 2 ET-ADRS-1/ Doc. 2.5(2), p. 2 Interoperability Standards at Levels of Syntax and Semantics 1. Distinguishing Syntax from Semantics 1.1. Semantics deals with the meaning of a symbol in some language, whereas syntax deals with the handling of symbols independent of their meaning. Information and communications technology has always supported many and diverse automated mechanisms for manipulating the syntax of data and information. Semantics, on the other hand, has mostly required the active involvement of human analysts. 1.2. An example of the distinction between semantics and syntax is seen in the architecture section of the GEOSS 10-Year Implementation Plan Reference Document: 1 Systems interoperating in GEOSS agree to avoid non-standard data syntaxes in favor of well-known and precisely defined syntaxes for data traversing system interfaces. The international standard ASN.1 (Abstract Syntax Notation) and the industry standard XML (Extensible Markup Language) are examples of robust and generalized data syntaxes, and these are themselves inter-convertible. It is also important to register the semantics of shared data elements so that any system designer can determine in a precise way the exact meaning of data occurring at service interfaces between components. The standard ISO/IEC 11179, Information Technology--Metadata Registries, provides guidance on representing data semantics in a common registry. 1.3. Global organizations such as WMO require policies and procedures that accommodate wide diversity in data and information syntax. Open standards mechanisms for syntax transformation should be promoted, including the transformation between ASN.1 and XML described herein. WMO policies and procedures should also address the needs of analysts seeking semantic interoperability in specific applications. As explained below, it is important to distinguish between interoperability standards that address syntax (e.g, ASN.1, XML) and standards that address semantics (e.g., ISO/IEC 11179). 2. Standards-based Interoperability at the Level of Syntax 2.1. Automated mechanisms for manipulating the syntax of data and information include ways to define structure (e.g., elements, element groups, records) as well as data type (e.g., numbers, dates, text). There are various conventions for how to associate such structure and type descriptions with particular instances of data, including external as well as internal descriptions (sometimes called "self-describing file formats"). There are also various methods for conveying structural information within a data instance, including start-stop tag "mark-up" approaches and "start position, length" approaches. It is also important to realize that some mechanisms have the ability to distinguish bits as individual data elements, whereas others have whole characters as their smallest distinguishable data element. 2.2. Examples of the mechanism characteristics described above can be seen with ASN.1 and XML. Both use external descriptions to associate a given data instance to its syntax definitions; ASN.1 calls this syntax description a "specification", whereas XML calls it a "schema". ASN.1 uses "start position, length" to convey structural information, whereas XML uses start-stop mark-up for the same purpose. ASN.1 can describe data elements to the bit level, whereas XML is character-oriented (with Unicode as its character repertoire). 1 GEOSS (the Global Earth Observation System of Systems) is directed by the Group on Earth Observations, an international partnership (see http://earthobservations.org/ ) ET-ADRS-1/ Doc. 2.5(2), p. 3 2.3. Although information and communications technology has diverse mechanisms for manipulating syntax, the interoperability challenge here is fairly straightforward. In general, it is not necessary that interoperating systems come to a joint decision on a single syntax. Rather, it is only necessary that among the syntaxes in use there exists a standardized transformation with acceptable "lossiness". For example, two systems can have syntactic interoperability even though one describes syntax with ASN.1 specifications while the other uses XML schemas. This is the case between ASN.1 and XML because there does exist a standardized transformation: XML Encoding Rules. 2.4. An ASN.1 specification defines what elements of data are allowed, working up from single bits. It is likely that ASN.1 could be used to describe currently used WMO formats such as BUFR and NetCDF, as it was designed to describe just about every type of data as implemented on every type of computing device, including all types of binary data. Of course, one could design an XML representation of formats such as BUFR and NetCDF without first constructing an ASN.1 specification. But, the resulting character-based XML messages would be much larger in size (perhaps several times larger or more). This is not only because ASN.1 is binary-oriented whereas XML is character-oriented. Efficiency comes primarily from the ASN.1 techniques for encoding abstract specifications into real messages. 2.5. In ASN.1, the concrete syntax ("on-the-wire encoding") is provided separately from the abstract syntax. The concrete syntax is realized through standard "encoding rules", which are applied over a communications session for the actual message instances. Commonly used ASN.1 encoding rules are: Basic Encoding Rules which represent each byte present as defined in the specification; Packed Encoding Rules which group repeated bytes (a technique broadly implemented as "Zip" file compression); Distinguished Encoding Rules which represent only the bytes that change on a message to message basis, and XML Encoding Rules which translate between ASN.1 elements and their XML equivalents. 2.6. An XML Encoding Rules (XER) tool can automatically and losslessly generate the corresponding XML schema. (This is greatly simplified because an ASN.1 specification provides element names that are legal for XML elements as well). The reverse is also true: XER tools can automatically generate an ASN.1 specification from an XML schema. This means that an application designer can define any data using ASN.1, including all manner of binary data down to assigning meanings to individual bits, and then generate an XML schema. Or, the designer can take an XML stream through XER and have the file automatically compressed using packed or distinguished encoding rules. 2.7. Using XER, applications can have the best of both worlds: high transmission efficiency and the broadest range of data type support through ASN.1, as well as the designtime simplicity of XML representation. XER can be applied at the instance level rather than the schema level if it is desired that the applications are unaffected. In other words, particular channels can put XER converters at both ends to enhance transmission efficiency "on-thefly". This is especially useful for bandwidth-constrained channels such as cell phones, as well as for channels that must carry very high traffic volumes. In many WMO applications, high data volume and/or narrow bandwidth are major issues, and the efficient packing of binary data must be of primary interest. 2.8. ASN.1 is the most common way of describing data that moves by networks. Almost all of the world's digital voice telephone traffic is described via ASN.1, and most Internet network control traffic (Simple Network Management Protocol) is described by ASN.1 as well. The recently approved ITU (International Telecommunication Union) Recommendation X.1303, Common Alerting Protocol (CAP), provides a good example of supplementing XML with ASN.1. The use of XER to achieve ASN.1 and XML interoperability in the case of the standard X.1303, Common Alerting Protocol, is featured in another paper submitted to this ET-ADRS-1/ Doc. 2.5(2), p. 4 meeting. CAP should become a widely used standard across WMO Members, as it standardizes all-hazards, all-media public warning. 2.9. CAP was originally developed in the OASIS standards body, which works primarily in the realm of e-business and Electronic Data Interchange. Message elements of the CAP standard were originally expressed using XML, and many applications had already deployed XML for their CAP messages. On the other hand, the ITU standards body is concerned with telecommunications efficiency. The ITU insisted that CAP messages be represented in ASN.1. Because of XER, the addition of ASN.1 for CAP messages was a simple add-on, and ASN.1 was readily accepted as an additional transmission format in the CAP standard. 2.10. XER transformations are lossless in both directions. But, there are syntax pairs where a standardized syntax transformation is not lossless in practice and perhaps not even in principle. The challenge then is to determine whether the degree of syntax transformation lossiness is acceptable in the context of the given application that requires interoperability. Some applications, such as those involving archiving, require syntax mechanisms to be interoperable over very long time periods. Designers of those applications must be especially mindful of the robustness of standardized syntax transformations. 3. Standards-based Interoperability at the Level of Semantics 3.1. As discussed above, syntactic interoperability is a straightforward task that can be automated in an application as long as there is a standardized syntax transformation with acceptable lossiness. Yet, even though two components in an application have interoperable or identical syntax mechanisms (e.g., both use XML or both use ASN.1), the application designer must still consider issues of semantic interoperability. 3.2. For this discussion, data exchanged between components is considered to be interoperable at the semantic level if its meaning is understood well enough to justify whatever integrated or transformation operations are being performed. In the simplest case, the respective data definitions for an element exchanged between two components may be identical. For example, both components may define a data element for a point location on the earth. As the definitions are identical, semantic interoperability is established. Of course, a syntax transformation may be necessary if one component represents the point using "latitude/longitude" and units of "degrees/minutes/seconds", while the other component represents the point using "x/y coordinates" and units of "decimal degrees". 3.3. Semantic interoperability is often constrained in practice by the lack of available data definitions that are good enough to justify the integration or transformation operations envisioned for the application. All too often, the definitions of a data or information element is not written out in a data dictionary, but is only understood in the context of the computer programs that create or use the element. A somewhat better situation exists among systems that operate in a single "universe of discourse", such as the community of practices that focuses on Geographic Information Systems. To the extent that data element meanings are strongly constrained throughout such a community, system designers might focus primarily on syntactic interoperability. In general, however, to integrate or transform any data or information element without an explicit definition is an inherently risky undertaking. 3.4. The standard ISO/IEC 11179, Metadata Registries, provides useful guidance for documenting the meanings of data or information elements. 2 Because it is not tied to any particular information and communications technology, it provides great flexibility in relating 2 A copy of the ISO/IEC 11179 standard is available at http://standards.iso.org/ittf/PubliclyAvailableStandards/c035343_ISO_IEC_11179-1_2004(E).zip ET-ADRS-1/ Doc. 2.5(2), p. 5 elements one to the other within and among data dictionaries. As a tool for regularizing semantic expressions, it may be also helpful in establishing the base for automated inference in the case of complex semantic issues. Yet, it is important to note that the objective here is have good definitions for data and information elements. Additional standards from the discipline of formal logic would need to be applied if the objective is to fully automate inferencing across collections of data dictionaries or metadata registries. 3.5. A policy promoting broader WMO use of guidance available in ISO/IEC 11179 would align well with existing WMO procedures, and would complement broader use of the ISO 191xx series of standards, especially ISO 19115 which uses ISO/IEC 11179 in defining metadata elements. 3.6. The following table gives examples of data definitions written in compliance with ISO/IEC 11179, as found in the Data Dictionary section of the standard X.1303, Common Alerting Protocol, referenced above. Element Name alert Context. Class. Attribute. Representation cap. alert. group identifier sender cap. alert. identifier cap. alert. sender. identifier sent cap. alert. sent. time status cap. alert. status. code Definition and (Optionality) Notes or Value Domain The container for all component parts of the alert message (REQUIRED) (1) Surrounds CAP alert message sub-elements. (2) MUST include the xmlns attribute referencing the CAP URI as the namespace, e.g.: The identifier of the alert message (REQUIRED) The identifier of the sender of the alert message (REQUIRED) The time and date of the origination of the alert message (REQUIRED) The code denoting the appropriate handling of the alert message (REQUIRED) <cap:alert xmlns:cap="urn:oasis: names:tc:emergency:cap:1.1"> [sub-elements] </cap: alert> (3) In addition to the specified subelements, MAY contain one or more <info> blocks. (1) A number or string uniquely identifying this message, assigned by the sender (2) MUST NOT include spaces, commas, or restricted characters (< and &) (1) Identifies the originator of this alert. Guaranteed by assigner to be unique globally; e. g., may be based on an Internet domain name (2) MUST NOT include spaces, commas, or restricted characters (< and &) The date and time is represented in [dateTime] format (e.g., "2002-05-24T16:49:00-07:00" for 24 May 2002 at 16: 49 PDT). Code Values: “Actual" Actionable by all targeted recipients “Exercise" Actionable only by designated exercise participants; exercise identifier should appear in <note> “System" For messages that support alert network internal functions. “Test" Technical testing only, all recipients disregard ______________________