Efficient XML Report Version 1.0 3/15/2010 National Center for Atmospheric Research DOCUMENT REVISION REGISTER Version 1.0 Date Content Changes 03/15/2010 Preliminary draft on compactness Please direct comments or questions to: Aaron Braeckel National Center for Atmospheric Research Research Applications Laboratory 3450 Mitchell Lane Boulder, CO 80301 braeckel@ucar.edu (303)497-2806 Editors Aaron Braeckel Contributors Aaron Braeckel Terms of Use – NNEW Documentation The following Terms of Use applies to the NNEW Documentation. 1. Use. The User may use NNEW Documentation for any lawful purpose without any fee or cost. Any modification, reproduction and redistribution may take place without further permission so long as proper copyright notices and acknowledgements are retained. 2. Acknowledgement. The NNEW Documentation was developed through the sponsorship of the Federal Aviation Administration. 3. Copyright. Any copyright notice contained in this Terms of Use, the NNEW Documentation, any software code, or any part of the website shall remain intact and unaltered and shall be affixed to any use, distribution or copy. Except as specifically permitted herein, the user is not granted any express or implied right under any patents, copyrights, trademarks, or other intellectual property rights with respect to the NNEW Documentation. 4. No Endorsements. The names, MIT, Lincoln Labs, UCAR and NCAR, may not be used in any advertising or publicity to endorse or promote any program, project, product or commercial entity. 5. Limitation of Liability. The NNEW Documentation, including all content and materials, is provided "as is." There are no warranties of use, fitness for any purpose, service or goods, implied or direct, associated with the NNEW Documentation and MIT and UCAR expressly disclaim any warranties. In no event shall MIT or UCAR be liable for any damages of any nature suffered by any user, or any third party resulting in whole or in part from use of the NNEW Documentation. Table of Contents OVERVIEW..................................................................................................................................................... 7 Binary and Efficient XML ........................................................................................................................... 7 Memory Usage .......................................................................................................................................... 8 Processing ............................................................................................................................................... 10 Compactness ........................................................................................................................................... 10 Increased network bandwidth requirements ..................................................................................... 10 Increased storage requirements ......................................................................................................... 10 Increased data latencies ..................................................................................................................... 11 EXISTING WORK .......................................................................................................................................... 12 W3C XML Binary Characterization Working Group (XBC WG)................................................................ 12 W3C XML Efficient XML Working Group (EXI WG) ................................................................................. 13 MIT Lincoln Labs FastInfoset and EXI weather comparison ................................................................... 13 NCAR Preliminary Sun’s FastInfoset Evaluation...................................................................................... 14 NCAR Exificient (EXI) Library Compactness Assessment ......................................................................... 15 SOLUTION CLASSES ..................................................................................................................................... 16 Data-agnostic compression..................................................................................................................... 16 Hardware ................................................................................................................................................ 16 XML-Wrapped Binary .............................................................................................................................. 16 Efficient/Binary XML Formats ................................................................................................................. 16 ASSESSMENT ............................................................................................................................................... 19 Environment ........................................................................................................................................... 19 Software .............................................................................................................................................. 19 Hardware ............................................................................................................................................ 19 Configuration ...................................................................................................................................... 20 Data ..................................................................................................................................................... 20 Output ................................................................................................................................................. 20 Analysis ............................................................................................................................................... 27 FUTURE WORK ............................................................................................................................................ 30 RECOMMENDATIONS ................................................................................................................................. 32 APPENDIX A - ACRONYMS ................................................................................................................... 33 APPENDIX B – DATA EXAMPLES ........................................................................................................ 34 Aircraft Reports ....................................................................................................................................... 34 AIR/SIGMETs ........................................................................................................................................... 35 METARs ................................................................................................................................................... 37 TAFs ......................................................................................................................................................... 39 APPENDIX C - DEFINITIONS AND TERMS ......................................................................................... 42 APPENDIX D - REFERENCES................................................................................................................. 43 Table of Figures No table of figures entries found. OVERVIEW The eXtensible Markup Language (XML) has become ubiquitous in software systems and data exchange since its release in 1998 by the W3C. XML is now the de facto standard data format across most domains, including Service Oriented Architectures (SOA). This is due to a number of advantages: Human-readable Self-describing Hardware, software, and platform-independent Expressive data model (trees, graphs, etc.) Extensible Validatable Namespaces However, these benefits can come with a performance cost as compared to many legacy formats. This includes increased processing, less compactness, and increased memory usage during common operations such as data parsing, storage, and regular data exchange. For example, a DoD study noted a 10x to 100x file size increase when moving from “legacy” data formats to XML (1. This report analyzes the efficiency cost and alternatives for XML usage in the weather domain. This analysis may also be highly relevant to other scientific domains dealing with large data volumes. As XML becomes a mission-critical component of modern data systems and as data volumes in the weather domain increase, it is essential to understand the efficiency characteristics of XML. Binary and Efficient XML The terms “binary XML” and “efficient XML” are often used conjointly. For the purposes of this report, efficient XML is considered a superset of binary XML. Binary XML approaches are one strategy to solve the more general efficient XML problem. This report analyzes the broader set of techniques (parsing techniques, hardware, alternative XML encodings, etc.) for efficient XML. Efficiency includes a broader set of potential solutions. PROBLEM DESCRIPTION For the purposes of this report, the following solution criteria are considered: Open standard Minimal impacts on existing XML characteristics, such as platform-independence Minimal impacts on the XML family of functionality, such as: o XPath o XQuery o XSLT (transformation) o XML Schema Minimal impacts to developers and users One of the efficiency characteristics of XML comes from the representation of numeric data as character data. For example, the integer value -12345 can be encoded as two bytes (octets) if encoded directly as an integer value. When this value is represented in XML/UTF-8 it is encoded as 6 characters (“-12345”) each character requiring an octet (byte) to represent. In the case of XML/UTF-16, this would require two octets apiece. For human readability, many XML documents contain a significant amount of whitespace that is not used for machine-readable purposes. This particular issue is a good example of the tradeoff between usability and efficiency that can be made with XML. Whitespace is critical for humans working with XML data but significantly impacts file sizes and automated data transfer. Memory Usage XML decoding and encoding can be more memory-intensive than with binary equivalents. This particular impact can be lessened by using appropriate techniques to encode and decode XML. There are several techniques by which XML can be encoded and decoded. Generally most memory problems with XML can be addressed by making use of event-driven techniques. While object model techniques such as DOM are very natural for many developers, it can have a significant memory impact to store the entire XML model in memory while operating on it. In-memory representations of XML objects can be many times the size of the original XML document. Several alternative parsing techniques are described below as discrete examples. DOM (Document Object Model) Parsing With DOM decoding an XML library is used to build a set of XML-specific representation in memory that is then used to construct domain objects. In object-oriented languages, a DOM library typically includes objects representing the fundamental XML concepts such as Elements, Attributes, and Documents. The decoding software would typically take appropriate action (translate to a domain-specific object, perform an action, etc.) based on this in-memory XML model. Figure 1 shows a simple example of DOM parsing in Java. Note that the process is that the parser is asked to build an XML object model, then this object model is queried for its contents. In many cases this results in duplicate in-memory representations as the DOM objects are translated to domainspecific objects. Figure 1 DOM Parsing Example Simple API for XML (SAX) Parsing SAX parsing is event-driven. Rather than the parser building an in-memory representation of the XML document which is passed to the developer, events are fired whenever the parser encounters the start of an element, a new attribute, or any other significant parsing event. Here is an example of Java code to parse XML using SAX. This example only includes event handling for when the opening tag of an element is detected for clarity: Figure 2 SAX Parsing Example Streaming API for XML (StAX) Parsing StAX parsing is similar to SAX parsing in that it is also event-driven. However, instead of the SAX parser pushing events to interested parties (as in SAX), these parties query the StAX parser for the next event. This parsing model tends to give the developer more control over when events are handled, and retains the streaming/event-driven nature of SAX parsing. Figure 3 StAX Parsing Example Processing Processing efficiency can have a broad impact on both high-end server installations and mobile devices. Processing efficiency impacts can be broken into: Encoding time – the amount of processing required to encode data files to be passed to another system component. In most cases this is the result of a data producer sending data to a data consumer Decoding time – the amount of processing required to decode or parse the data contents. In many cases this takes place when a data consumer is parsing the result from a data producer Mobile devices, in particular, are sensitive to processing efficiency. Increased processing work on mobile devices can have a significant impact on battery life. However, it is relatively infrequent for XML data to be processed on mobile devices, and instead the XML is processed into derived products such as images for consumption on mobile devices. Even for data systems with little constraints on hardware, processing efficiency can have a cumulative impact on the time taken to pass data through the system. This is most notable in cases where a series of system components exchange data before it reaches its final destination. Compactness In many cases data compactness can have an even greater importance than processing. There are several specific consequences of poor data compactness. Increased network bandwidth requirements In high end data systems, mobile devices, and dedicated aircraft devices network bandwidth is of critical importance. Wide-area network bandwidth can often be prohibitively expensive, and in some cases can drive fundamental system design decisions. The costs of WAN bandwidth can be one of the more significant ongoing expenses for data systems. Relative to processing impacts, it is notable that processing improvements (CPU) have historically far outstripped improvements in WAN speeds. Increased storage requirements Data storage is a fundamental driver for data staging and archiving use cases. Increased file sizes can also impact the processing work required to find and deliver data to downstream customers. Generally speaking, increased storage requirements are not a major cost or performance driver. In many cases increased storage may be offset by the minimal cost and good performance of storage devices, but is useful to consider in analyzing efficiency. Increased data latencies There are many scenarios where the delay in delivering data to consumers is a critical system consideration. This becomes particularly important when system components are chained together. In this case the time taken to pass data across the network can become cumulatively significant and the increased data latency is multiplied by the number of systems participating in the data exchanges. EXISTING WORK Many analyses have taken place on how to overcome the efficiency problems with XML. Most of these have studied the general characteristics of XML across all domains, but there are several weatherspecific studies of note. W3C XML Binary Characterization Working Group (XBC WG) The W3C convened the Binary Characterization Working Group (2) to collect use cases and gather requirements (3) for a more efficient XML encoding. This working group concluded that it is possible to address these requirements with an alternative XML encoding and that it is critically important that the W3C do so. The critical properties identified by the XBC WG include: Directly Readable & Writable Transport Independence Compactness Human Language Neutral Platform Neutrality Integratable into XML Stack Royalty Free Fragmentable Streamable Roundtrip Support Generality Schema Extensions and Deviations Format Version Identifier Content Type Management Self-Contained These properties are defined and explained in detail in the XBC WGs final report (4). W3C XML Efficient XML Working Group (EXI WG) The W3C convened the Efficient XML Working Group (5) as a follow-on activity to define and measure the benefits of an alternative encoding of the XML information set (data model) to provide more efficient XML. This encoding must also meet the requirements defined by the Binary Characterization Working Group. The Efficient XML Working Group analyzed a number of solutions for efficient XML based on a broad set of use cases (as defined by the Binary Characterization WG) with a large and varied set of sample files. The WG subsequently published their test results and testing framework. The EXI WG evaluated (6) nine different encoding alternatives. Based on the necessary and desirable properties, the EXI WG evaluated which formats met the minimum requirements to be a candidate format. A summary of the findings is reproduced here: Format Meets Minimum Requirements? XML + GZIP No Fast Infoset No FXDI (Fujitsu Binary) No Efficient XML (AgileDelta) Yes Xebu No X.694 with BER No X.694 with PER No X.694 with PER + Fast Infoset Yes esXML No Figure 4: EXI WG Candidates By way of example, XML + GZIP did not meet either the Compactness or Generality properties. Definitions of the property types and explanations of the process and conclusions may be found in the EXI WGs measurements note (6). Based on their testing results, the EXI WG defined an alternative XML encoding called EXI which was largely based upon AgileDelta’s EfficientXML format. The EXI format specification entered Candidate Recommendation status in late 2009, and is expected to produce a Final Recommendation. MIT Lincoln Labs FastInfoset and EXI weather comparison MIT Lincoln Labs performed a comparison of Sun’s implementation of Fast Infoset and AgileDelta’s EfficientXML. Note that EfficientXML is closely related to the W3C’s EXI format (as described in Section 0) but does not include several features eventually included in the final EXI format specification. This trial was performed with 134 XML cases. These files were of two types: NCML-GML and polar radar data. These weather-specific data files were placed within the W3C’s EXI Test Framework and a series of in-memory round-trip encode/decode operations were performed. MIT LL concluded that both Sun’s Fast Infoset and EfficientXML were comparable formats. EfficientXML produced more compact results (83.8% compactness vs 75% compactness), and Fast Infoset had better processing characteristics (86ms vs 207ms per run). It was judged that EfficientXML‘s compactness advantage was the more important factor and that EfficientXML had the overall advantage. NCAR Preliminary Sun’s FastInfoset Evaluation The National Center for Atmospheric performed a weather-specific evaluation of compactness and processing characteristics of four common weather products: Decoded AIR/SIGMETs Decoded METARs Decoded PIREPs Decoded TAFs XML File Size Fast Infoset File Size XML Parsing Time Fast Infoset Parsing Time 5 7kb 3kb (0.43) 18ms 13ms (0.72) METARs 1481 1167kb 373kb (0.32) 84ms 56ms (0.667) PIREPs 158 155kb 51kb (0.33) 29ms 29ms (1.0) TAFs 177 471kb 98kb (0.208) 57ms 39ms (0.684) Product Count (# of records) AIR/SIGMETs Figure 5: FastInfoset Weather Analysis This evaluation used the Japex framework, which is related to the EXI WG testing framework. These assessments were performed without schema information. Averaged over all products, Sun’s Fast Infoset was found to reduce file size to 75% of the original file size, and reduced parsing (processing) time to 33% of raw XML parsing time. This evaluation concluded that there were considerable efficiency gains in both compactness and processing time when using Sun’s Fast Infoset. There was no case in which performance was worse. It was expected that further trials that provided Fast Infoset with schema information could improve compactness significantly. NCAR Exificient (EXI) Library Compactness Assessment Once it became clear that the W3C EXI WG was favoring EXI over Fast Infoset, NCAR did an assessment of compactness for four common products: Decoded AIR/SIGMETs Decoded METARs Decoded PIREPs Decoded TAFs This evaluation used the 0.2 version of the Exificient library for writing EXI data files. At the time of the assessment Exificient did not implement the full set of EXI features, and as such this assessment was intended as a measurement for an early version of Exificient rather than a set of EXI guidance metrics. The generated EXI files were compared to their original XML versions, and GZIPed versions of the original XML and of the EXI-encoded files were measured. The conclusion of this assessment was that over all four products the average Exificient/EXI compression was 0.13 of the original file sizes. By comparison, the GZIPed XML files averaged 0.07 of the original XML file sizes. GZIPing the EXI files achieved about the same level of compactness as GZIPing the original XML files. Both schema-informed EXI files were generated as well as schema-less files. All schema-informed versions were larger than their schema-ignorant equivalents, which was attributed to limitations of the 0.2 version of Exificient. The Exificient library was found to correctly preserve the complete XML data model (including whitespace) in a roundtrip from XML -> EXI -> XML. SOLUTION CLASSES Data-agnostic compression Data-agnostic compression includes techniques such as GZIP, ZIP, BZIP2, and LZMA. These utilities are used to compress the stream or file. This technique does not preserve the native XML document model, and as such the XML must be decompressed prior to being transformed, parsed, or worked with as an XML document. Different data-agnostic compression techniques have differing tradeoffs between the compression achieved and the processing time required. GZIP is often considered a reasonable middle-ground solution that balances good compression with a minimal processing impact. BZIP2 often requires more processing time than GZIP, but offers better compression. Generally speaking, data-agnostic compression addresses the compactness problem but requires additional processing time to compress and decompress the contents. Hardware Not long after XML was standardized a family of hardware solutions emerged to address XML efficiency issues. These are typically referred to as “XML appliances”. In some cases the appliances are transparent to developers, and in other cases they require the use of custom libraries and drivers. Most modern XML appliances are rack-mountable and may be purchased from a vendor. XML-Wrapped Binary Another alternative for efficient XML is to encode metadata as XML, and wrap this around more efficient binary contents. The advantages of this approach are that XML may be used where it is needed for its flexibility, and can describe the contents of embedded raw data values encoded in a more efficient binary form. This approach has several significant limitations, however. The mixed model can be difficult to integrate into XML libraries and utilities and segregates the data model into two separate components. The wrapped binary contents are opaque to XML libraries and tools, and as such cannot be transformed or acted upon by the XML technology stack (XSLT, XPATH, etc.). Binary XML Formats Efficient/binary XML formats attempt to address the problem of efficiency by finding alternate, nontextual encodings. Some solutions in this space may simply be considered an XML-specific compression scheme, whereas others preserve the complete XML data model and may be considered a binary encoding of XML. This latter category of format allows for the format to be applied at a fairly low level, and libraries and XML-related technologies may still be used. However, all solutions in this category lose the human-readability characteristic in favor of efficiency. Many alternatives have emerged among binary XML data formats. Those not sponsored and/or developed by a standards body were not considered. If standardization were not a criterion, there are numerous other binary XML formats that could have been included such as XMill. Whenever binary and/or efficient XML formats are discussed, inevitably the question of human readability is raised. Almost all binary XML solutions discard the text encoding for efficiency, which removes the human readability and transparency that has been one of the cornerstones of XML’s success. This is analogous to the issue of whether to include whitespace XML for human readability and debugging. For the purposes of weather data exchange, the efficiency gains of lossless binary XML approaches are considered worthwhile as long as sufficient tooling exists to conveniently support human-readable translation scenarios. Each system must be evaluated independently, but the essential characteristic in using binary XML is that it be losslessly (and easily) translatable between its binary and its human-readable XML form. Data Format Standards Bodies Fast Infoset ITU-T Notes Was not considered to satisfy: ISO EXI W3C* BiM ISO (MPEG WG) BXML OGC W3C EXI Characteristics (7) In the W3Cs Candidate Recommendation phase. The W3C EXI WG appears to be making recent progress. Compactness Generality Meets all W3C EXI/XBC WG requirements Not measured Offered as a best practice paper rather than an official standard. The OGC has stated that it will not move BXML to a supported standard but will support BXML until an international standard for compressed XML emerges (8) Not measured WBXML Open Mobile Alliance W3C* WBXML is a proposed W3C standard. It does not appear to be active in the W3C any longer. Originally developed as a compact wireless mobile format. Not measured In the opinion of a developer involved in both the Fast Infoset and EXI standardization groups, EXI seems to be in a leadership position with Fast Infoset and BiM slightly behind. Simple search engine popularity analyses seem to support this position. Within the wireless and mobile communities WBXML is in use, but significant adoption outside of these communities has not been observed. BXML has been used within the OGC avionics community and among some OGC members. Based on community interest levels and standardization maturity EXI, Fast Infoset, and BiM are the leading candidates. Because BiM was not a candidate that was evaluated as part of the EXI WGs, it is not clear which of the W3Cs requirements BiM does or does not meet. ASSESSMENT There were several goals for this assessment: Compare XML compactness characteristics of XML with several binary XML solutions, legacy binary formats, and GZIPed equivalents Compare GML-based weather files (WXXM 1.1.1) to simple (ADDS Dataserver) weather files to ascertain the effectiveness of the compactness techniques across different original data structures The EXI WG Test Framework was a convenient and relatively mature starting point for efficient XML analysis. This test framework is based upon Japex (9), a performance micro-benchmarking framework for Java. Japex executes performance trials and produces HTML output (such as statistics and graphs) of the results. The EXI test framework includes Japex configurations for all the formats evaluated by the EXI WG. Environment For configuration and logic re-use, software versions were kept as similar to the EXI WG test framework as possible. Software Java 1.6.0_19-b04 (32 bit) Java 1.6 was used as the basis for the analysis. JAPEX 1.1 (https://japex.dev.java.net/) Japex is a library used by the W3C EXI for their compactness and processing evaluation. Japex was also used as a foundational component for measuring compactness in this assessment. Sun’s Fast Infoset 1.2.2 1.2.2 is not the most recent version of the library, but it was the version used in the EXI WG Test Framework. Exificient 0.4 The Exificient library was used for EXI format generation. As of version 0.4, there are a number of EXI features not yet implemented. Host and Hardware Dual core 64-bit 3Ghz CPU running 32-bit Debian Linux. Configuration All trials were run with the following Java command line options : -server –mx2650M –Djapex.numberOfThreads=1 Data The original data files were collected from the ADDS Dataserver version 1.2. For each product in the assessment (METARs, TAFs, AIR/SIGMETs, and aircraft reports) a file was retrieved that contained a single record, and another file was retrieved that contained 24 hours of data for that product: Aircraft reports (31057 records for a 24 hour period) AIR/SIGMET reports (317 records for a 24 hour period) METAR reports (160761 records for a 24 hour period) TAF reports (24783 records for a 24 hour period) These data files were then converted to WXXM 1.1.1 format. This process was not perfect, and is estimated to have converted 90-95% of the ADDS Dataserver contents to WXXM 1.1.1 equivalents. In some cases default enumerated values were inserted. WXXM was considered a relatively realistic and non-trivial schema for weather data exchange scenarios. The ADDS Dataserver data is mature, but it does not encode a unified model for data exchange and in some ways represents a best-case, straightforward scenario for XML data exchange. Examples of ADDS Dataserver and WXXM 1.1.1 data are shown in Appendix B. For METAR and TAF data, octet-wise legacy binary formats were available. For example, each binary METAR record started with the following: Figure 6 Binary METAR The binary formats for TAF and METAR are straightforward binary representations of the decoded products. The binary formats were not tuned for extreme compactness, and decoded fields were not encoded at the sub-octet (bit) level. Typical byte lengths for each product are included in the results. Output The WXXM 1.1.1 data was used as the baseline in all graphs and comparisons. Aircraft Reports Figure 7: Aircraft Reports Report AIR/SIGMETs Figure 8: AIR/SIGMETs Report METARs Figure 9: METARs Report TAFs Figure 10: TAFs Report 24 Hours Product Summary Figure 11 lists the raw octet counts for every product in each format. Byte/octet sizes Aircraft Reports (24 hours) Avg. Size Across All Products Avg. Compaction (relative to Baseline WXXM) AIR/SIGMETs METARs TAFs (24 hours) (24 hours) (24 hours) Baseline WXXM 39608698 737248 301227422 125747574 116830236 1.00 Formatted WXXM 53319755 932382 409222080 171613035 158771813 1.36 Baseline ADDS 16013190 617833 113930448 45817011 44094621 0.38 Formatted ADDS 19305370 869289 140365356 56260159 54200044 0.46 Exificient WXXM (with schema) 4934941 269321 33557667 12128631 12722640 0.11 Exificient WXXM (without schema) 4461832 244939 24078397 8462731 9311975 0.08 Exificient ADDS (without schema) 3327079 203275 19312665 6393763 7309196 0.06 Sun’s FastInfoset WXXM (with schema) 7001802 381684 45577140 16982830 17485864 0.15 Sun’s FastInfoset WXXM (without schema) 6694989 375845 43365472 15601237 16509386 0.14 Sun’s FastInfoset ADDS (without schema) 3901922 234078 23006964 8184989 8831988.3 0.08 GZIP WXXM 1744655 69561 13241820 4270138 4831544 0.04 GZIP ADDS 1598625 54343 10546794 3030971 3807683 0.03 GZIP Exificient WXXM (with schema) 2090570 139817 13243397 3975517 4862325 0.04 GZIP Sun’s FastInfoset (without schema) 1320964 59635 8759934 2554030 3173641 0.03 Legacy Binary (TAFs and METARs only) N/A N/A 25468818 10434486 17951652 0.15 Figure 11: Compaction Assessment Results (24 hrs of records) Single Product Summary Figure 12 lists the raw octet counts for every product in each format. Byte/octet sizes Aircraft Reports AIR/SIGMETs METARs TAFs (1 record) (1 record) (1 record) (1 record) Avg. Size Across All Products Avg. Compaction (relative to Baseline WXXM) Baseline WXXM 1969 3371 2927 2823 2772.5 1.00 Formatted WXXM 2357 4002 3736 3436 3383 1.22 Baseline ADDS 909 3345 1361 1351 1742 0.63 Formatted ADDS 1045 5033 1595 1547 2305 0.83 Exificient WXXM (with schema) 276 921 486 596 570 0.21 Exificient WXXM (without schema) 680 1993 1105 1212 1248 0.45 Exificient ADDS (without schema) 582 1086 843 949 865 0.31 Sun’s FastInfoset WXXM (with schema) 1169 2497 1644 1752 1766 0.64 Sun’s FastInfoset WXXM (without schema) 1133 2449 1602 1709 1723 0.62 Sun’s FastInfoset ADDS (without schema) 665 1351 941 1039 999 0.36 GZIP WXXM 718 1246 998 1022 996 0.36 GZIP ADDS 506 789 725 682 676 0.24 GZIP Exificient WXXM (with schema) 299 944 509 502 564 0.20 GZIP Sun’s FastInfoset (without schema) 728 1265 1018 1022 1008 0.36 Legacy Binary (TAFs and METARs only) N/A N/A 220 387 304 0.11 Figure 12: Compaction Assessment Results (single record) Analysis A comparison of the single record results and the 24-hours results indicates a noteworthy overhead for all compacted data formats except EXI with a schema. As a percentage of bytes, the single record data files showed worse compaction than the 24 hours results. This is not unexpected, but it is worth consideration that if single records are transferred or stored independently almost all of the approaches will significantly increase data sizes. The remainder of the analysis focuses upon the 24-hour data files as they are considered more statistically significant. Formatted data (with human-readable whitespace) was 25-35% larger than their unformatted equivalents. The removal of human-generated XML formatting is a lossy process (specialized formatting created by a human cannot be restored) in cases of automated data transfer/storage the formatting is typically not carrying any meaningful information. For all compression techniques/data formats, the benefits were very obvious. GZIP compression was able to compress data to 4% of the original file sizes, and Exificient (the best efficient XML technique) was able to compress data to 8% of the original file sizes. In data systems where compaction efficiency is a concern, these compression ratios will be very noticeable. One of the primary goals of this assessment was to compare EXI and FastInfoset and assess their effectiveness in compressing weather data. Based on the implementations used in this study (Exificient 0.4 and Sun’s FastInfoset 1.2.2) it appears that EXI provides better compaction for weather data. As shown in Figure 13, in all cases Exificient came out on top. Exificient Average Size (bytes) Sun’s FastInfo Average Size (bytes) WXXM (schemainformed) 12722640 17485864 0.73 WXXM (no schema) 9311975 16509386 0.56 ADDS (no schema) 7309196 8831988 0.83 24 hours of data records Exificient to Sun’s FastInfoset compactness Figure 13: EXI and FastInfoset Comparison A puzzling aspect of the results is that schema-informed WXXM provided worse compactness than schema-less techniques for both EXI and FastInfoset. It was expected that schema information (such as the numeric schema types) would result in greater compactness than when EXI and FastInfoset were given no type information. This could be the result of WXXM’s method of representing numeric data, or could indicate that neither of the libraries made complete use of the schemas. Schema-informed ADDS data might have provided some insight into this behavior. While EXI and FastInfoset did not compress to GZIP levels, EXI + GZIP and FastInfoset + GZIP appear to give very similar levels of compression as compared to the original XML + GZIP. This characteristic could be useful in scenarios where data is staged in EXI for fast access, and could then be archived in GZIPed (or another data-agnostic compression scheme) EXI for greater compactness. If WXXM and ADDS data is compared, it is clear that the complexity of the model has a significant effect on data compactness. As stated in Section 0, not all ADDS data elements were completely converted to WXXM and the baseline ADDS data was still 34% (formatted) and 38% (unformatted) of the size of the converted WXXM. This trait may be at least partially explained by the GML Object-Property-Value model, which requires that GML application schemas represent every relationship between two objects as a separate XML element. WXXM also has a much deeper level of nesting on average for similar concepts. For example, in the ADDS METARs format air temperature is represented as: <METAR> <temp_c>1.1</temp_c> </METAR> Whereas WXXM represents the same information in a structure similar to: <ns3:METAR> <ns3:aerodromeWxObservation> <ns8:Observation> <ns8:result> <ns3:airTemperature uom="C">33.97999954223633</ns3:airTemperature> </ns8:result > </ns8:Observation> </ns3:aerodromeWxObservation > </ns3:METAR> It is anticipated that the automated conversion process could be improved. Note the additional attributes in the example below, such as ns:type, xsi:type, and xsi:nil: <ns3:aerodromeWxObservation ns2:type="simple"> <ns8:Observation xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="ns4:ObservationType"> <ns1:boundedBy xsi:nil="true"/> ….. It is possible that some of these extra attributes could be removed from the WXXM data and the file sizes might be more comparable. FUTURE WORK The latest round of weather-specific assessments does not include any processing measurements. The EXI test framework includes two additional components of interest: processing and network round-trip. Future assessments should include more information on the tradeoff between processing and compactness, and measure information on how these efficient XML techniques influence transactions per second in low and high bandwidth scenarios. An area that should be investigated further is why schema-informed EXI and FastInfoset were larger than their schema-less versions. As mentioned in the analysis, it may be helpful to include ADDS schema-informed in the next trial to identify whether it is a data model/schema problem or whether the testing framework is misconfigured. It is also possible that slight modifications to WXXM schemas could result in much better compactness, such as if the numeric values do not currently carry XML schema type information and there is not sufficient information in the schemas to allow for full compaction. For a number of reasons it would be desirable to improve the conversion process (as described above). This would be useful to reduce uncertainty in future assessments. Because of the early version of the Exificient library, it would be very beneficial to include other EXI implementations in future trials. Exificient does not yet implement a number of EXI features, and has not yet gone through an in-depth performance tuning process. AgileDelta (10) has a commercial, full implementation of the current EXI specification that could provide much more accurate estimates of EXI compactness and overall efficiency. It would be useful to identify how well formatted XML compresses in EXI/FastInfoset. All the compressed scenarios of this assessment used unformatted data, and it is unclear whether removing formatting has a significant impact on efficiency, and therefore whether it is compelling to remove when not needed. As one of the three leading candidates in the standardization realm, BiM would be helpful to add to future assessments. A study of the relationship between different measurements of XML complexity and file sizes might be beneficial for estimating compactness characteristics without direct experiments. Complexity metrics might also be useful in understanding some of the more basic properties of weather data. Two obvious measures of complexity are the maximum depth/nesting of elements and attributes and instance document entropy. An analysis of the entropy of various products or product types (such as that described in 11) might give guidance on the best-case scenario for data representations. It is possible that data entropy and/or complexity could be used to roughly estimate data compactness. The performance improvements and other impacts of hardware XML accelerators should be evaluated. They could be a viable (and relatively simple) technology solution for some scenarios, but it is not yet clear whether this class of solution can provide a transparent, cross-platform, efficient solution for common weather scenarios. RECOMMENDATIONS In general, efficient XML techniques show clear benefits in both compactness and processing. While not measured by this assessment, assessments noted in the EXISTING WORK section clearly demonstrates both benefits. If network bandwidth is a significant cost driver or bandwidth is in any way an important characteristic of the system, data-agnostic compression or efficient XML formats should be used. In cases where bandwidth and/or data storage sizes are important and processing time not an issue, data-agnostic compression techniques are the most straightforward approach and would offer optimal compactness. UTF-8 should be used rather than UTF-16 when working with XML documents primarily written using the English language. Use other encoding schemes (ISO-8859, UTF-16, etc.) for other languages, but in each case choose an encoding that matches well with the language. SAX or StAX decoding should be utilized to minimize memory consumption. DOM-based parsing is often most natural for developers, but can require significant more memory because of the XML model in that is kept in memory. APPENDIX A - ACRONYMS AIRMET SIGMET API DOM GML METAR MIME NCAR NNEW OGC SAX StAX SOAP TAF UOM URI URL URN XML Application Programming Interface Geographic Markup Language Multipurpose Internet Mail Extensions National Center for Atmospheric Research NextGen Network Enabled Weather Open GeoSpatial Consortium Single Authoritative Source The technology formerly known as Simple Object Access Protocol Terminal Aerodrome Forecasts Unit(s) of measure Uniform Resource Identifiers Uniform Resource Locators Uniform Resource Names Extensible Markup Language APPENDIX B – DATA EXAMPLES This section contains examples of the WXXM 1.1.1 and ADDS Dataserver 1.2 data files used in the assessment. Note that formatting (whitespace) has been added for readability, and that for several files some redundant text was omitted and replaced with “…” or a comment. Aircraft Reports The aircraft reports dataset included PIREPs, AIREPs, and AMDARs. ADDS Dataserver 1.2: WXXM 1.1.1: AIR/SIGMETs ADDS Dataserver 1.2: WXXM 1.1.1: METARs ADDS Dataserver 1.2: WXXM 1.1.1: TAFs ADDS Dataserver 1.2: WXXM 1.1.1: APPENDIX C - DEFINITIONS AND TERMS Octet – 8 bits. While often used synonymously with byte, the term byte is overloaded and does not always indicate exactly 8 bits. APPENDIX D - REFERENCES 1 Efficient XML – Taking Net-Centric Operations to the Edge. John Schneider 2 W3C XML Binary Characterization Working Group. http://www.w3.org/XML/Binary/ 3 W3C XML Binary Characterization Working Group Minimum Binary XML Requirements. http://www.w3.org/TR/xbc-characterization/#N102EC 4 W3C Binary Characterization Working Group Analysis. http://www.w3.org/TR/xbc-characterization/ 5 W3C Efficient XML Interchange Working Group (EXI). http://www.w3.org/XML/EXI/ 6 W3C Efficient XML Interchange Working Group Measurements Note. http://www.w3.org/TR/2007/WD-exi-measurements-20070725/ 7 W3C Efficient XML Working Group Measurements Note – Requirements. http://www.w3.org/TR/2007/WD-exi-measurements-20070725/#contributions-assessment 8 BXML/OGC Clarification on the OGC Forum. http://feature.opengeospatial.org/forumbb/viewtopic.php?t=1193 9 Japex, Java Micro-benchmarking framework. https://japex.dev.java.net/ 10 AgileDelta. http://www.agiledelta.com 11 An Analysis of XML Compression Efficiency. C. Augeri, B. Mullins, et al. http://www.usenix.org/events/expcs07/papers/7-augeri.pdf