Binary and Efficient XML

advertisement
Efficient XML Report
Version 1.0
3/15/2010
National Center for Atmospheric Research
DOCUMENT REVISION REGISTER
Version
1.0
Date
Content Changes
03/15/2010 Preliminary draft on compactness
Please direct comments or questions to:
Aaron Braeckel
National Center for Atmospheric Research
Research Applications Laboratory
3450 Mitchell Lane
Boulder, CO 80301
braeckel@ucar.edu
(303)497-2806
Editors
Aaron Braeckel
Contributors
Aaron Braeckel
Terms of Use – NNEW Documentation
The following Terms of Use applies to the NNEW Documentation.
1. Use. The User may use NNEW Documentation for any lawful purpose without any fee or cost.
Any modification, reproduction and redistribution may take place without further permission so
long as proper copyright notices and acknowledgements are retained.
2. Acknowledgement. The NNEW Documentation was developed through the sponsorship of the
Federal Aviation Administration.
3. Copyright. Any copyright notice contained in this Terms of Use, the NNEW Documentation, any
software code, or any part of the website shall remain intact and unaltered and shall be affixed to
any use, distribution or copy. Except as specifically permitted herein, the user is not granted any
express or implied right under any patents, copyrights, trademarks, or other intellectual property
rights with respect to the NNEW Documentation.
4. No Endorsements. The names, MIT, Lincoln Labs, UCAR and NCAR, may not be used in any
advertising or publicity to endorse or promote any program, project, product or commercial
entity.
5. Limitation of Liability. The NNEW Documentation, including all content and materials, is
provided "as is." There are no warranties of use, fitness for any purpose, service or goods,
implied or direct, associated with the NNEW Documentation and MIT and UCAR expressly
disclaim any warranties. In no event shall MIT or UCAR be liable for any damages of any nature
suffered by any user, or any third party resulting in whole or in part from use of the NNEW
Documentation.
Table of Contents
OVERVIEW..................................................................................................................................................... 7
Binary and Efficient XML ........................................................................................................................... 7
Memory Usage .......................................................................................................................................... 8
Processing ............................................................................................................................................... 10
Compactness ........................................................................................................................................... 10
Increased network bandwidth requirements ..................................................................................... 10
Increased storage requirements ......................................................................................................... 10
Increased data latencies ..................................................................................................................... 11
EXISTING WORK .......................................................................................................................................... 12
W3C XML Binary Characterization Working Group (XBC WG)................................................................ 12
W3C XML Efficient XML Working Group (EXI WG) ................................................................................. 13
MIT Lincoln Labs FastInfoset and EXI weather comparison ................................................................... 13
NCAR Preliminary Sun’s FastInfoset Evaluation...................................................................................... 14
NCAR Exificient (EXI) Library Compactness Assessment ......................................................................... 15
SOLUTION CLASSES ..................................................................................................................................... 16
Data-agnostic compression..................................................................................................................... 16
Hardware ................................................................................................................................................ 16
XML-Wrapped Binary .............................................................................................................................. 16
Efficient/Binary XML Formats ................................................................................................................. 16
ASSESSMENT ............................................................................................................................................... 19
Environment ........................................................................................................................................... 19
Software .............................................................................................................................................. 19
Hardware ............................................................................................................................................ 19
Configuration ...................................................................................................................................... 20
Data ..................................................................................................................................................... 20
Output ................................................................................................................................................. 20
Analysis ............................................................................................................................................... 27
FUTURE WORK ............................................................................................................................................ 30
RECOMMENDATIONS ................................................................................................................................. 32
APPENDIX A - ACRONYMS ................................................................................................................... 33
APPENDIX B – DATA EXAMPLES ........................................................................................................ 34
Aircraft Reports ....................................................................................................................................... 34
AIR/SIGMETs ........................................................................................................................................... 35
METARs ................................................................................................................................................... 37
TAFs ......................................................................................................................................................... 39
APPENDIX C - DEFINITIONS AND TERMS ......................................................................................... 42
APPENDIX D - REFERENCES................................................................................................................. 43
Table of Figures
No table of figures entries found.
OVERVIEW
The eXtensible Markup Language (XML) has become ubiquitous in software systems and data exchange
since its release in 1998 by the W3C. XML is now the de facto standard data format across most
domains, including Service Oriented Architectures (SOA). This is due to a number of advantages:

Human-readable

Self-describing

Hardware, software, and platform-independent

Expressive data model (trees, graphs, etc.)

Extensible

Validatable

Namespaces
However, these benefits can come with a performance cost as compared to many legacy formats. This
includes increased processing, less compactness, and increased memory usage during common
operations such as data parsing, storage, and regular data exchange. For example, a DoD study noted a
10x to 100x file size increase when moving from “legacy” data formats to XML (1.
This report analyzes the efficiency cost and alternatives for XML usage in the weather domain. This
analysis may also be highly relevant to other scientific domains dealing with large data volumes. As XML
becomes a mission-critical component of modern data systems and as data volumes in the weather
domain increase, it is essential to understand the efficiency characteristics of XML.
Binary and Efficient XML
The terms “binary XML” and “efficient XML” are often used conjointly. For the purposes of this report,
efficient XML is considered a superset of binary XML. Binary XML approaches are one strategy to solve
the more general efficient XML problem. This report analyzes the broader set of techniques (parsing
techniques, hardware, alternative XML encodings, etc.) for efficient XML. Efficiency includes a broader
set of potential solutions.
PROBLEM DESCRIPTION
For the purposes of this report, the following solution criteria are considered:

Open standard

Minimal impacts on existing XML characteristics, such as platform-independence

Minimal impacts on the XML family of functionality, such as:

o
XPath
o
XQuery
o
XSLT (transformation)
o
XML Schema
Minimal impacts to developers and users
One of the efficiency characteristics of XML comes from the representation of numeric data as character
data. For example, the integer value -12345 can be encoded as two bytes (octets) if encoded directly as
an integer value. When this value is represented in XML/UTF-8 it is encoded as 6 characters (“-12345”)
each character requiring an octet (byte) to represent. In the case of XML/UTF-16, this would require
two octets apiece.
For human readability, many XML documents contain a significant amount of whitespace that is not
used for machine-readable purposes. This particular issue is a good example of the tradeoff between
usability and efficiency that can be made with XML. Whitespace is critical for humans working with XML
data but significantly impacts file sizes and automated data transfer.
Memory Usage
XML decoding and encoding can be more memory-intensive than with binary equivalents. This
particular impact can be lessened by using appropriate techniques to encode and decode XML.
There are several techniques by which XML can be encoded and decoded. Generally most memory
problems with XML can be addressed by making use of event-driven techniques. While object model
techniques such as DOM are very natural for many developers, it can have a significant memory impact
to store the entire XML model in memory while operating on it. In-memory representations of XML
objects can be many times the size of the original XML document. Several alternative parsing
techniques are described below as discrete examples.
DOM (Document Object Model) Parsing
With DOM decoding an XML library is used to build a set of XML-specific representation in memory that
is then used to construct domain objects. In object-oriented languages, a DOM library typically includes
objects representing the fundamental XML concepts such as Elements, Attributes, and Documents. The
decoding software would typically take appropriate action (translate to a domain-specific object,
perform an action, etc.) based on this in-memory XML model.
Figure 1 shows a simple example of DOM parsing in Java. Note that the process is that the parser is
asked to build an XML object model, then this object model is queried for its contents. In many cases
this results in duplicate in-memory representations as the DOM objects are translated to domainspecific objects.
Figure 1 DOM Parsing Example
Simple API for XML (SAX) Parsing
SAX parsing is event-driven. Rather than the parser building an in-memory representation of the XML
document which is passed to the developer, events are fired whenever the parser encounters the start
of an element, a new attribute, or any other significant parsing event.
Here is an example of Java code to parse XML using SAX. This example only includes event handling for
when the opening tag of an element is detected for clarity:
Figure 2 SAX Parsing Example
Streaming API for XML (StAX) Parsing
StAX parsing is similar to SAX parsing in that it is also event-driven. However, instead of the SAX parser
pushing events to interested parties (as in SAX), these parties query the StAX parser for the next event.
This parsing model tends to give the developer more control over when events are handled, and retains
the streaming/event-driven nature of SAX parsing.
Figure 3 StAX Parsing Example
Processing
Processing efficiency can have a broad impact on both high-end server installations and mobile devices.
Processing efficiency impacts can be broken into:

Encoding time – the amount of processing required to encode data files to be passed to another
system component. In most cases this is the result of a data producer sending data to a data
consumer

Decoding time – the amount of processing required to decode or parse the data contents. In
many cases this takes place when a data consumer is parsing the result from a data producer
Mobile devices, in particular, are sensitive to processing efficiency. Increased processing work on
mobile devices can have a significant impact on battery life. However, it is relatively infrequent for XML
data to be processed on mobile devices, and instead the XML is processed into derived products such as
images for consumption on mobile devices.
Even for data systems with little constraints on hardware, processing efficiency can have a cumulative
impact on the time taken to pass data through the system. This is most notable in cases where a series
of system components exchange data before it reaches its final destination.
Compactness
In many cases data compactness can have an even greater importance than processing. There are
several specific consequences of poor data compactness.
Increased network bandwidth requirements
In high end data systems, mobile devices, and dedicated aircraft devices network bandwidth is of critical
importance. Wide-area network bandwidth can often be prohibitively expensive, and in some cases can
drive fundamental system design decisions. The costs of WAN bandwidth can be one of the more
significant ongoing expenses for data systems. Relative to processing impacts, it is notable that
processing improvements (CPU) have historically far outstripped improvements in WAN speeds.
Increased storage requirements
Data storage is a fundamental driver for data staging and archiving use cases. Increased file sizes can
also impact the processing work required to find and deliver data to downstream customers. Generally
speaking, increased storage requirements are not a major cost or performance driver. In many cases
increased storage may be offset by the minimal cost and good performance of storage devices, but is
useful to consider in analyzing efficiency.
Increased data latencies
There are many scenarios where the delay in delivering data to consumers is a critical system
consideration. This becomes particularly important when system components are chained together. In
this case the time taken to pass data across the network can become cumulatively significant and the
increased data latency is multiplied by the number of systems participating in the data exchanges.
EXISTING WORK
Many analyses have taken place on how to overcome the efficiency problems with XML. Most of these
have studied the general characteristics of XML across all domains, but there are several weatherspecific studies of note.
W3C XML Binary Characterization Working Group (XBC WG)
The W3C convened the Binary Characterization Working Group (2) to collect use cases and gather
requirements (3) for a more efficient XML encoding. This working group concluded that it is possible to
address these requirements with an alternative XML encoding and that it is critically important that the
W3C do so.
The critical properties identified by the XBC WG include:

Directly Readable & Writable

Transport Independence

Compactness

Human Language Neutral

Platform Neutrality

Integratable into XML Stack

Royalty Free

Fragmentable

Streamable

Roundtrip Support

Generality

Schema Extensions and Deviations

Format Version Identifier

Content Type Management

Self-Contained
These properties are defined and explained in detail in the XBC WGs final report (4).
W3C XML Efficient XML Working Group (EXI WG)
The W3C convened the Efficient XML Working Group (5) as a follow-on activity to define and measure
the benefits of an alternative encoding of the XML information set (data model) to provide more
efficient XML. This encoding must also meet the requirements defined by the Binary Characterization
Working Group.
The Efficient XML Working Group analyzed a number of solutions for efficient XML based on a broad set
of use cases (as defined by the Binary Characterization WG) with a large and varied set of sample files.
The WG subsequently published their test results and testing framework.
The EXI WG evaluated (6) nine different encoding alternatives. Based on the necessary and desirable
properties, the EXI WG evaluated which formats met the minimum requirements to be a candidate
format. A summary of the findings is reproduced here:
Format
Meets Minimum
Requirements?
XML + GZIP
No
Fast Infoset
No
FXDI (Fujitsu Binary)
No
Efficient XML (AgileDelta)
Yes
Xebu
No
X.694 with BER
No
X.694 with PER
No
X.694 with PER + Fast Infoset
Yes
esXML
No
Figure 4: EXI WG Candidates
By way of example, XML + GZIP did not meet either the Compactness or Generality properties.
Definitions of the property types and explanations of the process and conclusions may be found in the
EXI WGs measurements note (6).
Based on their testing results, the EXI WG defined an alternative XML encoding called EXI which was
largely based upon AgileDelta’s EfficientXML format. The EXI format specification entered Candidate
Recommendation status in late 2009, and is expected to produce a Final Recommendation.
MIT Lincoln Labs FastInfoset and EXI weather comparison
MIT Lincoln Labs performed a comparison of Sun’s implementation of Fast Infoset and AgileDelta’s
EfficientXML. Note that EfficientXML is closely related to the W3C’s EXI format (as described in Section
0) but does not include several features eventually included in the final EXI format specification. This
trial was performed with 134 XML cases. These files were of two types: NCML-GML and polar radar
data. These weather-specific data files were placed within the W3C’s EXI Test Framework and a series of
in-memory round-trip encode/decode operations were performed.
MIT LL concluded that both Sun’s Fast Infoset and EfficientXML were comparable formats. EfficientXML
produced more compact results (83.8% compactness vs 75% compactness), and Fast Infoset had better
processing characteristics (86ms vs 207ms per run). It was judged that EfficientXML‘s compactness
advantage was the more important factor and that EfficientXML had the overall advantage.
NCAR Preliminary Sun’s FastInfoset Evaluation
The National Center for Atmospheric performed a weather-specific evaluation of compactness and
processing characteristics of four common weather products:

Decoded AIR/SIGMETs

Decoded METARs

Decoded PIREPs

Decoded TAFs
XML File Size
Fast Infoset
File Size
XML Parsing
Time
Fast Infoset
Parsing Time
5
7kb
3kb (0.43)
18ms
13ms (0.72)
METARs
1481
1167kb
373kb (0.32)
84ms
56ms (0.667)
PIREPs
158
155kb
51kb (0.33)
29ms
29ms (1.0)
TAFs
177
471kb
98kb (0.208)
57ms
39ms (0.684)
Product
Count
(# of records)
AIR/SIGMETs
Figure 5: FastInfoset Weather Analysis
This evaluation used the Japex framework, which is related to the EXI WG testing framework. These
assessments were performed without schema information. Averaged over all products, Sun’s Fast
Infoset was found to reduce file size to 75% of the original file size, and reduced parsing (processing)
time to 33% of raw XML parsing time.
This evaluation concluded that there were considerable efficiency gains in both compactness and
processing time when using Sun’s Fast Infoset. There was no case in which performance was worse. It
was expected that further trials that provided Fast Infoset with schema information could improve
compactness significantly.
NCAR Exificient (EXI) Library Compactness Assessment
Once it became clear that the W3C EXI WG was favoring EXI over Fast Infoset, NCAR did an assessment
of compactness for four common products:

Decoded AIR/SIGMETs

Decoded METARs

Decoded PIREPs

Decoded TAFs
This evaluation used the 0.2 version of the Exificient library for writing EXI data files. At the time of the
assessment Exificient did not implement the full set of EXI features, and as such this assessment was
intended as a measurement for an early version of Exificient rather than a set of EXI guidance metrics.
The generated EXI files were compared to their original XML versions, and GZIPed versions of the
original XML and of the EXI-encoded files were measured.
The conclusion of this assessment was that over all four products the average Exificient/EXI compression
was 0.13 of the original file sizes. By comparison, the GZIPed XML files averaged 0.07 of the original
XML file sizes. GZIPing the EXI files achieved about the same level of compactness as GZIPing the
original XML files.
Both schema-informed EXI files were generated as well as schema-less files. All schema-informed
versions were larger than their schema-ignorant equivalents, which was attributed to limitations of the
0.2 version of Exificient. The Exificient library was found to correctly preserve the complete XML data
model (including whitespace) in a roundtrip from XML -> EXI -> XML.
SOLUTION CLASSES
Data-agnostic compression
Data-agnostic compression includes techniques such as GZIP, ZIP, BZIP2, and LZMA. These utilities are
used to compress the stream or file. This technique does not preserve the native XML document model,
and as such the XML must be decompressed prior to being transformed, parsed, or worked with as an
XML document.
Different data-agnostic compression techniques have differing tradeoffs between the compression
achieved and the processing time required. GZIP is often considered a reasonable middle-ground
solution that balances good compression with a minimal processing impact. BZIP2 often requires more
processing time than GZIP, but offers better compression.
Generally speaking, data-agnostic compression addresses the compactness problem but requires
additional processing time to compress and decompress the contents.
Hardware
Not long after XML was standardized a family of hardware solutions emerged to address XML efficiency
issues. These are typically referred to as “XML appliances”. In some cases the appliances are
transparent to developers, and in other cases they require the use of custom libraries and drivers.
Most modern XML appliances are rack-mountable and may be purchased from a vendor.
XML-Wrapped Binary
Another alternative for efficient XML is to encode metadata as XML, and wrap this around more efficient
binary contents. The advantages of this approach are that XML may be used where it is needed for its
flexibility, and can describe the contents of embedded raw data values encoded in a more efficient
binary form.
This approach has several significant limitations, however. The mixed model can be difficult to integrate
into XML libraries and utilities and segregates the data model into two separate components. The
wrapped binary contents are opaque to XML libraries and tools, and as such cannot be transformed or
acted upon by the XML technology stack (XSLT, XPATH, etc.).
Binary XML Formats
Efficient/binary XML formats attempt to address the problem of efficiency by finding alternate, nontextual encodings. Some solutions in this space may simply be considered an XML-specific compression
scheme, whereas others preserve the complete XML data model and may be considered a binary
encoding of XML. This latter category of format allows for the format to be applied at a fairly low level,
and libraries and XML-related technologies may still be used. However, all solutions in this category
lose the human-readability characteristic in favor of efficiency.
Many alternatives have emerged among binary XML data formats. Those not sponsored and/or
developed by a standards body were not considered. If standardization were not a criterion, there are
numerous other binary XML formats that could have been included such as XMill.
Whenever binary and/or efficient XML formats are discussed, inevitably the question of human
readability is raised. Almost all binary XML solutions discard the text encoding for efficiency, which
removes the human readability and transparency that has been one of the cornerstones of XML’s
success. This is analogous to the issue of whether to include whitespace XML for human readability and
debugging. For the purposes of weather data exchange, the efficiency gains of lossless binary XML
approaches are considered worthwhile as long as sufficient tooling exists to conveniently support
human-readable translation scenarios. Each system must be evaluated independently, but the essential
characteristic in using binary XML is that it be losslessly (and easily) translatable between its binary and
its human-readable XML form.
Data Format
Standards
Bodies
Fast Infoset
ITU-T
Notes
Was not considered to satisfy:
ISO
EXI
W3C*
BiM
ISO (MPEG
WG)
BXML
OGC
W3C EXI Characteristics (7)
In the W3Cs Candidate
Recommendation phase. The
W3C EXI WG appears to be
making recent progress.

Compactness

Generality
Meets all W3C EXI/XBC WG
requirements
Not measured
Offered as a best practice paper
rather than an official standard.
The OGC has stated that it will
not move BXML to a supported
standard but will support BXML
until an international standard
for compressed XML emerges (8)
Not measured
WBXML
Open Mobile
Alliance
W3C*
WBXML is a proposed W3C
standard. It does not appear to
be active in the W3C any longer.
Originally developed as a
compact wireless mobile format.
Not measured
In the opinion of a developer involved in both the Fast Infoset and EXI standardization groups, EXI seems
to be in a leadership position with Fast Infoset and BiM slightly behind. Simple search engine popularity
analyses seem to support this position. Within the wireless and mobile communities WBXML is in use,
but significant adoption outside of these communities has not been observed. BXML has been used
within the OGC avionics community and among some OGC members.
Based on community interest levels and standardization maturity EXI, Fast Infoset, and BiM are the
leading candidates. Because BiM was not a candidate that was evaluated as part of the EXI WGs, it is
not clear which of the W3Cs requirements BiM does or does not meet.
ASSESSMENT
There were several goals for this assessment:

Compare XML compactness characteristics of XML with several binary XML solutions, legacy
binary formats, and GZIPed equivalents

Compare GML-based weather files (WXXM 1.1.1) to simple (ADDS Dataserver) weather files to
ascertain the effectiveness of the compactness techniques across different original data
structures
The EXI WG Test Framework was a convenient and relatively mature starting point for efficient XML
analysis. This test framework is based upon Japex (9), a performance micro-benchmarking framework
for Java. Japex executes performance trials and produces HTML output (such as statistics and graphs) of
the results. The EXI test framework includes Japex configurations for all the formats evaluated by the
EXI WG.
Environment
For configuration and logic re-use, software versions were kept as similar to the EXI WG test framework
as possible.
Software
Java 1.6.0_19-b04 (32 bit)
Java 1.6 was used as the basis for the analysis.
JAPEX 1.1 (https://japex.dev.java.net/)
Japex is a library used by the W3C EXI for their compactness and processing evaluation. Japex was also
used as a foundational component for measuring compactness in this assessment.
Sun’s Fast Infoset 1.2.2
1.2.2 is not the most recent version of the library, but it was the version used in the EXI WG Test
Framework.
Exificient 0.4
The Exificient library was used for EXI format generation. As of version 0.4, there are a number of EXI
features not yet implemented.
Host and Hardware
Dual core 64-bit 3Ghz CPU running 32-bit Debian Linux.
Configuration
All trials were run with the following Java command line options :
-server –mx2650M –Djapex.numberOfThreads=1
Data
The original data files were collected from the ADDS Dataserver version 1.2. For each product in the
assessment (METARs, TAFs, AIR/SIGMETs, and aircraft reports) a file was retrieved that contained a
single record, and another file was retrieved that contained 24 hours of data for that product:

Aircraft reports (31057 records for a 24 hour period)

AIR/SIGMET reports (317 records for a 24 hour period)

METAR reports (160761 records for a 24 hour period)

TAF reports (24783 records for a 24 hour period)
These data files were then converted to WXXM 1.1.1 format. This process was not perfect, and is
estimated to have converted 90-95% of the ADDS Dataserver contents to WXXM 1.1.1 equivalents. In
some cases default enumerated values were inserted. WXXM was considered a relatively realistic and
non-trivial schema for weather data exchange scenarios. The ADDS Dataserver data is mature, but it
does not encode a unified model for data exchange and in some ways represents a best-case,
straightforward scenario for XML data exchange. Examples of ADDS Dataserver and WXXM 1.1.1 data
are shown in Appendix B.
For METAR and TAF data, octet-wise legacy binary formats were available. For example, each binary
METAR record started with the following:
Figure 6 Binary METAR
The binary formats for TAF and METAR are straightforward binary representations of the decoded
products. The binary formats were not tuned for extreme compactness, and decoded fields were not
encoded at the sub-octet (bit) level. Typical byte lengths for each product are included in the results.
Output
The WXXM 1.1.1 data was used as the baseline in all graphs and comparisons.
Aircraft Reports
Figure 7: Aircraft Reports Report
AIR/SIGMETs
Figure 8: AIR/SIGMETs Report
METARs
Figure 9: METARs Report
TAFs
Figure 10: TAFs Report
24 Hours Product Summary
Figure 11 lists the raw octet counts for every product in each format.
Byte/octet
sizes
Aircraft
Reports
(24 hours)
Avg. Size
Across
All
Products
Avg.
Compaction
(relative to
Baseline
WXXM)
AIR/SIGMETs
METARs
TAFs
(24 hours)
(24 hours)
(24 hours)
Baseline WXXM
39608698
737248
301227422
125747574
116830236
1.00
Formatted WXXM
53319755
932382
409222080
171613035
158771813
1.36
Baseline ADDS
16013190
617833
113930448
45817011
44094621
0.38
Formatted ADDS
19305370
869289
140365356
56260159
54200044
0.46
Exificient WXXM
(with schema)
4934941
269321
33557667
12128631
12722640
0.11
Exificient WXXM
(without schema)
4461832
244939
24078397
8462731
9311975
0.08
Exificient ADDS
(without schema)
3327079
203275
19312665
6393763
7309196
0.06
Sun’s FastInfoset
WXXM (with
schema)
7001802
381684
45577140
16982830
17485864
0.15
Sun’s FastInfoset
WXXM (without
schema)
6694989
375845
43365472
15601237
16509386
0.14
Sun’s FastInfoset
ADDS (without
schema)
3901922
234078
23006964
8184989
8831988.3
0.08
GZIP WXXM
1744655
69561
13241820
4270138
4831544
0.04
GZIP ADDS
1598625
54343
10546794
3030971
3807683
0.03
GZIP Exificient
WXXM (with
schema)
2090570
139817
13243397
3975517
4862325
0.04
GZIP Sun’s
FastInfoset
(without schema)
1320964
59635
8759934
2554030
3173641
0.03
Legacy Binary
(TAFs and METARs
only)
N/A
N/A
25468818
10434486
17951652
0.15
Figure 11: Compaction Assessment Results (24 hrs of records)
Single Product Summary
Figure 12 lists the raw octet counts for every product in each format.
Byte/octet
sizes
Aircraft
Reports
AIR/SIGMETs
METARs
TAFs
(1 record)
(1 record)
(1 record)
(1 record)
Avg. Size
Across All
Products
Avg.
Compaction
(relative to
Baseline
WXXM)
Baseline WXXM
1969
3371
2927
2823
2772.5
1.00
Formatted WXXM
2357
4002
3736
3436
3383
1.22
Baseline ADDS
909
3345
1361
1351
1742
0.63
Formatted ADDS
1045
5033
1595
1547
2305
0.83
Exificient WXXM
(with schema)
276
921
486
596
570
0.21
Exificient WXXM
(without schema)
680
1993
1105
1212
1248
0.45
Exificient ADDS
(without schema)
582
1086
843
949
865
0.31
Sun’s FastInfoset
WXXM (with
schema)
1169
2497
1644
1752
1766
0.64
Sun’s FastInfoset
WXXM (without
schema)
1133
2449
1602
1709
1723
0.62
Sun’s FastInfoset
ADDS (without
schema)
665
1351
941
1039
999
0.36
GZIP WXXM
718
1246
998
1022
996
0.36
GZIP ADDS
506
789
725
682
676
0.24
GZIP Exificient
WXXM (with
schema)
299
944
509
502
564
0.20
GZIP Sun’s
FastInfoset
(without schema)
728
1265
1018
1022
1008
0.36
Legacy Binary
(TAFs and
METARs only)
N/A
N/A
220
387
304
0.11
Figure 12: Compaction Assessment Results (single record)
Analysis
A comparison of the single record results and the 24-hours results indicates a noteworthy overhead for
all compacted data formats except EXI with a schema. As a percentage of bytes, the single record data
files showed worse compaction than the 24 hours results. This is not unexpected, but it is worth
consideration that if single records are transferred or stored independently almost all of the approaches
will significantly increase data sizes.
The remainder of the analysis focuses upon the 24-hour data files as they are considered more
statistically significant.
Formatted data (with human-readable whitespace) was 25-35% larger than their unformatted
equivalents. The removal of human-generated XML formatting is a lossy process (specialized formatting
created by a human cannot be restored) in cases of automated data transfer/storage the formatting is
typically not carrying any meaningful information.
For all compression techniques/data formats, the benefits were very obvious. GZIP compression was
able to compress data to 4% of the original file sizes, and Exificient (the best efficient XML technique)
was able to compress data to 8% of the original file sizes. In data systems where compaction efficiency
is a concern, these compression ratios will be very noticeable.
One of the primary goals of this assessment was to compare EXI and FastInfoset and assess their
effectiveness in compressing weather data. Based on the implementations used in this study (Exificient
0.4 and Sun’s FastInfoset 1.2.2) it appears that EXI provides better compaction for weather data. As
shown in Figure 13, in all cases Exificient came out on top.
Exificient Average Size
(bytes)
Sun’s FastInfo Average
Size (bytes)
WXXM (schemainformed)
12722640
17485864
0.73
WXXM (no schema)
9311975
16509386
0.56
ADDS (no schema)
7309196
8831988
0.83
24 hours of data records
Exificient to Sun’s
FastInfoset
compactness
Figure 13: EXI and FastInfoset Comparison
A puzzling aspect of the results is that schema-informed WXXM provided worse compactness than
schema-less techniques for both EXI and FastInfoset. It was expected that schema information (such as
the numeric schema types) would result in greater compactness than when EXI and FastInfoset were
given no type information. This could be the result of WXXM’s method of representing numeric data, or
could indicate that neither of the libraries made complete use of the schemas. Schema-informed ADDS
data might have provided some insight into this behavior.
While EXI and FastInfoset did not compress to GZIP levels, EXI + GZIP and FastInfoset + GZIP appear to
give very similar levels of compression as compared to the original XML + GZIP. This characteristic could
be useful in scenarios where data is staged in EXI for fast access, and could then be archived in GZIPed
(or another data-agnostic compression scheme) EXI for greater compactness.
If WXXM and ADDS data is compared, it is clear that the complexity of the model has a significant effect
on data compactness. As stated in Section 0, not all ADDS data elements were completely converted to
WXXM and the baseline ADDS data was still 34% (formatted) and 38% (unformatted) of the size of the
converted WXXM. This trait may be at least partially explained by the GML Object-Property-Value
model, which requires that GML application schemas represent every relationship between two objects
as a separate XML element. WXXM also has a much deeper level of nesting on average for similar
concepts. For example, in the ADDS METARs format air temperature is represented as:
<METAR>
<temp_c>1.1</temp_c>
</METAR>
Whereas WXXM represents the same information in a structure similar to:
<ns3:METAR>
<ns3:aerodromeWxObservation>
<ns8:Observation>
<ns8:result>
<ns3:airTemperature uom="C">33.97999954223633</ns3:airTemperature>
</ns8:result >
</ns8:Observation>
</ns3:aerodromeWxObservation >
</ns3:METAR>
It is anticipated that the automated conversion process could be improved. Note the additional
attributes in the example below, such as ns:type, xsi:type, and xsi:nil:
<ns3:aerodromeWxObservation ns2:type="simple">
<ns8:Observation xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="ns4:ObservationType">
<ns1:boundedBy xsi:nil="true"/>
…..
It is possible that some of these extra attributes could be removed from the WXXM data and the file
sizes might be more comparable.
FUTURE WORK
The latest round of weather-specific assessments does not include any processing measurements. The
EXI test framework includes two additional components of interest: processing and network round-trip.
Future assessments should include more information on the tradeoff between processing and
compactness, and measure information on how these efficient XML techniques influence transactions
per second in low and high bandwidth scenarios.
An area that should be investigated further is why schema-informed EXI and FastInfoset were larger
than their schema-less versions. As mentioned in the analysis, it may be helpful to include ADDS
schema-informed in the next trial to identify whether it is a data model/schema problem or whether the
testing framework is misconfigured. It is also possible that slight modifications to WXXM schemas could
result in much better compactness, such as if the numeric values do not currently carry XML schema
type information and there is not sufficient information in the schemas to allow for full compaction.
For a number of reasons it would be desirable to improve the conversion process (as described above).
This would be useful to reduce uncertainty in future assessments.
Because of the early version of the Exificient library, it would be very beneficial to include other EXI
implementations in future trials. Exificient does not yet implement a number of EXI features, and has
not yet gone through an in-depth performance tuning process. AgileDelta (10) has a commercial, full
implementation of the current EXI specification that could provide much more accurate estimates of EXI
compactness and overall efficiency.
It would be useful to identify how well formatted XML compresses in EXI/FastInfoset. All the
compressed scenarios of this assessment used unformatted data, and it is unclear whether removing
formatting has a significant impact on efficiency, and therefore whether it is compelling to remove when
not needed.
As one of the three leading candidates in the standardization realm, BiM would be helpful to add to
future assessments.
A study of the relationship between different measurements of XML complexity and file sizes might be
beneficial for estimating compactness characteristics without direct experiments. Complexity metrics
might also be useful in understanding some of the more basic properties of weather data.
Two obvious measures of complexity are the maximum depth/nesting of elements and attributes and
instance document entropy. An analysis of the entropy of various products or product types (such as
that described in 11) might give guidance on the best-case scenario for data representations. It is
possible that data entropy and/or complexity could be used to roughly estimate data compactness.
The performance improvements and other impacts of hardware XML accelerators should be evaluated.
They could be a viable (and relatively simple) technology solution for some scenarios, but it is not yet
clear whether this class of solution can provide a transparent, cross-platform, efficient solution for
common weather scenarios.
RECOMMENDATIONS
In general, efficient XML techniques show clear benefits in both compactness and processing. While not
measured by this assessment, assessments noted in the EXISTING WORK section clearly demonstrates
both benefits.
If network bandwidth is a significant cost driver or bandwidth is in any way an important characteristic
of the system, data-agnostic compression or efficient XML formats should be used. In cases where
bandwidth and/or data storage sizes are important and processing time not an issue, data-agnostic
compression techniques are the most straightforward approach and would offer optimal compactness.
UTF-8 should be used rather than UTF-16 when working with XML documents primarily written using
the English language. Use other encoding schemes (ISO-8859, UTF-16, etc.) for other languages, but in
each case choose an encoding that matches well with the language.
SAX or StAX decoding should be utilized to minimize memory consumption. DOM-based parsing is
often most natural for developers, but can require significant more memory because of the XML model in
that is kept in memory.
APPENDIX A - ACRONYMS
AIRMET
SIGMET
API
DOM
GML
METAR
MIME
NCAR
NNEW
OGC
SAX
StAX
SOAP
TAF
UOM
URI
URL
URN
XML
Application Programming Interface
Geographic Markup Language
Multipurpose Internet Mail Extensions
National Center for Atmospheric Research
NextGen Network Enabled Weather
Open GeoSpatial Consortium
Single Authoritative Source
The technology formerly known as Simple Object Access Protocol
Terminal Aerodrome Forecasts
Unit(s) of measure
Uniform Resource Identifiers
Uniform Resource Locators
Uniform Resource Names
Extensible Markup Language
APPENDIX B – DATA EXAMPLES
This section contains examples of the WXXM 1.1.1 and ADDS Dataserver 1.2 data files used in the
assessment. Note that formatting (whitespace) has been added for readability, and that for several files
some redundant text was omitted and replaced with “…” or a comment.
Aircraft Reports
The aircraft reports dataset included PIREPs, AIREPs, and AMDARs.
ADDS Dataserver 1.2:
WXXM 1.1.1:
AIR/SIGMETs
ADDS Dataserver 1.2:
WXXM 1.1.1:
METARs
ADDS Dataserver 1.2:
WXXM 1.1.1:
TAFs
ADDS Dataserver 1.2:
WXXM 1.1.1:
APPENDIX C - DEFINITIONS AND TERMS
Octet – 8 bits. While often used synonymously with byte, the term byte is overloaded and does not
always indicate exactly 8 bits.
APPENDIX D - REFERENCES
1 Efficient XML – Taking Net-Centric Operations to the Edge. John Schneider
2 W3C XML Binary Characterization Working Group. http://www.w3.org/XML/Binary/
3 W3C XML Binary Characterization Working Group Minimum Binary XML Requirements.
http://www.w3.org/TR/xbc-characterization/#N102EC
4 W3C Binary Characterization Working Group Analysis. http://www.w3.org/TR/xbc-characterization/
5 W3C Efficient XML Interchange Working Group (EXI). http://www.w3.org/XML/EXI/
6 W3C Efficient XML Interchange Working Group Measurements Note.
http://www.w3.org/TR/2007/WD-exi-measurements-20070725/
7 W3C Efficient XML Working Group Measurements Note – Requirements.
http://www.w3.org/TR/2007/WD-exi-measurements-20070725/#contributions-assessment
8 BXML/OGC Clarification on the OGC Forum.
http://feature.opengeospatial.org/forumbb/viewtopic.php?t=1193
9 Japex, Java Micro-benchmarking framework. https://japex.dev.java.net/
10 AgileDelta. http://www.agiledelta.com
11 An Analysis of XML Compression Efficiency. C. Augeri, B. Mullins, et al.
http://www.usenix.org/events/expcs07/papers/7-augeri.pdf
Download