Efficient XML Interchange Introduction - Open-DIS

advertisement
Efficient XML Interchange
XML
Why is XML good?
• A widely accepted standard for data representation
• Fairly simple format
• Flexible
It’s not used by everyone, but it’s used by enough
people to make for a rich tools environment
It’s flexible enough to be used in lots of contexts
It’s text based and human readable, which makes it a
good archival format
XML
XML in 10 points
http://www.w3.org/XML/1999/XML-in-10-Points
Includes (3) “XML is meant to be read”, and (4)
“XML is verbose by design”
XML can (but should not be) read by humans,
and is not very compact
XML
These design principles also make it very
difficult to use XML in some environments
• Wireless military links: low bandwidth
• Mobile devices: battery life limitations
• Processing efficiency: it can take CPU cycles
to parse XML
• Data binding
Limitations
Lots of ships have 64 Kbit/sec at best. It is problematic
to ship XML across these links
CPUs are on Moore’s law curve, but battery power is
limited by the state of chemistry. We can’t assume
that faster processors will save us. Lots of
applications for hand held devices with limited
battery power (cell phones, etc.)
Cell phones don’t necessarily have strong CPUs, so
parsing XML can be expensive relative to other tasks
Data Binding
This is a more subtle problem.
<Point x=“1.0” y=“2.0”/>
How do you convert this to an object? You need
to parse the string “1.0”, then convert it to a
binary representation
It’s the difference between
string x;
And
float x;
Data Binding
Typically something comes in from the wire,
and you have to do the Java equivalent of
Float.parseFloat(“1.0”);
This is expensive when working with numericheavy documents
It is much more efficient to keep the value X in
a binary representation in the document, then
simply read it on the receiving side
Efficient XML Interchange
EXI relaxes some of the requirements of XML in order to
be more compact, faster to parse, and have better
data binding characteristics
• Relax the “human readable” requirement
• Allow binary data
What you get is an alternate encoding of the XML
infoset that is more compact, faster to parse, and
allows deployment in new environments that XML
previously could not be deployed in
EXI
EXI is being developed by a W3C working group
and is on a standards track. The hope is that
this will become a W3C-blessed encoding of
the XML infoset
Working group draft now working its way to
approval.
Need multiple implementations, blessed by W3C
technical architecture group, approval by
other W3C working groups (encryption,
processors, etc.)
EXI
• Represents the same data as an XML
document, only in a more efficient encoding
• Minimal impact on other XML technologies,
such as encryption
• More efficient to parse, better data binding
performance
EXI
http://www.w3.org/XML/EXI
Includes file format specification, primer on EXI,
best practices
Note that one thing that is NOT specified is an
API for accessing the data. This is an
important and significant omission
Lack of a standardized typed API means we still
have to go through string representations
Typed API
What is meant by a typed API?
DOM and SAX return string values:
Attr anAttribute;
…
// DOM returns a String attribute value here
String val = anAttribute.getValue()
And then we need to convert val into a float via
Float aFloat = Float.parseFloat(val);
Typed API
But what we often want is the value specified in
the schema:
Float aFloat = anAttribute.getFloat();
There are proposals for a generalized typed
API, but it is not part of this standard
EXI
EXI has several options to handle different
situations.
• You have an XML document and a schema
• You have an XML document but no schema
• You have an XML document, and a schema
that almost, but not quite, matches the
document
Element and Attribute Names
Tag names take up a lot of space, and can be
somewhat expensive to parse
<Name first=“James” last=“Madison”>
<State>Virginia</State>
</Name>
Count up the characters used for markup here:31/55
~=50-60% of file size for markup tags
If we replace the character tags with numeric stand-ins
we can get much more compact, and it will be faster
to parse
Schema-Informed
If you have a schema, that gives you type
information about the XML document. You
know that <foo x=“1.0”/> means the x is a
float value rather than a string, because the
schema tells you that.
That means you can store the “1.0” value in a
binary format, which is generally more
compact and has the potential to have better
data binding with a typed API
Schemaless
What if you don’t have a schema? This means
you can’t exploit type information. But EXI
should support this situation, because it
should be a general solution
EXI handles this by replacing repeating strings
with a compact identifier
Schemaless
<Address town=“Monterey” zip=“93943”/>
The strings “Monterey” and the zip code are likely to be
repeated many times in an XML document. We can
create a table of these values, and then use the table
ID rather than the whole string
String
Monterey
ID
1
93940
San Jose
98842
2
3
5
“Almost” Schemas
If you have a document that doesn’t quite
match the schema, EXI can take a forgiving
attitude. It uses the schema to encode the
types it knows about, and uses strings and
string table identifiers to handle the ones not
described by the schema
Implementations
As of now there is one implementation of the
draft spec, Efficient XML from Agile Delta
(http://www.agiledelta.com)
Other open source projects underway, and
some commercial projects
The standards process requires that multiple
independent implementations be available
before the standard is approved
Results
Example: Distributed Interactive Simulation
(DIS) is an IEEE standard for modeling and
simulation. It is a binary standard that
contains (x,y,z), velocity, acceleration, and
other numeric-heavy data
We did an XML representation of the binary DIS
standard
Results
1 PDU
1000
PDUs
DIS
Binary
(bytes)
144
DIS
XML
EXI
Format
1167
129
464,480
3,924,680
365,564
Results
• Somewhat better size than the original binary format.
The exact size varies somewhat depending on the
numeric data, while the original binary format is
always the same size. Exi seems to be consistently
better, though
• AND it is marked up in a way that makes it equivalent
to an XML file. This means we can easily access all
the tools of the XML ecosystem by simply converting
it to a text XML representation
Conclusions
Replace all text XML with EXI? No! EXI is intended to
expand the use of XML into use cases that XML could
not service. XML mostly does fine in its existing
environment
EXI can be used to XML-ify existing binary protocols and
get slightly better performance with greatly increased
interoperability (no one knows DIS binary, everyone
knows XML)
Next great frontier: typed XML APIs
Download