[#JAXB-614] JAXB generates illegal XML characters

advertisement
[JAXB-614] JAXB generates illegal XML characters Created: 18/Mar/09
Status:
Project:
Component/s:
Affects
Version/s:
Fix Version/s:
Open
jaxb
runtime
2.1
Type:
Reporter:
Resolution:
Labels:
Remaining
Estimate:
Time Spent:
Original
Estimate:
Environment:
Bug
ranboii
Unresolved
None
Not Specified
Issuezilla Id:
614
Updated: 13/May/15
not determined
Priority:
Assignee:
Votes:
Minor
Martin Grebac
5
Not Specified
Not Specified
Operating System: All
Platform: All
URL: http://forums.java.net/jive/thread.jspa?threadID=59068
Description
Some characters (such as 0x1f) that are legal in Java strings are illegal in XML
(and XML does not provide a way to escape such characters to make them legal).
When JAXB marshals objects that contain these illegal characters in strings, it
currently includes those characters in the XML, thus generating invalid XML.
Later when it comes time to unmarshal the XML back into objects, an exception
is thrown due to the illegal character. This could spell disaster for a system
that, for example, write objects as XML or fast infoset into a database and then
cannot read them back out later.
JAXB should not be allowed to ever generate invalid XML. If an exception is
going to be thrown, it should be thrown when generating the XML, not when trying
to decode it. So a minimum requirement should be that JAXB throw an exception
when attempting to generate invalid XML, or that it should at least strip out
the characters that would be invalid (or have a property on the marshaller that
allows this to be set).
However, JAXB is also supposed to be converting an object to XML and back
losslessly, so an even better solution would be to do a consistent kind of
escaping of the offending characters in such a way that when the strings are
marshalled back in, the original string can be reconstructed.
It should be straightforward to come up with an escaping scheme that
guarantees lossless translation from Strings to XML and back (e.g., convert 0x1f
to "\u001f" or "JAXB_UNICODE_001f" or something unlikely to appear by
accident). I don't know that it's possible to guarantee that XML generated
through some other process won't ever be accidentally interpreted as containing
"escaped" strings, but it can be made very unlikely.
Below is the simplest unit test I could come up with that exposes the problem.
public void testBinary() throws JAXBException
{ JAXBContext jxbc = JAXBContext.newInstance(OneString.class); OneString orig = new
OneString(); orig.setString("\u001f"); ByteArrayOutputStream s = new
ByteArrayOutputStream(); Marshaller m = jxbc.createMarshaller(); m.marshal(orig, s); String
xml = s.toString(); OneString result = (OneString) jxbc.createUnmarshaller().unmarshal(new
ByteArrayInputStream(xml.getBytes())); assertEquals("\u001f", result.getString()); }
@XmlRootElement(name = "oneString")
private static class OneString {
String string;
public String getString()
{ return string; }
public void setString(String s)
{ this.string = s; }
}
There are workarounds for this issue, e.g., at http://tinyurl.com/cq9u58; but as
it currently exists, this is a dangerous bug that can make data unreadable.
Comments
Comment by ranboii [ 18/Mar/09 ]
Actually, I meant to post this URL demonstrating a workaround:
http://blog.lesc.se/2009/03/escape-illegal-characters-with-jaxb-xml.html
which at least illustrates I'm not the only one to have seen this issue. Of
course, a fix would be much better than a workaround.
Comment by Pavel Bucek [ 02/Apr/09 ]
partially fixed in trunk.
IllegalArgumentException should be thrown whether you try marshal string with
invalid xml content. But there is a catch. Invalid characters can occur when
UTF-32 is used (it happens because of encoding its characters to UTF-16 which is
java native encoding).
Anyway, it is still far from perfect and needs some additional work.
Adjusting priority and assigning to myself.
Comment by Pavel Bucek [ 02/Apr/09 ]
reassigning
Comment by mnsam [ 11/May/12 ]
As per the workaround, is this the list of unsupported characters ?
"\u0000\u0001\u0002\u0003\u0004\u0005" +
"\u0006\u0007\u0008\u000B\u000C\u000E\u000F\u0010\u0011\u0012" +
"\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001A\u001B\u001C" +
"\u001D\u001E\u001F\uFFFE\uFFFF"
Comment by aldaranalton [ 13/May/15 ]
What about an option to remove automatically all not printable characters?
str.replaceAll("\\P{Print}", "");
Generated at Sun Mar 06 10:04:44 UTC 2016 using JIRA 6.2.3#6260sha1:63ef1d6dac3f4f4d7db4c1effd405ba38ccdc558.
Download