[JAXB-614] JAXB generates illegal XML characters Created: 18/Mar/09 Status: Project: Component/s: Affects Version/s: Fix Version/s: Open jaxb runtime 2.1 Type: Reporter: Resolution: Labels: Remaining Estimate: Time Spent: Original Estimate: Environment: Bug ranboii Unresolved None Not Specified Issuezilla Id: 614 Updated: 13/May/15 not determined Priority: Assignee: Votes: Minor Martin Grebac 5 Not Specified Not Specified Operating System: All Platform: All URL: http://forums.java.net/jive/thread.jspa?threadID=59068 Description Some characters (such as 0x1f) that are legal in Java strings are illegal in XML (and XML does not provide a way to escape such characters to make them legal). When JAXB marshals objects that contain these illegal characters in strings, it currently includes those characters in the XML, thus generating invalid XML. Later when it comes time to unmarshal the XML back into objects, an exception is thrown due to the illegal character. This could spell disaster for a system that, for example, write objects as XML or fast infoset into a database and then cannot read them back out later. JAXB should not be allowed to ever generate invalid XML. If an exception is going to be thrown, it should be thrown when generating the XML, not when trying to decode it. So a minimum requirement should be that JAXB throw an exception when attempting to generate invalid XML, or that it should at least strip out the characters that would be invalid (or have a property on the marshaller that allows this to be set). However, JAXB is also supposed to be converting an object to XML and back losslessly, so an even better solution would be to do a consistent kind of escaping of the offending characters in such a way that when the strings are marshalled back in, the original string can be reconstructed. It should be straightforward to come up with an escaping scheme that guarantees lossless translation from Strings to XML and back (e.g., convert 0x1f to "\u001f" or "JAXB_UNICODE_001f" or something unlikely to appear by accident). I don't know that it's possible to guarantee that XML generated through some other process won't ever be accidentally interpreted as containing "escaped" strings, but it can be made very unlikely. Below is the simplest unit test I could come up with that exposes the problem. public void testBinary() throws JAXBException { JAXBContext jxbc = JAXBContext.newInstance(OneString.class); OneString orig = new OneString(); orig.setString("\u001f"); ByteArrayOutputStream s = new ByteArrayOutputStream(); Marshaller m = jxbc.createMarshaller(); m.marshal(orig, s); String xml = s.toString(); OneString result = (OneString) jxbc.createUnmarshaller().unmarshal(new ByteArrayInputStream(xml.getBytes())); assertEquals("\u001f", result.getString()); } @XmlRootElement(name = "oneString") private static class OneString { String string; public String getString() { return string; } public void setString(String s) { this.string = s; } } There are workarounds for this issue, e.g., at http://tinyurl.com/cq9u58; but as it currently exists, this is a dangerous bug that can make data unreadable. Comments Comment by ranboii [ 18/Mar/09 ] Actually, I meant to post this URL demonstrating a workaround: http://blog.lesc.se/2009/03/escape-illegal-characters-with-jaxb-xml.html which at least illustrates I'm not the only one to have seen this issue. Of course, a fix would be much better than a workaround. Comment by Pavel Bucek [ 02/Apr/09 ] partially fixed in trunk. IllegalArgumentException should be thrown whether you try marshal string with invalid xml content. But there is a catch. Invalid characters can occur when UTF-32 is used (it happens because of encoding its characters to UTF-16 which is java native encoding). Anyway, it is still far from perfect and needs some additional work. Adjusting priority and assigning to myself. Comment by Pavel Bucek [ 02/Apr/09 ] reassigning Comment by mnsam [ 11/May/12 ] As per the workaround, is this the list of unsupported characters ? "\u0000\u0001\u0002\u0003\u0004\u0005" + "\u0006\u0007\u0008\u000B\u000C\u000E\u000F\u0010\u0011\u0012" + "\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001A\u001B\u001C" + "\u001D\u001E\u001F\uFFFE\uFFFF" Comment by aldaranalton [ 13/May/15 ] What about an option to remove automatically all not printable characters? str.replaceAll("\\P{Print}", ""); Generated at Sun Mar 06 10:04:44 UTC 2016 using JIRA 6.2.3#6260sha1:63ef1d6dac3f4f4d7db4c1effd405ba38ccdc558.