XML encoding declaration and endianness



I'm tidying up some of my really old Java code, written to the first edition of the XML spec before XML parsing was included in the JDK libraries, and trying to bring it up to date as well as write some tests. In particular I'm (re)implementing XML character encoding autodetection like this:



  1. I read the BOM, if any.

  2. If there is no BOM, I "impute" a BOM based upon the expected <?xml start of the XML declaration.

  3. I now have enough information (number of bytes per character, endianness, etc.) to read my way over to the encoding= declaration, if any, which according to the XML spec may tell me some more specific or esoteric encoding


So let's say that the file has an actual BOM for UTF-16LE. What should be the value of the XML encoding attribute? Should it be encoding="UTF-16LE"? But the Unicode Byte Order Mark FAQ seems to indicate that, if a UTF-16 family BOM is present, I should "tag the text" as merely UTF-16. Does that mean I should use encoding="UTF-16" in my XML file? But then should my parser ignore the encoding value and go with the more specific charset it has determined from the BOM? I'm starting to confuse myself.


If I use a UTF-16LE BOM with an XML file, 1) what value should I use in the encoding attribute, and 2) what charset should my parser autodetect as the encoding of the file?


No comments:

Post a Comment