Re: UTF-16, the BOM, and media types

At 00/03/22 13:52 -0800, Tim Bray wrote:

At 04:34 PM 3/22/00 -0500, John Cowan wrote:
>> Section 4.3.3 of XML 1.0 says
>>  "Entities encoded in UTF-16 must begin with the Byte Order Mark described
>>   by ISO/IEC 10646 Annex E and Unicode Appendix B (the ZERO WIDTH NO-BREAK
>>   SPACE character, #xFEFF)."
>
>That describes entities encoded in the charset called "UTF-16".  It says
>nothing about entities encoded in the charsets "UTF-16BE" and "UTF-16LE"
>or for that matter charset "x-focs".

Yep, if you hold your head at just the right angle, and don't think of
the word "rhinocerous", you can convince yourself that the 16[BL]E
encodings are really different things entirely, just happen to share
a few characters with That Other Encoding's name, just close personal
friends, etc...


Well, we know they are closely related, but what do processors do?
No XML processor is supposed or even allowed to assume that e.g.
iso-8859-1 and iso-8859-15 are closely related, or that e.g.
iso-8859-1 and windows-1252 are even more closely related.
There is no way for a processor to figure out. Trying to guess
at that level in a non-interactive environment is doomed to
fail. Trying to guess on prefixes of names is of course crazy.

So if something comes in with a label of UTF-16BE, then an XML
processor can either say 'sorry, don't know UTF-16BE', or it
can know it and interpret it accordingly. Every XML processor
has to understand UTF-16, but supporting UTF-16LE is not
required. If you don't like UTF-16LE for XML, just don't
support it.

And please note the following erratum to the XML spec:
http://www.w3.org/XML/xml-19980210-errata#E44

New:
      00 3C ## ##,
      00 25 ## ##,
      00 20 ## ##,
      00 09 ## ##,
      00 0D ## ## or
      00 0A ## ##: Big-endian UTF-16 or ISO-10646-UCS-2. Note that, absent
                   an encoding declaration, these cases are strictly
                   speaking in error.
      3C 00 ## ##,
      25 00 ## ##,
      20 00 ## ##,
      09 00 ## ##,
      0D 00 ## ## or
      0A 00 ## ##: Little-endian UTF-16 or ISO-10646-UCS-2. Note that, absent
                   an encoding declaration, these cases are strictly
                   speaking in error.

old:
      00 3C 00 3F: UTF-16, big-endian, no Byte Order Mark (and thus,
                   strictly speaking, in error)
      3C 00 3F 00: UTF-16, little-endian, no Byte Order Mark (and thus,
                   strictly speaking, in error)

The new text is quite a bit clearer. But if it's not clear enough,
then we'll have to make it even clearer.


Regards,   Martin.