Re: UTF-16, the BOM, and media types

I'm surprised this comes up again.

At 00/03/22 13:09 -0800, Tim Bray wrote:

At 02:57 PM 3/22/00 -0500, John Cowan wrote:
>> UTF-16le and UTF-16be cannot be used for XML.  XML mandates
>> the BOM for utf-16.  Meanwhile, utf-16le and utf-16be cannot
>> have the BOM.  More about this, see RFC 2781.
>
>I do not understand this from the text of XML 1.0.  Clause 4.3.3 only says
>that if there is no encoding declaration, then either:

Section 4.3.3 of XML 1.0 says
 "Entities encoded in UTF-16 must begin with the Byte Order Mark described
  by ISO/IEC 10646 Annex E and Unicode Appendix B (the ZERO WIDTH NO-BREAK
  SPACE character, #xFEFF)."

Thus in my view the RFC is correct,


Sorry, which RFC? If you mean RFC 2781, then I just checked
and I didn't find the string 'XML' there.

And it's not supposed to show up. RFCs about character encodings
are not supposed to say things about various document formats,
and document formats, for the largest part, are not supposed to
say something about specific character encodings.

and thus 16BE and 16LE are not useful
for XML.  It is good practice, whenever you store anything in UTF-16, to
put a BOM in, and XML makes that good practice compulsory, which is pretty
painless since it seems that virtually all software that writes UTF-16 does
so anyhow. The cost of a BOM is zilch.  The benefit in data survival in the
face of stupid byte order tricks (yes, they still happen), is immense.

Martin Duerst, a smart guy whom I respect, invested several hours in
trying to convince me that the 16[BL]E variants with forbidden-BOM had
some real-world justification,


Well, that was mainly because you insisted that they needed to be
forbidden unless there was some real-world justification. But that's
not the real issue. The real issue is that each spec stays with it's
business.

XML does that most of the time, and requiring a BOM for UTF-16
in the XML spec makes sense in the context of requiring all
XML processors to accept UTF-8 and UTF-16 even without any
encoding information. Otherwise, it wouldn't always be possible
to distinguish them. Apart from that, it does not make sense
to use the XML spec to try to legislate on any character encodings.
It would be very surprising if e.g. the XML spec said that
EUC-JP is okay, but Shift_JIS is not okay, or Shift_JIS is
okay, except for half-width kana, and so on.

What you think is suitable for XML is another thing, that can
go into tutorials, books, and so on. Some people would claim
that only UTF-8 is suitable, others would claim whatever they
want. Some get it right, and others get it wrong. It's not
for the spec to judge.


Regards,   Martin.

P.S.: If you wonder where UTF-16BE/LE could be of use in the context
      of XML, there was recently a discussion in the XML Signature
      WG about the use of XPath. The BOM confused a lot of people
      on that group.