ietf
[Top] [All Lists]

Re: BOMs

2013-11-19 04:14:10
----- Original Message -----
From: "Martin J. Dürst" <duerst(_at_)it(_dot_)aoyama(_dot_)ac(_dot_)jp>
To: "Henry S. Thompson" <ht(_at_)inf(_dot_)ed(_dot_)ac(_dot_)uk>
Cc: "John Cowan" <cowan(_at_)mercury(_dot_)ccil(_dot_)org>; "IETF Discussion"
<ietf(_at_)ietf(_dot_)org>; "Pete Cordell" <petejson(_at_)codalogic(_dot_)com>; 
"JSON WG"
<json(_at_)ietf(_dot_)org>; "Anne van Kesteren" <annevk(_at_)annevk(_dot_)nl>;
<www-tag(_at_)w3(_dot_)org>; "es-discuss" <es-discuss(_at_)mozilla(_dot_)org>
Sent: Monday, November 18, 2013 11:26 AM

On 2013/11/18 20:11, Henry S. Thompson wrote:
Pete Cordell writes:

Given the history below, would it be sensible to accept BOMs for
UTF-8
encoding, but not for UTF-16 and UTF-32?  In other words, are BOMs
needed
and/or used in the wild for UTF-16 and UTF-32?

Maybe the text can say something like "SHOULD accept BOMs for
UTF-8,
and MAY accept BOMs for UTF-16 and / or UTF-32"?

My sense is that you'll see more UTF-16 BOMs than anything else.

Yes indeed. BOM means Byte Order Mark. It's crucial for over-the-wire
UTF-16. (It's irrelevant for in-memory UTF-16, but that's not what we
are discussing.) To bring up the XML example again, XML actually
strictly requires a BOM for UTF-16. The IETF definition of UTF-16 does
not require a BOM for UTF-16. See http://tools.ietf.org/html/rfc2781,
in
particular http://tools.ietf.org/html/rfc2781#section-3.2,
http://tools.ietf.org/html/rfc2781#section-3.3, and
http://tools.ietf.org/html/rfc2781#section-4.

For UTF-8, the BOM is not a Byte Order Mark, because such a mark isn't
necessary at all. It may serve as a signature, but is not necessary,
and
in some circumstances counterproductive.

Martin

We had a similar discussion with syslog back in 2005, the issue being
that UTF-8 was new and different and how to tell whether it was being
used or not, and what made it into RFC5424 was
"  If a syslog application encodes MSG in UTF-8, the string MUST start
   with the Unicode byte order mask (BOM), which for UTF-8 is ABNF
   %xEF.BB.BF.  "
which remains a MUST to this day.  There are no relevant Errata.

Tom Petch

As for what to say about whether to accept BOMs or not, I'd really
want
to know what the various existing parsers do. If they accept BOMs,
then
we can say they should accept BOMs. If they don't accept BOMs, then we
should say that they don't.

Regards,   Martin.

UTF-32 support seems to be waning (at least in the browsers), but
UTF-16 is in pretty widespread use.  John, do you think you can fool
google into counting BOMs for us?





<Prev in Thread] Current Thread [Next in Thread>