ietf
[Top] [All Lists]

Re: BOMs

2013-11-19 05:10:35
On 2013/11/19 19:10, t.p. wrote:
----- Original Message -----
From: "Martin J. Dürst"<duerst(_at_)it(_dot_)aoyama(_dot_)ac(_dot_)jp>

For UTF-8, the BOM is not a Byte Order Mark, because such a mark isn't
necessary at all. It may serve as a signature, but is not necessary,
and
in some circumstances counterproductive.

Martin

We had a similar discussion with syslog back in 2005, the issue being
that UTF-8 was new and different and how to tell whether it was being
used or not, and what made it into RFC5424 was
"  If a syslog application encodes MSG in UTF-8, the string MUST start
    with the Unicode byte order mask (BOM), which for UTF-8 is ABNF
    %xEF.BB.BF.  "
which remains a MUST to this day.  There are no relevant Errata.

Tom Petch

This is something that seems to have made quite a lot of sense for syslog. I can understand that if before 2005, syslog was used with legacy encodings (iso-8859-1, Shift_JIS and similar), and there was otherwise no easy way to label the UTF-8 strings.

But another solution (for syslog, that is) would also have been possible. As John already pointed out, UTF-8 is very easy to detect heuristically: If a byte sequence follows the UTF-8 byte pattern, it's most definitely UTF-8 and not something else. For more background, please see http://www.sw.it.aoyama.ac.jp/2012/pub/IUC11-UTF-8.pdf, where that idea came up first.

As for JSON, it doesn't have the problem of legacy encodings. JSON by definition is encoded in an Unicode encoding form, and it's easy to distinguish these because of the restrictions on character sequences in JSON. And this can be done without a BOM (or with a BOM).

What's most important now is to know what receivers actually accept. We are not in a design phase, we are just updating the definition of JSON and making sure we fix problems if there are problems, but we have to use the installed base for the main guidance, not other protocols or formats.

Regards,   Martin.

<Prev in Thread] Current Thread [Next in Thread>