perl-unicode

Re: UTF-8 and the BOM

2015-09-01 08:10:39
On 2015/08/31 03:12, Rune Henssel wrote:
I have a very short question: Why is it that, when decoding text,
Encode::Unicode removes all other BOMs except the UTF-8?

I am sure that someone has a good explanation as to why Encode::Unicode
behaves this way, please enlighten me.

Regarding the BOM on UTF-8, there are two differing opinions.

Simplified, one opinion is that the BOM is very helpful on UTF-8 to distinguish it from other Unicode encoding forms (UTF-16,...) and from legacy encodings (iso-8859-1,...). As such, the BOM should be removed when decoding, because there is no need anymore to indicate the encoding.

The other opinion is that the BOM on UTF-8 is harmful, because it creates problems in all kinds of situations where a program can run fine by only looking at 7-bit bytes (ASCII characters) and just pass through 8-bit bytes. Therefore, if there's ever a BOM, it won't be an encoding signature, but a ZERO WIDTH NO-BREAK SPACE (the official name of U+FEFF).

I have personally been a strong proponent of the second opinion. UTF-8 is easy to detect even without a leading BOM (see also http://www.sw.it.aoyama.ac.jp/2012/pub/IUC11-UTF-8.pdf).

The second opinion also fits Perl's use (Unix/Linux, pipes,...) quite well. However, starting with Microsoft Windows' notepad, the number of software components that (optionally, or as in the case of notepad, always) attach a BOM to UTF-8 files or that accept files in UTF-8 with or without a leading BOM has steadily increased. In other words, opinion one has essentially won "on the ground".

This has been acknowledged by the standard in that a new character, U+2060, WORD JOINER, was introduced to essentially replace the "ZERO WIDTH NO-BREAK SPACE" role of U+FEFF. [The BOM got merged with that role only because the relevant ISO committee didn't think that a BOM was a character, and therefore didn't want to encode it in a character standard.]

Nevertheless, this doesn't mean that all UTF-8 files should start with a BOM. As long as you can avoid it, just leave the BOM out. This even more applies to fields in data structures that are in UTF-8; the BOM is absolutely inappropriate there.

Coming back to the original question: I guess it's a result of Encode::Unicode dating from way back (when the influence of opinion one was still very low), and of Perl being oriented towards streaming processing (Unix pipes,...).

For the future, I guess in true Perl fashion it might make sense to make this configurable (remove a leading BOM; accept a leading BOM as part of the data; reject input with a leading BOM). This actually applies not only to UTF-8 but also to UTF-16 and other Unicode encoding forms, because data in a data structure that's UTF-16(BE|LE) by definition doesn't need a BOM either.

Regards,   Martin.





Yours
Rune Henssel
.


<Prev in Thread] Current Thread [Next in Thread>
  • Re: UTF-8 and the BOM, Martin J. Dürst <=