Re: UTF-8 and the BOM

On 2015/08/31 03:12, Rune Henssel wrote:

I have a very short question: Why is it that, when decoding text,
Encode::Unicode removes all other BOMs except the UTF-8?

I am sure that someone has a good explanation as to why Encode::Unicode
behaves this way, please enlighten me.


Regarding the BOM on UTF-8, there are two differing opinions.

Simplified, one opinion is that the BOM is very helpful on UTF-8 todistinguish it from other Unicode encoding forms (UTF-16,...) and fromlegacy encodings (iso-8859-1,...). As such, the BOM should be removedwhen decoding, because there is no need anymore to indicate the encoding.

The other opinion is that the BOM on UTF-8 is harmful, because itcreates problems in all kinds of situations where a program can run fineby only looking at 7-bit bytes (ASCII characters) and just pass through8-bit bytes. Therefore, if there's ever a BOM, it won't be an encodingsignature, but a ZERO WIDTH NO-BREAK SPACE (the official name of U+FEFF).

I have personally been a strong proponent of the second opinion. UTF-8is easy to detect even without a leading BOM (see alsohttp://www.sw.it.aoyama.ac.jp/2012/pub/IUC11-UTF-8.pdf).

The second opinion also fits Perl's use (Unix/Linux, pipes,...) quitewell. However, starting with Microsoft Windows' notepad, the number ofsoftware components that (optionally, or as in the case of notepad,always) attach a BOM to UTF-8 files or that accept files in UTF-8 withor without a leading BOM has steadily increased. In other words, opinionone has essentially won "on the ground".

This has been acknowledged by the standard in that a new character,U+2060, WORD JOINER, was introduced to essentially replace the "ZEROWIDTH NO-BREAK SPACE" role of U+FEFF. [The BOM got merged with that roleonly because the relevant ISO committee didn't think that a BOM was acharacter, and therefore didn't want to encode it in a character standard.]

Nevertheless, this doesn't mean that all UTF-8 files should start with aBOM. As long as you can avoid it, just leave the BOM out. This even moreapplies to fields in data structures that are in UTF-8; the BOM isabsolutely inappropriate there.

Coming back to the original question: I guess it's a result ofEncode::Unicode dating from way back (when the influence of opinion onewas still very low), and of Perl being oriented towards streamingprocessing (Unix pipes,...).

For the future, I guess in true Perl fashion it might make sense to makethis configurable (remove a leading BOM; accept a leading BOM as part ofthe data; reject input with a leading BOM). This actually applies notonly to UTF-8 but also to UTF-16 and other Unicode encoding forms,because data in a data structure that's UTF-16(BE|LE) by definitiondoesn't need a BOM either.


Regards,   Martin.



Yours
Rune Henssel
.