On 2015/08/31 03:12, Rune Henssel wrote:
I have a very short question: Why is it that, when decoding text,
Encode::Unicode removes all other BOMs except the UTF-8?
I am sure that someone has a good explanation as to why Encode::Unicode
behaves this way, please enlighten me.
Regarding the BOM on UTF-8, there are two differing opinions.
Simplified, one opinion is that the BOM is very helpful on UTF-8 to
distinguish it from other Unicode encoding forms (UTF-16,...) and from
legacy encodings (iso-8859-1,...). As such, the BOM should be removed
when decoding, because there is no need anymore to indicate the encoding.
The other opinion is that the BOM on UTF-8 is harmful, because it
creates problems in all kinds of situations where a program can run fine
by only looking at 7-bit bytes (ASCII characters) and just pass through
8-bit bytes. Therefore, if there's ever a BOM, it won't be an encoding
signature, but a ZERO WIDTH NO-BREAK SPACE (the official name of U+FEFF).
I have personally been a strong proponent of the second opinion. UTF-8
is easy to detect even without a leading BOM (see also
http://www.sw.it.aoyama.ac.jp/2012/pub/IUC11-UTF-8.pdf).
The second opinion also fits Perl's use (Unix/Linux, pipes,...) quite
well. However, starting with Microsoft Windows' notepad, the number of
software components that (optionally, or as in the case of notepad,
always) attach a BOM to UTF-8 files or that accept files in UTF-8 with
or without a leading BOM has steadily increased. In other words, opinion
one has essentially won "on the ground".
This has been acknowledged by the standard in that a new character,
U+2060, WORD JOINER, was introduced to essentially replace the "ZERO
WIDTH NO-BREAK SPACE" role of U+FEFF. [The BOM got merged with that role
only because the relevant ISO committee didn't think that a BOM was a
character, and therefore didn't want to encode it in a character standard.]
Nevertheless, this doesn't mean that all UTF-8 files should start with a
BOM. As long as you can avoid it, just leave the BOM out. This even more
applies to fields in data structures that are in UTF-8; the BOM is
absolutely inappropriate there.
Coming back to the original question: I guess it's a result of
Encode::Unicode dating from way back (when the influence of opinion one
was still very low), and of Perl being oriented towards streaming
processing (Unix pipes,...).
For the future, I guess in true Perl fashion it might make sense to make
this configurable (remove a leading BOM; accept a leading BOM as part of
the data; reject input with a leading BOM). This actually applies not
only to UTF-8 but also to UTF-16 and other Unicode encoding forms,
because data in a data structure that's UTF-16(BE|LE) by definition
doesn't need a BOM either.
Regards, Martin.
Yours
Rune Henssel
.