Re: byte order mark

Gurusamy Sarathy writes:
: John Dlugosz asked me if Perl will in future support a
: utf8 byte-order-mark in source files.

BOMs are an abomination.

That being said, we will in any event have a generalized text open
function that can recognize various encodings and translate on the
fly to utf8, so I don't see why we can't support Unicode even in
its abominable forms.

: I'd expect that
: the shebang line will be rendered useless on most Unix
: systems if someone did this to a script.  Anyone have
: any ideas?

Certainly it's not our job to fix the kernel if the kernel doesn't
recognize byte-order-marks, but we can at least make Perl ignore them,
and possibly switch to utf8 mode automatically.

: John also said:
: >I can't get to newsgroups here at work, but here is a list of file
: >signatures for text files:
: >
: >      |----------------------+-------------------------------|
: >      |                      |                               |
: >      | File begins with     | Coding System                 |
: >      |                      |                               |
: >      |----------------------+-------------------------------|
: >      |                      |                               |
: >      | fe ff                | big-endian Unicode (UTF-16)   |
: >      |                      |                               |
: >      |----------------------+-------------------------------|
: >      |                      |                               |
: >      | ff fe                | little-endian Unicode         |
: >      |                      | (UTF-16)                      |
: >      |                      |                               |
: >      |----------------------+-------------------------------|
: >      |                      |                               |
: >      | 00 00 fe ff          | big-endian UCS-4              |
: >      |                      |                               |
: >      |----------------------+-------------------------------|
: >      |                      |                               |
: >      | ff fe 00 00          | little-endian UCS-4           |
: >      |                      |                               |
: >      |----------------------+-------------------------------|
: >      |                      |                               |
: >      | 0f fe ff             | UTR-6 (compressed Unicode)    |
: >      |                      |                               |
: >      |----------------------+-------------------------------|
: >      |                      |                               |
: >      | ef bb bf             | UTF-8                         |
: >      |                      |                               |
: >      |----------------------+-------------------------------|
: >
: >
: >
: >(let me know if that came through unmolested)
: >
: >I would expect that any tool that accepts "text files" as input would skip
: >a U+FEFF character as whitespace or comment or whatnot -- throw it out
: >early during lexing.

That's easy enough.

: >So if I use -MUTF8 on the perl command line, it should happily put up with
: >them, no matter where they appear in the text (specifically, the first char
: >in each module).  That should be a conformance reqirement.
: >
: >Having the main script's parser automatically recognise the ef bb bf bytes
: >and implicitly turn on UTF-8 mode would be my "wish".

I haven't thought about that one enough to know if I like it.  It's one
thing to use 0xfeff as a byte-order mark, but to start using it as a
magic number indicating utf8 seems a bit more bogus than even the
original abomination.  Nevertheless, it seems the best thing to do, given
the existence of a mark there.

: Part II of the wish
: >would be that if another pattern from the above table is seen, to report a
: >specific error ("unsupported source code format") rather than treating it
: >as if the signature was not present (e.g. read in "legacy" 8-bit character
: >mode).

No, we can do better than that.  We'd swap in a translater and the
lexer would never see anything but utf8.

: >Also, does Perl consider a "line" of source to be delimited by U+2028 ?

Not at the moment.  You suppose the paragraph separator should make a
new line too, since you use it insead of a line separator?

: >That's the standard unambiguous line-break character, and that's what
: >Unipad saves.  OTOH, in UTF-8 you could argue that the traditional 0x0a is
: >just fine, being shorter (one reason for using UTF-8 instead of UTF-16) and
: >more intercompatible with ASCII (the other reason for using UTF-8).  So
: >what does Perl count for line numbers, and more specifically, how do you
: >specify which?  For <FH> it's clear (but what's the default), but what
: >about source files?

The situation is a little more complicated than that.  We already have
to be able to handle any of CR, LF, or CRLF as "\n".  If we have to add
a few more sequences to the list for Unicode, ah well.

Larry