Gurusamy Sarathy writes:
: John Dlugosz asked me if Perl will in future support a
: utf8 byte-order-mark in source files.
BOMs are an abomination.
That being said, we will in any event have a generalized text open
function that can recognize various encodings and translate on the
fly to utf8, so I don't see why we can't support Unicode even in
its abominable forms.
: I'd expect that
: the shebang line will be rendered useless on most Unix
: systems if someone did this to a script. Anyone have
: any ideas?
Certainly it's not our job to fix the kernel if the kernel doesn't
recognize byte-order-marks, but we can at least make Perl ignore them,
and possibly switch to utf8 mode automatically.
: John also said:
: >I can't get to newsgroups here at work, but here is a list of file
: >signatures for text files:
: >
: > |----------------------+-------------------------------|
: > | | |
: > | File begins with | Coding System |
: > | | |
: > |----------------------+-------------------------------|
: > | | |
: > | fe ff | big-endian Unicode (UTF-16) |
: > | | |
: > |----------------------+-------------------------------|
: > | | |
: > | ff fe | little-endian Unicode |
: > | | (UTF-16) |
: > | | |
: > |----------------------+-------------------------------|
: > | | |
: > | 00 00 fe ff | big-endian UCS-4 |
: > | | |
: > |----------------------+-------------------------------|
: > | | |
: > | ff fe 00 00 | little-endian UCS-4 |
: > | | |
: > |----------------------+-------------------------------|
: > | | |
: > | 0f fe ff | UTR-6 (compressed Unicode) |
: > | | |
: > |----------------------+-------------------------------|
: > | | |
: > | ef bb bf | UTF-8 |
: > | | |
: > |----------------------+-------------------------------|
: >
: >
: >
: >(let me know if that came through unmolested)
: >
: >I would expect that any tool that accepts "text files" as input would skip
: >a U+FEFF character as whitespace or comment or whatnot -- throw it out
: >early during lexing.
That's easy enough.
: >So if I use -MUTF8 on the perl command line, it should happily put up with
: >them, no matter where they appear in the text (specifically, the first char
: >in each module). That should be a conformance reqirement.
: >
: >Having the main script's parser automatically recognise the ef bb bf bytes
: >and implicitly turn on UTF-8 mode would be my "wish".
I haven't thought about that one enough to know if I like it. It's one
thing to use 0xfeff as a byte-order mark, but to start using it as a
magic number indicating utf8 seems a bit more bogus than even the
original abomination. Nevertheless, it seems the best thing to do, given
the existence of a mark there.
: Part II of the wish
: >would be that if another pattern from the above table is seen, to report a
: >specific error ("unsupported source code format") rather than treating it
: >as if the signature was not present (e.g. read in "legacy" 8-bit character
: >mode).
No, we can do better than that. We'd swap in a translater and the
lexer would never see anything but utf8.
: >Also, does Perl consider a "line" of source to be delimited by U+2028 ?
Not at the moment. You suppose the paragraph separator should make a
new line too, since you use it insead of a line separator?
: >That's the standard unambiguous line-break character, and that's what
: >Unipad saves. OTOH, in UTF-8 you could argue that the traditional 0x0a is
: >just fine, being shorter (one reason for using UTF-8 instead of UTF-16) and
: >more intercompatible with ASCII (the other reason for using UTF-8). So
: >what does Perl count for line numbers, and more specifically, how do you
: >specify which? For <FH> it's clear (but what's the default), but what
: >about source files?
The situation is a little more complicated than that. We already have
to be able to handle any of CR, LF, or CRLF as "\n". If we have to add
a few more sequences to the list for Unicode, ah well.
Larry