perl-unicode

byte order mark

1999-10-05 15:16:39
John Dlugosz asked me if Perl will in future support a
utf8 byte-order-mark in source files.  I'd expect that
the shebang line will be rendered useless on most Unix
systems if someone did this to a script.  Anyone have
any ideas?

John also said:
I can't get to newsgroups here at work, but here is a list of file
signatures for text files:

     |----------------------+-------------------------------|
     |                      |                               |
     | File begins with     | Coding System                 |
     |                      |                               |
     |----------------------+-------------------------------|
     |                      |                               |
     | fe ff                | big-endian Unicode (UTF-16)   |
     |                      |                               |
     |----------------------+-------------------------------|
     |                      |                               |
     | ff fe                | little-endian Unicode         |
     |                      | (UTF-16)                      |
     |                      |                               |
     |----------------------+-------------------------------|
     |                      |                               |
     | 00 00 fe ff          | big-endian UCS-4              |
     |                      |                               |
     |----------------------+-------------------------------|
     |                      |                               |
     | ff fe 00 00          | little-endian UCS-4           |
     |                      |                               |
     |----------------------+-------------------------------|
     |                      |                               |
     | 0f fe ff             | UTR-6 (compressed Unicode)    |
     |                      |                               |
     |----------------------+-------------------------------|
     |                      |                               |
     | ef bb bf             | UTF-8                         |
     |                      |                               |
     |----------------------+-------------------------------|



(let me know if that came through unmolested)

I would expect that any tool that accepts "text files" as input would skip
a U+FEFF character as whitespace or comment or whatnot -- throw it out
early during lexing.

So if I use -MUTF8 on the perl command line, it should happily put up with
them, no matter where they appear in the text (specifically, the first char
in each module).  That should be a conformance reqirement.

Having the main script's parser automatically recognise the ef bb bf bytes
and implicitly turn on UTF-8 mode would be my "wish".  Part II of the wish
would be that if another pattern from the above table is seen, to report a
specific error ("unsupported source code format") rather than treating it
as if the signature was not present (e.g. read in "legacy" 8-bit character
mode).

Also, does Perl consider a "line" of source to be delimited by U+2028 ?
That's the standard unambiguous line-break character, and that's what
Unipad saves.  OTOH, in UTF-8 you could argue that the traditional 0x0a is
just fine, being shorter (one reason for using UTF-8 instead of UTF-16) and
more intercompatible with ASCII (the other reason for using UTF-8).  So
what does Perl count for line numbers, and more specifically, how do you
specify which?  For <FH> it's clear (but what's the default), but what
about source files?

I'm specifically interested in using Perl to implement some Win32
"resource" tools, to take the place of or augment the decrepid RC compiler.

--John



Sarathy
gsar(_at_)activestate(_dot_)com

<Prev in Thread] Current Thread [Next in Thread>