perl-unicode

Re: byte order mark

1999-10-06 09:23:05
Can Perl steal this:
http://www.php.net/manual/function.utf8-encode.php3

A. Darwish


From: Gurusamy Sarathy <gsar(_at_)ActiveState(_dot_)com>
To: perl5-porters(_at_)perl(_dot_)org, perl-unicode(_at_)perl(_dot_)org
CC: gsar(_at_)activestate(_dot_)com, John Dlugosz 
<jdlugosz(_at_)kodak(_dot_)com>
Subject: byte order mark
Date: Tue, 05 Oct 1999 15:20:11 -0700

John Dlugosz asked me if Perl will in future support a
utf8 byte-order-mark in source files.  I'd expect that
the shebang line will be rendered useless on most Unix
systems if someone did this to a script.  Anyone have
any ideas?

John also said:
>I can't get to newsgroups here at work, but here is a list of file
>signatures for text files:
>
>      |----------------------+-------------------------------|
>      |                      |                               |
>      | File begins with     | Coding System                 |
>      |                      |                               |
>      |----------------------+-------------------------------|
>      |                      |                               |
>      | fe ff                | big-endian Unicode (UTF-16)   |
>      |                      |                               |
>      |----------------------+-------------------------------|
>      |                      |                               |
>      | ff fe                | little-endian Unicode         |
>      |                      | (UTF-16)                      |
>      |                      |                               |
>      |----------------------+-------------------------------|
>      |                      |                               |
>      | 00 00 fe ff          | big-endian UCS-4              |
>      |                      |                               |
>      |----------------------+-------------------------------|
>      |                      |                               |
>      | ff fe 00 00          | little-endian UCS-4           |
>      |                      |                               |
>      |----------------------+-------------------------------|
>      |                      |                               |
>      | 0f fe ff             | UTR-6 (compressed Unicode)    |
>      |                      |                               |
>      |----------------------+-------------------------------|
>      |                      |                               |
>      | ef bb bf             | UTF-8                         |
>      |                      |                               |
>      |----------------------+-------------------------------|
>
>
>
>(let me know if that came through unmolested)
>
>I would expect that any tool that accepts "text files" as input would skip
>a U+FEFF character as whitespace or comment or whatnot -- throw it out
>early during lexing.
>
>So if I use -MUTF8 on the perl command line, it should happily put up with >them, no matter where they appear in the text (specifically, the first char
>in each module).  That should be a conformance reqirement.
>
>Having the main script's parser automatically recognise the ef bb bf bytes >and implicitly turn on UTF-8 mode would be my "wish". Part II of the wish >would be that if another pattern from the above table is seen, to report a
>specific error ("unsupported source code format") rather than treating it
>as if the signature was not present (e.g. read in "legacy" 8-bit character
>mode).
>
>Also, does Perl consider a "line" of source to be delimited by U+2028 ?
>That's the standard unambiguous line-break character, and that's what
>Unipad saves. OTOH, in UTF-8 you could argue that the traditional 0x0a is >just fine, being shorter (one reason for using UTF-8 instead of UTF-16) and
>more intercompatible with ASCII (the other reason for using UTF-8).  So
>what does Perl count for line numbers, and more specifically, how do you
>specify which?  For <FH> it's clear (but what's the default), but what
>about source files?
>
>I'm specifically interested in using Perl to implement some Win32
>"resource" tools, to take the place of or augment the decrepid RC compiler.
>
>--John
>


Sarathy
gsar(_at_)activestate(_dot_)com

______________________________________________________
Get Your Private, Free Email at http://www.hotmail.com

<Prev in Thread] Current Thread [Next in Thread>