Re: byte order mark

Can Perl steal this:
http://www.php.net/manual/function.utf8-encode.php3

A. Darwish

From: Gurusamy Sarathy <gsar(_at_)ActiveState(_dot_)com>
To: perl5-porters(_at_)perl(_dot_)org, perl-unicode(_at_)perl(_dot_)org
CC: gsar(_at_)activestate(_dot_)com, John Dlugosz 
<jdlugosz(_at_)kodak(_dot_)com>
Subject: byte order mark
Date: Tue, 05 Oct 1999 15:20:11 -0700

John Dlugosz asked me if Perl will in future support a
utf8 byte-order-mark in source files.  I'd expect that
the shebang line will be rendered useless on most Unix
systems if someone did this to a script.  Anyone have
any ideas?

John also said:
>I can't get to newsgroups here at work, but here is a list of file
>signatures for text files:
>
>      |----------------------+-------------------------------|
>      |                      |                               |
>      | File begins with     | Coding System                 |
>      |                      |                               |
>      |----------------------+-------------------------------|
>      |                      |                               |
>      | fe ff                | big-endian Unicode (UTF-16)   |
>      |                      |                               |
>      |----------------------+-------------------------------|
>      |                      |                               |
>      | ff fe                | little-endian Unicode         |
>      |                      | (UTF-16)                      |
>      |                      |                               |
>      |----------------------+-------------------------------|
>      |                      |                               |
>      | 00 00 fe ff          | big-endian UCS-4              |
>      |                      |                               |
>      |----------------------+-------------------------------|
>      |                      |                               |
>      | ff fe 00 00          | little-endian UCS-4           |
>      |                      |                               |
>      |----------------------+-------------------------------|
>      |                      |                               |
>      | 0f fe ff             | UTR-6 (compressed Unicode)    |
>      |                      |                               |
>      |----------------------+-------------------------------|
>      |                      |                               |
>      | ef bb bf             | UTF-8                         |
>      |                      |                               |
>      |----------------------+-------------------------------|
>
>
>
>(let me know if that came through unmolested)
>

>I would expect that any tool that accepts "text files" as input wouldskip

>a U+FEFF character as whitespace or comment or whatnot -- throw it out
>early during lexing.
>

>So if I use -MUTF8 on the perl command line, it should happily put upwith>them, no matter where they appear in the text (specifically, the firstchar

>in each module).  That should be a conformance reqirement.
>

>Having the main script's parser automatically recognise the ef bb bfbytes>and implicitly turn on UTF-8 mode would be my "wish". Part II of thewish>would be that if another pattern from the above table is seen, to reporta

>specific error ("unsupported source code format") rather than treating it

>as if the signature was not present (e.g. read in "legacy" 8-bitcharacter

>mode).
>
>Also, does Perl consider a "line" of source to be delimited by U+2028 ?
>That's the standard unambiguous line-break character, and that's what

>Unipad saves. OTOH, in UTF-8 you could argue that the traditional 0x0ais>just fine, being shorter (one reason for using UTF-8 instead of UTF-16)and

>more intercompatible with ASCII (the other reason for using UTF-8).  So
>what does Perl count for line numbers, and more specifically, how do you
>specify which?  For <FH> it's clear (but what's the default), but what
>about source files?
>
>I'm specifically interested in using Perl to implement some Win32

>"resource" tools, to take the place of or augment the decrepid RCcompiler.

>
>--John
>


Sarathy
gsar(_at_)activestate(_dot_)com


______________________________________________________
Get Your Private, Free Email at http://www.hotmail.com