Can Perl steal this:
http://www.php.net/manual/function.utf8-encode.php3
A. Darwish
From: Gurusamy Sarathy <gsar(_at_)ActiveState(_dot_)com>
To: perl5-porters(_at_)perl(_dot_)org, perl-unicode(_at_)perl(_dot_)org
CC: gsar(_at_)activestate(_dot_)com, John Dlugosz
<jdlugosz(_at_)kodak(_dot_)com>
Subject: byte order mark
Date: Tue, 05 Oct 1999 15:20:11 -0700
John Dlugosz asked me if Perl will in future support a
utf8 byte-order-mark in source files. I'd expect that
the shebang line will be rendered useless on most Unix
systems if someone did this to a script. Anyone have
any ideas?
John also said:
>I can't get to newsgroups here at work, but here is a list of file
>signatures for text files:
>
> |----------------------+-------------------------------|
> | | |
> | File begins with | Coding System |
> | | |
> |----------------------+-------------------------------|
> | | |
> | fe ff | big-endian Unicode (UTF-16) |
> | | |
> |----------------------+-------------------------------|
> | | |
> | ff fe | little-endian Unicode |
> | | (UTF-16) |
> | | |
> |----------------------+-------------------------------|
> | | |
> | 00 00 fe ff | big-endian UCS-4 |
> | | |
> |----------------------+-------------------------------|
> | | |
> | ff fe 00 00 | little-endian UCS-4 |
> | | |
> |----------------------+-------------------------------|
> | | |
> | 0f fe ff | UTR-6 (compressed Unicode) |
> | | |
> |----------------------+-------------------------------|
> | | |
> | ef bb bf | UTF-8 |
> | | |
> |----------------------+-------------------------------|
>
>
>
>(let me know if that came through unmolested)
>
>I would expect that any tool that accepts "text files" as input would
skip
>a U+FEFF character as whitespace or comment or whatnot -- throw it out
>early during lexing.
>
>So if I use -MUTF8 on the perl command line, it should happily put up
with
>them, no matter where they appear in the text (specifically, the first
char
>in each module). That should be a conformance reqirement.
>
>Having the main script's parser automatically recognise the ef bb bf
bytes
>and implicitly turn on UTF-8 mode would be my "wish". Part II of the
wish
>would be that if another pattern from the above table is seen, to report
a
>specific error ("unsupported source code format") rather than treating it
>as if the signature was not present (e.g. read in "legacy" 8-bit
character
>mode).
>
>Also, does Perl consider a "line" of source to be delimited by U+2028 ?
>That's the standard unambiguous line-break character, and that's what
>Unipad saves. OTOH, in UTF-8 you could argue that the traditional 0x0a
is
>just fine, being shorter (one reason for using UTF-8 instead of UTF-16)
and
>more intercompatible with ASCII (the other reason for using UTF-8). So
>what does Perl count for line numbers, and more specifically, how do you
>specify which? For <FH> it's clear (but what's the default), but what
>about source files?
>
>I'm specifically interested in using Perl to implement some Win32
>"resource" tools, to take the place of or augment the decrepid RC
compiler.
>
>--John
>
Sarathy
gsar(_at_)activestate(_dot_)com
______________________________________________________
Get Your Private, Free Email at http://www.hotmail.com