perl-unicode

Re: Make Encode.pm support the real UTF-8

2004-12-03 12:30:13
On Dec 02, 2004, at 23:25, Tim Bunce wrote:
On Wed, Dec 01, 2004 at 01:28:05PM -0800, Gisle Aas wrote:
As you probably know perl's version of UTF-8 is not the real thing.  I
thought I would hack up a patch to support the encoding as defined by
Unicode.  That involves rejecting illegal chars (like surrogates,
"\x{FFFF}" and "\x{FDD0}), chars above 0x10FFFF, overlong sequences
and such.

It's worth remembering that overlong sequences are a potential security risk.

Before I do this I would like to get some feedback on the interface.
My prefered interface would be to make:

   encode("UTF-8", $string)

imply the official restricted form

I think that would be best.

But to what extent? Does it mean restricted, but unused codepoints (i.e. U+10F000) to be illegal? Does that mean we have to verify and if necessary, patch perl anytime Unicode.org updates Unicode?

While I agree official UTF-8 be supported separately from "Perl" UTF-8, I would like perl to be independent from unicode.org. Remember that perl community does not have a vote in unicode.org (or does it?). Making perl too compliant to the Unicode standard means that perl is at a mercy thereof.

and then have

   encode("UTF-8-Perl", $string)

be used as the name for Perl's relaxed and extended version of the
encoding.  The encode_utf8($string) function would continue to be the
same as encode("UTF-8-Perl", $string).

Isn't there a standard name for the 'unrestricted' encoding?
(Might be an IETF RFC rather than a unicode standard.)

To my knowledge there are at least 3 flavors of UTF-8;

* Official -- officialized by unicode.org.
* RFC 2279 -- "unrestricted", U+0000 - U+7FFF_FFFF
* Perl     -- "unrestricted", U+0000 - U+7FFF_FFFF_FFFF_FFFF w/ 64bitint

This implies that encode("UTF-8", $string) can start failing while
previously it could not.

Anyone working with valid UTF-8 would not get failures.
Anyone who thinks they're using valid UTF-8 but aren't should be grateful!
Anyone not using valid UTF-8 (eg using it as a way to encode integers)
needs to be told in advance - but I doubt there are many and they're
likely to be cluefull users who read release notes :)

There are many movements and implementations that "extends" Unicode by making use of codepoints beyond 0x10FFFF. Current perl can accept them; "Real", official unicode cannot.

I'd say "UTF-8" should mean the official restricted form for perl 5.10.

Perl is a language where "use strict" is not default. Why make its default encoding strict then?

Perl should be liberal, not official.

Why make real when you already have something better than real?

So my proposal is opposite; Leave "utf8" and "UTF-8" as it is now and define "UTF-8-official" or "UTF-8-pedantic" or whatever.

The only remaining issues are then what to do for 5.8.7
and what to call the unrestricted encoding.

I would like to keep calling that 'utf8'.

Dan the Encode Maintainer