Re: Make Encode.pm support the real UTF-8

On Wed, Dec 01, 2004 at 01:28:05PM -0800, Gisle Aas wrote:

As you probably know perl's version of UTF-8 is not the real thing.  I
thought I would hack up a patch to support the encoding as defined by
Unicode.  That involves rejecting illegal chars (like surrogates,
"\x{FFFF}" and "\x{FDD0}), chars above 0x10FFFF, overlong sequences
and such.


It's worth remembering that overlong sequences are a potential security risk.

Before I do this I would like to get some feedback on the interface.
My prefered interface would be to make:

   encode("UTF-8", $string)

imply the official restricted form


I think that would be best.

and then have

   encode("UTF-8-Perl", $string)

be used as the name for Perl's relaxed and extended version of the
encoding.  The encode_utf8($string) function would continue to be the
same as encode("UTF-8-Perl", $string).


Isn't there a standard name for the 'unrestricted' encoding?
(Might be an IETF RFC rather than a unicode standard.)

This implies that encode("UTF-8", $string) can start failing while
previously it could not.


Anyone working with valid UTF-8 would not get failures.
Anyone who thinks they're using valid UTF-8 but aren't should be grateful!
Anyone not using valid UTF-8 (eg using it as a way to encode integers)
needs to be told in advance - but I doubt there are many and they're
likely to be cluefull users who read release notes :)

I'd say "UTF-8" should mean the official restricted form for perl 5.10.

The only remaining issues are then what to do for 5.8.7
and what to call the unrestricted encoding.

Tim.

<Prev in Thread]

Current Thread

[Next in Thread>

Previous by Date:

Re: Make Encode.pm support the real UTF-8, Bjoern Hoehrmann

Next by Date:

Re: Make Encode.pm support the real UTF-8, Gisle Aas

Previous by Thread:

Re: Make Encode.pm support the real UTF-8, Nick Ing-Simmons

Next by Thread:

Re: Make Encode.pm support the real UTF-8, Bob Hallissy

Indexes:

[Date] [Thread] [Top] [All Lists]