perl-unicode

Make Encode.pm support the real UTF-8

2004-12-02 03:30:14
As you probably know perl's version of UTF-8 is not the real thing.  I
thought I would hack up a patch to support the encoding as defined by
Unicode.  That involves rejecting illegal chars (like surrogates,
"\x{FFFF}" and "\x{FDD0}), chars above 0x10FFFF, overlong sequences
and such.

Before I do this I would like to get some feedback on the interface.
My prefered interface would be to make:

   encode("UTF-8", $string)

imply the official restricted form and then have

   encode("UTF-8-Perl", $string)

be used as the name for Perl's relaxed and extended version of the
encoding.  The encode_utf8($string) function would continue to be the
same as encode("UTF-8-Perl", $string).

This implies that encode("UTF-8", $string) can start failing while
previously it could not.

Another approach would be to add a FB_STRICT flag that could be passed
with the CHECK argument.  I'm not sure this would make sense for any
encoding besides UTF-8 though.

Other suggestions or comments?

Regards,
Gisle