Make Encode.pm support the real UTF-8

As you probably know perl's version of UTF-8 is not the real thing.  I
thought I would hack up a patch to support the encoding as defined by
Unicode.  That involves rejecting illegal chars (like surrogates,
"\x{FFFF}" and "\x{FDD0}), chars above 0x10FFFF, overlong sequences
and such.

Before I do this I would like to get some feedback on the interface.
My prefered interface would be to make:

   encode("UTF-8", $string)

imply the official restricted form and then have

   encode("UTF-8-Perl", $string)

be used as the name for Perl's relaxed and extended version of the
encoding.  The encode_utf8($string) function would continue to be the
same as encode("UTF-8-Perl", $string).

This implies that encode("UTF-8", $string) can start failing while
previously it could not.

Another approach would be to add a FB_STRICT flag that could be passed
with the CHECK argument.  I'm not sure this would make sense for any
encoding besides UTF-8 though.

Other suggestions or comments?

Regards,
Gisle

Previous by Date:	Re: About HTML unicode, Gisle Aas
Next by Date:	getBytes in perl ?!?, PerlDiscuss - Perl Newsgroups and mailing lists
Previous by Thread:	About HTML unicode, He Zhiqiang
Next by Thread:	Re: Make Encode.pm support the real UTF-8, Bjoern Hoehrmann
Indexes:	[Date] [Thread] [Top] [All Lists]