perl-unicode

Re: Make Encode.pm support the real UTF-8

2004-12-02 05:30:07
* Gisle Aas wrote:
As you probably know perl's version of UTF-8 is not the real thing.  I
thought I would hack up a patch to support the encoding as defined by
Unicode.  That involves rejecting illegal chars (like surrogates,
"\x{FFFF}" and "\x{FDD0}), chars above 0x10FFFF, overlong sequences
and such.

I would very much like to have this functionality available in some
standard module. Though, what do you mean here by rejecting exactly?
For example, by default, I would expect

  decode("UTF-8" => "Bj\xF6rn")

to return "Bj\x{FFFD}rn" as documented in `perldoc Encode`; would
this change (i.e., would it croak instead)?

Before I do this I would like to get some feedback on the interface.
My prefered interface would be to make:

  encode("UTF-8", $string)

imply the official restricted form and then have

  encode("UTF-8-Perl", $string)

be used as the name for Perl's relaxed and extended version of the
encoding.  The encode_utf8($string) function would continue to be the
same as encode("UTF-8-Perl", $string).

I would prefer there was no semantic overloading of "UTF-8" at all,
I generally expect that anything called UTF-8 refers to UTF-8 as
defined in the Unicode standard or RFC 3629. I was for example sur-
prised that Encode::is_utf8(...) considers sequences UTF-8 that are
not UTF-8 as defined in those specifications (the documentation
explicitly states "well-formed UTF-8").

Now that we have this problem, introducing more places where one needs
to carefully check the documentation what is considered UTF-8 does not
seem like the best option, having decode_utf8() and decode(utf8=>...)
mean some- thing different is likely going to cause confusion. Maybe
this could go the other way round, i.e. introduce a new encoding
"UTF-8-Strict" or something.

This implies that encode("UTF-8", $string) can start failing while
previously it could not.

As above, by default I do not think it should fail but rather use a
replacement character instead of croaking. The result should be the
same as (using RFC-3629-UTF-8 to mean the non-Perl UTF-8)

  encode("RFC-3629-UTF-8" => decode("RFC-3629-UTF-8" => $string))

where decode("RFC-3629-UTF-8") would always return a RFC-3629-UTF-8
string with no illegal sequences (and as that should not fail, the
above should not fail either). I.e.

  encode("RFC-3629-UTF-8" => $string) eq
  encode("RFC-3629-UTF-8" => decode("RFC-3629-UTF-8" => $string))

would always hold true (assuming that decode("RFC-3629-UTF-8") would
ignore that the UTF-8 flag on $string is already set and decode
"again").

Other suggestions or comments?

There should be a corresponding is_foo function that checks whether
a sequence of octets (or a string with the UTF-8 flag set) is actually
UTF-8 as defined in the relevant specifications, maybe by adding one
more argument to Encode::is_utf8 like

  Encode::is_utf8($string, $perl_utf8_check, $real_utf8_check)
-- 
Björn Höhrmann · mailto:bjoern(_at_)hoehrmann(_dot_)de · 
http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/