perl-unicode

Re: Make Encode.pm support the real UTF-8

2004-12-02 07:30:06
* Gisle Aas wrote:
More interesting is:

  decode("UTF8", "Bj\xEF\xBF\xBFrn")

where "\xEF\xBF\xBF" is not legal UTF-8 because "\x{FFFF}" is not
legal Unicode.  Either the whole sequence "\xEF\xBF\xBF" is replaced
by "\x{FFFD}" or each bad byte is giving us
"Bj\x{FFFD}\x{FFFD}\x{FFFD}rn".  I think the later will be more sane,
especially when you hit on perl 64-bit extension to UTF-8..

I think it should do whatever comes closest to the requirements or
suggestions in Unicode or RFC 3629; I am not sure what that would be
though.

Now that we have this problem, introducing more places where one needs
to carefully check the documentation what is considered UTF-8 does not
seem like the best option, having decode_utf8() and decode(utf8=>...)
mean some- thing different is likely going to cause confusion. Maybe
this could go the other way round, i.e. introduce a new encoding
"UTF-8-Strict" or something.

This is certainly more backwards compatible, but do we really want
perl applications to exchange illegal UTF-8 by default?

Hmm, maybe I should ask why you proposed to keep the old behavior of
encode_utf8 in the first place? The change would make more sense to
me if both encode("UTF-8" => ...) and encode_utf8(...) were changed.
-- 
Björn Höhrmann · mailto:bjoern(_at_)hoehrmann(_dot_)de · 
http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/