Re: Make Encode.pm support the real UTF-8

Bjoern Hoehrmann <derhoermi(_at_)gmx(_dot_)net> writes:

* Gisle Aas wrote:

As you probably know perl's version of UTF-8 is not the real thing.  I
thought I would hack up a patch to support the encoding as defined by
Unicode.  That involves rejecting illegal chars (like surrogates,
"\x{FFFF}" and "\x{FDD0}), chars above 0x10FFFF, overlong sequences
and such.


I would very much like to have this functionality available in some
standard module. Though, what do you mean here by rejecting exactly?


It would do the same as it currently does for illegal UTF-8-Perl.  It
falls back to what the CHECK argument ask for.

For example, by default, I would expect

  decode("UTF-8" => "Bj\xF6rn")

to return "Bj\x{FFFD}rn" as documented in `perldoc Encode`; would
this change (i.e., would it croak instead)?


It would be exactly the same.

More interesting is:

   decode("UTF8", "Bj\xEF\xBF\xBFrn")

where "\xEF\xBF\xBF" is not legal UTF-8 because "\x{FFFF}" is not
legal Unicode.  Either the whole sequence "\xEF\xBF\xBF" is replaced
by "\x{FFFD}" or each bad byte is giving us
"Bj\x{FFFD}\x{FFFD}\x{FFFD}rn".  I think the later will be more sane,
especially when you hit on perl 64-bit extension to UTF-8..

Before I do this I would like to get some feedback on the interface.
My prefered interface would be to make:

  encode("UTF-8", $string)

imply the official restricted form and then have

  encode("UTF-8-Perl", $string)

be used as the name for Perl's relaxed and extended version of the
encoding.  The encode_utf8($string) function would continue to be the
same as encode("UTF-8-Perl", $string).


I would prefer there was no semantic overloading of "UTF-8" at all,
I generally expect that anything called UTF-8 refers to UTF-8 as
defined in the Unicode standard or RFC 3629. I was for example sur-
prised that Encode::is_utf8(...) considers sequences UTF-8 that are
not UTF-8 as defined in those specifications (the documentation
explicitly states "well-formed UTF-8").


This can be fixed by fixing the documentation.  It might be possible
to get a way by making a distinction between 'utf8' and 'UTF-8'.  The
former being the perl variant while we reserve uppercase form with
dash for the real UTF-8.

Now that we have this problem, introducing more places where one needs
to carefully check the documentation what is considered UTF-8 does not
seem like the best option, having decode_utf8() and decode(utf8=>...)
mean some- thing different is likely going to cause confusion. Maybe
this could go the other way round, i.e. introduce a new encoding
"UTF-8-Strict" or something.


This is certainly more backwards compatible, but do we really want
perl applications to exchange illegal UTF-8 by default?

This implies that encode("UTF-8", $string) can start failing while
previously it could not.


As above, by default I do not think it should fail but rather use a
replacement character instead of croaking.


Yes.  By failing I mean; handle the bad bytes as specified by the
CHECK argument.

                                             The result should be the
same as (using RFC-3629-UTF-8 to mean the non-Perl UTF-8)

  encode("RFC-3629-UTF-8" => decode("RFC-3629-UTF-8" => $string))

where decode("RFC-3629-UTF-8") would always return a RFC-3629-UTF-8
string with no illegal sequences (and as that should not fail, the
above should not fail either). I.e.

  encode("RFC-3629-UTF-8" => $string) eq
  encode("RFC-3629-UTF-8" => decode("RFC-3629-UTF-8" => $string))

would always hold true (assuming that decode("RFC-3629-UTF-8") would
ignore that the UTF-8 flag on $string is already set and decode
"again").

Other suggestions or comments?


There should be a corresponding is_foo function that checks whether
a sequence of octets (or a string with the UTF-8 flag set) is actually
UTF-8 as defined in the relevant specifications, maybe by adding one
more argument to Encode::is_utf8 like

  Encode::is_utf8($string, $perl_utf8_check, $real_utf8_check)


Agree.

Regards,
Gisle