On Tuesday, May 6, 2003, at 06:20 AM, debianbugs(_at_)j3e(_dot_)de (via RT)
wrote:
# New Ticket Created by debianbugs(_at_)j3e(_dot_)de
# Please include the string: [perl #22111]
# in the subject line of all future correspondence about this issue.
# <URL: http://rt.perl.org/rt2/Ticket/Display.html?id=22111 >
This is a bug report for perl from debianbugs(_at_)j3e(_dot_)de,
generated with the help of perlbug 1.34 running under perl v5.8.0.
-----------------------------------------------------------------
[Please enter your report here]
the encode function of perl is not able to convert from UTF-8 which is
in normatization form D (NFD). Normalization is handled by
Unicode::Normalize. To use encode one has to use the workaround
from_to(encode_utf8(NFC(decode_utf8($string))),"utf8","...")
but encode should correctly treat NFD encoded strings.
Bjoern
If perl is an application like, say, a word processor, I would agree
that perl and Encode should handle Normalization internally and
transparently so "canonically-equivalent" strings compare as equal.
But perl is a PROGRAMMING LANGUAGE so you have to be able to treat
different (though may be equivalent Unicode-wise) things different by
default. Otherwise you can't even implement new normalization in perl.
So I do not consider this as a bug since perl 5.8 comes with both
Encode and Unicode::Normalize.
If you want to do it transparently, you can always use Encode::Encoding
to implement your own. Here is an example.
package Encode::UTF8::NFD;
use strict;
use base qw(Encode::Encoding);
use Unicode::Normalize;
__PACKAGE__->Define('utf8-nfd');
sub decode($$;;$){
my ($obj, $str, $chk) = @_;
$str = NFD(decode('utf8' => $str));
$_[1] = '' if $chk; # this is what in-place edit means
return $str;
}
sub encode($$;;$){
my ($obj, $str, $chk) = @_;
$str = encode('utf8' => NFC($str));
$_[1] = '' if $chk; # this is what in-place edit means
return $str;
}
1;
Normalization is not an "easy thing that should be done easily". It is
definitely a "hard thing that should be possible" and it is possible
already.
Dan the Encode Maintainer