perl-unicode

Re: [perl #22111] perl::Encode doesn't handle UTF-8 NFD strings

2003-05-06 05:30:06
On Tuesday, May 6, 2003, at 06:20  AM, debianbugs(_at_)j3e(_dot_)de (via RT) 
wrote:
# New Ticket Created by  debianbugs(_at_)j3e(_dot_)de
# Please include the string:  [perl #22111]
# in the subject line of all future correspondence about this issue.
# <URL: http://rt.perl.org/rt2/Ticket/Display.html?id=22111 >



This is a bug report for perl from debianbugs(_at_)j3e(_dot_)de,
generated with the help of perlbug 1.34 running under perl v5.8.0.


-----------------------------------------------------------------
[Please enter your report here]

the encode function of perl is not able to convert from UTF-8 which is in normatization form D (NFD). Normalization is handled by Unicode::Normalize. To use encode one has to use the workaround
from_to(encode_utf8(NFC(decode_utf8($string))),"utf8","...")
but encode should correctly treat NFD encoded strings.

Bjoern

If perl is an application like, say, a word processor, I would agree that perl and Encode should handle Normalization internally and transparently so "canonically-equivalent" strings compare as equal. But perl is a PROGRAMMING LANGUAGE so you have to be able to treat different (though may be equivalent Unicode-wise) things different by default. Otherwise you can't even implement new normalization in perl. So I do not consider this as a bug since perl 5.8 comes with both Encode and Unicode::Normalize.

If you want to do it transparently, you can always use Encode::Encoding to implement your own. Here is an example.

package Encode::UTF8::NFD;
use strict;
use base qw(Encode::Encoding);
use Unicode::Normalize;
__PACKAGE__->Define('utf8-nfd');

sub decode($$;;$){
        my ($obj, $str, $chk) = @_;
        $str = NFD(decode('utf8' => $str));
        $_[1] = '' if $chk; # this is what in-place edit means
        return $str;
}

sub encode($$;;$){
        my ($obj, $str, $chk) = @_;
        $str = encode('utf8' => NFC($str));
        $_[1] = '' if $chk; # this is what in-place edit means
        return $str;
}

1;

Normalization is not an "easy thing that should be done easily". It is definitely a "hard thing that should be possible" and it is possible already.

Dan the Encode Maintainer