perl-unicode

Re: Comparing inputs with source strings

2016-05-10 23:48:56
On 05/09/2016 08:53 AM, Daniel Dehennin wrote:
Hello,

I tried to make my Perl5 code unicode compliant after reading a post on
stackoverflow[1].

As suggested in the post:

     “always run incoming stuff through NFD and outbound stuff from NFC.”

I got a hard time finding why my Test::More was failing but displaying
exactly the same strings for “got” and “expected”.

I finally check how UTF-8 sources are handled and found that they are in
NFC form, I run the following script:

#+begin_src perl
#!/usr/bin/env perl

use utf8;
use warnings;

use Test::More;
use Unicode::Normalize;

my $unistring = 'C’est une chaîne unicode';

my @forms = ("NFD", "NFC", "NFKD", "NFKC");

for my $form (@forms) {
        if ($unistring eq &$form($unistring)) {
                print "UTF-8 source is in form '$form'\n";
        }
}
#+end_src

and got:

#+begin_src
UTF-8 source is in form 'NFC'
UTF-8 source is in form 'NFKC'
#+end_src

So, the Test::More::is_deeply was trying to compare an input in NFD with
the expected string in NFC.

My code can use Unicode::Collate, but for all the code I did not write I
wonder if there is a way to handle it cleanly.

Or maybe I'm doing something wrong?

I'm afraid that when it comes to normalization in Perl5, you have to do it yourself. I hear that Perl6 is much friendlier in this regard, but I have no personal experience with it. Your $unistring is in whatever normalization you made it when you typed it into your editor, or whatever your editor did with it as you were typing. You could have typed it in NFD, but probably the most natural way to enter things on your keyboard will underlying it all be NFC.

Normalization is tricky, and the Unicode Consortium has had to modify things years after they were first specified, because no one could reasonably implement what was expected. I may tackle getting normalization to be more developer friendly in future Perl5 versions, but not in the next couple of years.

Regards.

Footnotes:
[1]  
https://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default



<Prev in Thread] Current Thread [Next in Thread>