Re: removing accents


On Fri, 2 Jan 2004 11:56:12 +0100
Eric Cholet <cholet(_at_)logilune(_dot_)com> wrote:

Thanks for your detailed reply. I looked into this and found that I
can use Unicode::Normalize to decompose a string in NFD form and then
remove the accents with a regex removing /pM/. I wonder if I overlooked
a shortcoming in this approach since you didn't recommend it although
your are the author of Unicode::Normalize.


I'm afraid, the process of taking NFD followed by removing \pM characters
(remove_accent() as below) would remove marks other than accents too much.

Say, it replaces '≠' (U+2260, <NOT EQUAL TO>) with '=' (<EQUALS SIGN>)
since a mathematic "negation slash" is encoded by U+0338
 <COMBINING LONG SOLIDUS OVERLAY> which is to be removed.

sub remove_accent {
    use Unicode::Normalize;
    my $s = NFD(shift);
    $s =~ s/\pM//g;
    return $s;
}

Regards,
SADAHIRO Tomoyuki

<Prev in Thread]	Current Thread	[Next in Thread>
Re: removing accents, Eric Cholet Re: removing accents, SADAHIRO Tomoyuki <= Re: removing accents, Jarkko Hietaniemi Re: removing accents, Eric Cholet

Previous by Date:	Re: Keeping byte-wise processing as an option, Daisuke Maki
Next by Date:	Re: Keeping byte-wise processing as an option, Jarkko Hietaniemi
Previous by Thread:	Re: removing accents, Eric Cholet
Next by Thread:	Re: removing accents, Jarkko Hietaniemi
Indexes:	[Date] [Thread] [Top] [All Lists]