perl-unicode

Re: removing accents

2004-01-02 04:30:04
Le 28 déc. 03, à 04:45, SADAHIRO Tomoyuki a écrit :

On Sat, 27 Dec 2003 13:30:19 +0100
Eric Cholet <cholet(_at_)logilune(_dot_)com> wrote:

Here's another naive question from a unicode newbie:
Is there a way, using perl's unicode support, to remove
accents from a string? I looked at \pM but can't figure
out how it works, I wasn't able to match anything with it.

Thanks,
--
Eric Cholet

Hello.
There are some threads on this issue.
Those which I found are as following.

* http://www.xray.mpe.mpg.de/mailing-lists/perl-unicode/2003-05/ msg00016.html * http://www.xray.mpe.mpg.de/mailing-lists/perl-unicode/2001-12/ msg00004.html

I hope something there can help you.

==
P.S. UTR #30, Character Foldings, has two concepts about removing accents.
[cf. http://www.unicode.org/reports/tr30/ ]

One is "accent removal", and
the other is "diacritic removal (includes stroke, hook, descender)".

The accent removal utilizes canonical decomposition, and
non-decomposable characters, including Eth ("Ð", U+00D0),
O with stroke ("Ø", U+00D8), c with curl (U+0255,
cf. http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=0255 ),
d with hook (U+0257,
cf. http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=0257 ),
will not be transformed.

Though "diacritic removal" is provisional and its definition has not
been specified yet, I suppose it to have mapping of "Ø" to "O", etc.

Thanks for your detailed reply. I looked into this and found that I
can use Unicode::Normalize to decompose a string in NFD form and then
remove the accents with a regex removing /pM/. I wonder if I overlooked
a shortcoming in this approach since you didn't recommend it although
your are the author of Unicode::Normalize.

Thanks,
--
Eric Cholet

<Prev in Thread] Current Thread [Next in Thread>