perl-i18n

Re: Stripping out Unicode combining characters (diacritics) -

2008-05-08 04:16:02
Just to throw this out there: you may be interested in Text::Unidecode
(http://search.cpan.org/~sburke/Text-Unidecode-0.04/) if your ultimate
goal is to try to represent a unicode character with its closest ascii
(or perhaps I should say, "romanized") equivalent.

-- Brad

On Wed, May 7, 2008 at 9:51 AM, Doran, Michael D <doran(_at_)uta(_dot_)edu> 
wrote:

I received a number of helpful suggestions and solutions.  The approach I
decided to adopt in my larger script is to 'decode' all the incoming form
input as UTF-8 as well as the input from the database that I'll be matching
the form input against.  This seems to allow the '\p{M}' syntax to work as
expected in a Perl regexp.  In my test.cgi script for form input it would
like like this:

#!/usr/local/bin/perl
use strict;
use CGI;
use Encode;
my $query = CGI::new();
my $search_term = decode('UTF-8',$query->param('text'));
my $sans_diacritics  = $search_term;
$sans_diacritics =~ s/\pM*//g;
print qq(Content-type: text/plain; charset=utf-8

search_term     is $search_term
sans_diacritics is $sans_diacritics
);
exit(0);

I'm slowly figuring out how to work with Unicode in my web scripts, but
still have a lot to learn.  Thanks for all the help. :-)

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# doran(_at_)uta(_dot_)edu
# http://rocky.uta.edu/doran/


-----Original Message-----
From: Doran, Michael D [mailto:doran(_at_)uta(_dot_)edu]
Sent: Monday, May 05, 2008 7:27 PM
To: perl-i18n(_at_)perl(_dot_)org
Cc: Perl4lib
Subject: Stripping out Unicode combining characters (diacritics)

I'm trying to strip out combining diacritics from some form
input using this code:

<head>
    <META http-equiv="Content-Type" content="text/html;
charset=UTF-8"> </head> <body>
  <form action="test.cgi" accept-charset="UTF-8" method="get">
    <input type="text" name="text" value="" size="10">
    <input type="submit" value="submit">
  </form>
</body>
</html>

#!/usr/local/bin/perl
use CGI;
$query = CGI::new();
$search_term = $query->param('text');
$sans_diacritics  = $search_term;
$sans_diacritics  =~ s/\p{M}*//g;
#$sans_diacritics  =~ s/o//g;
print qq(Content-type: text/plain; charset=utf-8

$sans_diacritics
);
exit(0);


In the form, I'm inputting the string "Bartók" with the
accented character being a base character (small Latin letter
"o") followed by a combining acute accent.  However, when I
print (to the web) $sans_diacritics, I get my input with no
change -- the combining diacritic is still there.  I know
that my input is not a precomposed accented character,
because I can strip out the base "o" and the combining accent
either stands alone or jumps to another character [2].

The "\p{M}" is a Unicode class name for the character class
of Unicode 'marks', for example accent marks [1].  I've tried
these variations (and many others) and none seem to be doing
what I want:

       $sans_diacritics =~ s#[\p{Mark}]*##g;
       $sans_diacritics =~ tr#[\p{InCombiningDiacriticalMarks}]##;
       $sans_diacritics =~ tr#[\p{M}]##;
       $sans_diacritics =~ s/\p{M}*//g;
       $sans_diacritics =~ s#[\p{M}]##g;
       $sans_diacritics =~ s#\x{0301}##g;
       $sans_diacritics =~ s#\x{006F}\x{0301}##g;
       $sans_diacritics =~ s#[\x{0300}-\x{036F}]*##g;

I'm pulling my hair out on this... so any help would be
appreciated.  If there's any other info I can provide, let me know.

My Perl version is 5.8.8 and the script is running on a
server running Solaris 9.

-- Michael

[1] per http://perldoc.perl.org/perlretut.html and other documentation

[2] using $sans_diacritics  =~ s/o//g;

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# doran(_at_)uta(_dot_)edu
# http://rocky.uta.edu/doran/