Re: "Removing Accents" from unicode strings


agnarr(_at_)c2i(_dot_)net said:

I need to convert strings obtained from a mysql database in utf8 format
into a fileformat to be uploaded to specific hardware (specifically
GPS's).  Some of these formats may only allow unaccented characters, so I
need a way to convert accented characters into their respective base
characters, g.e. unicode '�' into ASCII 'o', '�' into 'a' and so  forth.

Is there an easy way to do this in Perl?


There's a prior thread on this list about this very topic:

http://www.mail-archive.com/perl-unicode(_at_)perl(_dot_)org/msg02000.html

Also, I've posted a couple different approaches on www.perlmonks.org -- 
here's my favorite:

#!/usr/bin/perl -CDS

use strict;
require 5.008;

my @charnames = grep /\tLATIN \S+ LETTER/, split( /^/, do 'unicore/Name.pl' );

my %accents;

for my $c ( split //, qq/AEIOUCNYaeioucny/ ) {
    my $case = ( $c eq lc $c ) ?  'SMALL' : 'CAPITAL';
    $accents{$c} =
          join( '', map { chr hex( substr $_, 0, 4 ) }
                grep /\tLATIN $case LETTER \U$c WITH/, @charnames );
}

# now use each element of %accents as a character class:

while (<>) {
    for my $c ( keys %accents ) {
        s/[$accents{$c}]/$c/g;
    }
    print;
}

__END__

Another way would be to simply hard-code a set of "tr/..././" steps, one
for each lower-case and upper-case unaccented letter (placed on the right),
with all its accented variants on the left.  Tedious to code, but very fast
at run-time.

        Dave Graff