perl-unicode

Re: Matching encoded strings and file names

2005-12-23 12:25:08
At 10:46 am +0100 20/12/05, szpara_ga(_at_)tlen(_dot_)pl wrote:

...Let's say I have a txt file which contains a list of strings. Some of these strings contain characters encoded in this fashion:

R\xC3\xA9union (\xC3\xA9 is one character - e with an accent).

To be more specific \xC3\xA9 represents in escaped form the UTF-8 transformation of the _precomposed_ form of e with an acute accent -- a single character.

  I also have directory which contains many files, some of which
also contain these special characters. What I would like to do
is find any strings from the txt file that match file names in
the directory. I have tried the following. Assuming
'R\xC3\xA9union' is in $in and the current file name from the
directory is in $file:

$in =~ s/\\x(..)/chr(hex($1))/eg;

If you evaluate "\xC9\xA9" then you get _two characters,which, when converted internally to UTF-8 will become 5 bytes; not what you want at all! You need first of all to evaluate the whole string, eg:

        #!/usr/bin/perl
        $ascii = 'R\xC3\xA9union';
        $utf8 =  eval qq~"$ascii"~;
        print $utf8;

But that may not be the end of it because, at least on the Mac, file names do not use the precomposed form of accented characters, so a file named "Réunion" is (in UTF-16) not "R\x{00E9}union" (0xC3A9) but "R\x{0065}\x{0301}"

        #!/usr/bin/perl
        binmode STDOUT, ":utf8";
        print "1. R\x{00E9}union" . $/ . "2. R\x{0065}\x{0301}union"

Others may have some magic solution but to me it seems you have to convert the escaped original text to the utf-8 bytes it intends, convert these to UTF-16BE and then produce one file with the contents in precomposed form and another in decomposed form. Which of these you use will depend on the normalisation used in the file system.


#!/usr/bin/perl
use Encode 'from_to';
use Unicode::Normalize;
$xtext = 'R\xC3\xA9union';
$dir = $ENV{HOME};
$junk_hex_escaped = "$dir/junk_hex_escaped.txt";
$junk_precomposed = "$dir/junk_precomposed.txt";
$junk_decomposed = "$dir/junk_decomposed.txt";
open HEXESCAPED, ">$junk_hex_escaped" or die $!;
print HEXESCAPED $xtext;
close HEXESCAPED;
open HEXESCAPED, $junk_hex_escaped or die $!;
open PRECOMPOSED, ">:utf8", $junk_precomposed or die $!;
open DECOMPOSED, ">:utf8", $junk_decomposed or die $!;
while (<HEXESCAPED>){
  $_ = eval qq~"$_"~;
  from_to ($_, "UTF-8", "UTF-16BE");
  ($precomposed, $decomposed) = (NFC($_), NFD($_));
  print PRECOMPOSED $precomposed;
  print DECOMPOSED $decomposed;
}
`open -e $junk_precomposed $junk_decomposed`