perl-unicode

Re: Converting string to UTF-16LE

2004-02-26 11:30:05
Sebastian Lehmann <news(_at_)surf2lemmer(_dot_)de> writes:
Hello,

i use a perl script to search different files. The search values are given
from a HTML page, the results are displayed on this page, too. The files are
saved in the UTF16LE format, therefore i will open them with the following
open command:

   open(F, "<:raw:encoding(UTF-16LE)", $file) || die "Cannot read $file:
$!\n";

This works fine and the data is readed correctly after opening.

The search value is specified in the HTML page, the URL with the value will
look like the following:

   http://10.0.5.62/search.pl?value=73,98,97,241,101,122

The numbers are the charcodes of the search value and will be formed back to
a string var in the perl script:

   sub decodeString {
       my $sInput = shift;
       my $sOutput = "";
       my @arrChars = split(/,/, $sInput);
       foreach ( @arrChars )
       {
           $iCharCode = ($_)*1;
           $sOutput .= chr($iCharCode);
       }
       return $sOutput;
   }

For this example the search value will be "Ibañez". Because of the search
isn't case-sensitive, all letters should be uppercased, using the uc method.
But uc will return different strings for the search value and for the line
read from the UTF16-LE file:

   $sValue = uc($sValue);        # $sValue is IBAñEZ after uc
   $sLine = uc($sLine);            # $sLine is IBAÑEZ after uc

So the search will not find the search value find although it should do so!
So (as mail tends to mangle this stuff too) the issue is that 

uc(chr(241)) ne  'Ñ' ?  (Upper case N with ~)?

This would seem to be a problem with the uc function.
Which perl version are you using?
Which locale are you in?


I played a lot with the decode and encode method, but with no success.

You should not really need that with perl5.8 - get the UTF16-LE into 
perl's internal form then just work on characters.

Does patern match style work? ($sLine =~ /$sValue/i)

Either the return string isn't valid or the uc method's result is the same.

Can anybody tell me how to work with UTF8 and UTF16 in the same script? 

The way this is meant to work is everything gets converted into perl's
internal form (which happens to be UTF-8 in perl5 but that is none of
user's business) then work in characters.

So what you have _should_ work - but doesn't.
(Attached is above converted to a script which fails.)

In the old latin1 world it was better to lc both things - and 
that does seem to work here too.

lib/unicore/CaseFolding.txt has

00D1; C; 00F1; # LATIN CAPITAL LETTER N WITH TILDE

UnicodeData.txt has

00D1;LATIN CAPITAL LETTER N WITH TILDE;Lu;0;L;004E 0303;;;;N;LATIN CAPITAL 
LETTER N TILDE;;;00F1;
00F1;LATIN SMALL LETTER N WITH TILDE;Ll;0;L;006E 0303;;;;N;LATIN SMALL LETTER N 
TILDE;;00D1;;00D1

But what does that mean?



Any
help would be greatly appreciated.

Thanks in advance,

Sebastian




  sub decodeString {
        my $sInput = shift;
        my $sOutput = "";
        my @arrChars = split(/,/, $sInput);
        foreach ( @arrChars )
        {
            $iCharCode = ($_)*1;
            $sOutput .= chr($iCharCode);
        }
        return $sOutput;
    }

my $sLine = "IBA\xD1EZ";
$sLine .= chr(0x100);
chop($sLine);


my $sValue = decodeString("73,98,97,241,101,122");

binmode(STDOUT,":utf8");

my $match = ($sLine =~ /$sValue/i) ? 'Yes' : 'No';

print "$sLine/$sValue $match\n";

$sLine = uc($sLine);
$sValue = uc($sValue);

$match = ($sLine =~ /$sValue/) ? 'Yes' : 'No';


print "$sLine/$sValue $match\n";



$sLine = lc($sLine);
$sValue = lc($sValue);

$match = ($sLine =~ /$sValue/) ? 'Yes' : 'No';


print "$sLine/$sValue $match\n";


<Prev in Thread] Current Thread [Next in Thread>