John Delacour wrote:
At 12:31 am +0800 3/12/04, He Zhiqiang wrote:
Now i encountered another problem, there are a few files contains not
only one charset but also two or more, for example, file1 contains
japanese and chinese, if i use open() to load the data into memory,
ord and length etc.. can't correctly work! Perhasp i miss something to
encode or decode the data ?
code:
#!/usr/bin/perl -w
use utf8;
open(FD, "< file1");
while(<FD>) {
chomp;
print "length = ".length($_);
}
close FD;
----------
length() can not count the correct non-ASCII characters. :(
If the file is in UTF-8, then it may be in any number of _languages_ but
it uses only one character set -- Unicode. So far as I know "use utf8"
is now redundant and ineffectual in Perl. You will get the correct
character count (6 characters rather than 18 bytes) by opening the file
handle as utf-8 as below.
If I could say additional comment to the JD's for Zhiqiang, "use utf8"
is just telling Perl parser that the program source file is written in
UTF-8.
cf. <http://www.perldoc.com/perl5.8.4/lib/utf8.html>
No other effect is expected by that pragma.
Zhiqiang had to tell Perl the string is encoded with UTF-8. You should
give length() the string which is so-called 'UTF8-flagged' form.
JD have already suggested how to enable UTF8-flag via ":utf8" I/O layer.
cf. <http://www.perldoc.com/perl5.8.4/lib/PerlIO.html>
Another way to enable the flag is to use "utf8::decode()" function.
My sample code is like below:
#!/usr/local/bin/perl -w
use 5.008;
use strict;
use warnings;
open (TXT, '<sample2.txt');
chomp(my @text = <TXT>);
close TXT;
print "utf8 flag desabled:\n";
foreach my $text (@text) {
print length($text), "\n";
}
print "utf8 flag enabled:\n";
foreach my $text (@text) {
utf8::decode($text);
print length($text), "\n";
}
--
Masanori HATA
<lovewing(_at_)dream(_dot_)big(_dot_)or(_dot_)jp>
He's always with us!