Re: About HTML unicode

John Delacour wrote:

At 12:31 am +0800 3/12/04, He Zhiqiang wrote:
Now i encountered another problem, there are a few files contains notonly one charset but also two or more, for example, file1 containsjapanese and chinese, if i use open() to load the data into memory,ord and length etc.. can't correctly work! Perhasp i miss something toencode or decode the data ?
code:
#!/usr/bin/perl -w
use utf8;
open(FD, "< file1");
while(<FD>) {
chomp;
print "length = ".length($_);
}
close FD;
----------
length() can not count the correct non-ASCII characters. :(
If the file is in UTF-8, then it may be in any number of _languages_ butit uses only one character set -- Unicode. So far as I know "use utf8"is now redundant and ineffectual in Perl. You will get the correctcharacter count (6 characters rather than 18 bytes) by opening the filehandle as utf-8 as below.

If I could say additional comment to the JD's for Zhiqiang, "use utf8"is just telling Perl parser that the program source file is written inUTF-8.

cf. <http://www.perldoc.com/perl5.8.4/lib/utf8.html>

No other effect is expected by that pragma.

Zhiqiang had to tell Perl the string is encoded with UTF-8. You shouldgive length() the string which is so-called 'UTF8-flagged' form.


JD have already suggested how to enable UTF8-flag via ":utf8" I/O layer.
cf. <http://www.perldoc.com/perl5.8.4/lib/PerlIO.html>

Another way to enable the flag is to use "utf8::decode()" function.
My sample code is like below:

#!/usr/local/bin/perl -w
use 5.008;
use strict;
use warnings;

open (TXT, '<sample2.txt');
chomp(my @text = <TXT>);
close TXT;

print "utf8 flag desabled:\n";
foreach my $text (@text) {
    print length($text), "\n";
}

print "utf8 flag enabled:\n";
foreach my $text (@text) {
    utf8::decode($text);
    print length($text), "\n";
}

--
Masanori HATA
<lovewing(_at_)dream(_dot_)big(_dot_)or(_dot_)jp>
He's always with us!