Quoth JD(_at_)BD8(_dot_)COM (John Delacour):
At 12:31 am +0800 3/12/04, He Zhiqiang wrote:
Now i encountered another problem, there are a few files contains
not only one charset but also two or more, for example, file1
contains japanese and chinese, if i use open() to load the data
into memory, ord and length etc.. can't correctly work! Perhasp i
miss something to encode or decode the data ?
code:
#!/usr/bin/perl -w
use utf8;
open(FD, "< file1");
while(<FD>) {
chomp;
print "length = ".length($_);
}
close FD;
----------
length() can not count the correct non-ASCII characters. :(
If the file is in UTF-8, then it may be in any number of _languages_
but it uses only one character set -- Unicode. So far as I know "use
utf8" is now redundant and ineffectual in Perl.
Both utf8.pm and encoding.pm alter the encoding Perl considers your
*source file* to be in. This is different from what utf8.pm did under
5.6.
You will get the
correct character count (6 characters rather than 18 bytes) by
opening the file handle as utf-8 as below.
no warnings;
my $f = "/tmp/cjk.txt";
my $text = "\x{56d8}\x{56d9}\x{56da}\x{56db}\x{56dc}\x{56dd}\n";
open F, ">$f";
binmode F;
both for portability and in case of some environment setting (PERLIO,
the locale variables with 5.8.0 or -C) having set some other encoding on
the data.
print F $text; # writes $text to $f as UTF-8
utf8::encode $text; # make sure $text is a a sequence of octets not
# characters
print F $text;
close F;
open F, "<:utf8", $f;
for (<F>) {
chomp;
print "$_ - Length = " . length() . $/;
}
Ben
--
Joy and Woe are woven fine,
A Clothing for the Soul divine William Blake
Under every grief and pine 'Auguries of Innocence'
Runs a joy with silken twine.
ben(_at_)morrow(_dot_)me(_dot_)uk