HI all:
Thanks masanori's reply and sample code! Before i saw your emai, i did the
job and it's indeed so easy. The only trick is to say "use utf8" and then
ord and length etc. can correctly handle multibytes.
Now i encountered another problem, there are a few files contains not only
one charset but also two or more, for example, file1 contains japanese and
chinese, if i use open() to load the data into memory, ord and length etc..
can't correctly work! Perhasp i miss something to encode or decode the data
?
code:
#!/usr/bin/perl -w
use utf8;
open(FD, "< file1");
while(<FD>) {
chomp;
print "length = ".length($_);
}
close FD;
----------
length() can not count the correct non-ASCII characters. :(
Masanori HATA writes:
To call it "HTML unicode" seems to be wrong, regularly it had better to
do as "Numeric character references", I think.
<http://www.w3.org/TR/html4/charset.html#entities>
To use numeric character references is not the only way to display
multi-lingual text in a html document. Actually, I use 'raw' UTF-8
characters in some html documents. For that, I edit the source file of a
html with a text editor which can handle UTF-8 encoding.
Please browse the sample.html which is attached with this mail.
Not only to view with browser but also to do the source of the file.
You may learn more about Unicode and HTML.
About Unicode:
<http://www.unicode.org/standard/WhatIsUnicode.html>
About HTML:
<http://www.w3.org/MarkUp/>
BTW, when you use numeric character references method, there is no need
to look around any modules. Only to use "unpack('U*', $string)" function
is enough to do.
Please inspect and estimate my sample code which is attached as sample.pl.
--
Masanori HATA
<lovewing(_at_)dream(_dot_)big(_dot_)or(_dot_)jp>
He's always with us!
----
He zhiqiang <hzqbbc(_at_)damail(_dot_)cn>
PM of PBIP lab and daYou IT Ltd.
Core member of anti-spam.org.cn