On Nov 10, 2004, at 01:30, Paul Bijnens wrote:
I have a program that reads and writes (among others) strings that
should be utf8 encoded. I say "should", because somewhere deep
inside the dark corners of that program, sometimes, the utf8 flag on
a string is lost. (I'm still investigating where, tips to attack
such a problem are welcome.)
Even when you try to set UTF-8 flag on strings which consists entirely
of ASCII ( /^[\x00-\x7f]$/ ) the UTF-8 will not be on. See "The UTF-8
flag" section of 'perldoc Encode'. Here is the short summary.
perldoc Encode
o When you decode, the resulting utf8 flag is on unless you can
unam-
biguously represent data. Here is the definition of
dis-ambiguity.
After "$utf8 = decode('foo', $octet);",
When $octet is... The utf8 flag in $utf8 is
---------------------------------------------
In ASCII only (or EBCDIC only) OFF
In ISO-8859-1 ON
In any other Encoding ON
---------------------------------------------
As you see, there is one exception, In ASCII. That way you
can assue
Goal #1. And with Encode Goal #2 is assumed but you still
have to be
careful in such cases mentioned in CAVEAT paragraphs.
When writing the string, the program clears the utf8 flag
and writes a simple string of octets using:
$s = encode("utf8", $s) if $s =~ /[^\x00-\x7f]/;
$n = length($s); # yes, we need length in bytes
...
print $s;
If what you need is byte length, you can simply "use bytes" as follows.
binmode is for print().
use bytes (); # avoid imports
binmode STDOUT => ":utf8";
my $s = "\x{5c0f}\x{98fc} \x{5f3e}";
# ...
my $n = length($s);ch
my $l = bytes::length($s);
# ...
print $s;
Why would someone test for pure 7-bit strings instead of:
$s = encode("utf8", $s) if Encode::is_utf8($s);
For most cases you don't have to and you should not have to (unless you
maintain Encode and/or perl :). Complex it may be, the internal UTF-8
flag was the best way to harness UTF-8 while keeping legacy,
byte-oriented scripts compatible.
which seems superior to avoid double utf8 encodings,
shoue ld the utf8-flag be lost. And it's faster.
Or even simply: Encode::_utf8_off($s)
The problem is that I'm usually wrong. Am I this time?
Am I missing something? Or do I need more coffee?
I have to admit Encode and Perl 5.8-way of handling Unicode needs more
recipes (Perl Cookbook 2nd Ed. does cover that issue on Ch. 8 but it
was hardly enough).
Dan the Encode Maintainer