perl-unicode

Re: clearing the utf8 flag

2004-11-09 11:30:08
On Nov 10, 2004, at 01:30, Paul Bijnens wrote:
I have a program that reads and writes (among others) strings that
should be utf8 encoded.  I say "should", because somewhere deep
inside the dark corners of that program, sometimes, the utf8 flag on
a string is lost. (I'm still investigating where, tips to attack
such a problem are welcome.)

Even when you try to set UTF-8 flag on strings which consists entirely of ASCII ( /^[\x00-\x7f]$/ ) the UTF-8 will not be on. See "The UTF-8 flag" section of 'perldoc Encode'. Here is the short summary.

perldoc Encode
o When you decode, the resulting utf8 flag is on unless you can unam- biguously represent data. Here is the definition of dis-ambiguity.

         After "$utf8 = decode('foo', $octet);",

           When $octet is...   The utf8 flag in $utf8 is
           ---------------------------------------------
           In ASCII only (or EBCDIC only)            OFF
           In ISO-8859-1                              ON
           In any other Encoding                      ON
           ---------------------------------------------

As you see, there is one exception, In ASCII. That way you can assue Goal #1. And with Encode Goal #2 is assumed but you still have to be
         careful in such cases mentioned in CAVEAT paragraphs.


When writing the string, the program clears the utf8 flag
and writes a simple string of octets using:

    $s = encode("utf8", $s) if $s =~ /[^\x00-\x7f]/;
    $n = length($s);   # yes, we need length in bytes
    ...
    print $s;

If what you need is byte length, you can simply "use bytes" as follows. binmode is for print().

use bytes (); # avoid imports
binmode STDOUT => ":utf8";
my $s = "\x{5c0f}\x{98fc} \x{5f3e}";
# ...
my $n = length($s);ch
my $l = bytes::length($s);
# ...
print $s;

Why would someone test for pure 7-bit strings instead of:

    $s = encode("utf8", $s) if Encode::is_utf8($s);

For most cases you don't have to and you should not have to (unless you maintain Encode and/or perl :). Complex it may be, the internal UTF-8 flag was the best way to harness UTF-8 while keeping legacy, byte-oriented scripts compatible.

which seems superior to avoid double utf8 encodings,
shoue ld the utf8-flag be lost.  And it's faster.

Or even simply:     Encode::_utf8_off($s)

The problem is that I'm usually wrong.  Am I this time?
Am I missing something?  Or do I need more coffee?

I have to admit Encode and Perl 5.8-way of handling Unicode needs more recipes (Perl Cookbook 2nd Ed. does cover that issue on Ch. 8 but it was hardly enough).

Dan the Encode Maintainer

<Prev in Thread] Current Thread [Next in Thread>