perl-unicode

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-18 15:36:54
Marvin,

I can always count on you for a detailed explanation. Thanks. You ought to turn 
this into a blog post!

On Jun 17, 2010, at 4:06 PM, Marvin Humphrey wrote:

There are two valid states for Perl scalars containing string data.

 * SVf_UTF8 flag off.
 * SVf_UTF8 flag on, and string data which is a valid UTF-8 byte sequence.

In both cases, we define the logical content of the string as a series of
Unicode code points.  

If the UTF8 flag is off, then the scalar's data will be interpreted as
Latin-1.  (Except under "use locale" but let's ignore that for now.)  Each
byte will be interpreted as a single code point.  The 256 logical code points
in Latin-1 are identical to the first 256 logical code points in Unicode.
This is by design -- the Unicode consortium chose to overlap with Latin-1
because it was so common.  So any string content that consists solely of code
points 255 and under can be represented in Latin-1 without loss.

Hrm. So am I safe in changing the CP1252 gremlin bytes to proper UTF-8 
characters in Encode::ZapCP1252 like so?

        $_[0] =~ s{([\x80-\x9f])}{
            $table->{$1} ? Encode::decode('UTF-8', $table->{$1}) : $1
        }emxsg

Where `$table` is the lookup table mapping hex values like \x80 to their UTF-8 
equivalents (€)? This is assuming that $_[0] has the UTF8 flag on, of course.

So is this safe? Are \x80-\x9f considered characters when the utf8 flag is on, 
or are they bytes that might break multibyte characters that use those bytes?

In a Perl scalar with the UTF8 flag on, you can get the code points by
decoding the variable width UTF-8 data, with each code point derived by
reading 1-5 bytes.  *Any* sequence of Unicode code points can be represented
without loss.

Right.

Unfortunately, it is really, really easy to mess up string handling when
writing XS modules.  A common error is to strip the UTF8 flag accidentally.
This changes the scalar's logical content, as now its string data will be
interpreted as Latin-1 rather than UTF-8.  

A less common error is to turn on the UTF8 flag for a scalar which does not
contain a valid UTF-8 byte sequence.  This puts the scalar into an what I'm
calling an "invalid state".  It will likely bring your program down with a
"panic" error message if you try to do something like run a regex on it.

Fortunately, I'm not writing XS modules. :-)

In your case, the Dump of the scalar demonstrated that it had the UTF8 flag
set and that it contained a valid UTF-8 byte sequence -- a "valid state".
However, it looks like it had invalid content.

Yes. I broke it with zap_cp1252 (applied before decoding). I just removed that 
and things became valid again. The character was still broken, as it is in the 
feed, but at least it was valid -- and the same as the source.

A scalar with the UTF8 flag off can never be in an "invalid state", because
any sequence of bytes is valid Latin-1.  However, it's easy to change the
string's logical content by accidentally stripping or forgetting to set the
UTF8 flag.  Unfortunately, this error leads to silent failure -- no error
message, but the content changes -- and it can be really hard to debug.

Yes, this is what happened to me by zapping the non-utf8 scalar with zap_cp1252 
before decoding it. Bad idea.

This fellow's name, which you can see if you visit
<http://twitter.com/tomaslau>, contains Unicode code point 0x010d, "LATIN 
SMALL
LETTER C WITH CARON".  As that code point is greater than 255, any Perl string
containing his name *must* have the UTF8 flag turned on.  

I strongly suspect that at some point one of the following two things
happened:

   * The code was input from a UTF-8 source but the input filehandle was not
     set to UTF-8 -- open (my $fh, '<:encoding(utf8)', $file) or die;

Well, I was pulling it from HTTP::Response->content. I'm not using 
HTTP::Response->decoded_content because it's XML, which should be binary (see 
http://juerd.nl/site.plp/perluniadvice)

   * The flag got stripped and subsequently the UTF-8 data was incorrectly
     reinterpreted as Latin-1.

You typically need Devel::Peek for hunting down the second kind of error.

I missed that one, fortunately.

It's more that getting UTF-8 support into Perl without breaking existing
programs was a truly awesome hack -- but that one of the limitations of that
hack was that the implementation is prone to silent failure.

Right. It's an impressive achievement. And I can't wait until DBI 2 is built on 
Rakudo. ;-)

The output it still broken, however, in both cases, looking like this:

   Laurinavičius
   Laurinavičius

Let's double check something first.  Based on your mail client (Apple Mail) I
see you're (still) using OS X.  Check out Terminal -> Preferences -> Advanced
-> Character encoding. What's it set to?  If it's not "Unicode (UTF-8)", set
it to that now.

I always use UTF-8. Snow Leopard actually seems to allow multiple encodings 
(!), as the "Encoding" tab (no more advanced tab) has UTF-8, Mac OS Roman, 
Latin-1, and Latin-9 (wha?) checked, as well as a bunch of other encodings.

Then try this:

   use 5.10.0;
   use Devel::Peek;

   my $str = "<p>Tomas Laurinavi\x{010d}ius</p>";
   say $str;

   binmode STDOUT, ':utf8';
   say $str;

   Dump $str;
   utf8::upgrade($str); # no effect
   Dump $str;

For me, that prints his name correctly twice.  The first time, though, I get
a "wide character in print" warning.  That warning arises because Perl's
STDOUT is set to Latin-1 by default.  It wants to "downgrade" the UTF8 scalar
to Latin-1, but it can't do so without loss, so it warns and outputs the bytes
as is.  After we change STDOUT to 'utf8', the warning goes away.

Yep, same here.

The utf8::upgrade() call has no effect, because the scalar starts off as
UTF8.  Prior to the introduction of the UTF8 flag, there was no way to put
the code point \x{010d} into a Perl string because Latin-1 can't represent it.
For backwards compatibility reasons, \x escapes below 255 have to be
represented as Latin-1.   Since you asked for \x{010d}, though, Perl knows
that the backwards compat rules don't apply and it can use a UTF8 scalar.

Ah, I see. That's probably what happened inside Google Pipes: their code read 
the original feed into a Latin-1 variable somehow, and the \x{010d} got changed 
to \x{c4}\x{8d}, and it wasn't converted back before being output as UTF-8.

Best,

David


<Prev in Thread] Current Thread [Next in Thread>