Re: Variation In Decoding Between Encode and XML::LibXML

On Thu, Jun 17, 2010 at 10:17:52AM -0700, David E. Wheeler wrote:

The logical content of the string follows in the second quote:

[UTF8 "<p>Tomas Laurinavi\x{c4}\x{8d}ius</p>"]


That's valid UTF-8.


In what sense? Legally perhaps, but I can make XML::LibXML choke on it.


There are two valid states for Perl scalars containing string data.

  * SVf_UTF8 flag off.
  * SVf_UTF8 flag on, and string data which is a valid UTF-8 byte sequence.

In both cases, we define the logical content of the string as a series of
Unicode code points.  

If the UTF8 flag is off, then the scalar's data will be interpreted as
Latin-1.  (Except under "use locale" but let's ignore that for now.)  Each
byte will be interpreted as a single code point.  The 256 logical code points
in Latin-1 are identical to the first 256 logical code points in Unicode.
This is by design -- the Unicode consortium chose to overlap with Latin-1
because it was so common.  So any string content that consists solely of code
points 255 and under can be represented in Latin-1 without loss.

In a Perl scalar with the UTF8 flag on, you can get the code points by
decoding the variable width UTF-8 data, with each code point derived by
reading 1-5 bytes.  *Any* sequence of Unicode code points can be represented
without loss.

Unfortunately, it is really, really easy to mess up string handling when
writing XS modules.  A common error is to strip the UTF8 flag accidentally.
This changes the scalar's logical content, as now its string data will be
interpreted as Latin-1 rather than UTF-8.  

A less common error is to turn on the UTF8 flag for a scalar which does not
contain a valid UTF-8 byte sequence.  This puts the scalar into an what I'm
calling an "invalid state".  It will likely bring your program down with a
"panic" error message if you try to do something like run a regex on it.

In your case, the Dump of the scalar demonstrated that it had the UTF8 flag
set and that it contained a valid UTF-8 byte sequence -- a "valid state".
However, it looks like it had invalid content.

A scalar with the UTF8 flag off can never be in an "invalid state", because
any sequence of bytes is valid Latin-1.  However, it's easy to change the
string's logical content by accidentally stripping or forgetting to set the
UTF8 flag.  Unfortunately, this error leads to silent failure -- no error
message, but the content changes -- and it can be really hard to debug.

This fellow's name, which you can see if you visit
<http://twitter.com/tomaslau>, contains Unicode code point 0x010d, "LATIN SMALL
LETTER C WITH CARON".  As that code point is greater than 255, any Perl string
containing his name *must* have the UTF8 flag turned on.  

I strongly suspect that at some point one of the following two things
happened:

    * The code was input from a UTF-8 source but the input filehandle was not
      set to UTF-8 -- open (my $fh, '<:encoding(utf8)', $file) or die;
    * The flag got stripped and subsequently the UTF-8 data was incorrectly
      reinterpreted as Latin-1.

You typically need Devel::Peek for hunting down the second kind of error.

IMO, to get UTF-8 right consistently in a large Perl system, you need to
understand the internals and you need Devel::Peek at hand.  Perl tries to 
hide
the details, but there are too many ways for it to fail silently.  ("perl 
-C",
$YAML::Syck::ImplicitUnicode, etc.)


Bleh. Such a PITA. I'd like not to have to think about this stuff, but I
must because other people haven't.


It's more that getting UTF-8 support into Perl without breaking existing
programs was a truly awesome hack -- but that one of the limitations of that
hack was that the implementation is prone to silent failure.

So here's my test:

    use 5.12.0;
    use Devel::Peek;

    my $str = "<p>Laurinavi\x{c3}\x{84}\x{c2}\x{8d}ius</p>";
    say $str;
    utf8::upgrade($str);
    binmode STDOUT, ':utf8';
    say $str;
    Dump $str;

The output it still broken, however, in both cases, looking like this:

    LaurinaviÄ?ius
    LaurinaviÃ?Â?ius


Let's double check something first.  Based on your mail client (Apple Mail) I
see you're (still) using OS X.  Check out Terminal -> Preferences -> Advanced
-> Character encoding. What's it set to?  If it's not "Unicode (UTF-8)", set
it to that now.

Then try this:

    use 5.10.0;
    use Devel::Peek;

    my $str = "<p>Tomas Laurinavi\x{010d}ius</p>";
    say $str;

    binmode STDOUT, ':utf8';
    say $str;

    Dump $str;
    utf8::upgrade($str); # no effect
    Dump $str;

For me, that prints his name correctly twice.  The first time, though, I get
a "wide character in print" warning.  That warning arises because Perl's
STDOUT is set to Latin-1 by default.  It wants to "downgrade" the UTF8 scalar
to Latin-1, but it can't do so without loss, so it warns and outputs the bytes
as is.  After we change STDOUT to 'utf8', the warning goes away.

The utf8::upgrade() call has no effect, because the scalar starts off as
UTF8.  Prior to the introduction of the UTF8 flag, there was no way to put
the code point \x{010d} into a Perl string because Latin-1 can't represent it.
For backwards compatibility reasons, \x escapes below 255 have to be
represented as Latin-1.   Since you asked for \x{010d}, though, Perl knows
that the backwards compat rules don't apply and it can use a UTF8 scalar.

HTH,

Marvin Humphrey