perl-unicode

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 20:03:59
On Wed, Jun 16, 2010 at 05:34:44PM -0700, David E. Wheeler wrote:

So the UTF8 flag is enabled, and yet it has "\303\204\302\215" in it. What is 
that crap?

That's octal notation, which I think Dump() uses for any byte greater than 127
and for control characters, so that it can output pure ASCII.  

That sequence is only four bytes: 
  
  marvin(_at_)smokey:~ $ perl -MEncode -MDevel::Peek -e '$s = 
"\303\204\302\215"; Encode::_utf8_on($s); Dump $s'
  SV = PV(0x801038) at 0x80e880
    REFCNT = 1
    FLAGS = (POK,pPOK,UTF8)
    PV = 0x2012f0 "\303\204\302\215"\0 [UTF8 "\x{c4}\x{8d}"]
    CUR = 4   <----------------------------------------------- four bytes
    LEN = 8
  marvin(_at_)smokey:~ $ 

The logical content of the string follows in the second quote:

 [UTF8 "<p>Tomas Laurinavi\x{c4}\x{8d}ius</p>"]

That's valid UTF-8.

my $str = '<p>Tomas Laurinavi????ius</p>';

In source code, I try to stick to pure ASCII and use \x escapes -- like Dump()
does.

  my $str = "<p>Tomas Laurinavi\x{c4}\x{8d}ius</p>"

However, because those code points are both representable as Latin-1, Perl
will create a Latin-1 string.  If you want to force its internal encoding to
UTF-8, you need to do additional work.

  marvin(_at_)smokey:~ $ perl -MDevel::Peek -e '$s = "\x{c4}"; Dump $s; 
utf8::upgrade($s); Dump $s'
  SV = PV(0x801038) at 0x80e870
    REFCNT = 1
    FLAGS = (POK,pPOK)
    PV = 0x2012e0 "\304"\0
    CUR = 1
    LEN = 4
  SV = PV(0x801038) at 0x80e870
    REFCNT = 1
    FLAGS = (POK,pPOK,UTF8)
    PV = 0x2008f0 "\303\204"\0 [UTF8 "\x{c4}"]
    CUR = 2
    LEN = 3
  marvin(_at_)smokey:~ $ 

Confused and frustrated,

IMO, to get UTF-8 right consistently in a large Perl system, you need to
understand the internals and you need Devel::Peek at hand.  Perl tries to hide
the details, but there are too many ways for it to fail silently.  ("perl -C",
$YAML::Syck::ImplicitUnicode, etc.)

Marvin Humphrey