perl-unicode

Re: Problems with XML - What exactly does "Cannot decode string with wide characters" mean?

2002-11-12 01:30:10
On Mon, 11 Nov 2002 23:37:12 -0800, Daisuke Maki 
<daisuke(_at_)wafu(_dot_)ne(_dot_)jp> said:

  >        utf82euc( $xml->findvalue( 'foobar' ) );

  >     where utf82euc() is a convenience function that I wrote which does:

  >        my $octets = decode( 'utf8', $text );

Decode doesn't return octets.

  >        return encode( 'euc-jp', $octets );

Encode doesn't take octets as second parameter.

  > The problem is that when I call decode(), I get the error

  >      "Cannot decode string with wide characters"

This is a typical beginner problem. I had to look up again and again
what decode and what encode does. Think from perl's point of view.
Decode transforms into perl's internal format, encode transforms from
perl's internal format. Treat the internal format as a black box,
forget that you know that it has something to do with UTF-8.

Decode only works if the octets you feed to it are really in the
format you specify.

Encode only works if the string is really in perl's internal format.

Once that is clear you know you have to have intimate knowledge which
format you get from the modules you use. If the manpage doesn't help
you, use Devel::Dump to determine what you have:

% perl -le '
use XML::LibXML;
use Devel::Peek;
my $parser = XML::LibXML->new();       
my $doc = $parser->parse_string(<<EOT);
<foobar>foo\x{100}bar</foobar>
EOT                               
my $v = $doc->findvalue("foobar");
Dump $v; 
'
SV = PVMG(0x823dae8) at 0x81ab1f8
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
  IV = 0
  NV = 0
  PV = 0x81e4848 "foo\304\200bar"\0 [UTF8 "foo\x{100}bar"]
  CUR = 8
  LEN = 9

You see, XML::LibXML returns strings already in perl's internal
format. So you should not call decode() at all on this string, it is
ready for use.

And rename your variables! The Encode manpage uses the variable names
$string and $octets for a reason:-)


Hope this helps,
-- 
andreas