perl-unicode

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 19:34:54
On Jun 16, 2010, at 4:47 PM, Marvin Humphrey wrote:

On Wed, Jun 16, 2010 at 01:59:33PM -0700, David E. Wheeler wrote:
I think what I need is some code to strip non-utf8 characters from a string
-- even if that string has the utf8 bit switched on. I thought that Encode
would do that for me, but in this case apparently not. Anyone got an
example?

Tri this:

   Encode::_utf8_off($string);
   $string = Encode::decode('utf8', $string);

That will replace any byte sequences which are invalid UTF-8 with the Unicode
replacement character.  

Yeah. Not working for me. See attached script. Devel::Peek says:

SV = PV(0x100801f18) at 0x10082f368
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x1002015c0 "<p>Tomas Laurinavi\303\204\302\215ius</p>"\0 [UTF8 
"<p>Tomas Laurinavi\x{c4}\x{8d}ius</p>"]
  CUR = 29
  LEN = 32

So the UTF8 flag is enabled, and yet it has "\303\204\302\215" in it. What is 
that crap?

Confused and frustrated,

David

Attachment: try.pl
Description: Text Data