perl-unicode

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 18:47:28
On Wed, Jun 16, 2010 at 01:59:33PM -0700, David E. Wheeler wrote:
I think what I need is some code to strip non-utf8 characters from a string
-- even if that string has the utf8 bit switched on. I thought that Encode
would do that for me, but in this case apparently not. Anyone got an
example?

Tri this:

    Encode::_utf8_off($string);
    $string = Encode::decode('utf8', $string);

That will replace any byte sequences which are invalid UTF-8 with the Unicode
replacement character.  

If you want to guarantee that the flag is on first, do this:

    utf8::upgrade($string);
    Encode::_utf8_off($string);
    $string = Encode::decode('utf8', $string);

Devel::Peek's Dump() function will come in handy for checking results.

Cheers,

Marvin Humphrey