perl-unicode

Re: Segfault using HTML::Entities

2004-06-30 06:30:08

On 30 Jun 2004, at 04:11, Edward Batutis wrote:


Malformed UTF-8 character (unexpected non-continuation byte 0x73,
immediately after start byte 0xe9) in substitution iterator at
/usr/lib/perl5/site_perl/5.8.3/i386-linux-thread-multi/HTML/ Entities.pm
line 435, <DATA> line 1.
Segmentation fault


I think this is an internal 'utf-8 flag' problem. I'm not a Perl internals expert, but there seems to be some funny goings on internally where Latin-1 strings are sometimes stored as Latin-1 and sometimes as UTF-8. The magic UTF-8 flag has to match the internal representation of the string. In this case, the UTF-8 flag is set, but the internal representation is not UTF-8
but Latin-1. That causes the low-level string parser to barf.

It might be possible to find a workaround by converting explicitly to UTF-8 and maybe manually setting the flag. I'm not sure that'd help though. See
the Encode docs for more info on how you would do this.

I tried doing explicit conversions, but didn't get very far. I did come up with a screwy work around, however.

In my original mail the offending line was:

<title>The Modern R&amp;eacute;sum&amp;eacute;</title>

Now this is a bit off, because is RSS, therefore utf8, but its got encoded latin1 entities (&eacute;) in there, with the & further encoded for xml safety.

After xml parsing the &amp; are resolved. Then I used decode_entities to resolve the &eacute;, then encoded again - and thats where it crashed. (I had to encode again because _most_ of the rss feeds I'm dealing with don't start with encoded entities).

Anyway, if I drop the initial decode, and do s/&amp;([A-Za-z])+;/&$1;/g after the encoding I solve the double encoding problem without causing a segfault.

So it looks like:

parse xml
$title = $item->title;
# decode_entities( $title )  # removed this line
encode_entities( $title )
$title =~ s/&amp;([A-Za-z])+;/&$1;/g  # added this one

Richard

=Ed Batutis



<Prev in Thread] Current Thread [Next in Thread>