Re: Segfault using HTML::Entities


On 30 Jun 2004, at 04:11, Edward Batutis wrote:

Malformed UTF-8 character (unexpected non-continuation byte 0x73,
immediately after start byte 0xe9) in substitution iterator at
/usr/lib/perl5/site_perl/5.8.3/i386-linux-thread-multi/HTML/Entities.pm
line 435, <DATA> line 1.
Segmentation fault
I think this is an internal 'utf-8 flag' problem. I'm not a Perlinternalsexpert, but there seems to be some funny goings on internally whereLatin-1strings are sometimes stored as Latin-1 and sometimes as UTF-8. ThemagicUTF-8 flag has to match the internal representation of the string. Inthiscase, the UTF-8 flag is set, but the internal representation is notUTF-8
but Latin-1. That causes the low-level string parser to barf.
It might be possible to find a workaround by converting explicitly toUTF-8and maybe manually setting the flag. I'm not sure that'd help though.See
the Encode docs for more info on how you would do this.

I tried doing explicit conversions, but didn't get very far. I did comeup with a screwy work around, however.


In my original mail the offending line was:

<title>The Modern R&amp;eacute;sum&amp;eacute;</title>

Now this is a bit off, because is RSS, therefore utf8, but its gotencoded latin1 entities (é) in there, with the & further encodedfor xml safety.

After xml parsing the & are resolved. Then I used decode_entitiesto resolve the é, then encoded again - and thats where itcrashed. (I had to encode again because _most_ of the rss feeds I'mdealing with don't start with encoded entities).

Anyway, if I drop the initial decode, and do s/&([A-Za-z])+;/&$1;/gafter the encoding I solve the double encoding problem without causinga segfault.


So it looks like:

parse xml
$title = $item->title;
# decode_entities( $title )  # removed this line
encode_entities( $title )
$title =~ s/&amp;([A-Za-z])+;/&$1;/g  # added this one

Richard

=Ed Batutis