On 30 Jun 2004, at 04:11, Edward Batutis wrote:
Malformed UTF-8 character (unexpected non-continuation byte 0x73,
immediately after start byte 0xe9) in substitution iterator at
/usr/lib/perl5/site_perl/5.8.3/i386-linux-thread-multi/HTML/
Entities.pm
line 435, <DATA> line 1.
Segmentation fault
I think this is an internal 'utf-8 flag' problem. I'm not a Perl
internals
expert, but there seems to be some funny goings on internally where
Latin-1
strings are sometimes stored as Latin-1 and sometimes as UTF-8. The
magic
UTF-8 flag has to match the internal representation of the string. In
this
case, the UTF-8 flag is set, but the internal representation is not
UTF-8
but Latin-1. That causes the low-level string parser to barf.
It might be possible to find a workaround by converting explicitly to
UTF-8
and maybe manually setting the flag. I'm not sure that'd help though.
See
the Encode docs for more info on how you would do this.
I tried doing explicit conversions, but didn't get very far. I did come
up with a screwy work around, however.
In my original mail the offending line was:
<title>The Modern R&eacute;sum&eacute;</title>
Now this is a bit off, because is RSS, therefore utf8, but its got
encoded latin1 entities (é) in there, with the & further encoded
for xml safety.
After xml parsing the & are resolved. Then I used decode_entities
to resolve the é, then encoded again - and thats where it
crashed. (I had to encode again because _most_ of the rss feeds I'm
dealing with don't start with encoded entities).
Anyway, if I drop the initial decode, and do s/&([A-Za-z])+;/&$1;/g
after the encoding I solve the double encoding problem without causing
a segfault.
So it looks like:
parse xml
$title = $item->title;
# decode_entities( $title ) # removed this line
encode_entities( $title )
$title =~ s/&([A-Za-z])+;/&$1;/g # added this one
Richard
=Ed Batutis