RE: Segfault using HTML::Entities

Malformed UTF-8 character (unexpected non-continuation byte 0x73,
immediately after start byte 0xe9) in substitution iterator at
/usr/lib/perl5/site_perl/5.8.3/i386-linux-thread-multi/HTML/Entities.pm
line 435, <DATA> line 1.
Segmentation fault


I think this is an internal 'utf-8 flag' problem. I'm not a Perl internals
expert, but there seems to be some funny goings on internally where Latin-1
strings are sometimes stored as Latin-1 and sometimes as UTF-8. The magic
UTF-8 flag has to match the internal representation of the string. In this
case, the UTF-8 flag is set, but the internal representation is not UTF-8
but Latin-1. That causes the low-level string parser to barf.

It might be possible to find a workaround by converting explicitly to UTF-8
and maybe manually setting the flag. I'm not sure that'd help though. See
the Encode docs for more info on how you would do this.

Internals experts: I find the "magic" Latin-1-ization of my strings to be a
pain in the neck sometimes. No doubt it works wonders for backwards
compatibility at times, but if I need to send UTF-8 out to external modules
I often need to make sure it is really UTF-8 I send. It is a pain to have to
check the string to see if it is Latin-1 or UTF-8. Is there a way to stop
the magic?

=Ed Batutis

Previous by Date:	Segfault using HTML::Entities, Richard Jolly
Next by Date:	Re: Segfault using HTML::Entities, Richard Jolly
Previous by Thread:	Segfault using HTML::Entities, Richard Jolly
Next by Thread:	Re: Segfault using HTML::Entities, Richard Jolly
Indexes:	[Date] [Thread] [Top] [All Lists]