perl-unicode

RE: Segfault using HTML::Entities

2004-06-29 20:30:08

Malformed UTF-8 character (unexpected non-continuation byte 0x73,
immediately after start byte 0xe9) in substitution iterator at
/usr/lib/perl5/site_perl/5.8.3/i386-linux-thread-multi/HTML/Entities.pm
line 435, <DATA> line 1.
Segmentation fault


I think this is an internal 'utf-8 flag' problem. I'm not a Perl internals
expert, but there seems to be some funny goings on internally where Latin-1
strings are sometimes stored as Latin-1 and sometimes as UTF-8. The magic
UTF-8 flag has to match the internal representation of the string. In this
case, the UTF-8 flag is set, but the internal representation is not UTF-8
but Latin-1. That causes the low-level string parser to barf.

It might be possible to find a workaround by converting explicitly to UTF-8
and maybe manually setting the flag. I'm not sure that'd help though. See
the Encode docs for more info on how you would do this.

Internals experts: I find the "magic" Latin-1-ization of my strings to be a
pain in the neck sometimes. No doubt it works wonders for backwards
compatibility at times, but if I need to send UTF-8 out to external modules
I often need to make sure it is really UTF-8 I send. It is a pain to have to
check the string to see if it is Latin-1 or UTF-8. Is there a way to stop
the magic?

=Ed Batutis


<Prev in Thread] Current Thread [Next in Thread>