invalid UTF-8

2006-01-09 23:21:19

When mhonarc is producing UTF-8 using the TEXTENCODE resource, does it
ever produce invalid UTF-8? I ask because I'm taking some mhonarc
output, stripping the HTML, then feeding the results to a Perl based text
analysis program. Which occasionally complains bitterly, for example:

Malformed UTF-8 character (unexpected continuation byte 0x85, with no
preceding start byte)

Either I am corrupting the data when I strip the HTML markup, or maybe
mhonarc is producing invalid UTF-8 once in a while. Note that the
source messages mhonarc has to work with could very well be
invalid. If so, does mhonarc repair the message to valid UTF-8, or
just pass on the bad text?


