invalid UTF-8


When mhonarc is producing UTF-8 using the TEXTENCODE resource, does it
ever produce invalid UTF-8? I ask because I'm taking some mhonarc
output, stripping the HTML, then feeding the results to a Perl based text
analysis program. Which occasionally complains bitterly, for example:

Malformed UTF-8 character (unexpected continuation byte 0x85, with no
preceding start byte)

Either I am corrupting the data when I strip the HTML markup, or maybe
mhonarc is producing invalid UTF-8 once in a while. Note that the
source messages mhonarc has to work with could very well be
invalid. If so, does mhonarc repair the message to valid UTF-8, or
just pass on the bad text?

-Jeff

---------------------------------------------------------------------
To sign-off this list, send email to majordomo(_at_)mhonarc(_dot_)org with the
message text UNSUBSCRIBE MHONARC-DEV

Previous by Date:	[bug #15415] mhonarc eats part of a message, Jeff Breidenbach
Next by Date:	Re: invalid UTF-8, Earl Hood
Previous by Thread:	[bug #15415] mhonarc eats part of a message, Jeff Breidenbach
Next by Thread:	Re: invalid UTF-8, Earl Hood
Indexes:	[Date] [Thread] [Top] [All Lists]