On January 9, 2006 at 22:19, Jeff Breidenbach wrote:
When mhonarc is producing UTF-8 using the TEXTENCODE resource, does it
ever produce invalid UTF-8? I ask because I'm taking some mhonarc
output, stripping the HTML, then feeding the results to a Perl based text
analysis program. Which occasionally complains bitterly, for example:
Malformed UTF-8 character (unexpected continuation byte 0x85, with no
preceding start byte)
I've made attempts to deal with malformed UTF-8, but I will have
to look into it. With TEXTENCODE, and perl >= 5.8, MHonArc utilizes
the Encode module to do the encoding, so it may be a factor. With
perl < 5.8, I've tried to deal with it as best as I know how.
Taking a quick look at the code, if the input is formally tagged
as us-ascii or utf-8, mhonarc passes the data as-is if encoding
to UTF-8. Therefore, if the source has bad sequences, then the
final output will also have them. It may be worth considering
if mhonarc should do a sanity check on the data even if the
source claims to be utf-8. There may be security implications.
If you can provide me with a sample message, I can check it out.
--ewh
---------------------------------------------------------------------
To sign-off this list, send email to majordomo(_at_)mhonarc(_dot_)org with the
message text UNSUBSCRIBE MHONARC-DEV