Re: invalid UTF-8

On January 9, 2006 at 22:19, Jeff Breidenbach wrote:

When mhonarc is producing UTF-8 using the TEXTENCODE resource, does it
ever produce invalid UTF-8? I ask because I'm taking some mhonarc
output, stripping the HTML, then feeding the results to a Perl based text
analysis program. Which occasionally complains bitterly, for example:

Malformed UTF-8 character (unexpected continuation byte 0x85, with no
preceding start byte)


I've made attempts to deal with malformed UTF-8, but I will have
to look into it.  With TEXTENCODE, and perl >= 5.8, MHonArc utilizes
the Encode module to do the encoding, so it may be a factor.  With
perl < 5.8, I've tried to deal with it as best as I know how.

Taking a quick look at the code, if the input is formally tagged
as us-ascii or utf-8, mhonarc passes the data as-is if encoding
to UTF-8.  Therefore, if the source has bad sequences, then the
final output will also have them.  It may be worth considering
if mhonarc should do a sanity check on the data even if the
source claims to be utf-8.  There may be security implications.

If you can provide me with a sample message, I can check it out.

--ewh

---------------------------------------------------------------------
To sign-off this list, send email to majordomo(_at_)mhonarc(_dot_)org with the
message text UNSUBSCRIBE MHONARC-DEV

Previous by Date:	invalid UTF-8, Jeff Breidenbach
Next by Date:	Relative path in AttachmentDir (bug?), Paul Chambers
Previous by Thread:	invalid UTF-8, Jeff Breidenbach
Next by Thread:	Relative path in AttachmentDir (bug?), Paul Chambers
Indexes:	[Date] [Thread] [Top] [All Lists]