Re: invalid UTF-8

2006-01-10 15:49:41
On January 9, 2006 at 22:19, Jeff Breidenbach wrote:

When mhonarc is producing UTF-8 using the TEXTENCODE resource, does it
ever produce invalid UTF-8? I ask because I'm taking some mhonarc
output, stripping the HTML, then feeding the results to a Perl based text
analysis program. Which occasionally complains bitterly, for example:

Malformed UTF-8 character (unexpected continuation byte 0x85, with no
preceding start byte)

I've made attempts to deal with malformed UTF-8, but I will have
to look into it.  With TEXTENCODE, and perl >= 5.8, MHonArc utilizes
the Encode module to do the encoding, so it may be a factor.  With
perl < 5.8, I've tried to deal with it as best as I know how.

Taking a quick look at the code, if the input is formally tagged
as us-ascii or utf-8, mhonarc passes the data as-is if encoding
to UTF-8.  Therefore, if the source has bad sequences, then the
final output will also have them.  It may be worth considering
if mhonarc should do a sanity check on the data even if the
source claims to be utf-8.  There may be security implications.

If you can provide me with a sample message, I can check it out.


To sign-off this list, send email to majordomo(_at_)mhonarc(_dot_)org with the

<Prev in Thread] Current Thread [Next in Thread>
  • invalid UTF-8, Jeff Breidenbach
    • Re: invalid UTF-8, Earl Hood <=