nmh-workers
[Top] [All Lists]

Re: [Nmh-workers] bug in decode_rfc2047()

2013-01-03 12:32:04
The root of all this is iconv's behavior that requires us to
skip past the invalid character.  Looking at it now, I wonder if
we can do better than the current special handling for UTF-8?
It's the "fromutf8" block below:
[...]

Hm.  I played around with this a bit, and I'm not sure what to do.

iconv() doesn't distinguish between "We can't convert this character to
the target character set" and "This multibyte sequence is invalid"; they
both get EILSEQ.  Even worse, we can't (portably) tell where the end of
a multibyte sequence is.

So, I see a couple of options.  We could go completely portable and put
in a "?" (or whatever) for every byte that's invalid.  That would have
us generate multiple "?" for multibyte character sets like UTF8.  We could
suppress multiple invalid bytes in a row so there's just one "?", but
that seems kinda lousy to me.

GNU libiconv (which is seems like a fair number of people use) has
an iconvctl() function and it has an undocumented function that lets
you create your own substitution function for invalid bytes/codepoints.
That function isn't part of POSIX.  The fact that it's undocumented and
nonstandard makes me think we shouldn't use it.

Unless we have a LOT of multibyte character sets to deal with, perhaps
the special-case here for UTF8 is the best alternative?  Any other thoughts
on this matter?

--Ken

_______________________________________________
Nmh-workers mailing list
Nmh-workers(_at_)nongnu(_dot_)org
https://lists.nongnu.org/mailman/listinfo/nmh-workers

<Prev in Thread] Current Thread [Next in Thread>