The root of all this is iconv's behavior that requires us to
skip past the invalid character. Looking at it now, I wonder if
we can do better than the current special handling for UTF-8?
It's the "fromutf8" block below:
[...]
Hm. I played around with this a bit, and I'm not sure what to do.
iconv() doesn't distinguish between "We can't convert this character to
the target character set" and "This multibyte sequence is invalid"; they
both get EILSEQ. Even worse, we can't (portably) tell where the end of
a multibyte sequence is.
So, I see a couple of options. We could go completely portable and put
in a "?" (or whatever) for every byte that's invalid. That would have
us generate multiple "?" for multibyte character sets like UTF8. We could
suppress multiple invalid bytes in a row so there's just one "?", but
that seems kinda lousy to me.
GNU libiconv (which is seems like a fair number of people use) has
an iconvctl() function and it has an undocumented function that lets
you create your own substitution function for invalid bytes/codepoints.
That function isn't part of POSIX. The fact that it's undocumented and
nonstandard makes me think we shouldn't use it.
Unless we have a LOT of multibyte character sets to deal with, perhaps
the special-case here for UTF8 is the best alternative? Any other thoughts
on this matter?
--Ken
_______________________________________________
Nmh-workers mailing list
Nmh-workers(_at_)nongnu(_dot_)org
https://lists.nongnu.org/mailman/listinfo/nmh-workers