nmh-workers
[Top] [All Lists]

Re: [Nmh-workers] nmh architecture discussion: format engine character set

2015-08-11 10:56:21
Ken Hornstein wrote:
Even if it can, I am unsure we can maintain
the correct column position when dealing with things like combining
characters.

That is possible. wcwidth() returns 0 for combining characters.

As I learned the hard way, that is NOT necessarily true.  Although the
problem there was in older versions of MacOS X; newer versions have fixed
that.  So that's a problem which is going away.

But ... let's put that quote into context.  I was speaking of the case
where internal data representation is UTF-8, but the user has a non-UTF-8
locale (let's say ISO-8859-1).  You can't use wcwidth() in this context
because it wants to work on the current locale.  Okay, that's technically
not true; it works on whatever you give to setlocale().  But as I explained
before, it's not practically possible to pick a UTF-8 locale if you're not
already in one because a) the locale names are not standardized, and b)
you don't know if a particular locale supports UTF-8 or not.

Also, it's even more confusing than that.  Let's assume that we could
use xlocale or change the whole process locale to an UTF-8 locale.  We
then calculate the complete character width.  But what happens when we
convert that with iconv() to the native character set?  If we do the
usual subsitution for non-valid characters like '?', then what happens
when we run into a combining character?  Does the combining character end
up with as a '?' ?  If so, that messes up the length calculation.

Do we have any specific cases where forcing a UTF-8 assumption actually
helps? The POSIX API is clumsy but the fact that it deals in the current
locale rather than UTF-8 doesn't make much difference. The code needs an
API to know stuff like how wide a string is. Knowing you have a UTF-8
encoding doesn't really gain you anything.

Well, I think it helps in two cases:

1) If you have an UTF-8 locale
2) If you don't have an UTF-8 locale, but you still want to output UTF-8.

For 1) it helps a little.  For 2) it helps more.  But ... I think people
in case 2) are wrong to have their system setup that way.  I mean,
seriously ... you're telling your operating system that you only support
ASCII, but you want us to output UTF-8 anyway?  How does that even
make any sense?

I think it'd be better to focus on real features. So if you want, for
example, character substitution on conversion failure and libunistring
helps then configure can check for it and disable the feature if it
isn't found. As an aside, that particular feature only sounds useful if
you're actually using a non-UTF-8 locale.

Well ... I am reluctant to make that optional.

At this point character conversion really isn't an optional feature in a
MUA.  I know, some people are foolishly disabling iconv support in nmh
(partially because of their lousy settings, partially because of our
bugs).  But really, you're expected to be able to handle different
character sets at this point; you need to be able to convert that.

iconv is a POSIX API; it's not perfect by any means, but at least it
works and is widely supported.  Supporting two or three codepaths
(one without any character conversion, one with iconv, one with
libunistring) seems like a bad idea to me, especially when it's part
of core functionality; I'm fine with optional things like TLS and
SASL support (well, maybe those things aren't so optional anymore in
practice), but everyone needs character conversion nowadays.  I'd
rather pick one option and stick with it.  If it's iconv, great.  If
it's libicu/libunistring, great.

Now as for the idea of focusing on features: yes, completely agree
that's important!  But the decisions we make now in terms of internals
really do matter on how we implement those features.  I don't really
see broad disagreement on the features: it's just more along the lines
of 'how do we get there?'.

Given that nmh is BSD licenced, I'd probably favour libicu over
libunistring just for its licence. Checking on a Debian system, neither
have vast numbers of reverse dependencies.

libicu/libunistring are great if you need to manipulate UTF-8 strings.
My issue is: I am not clear that's necessary for us.

So, what was the point of all this?  I guess for once rather than fumbling
around and glomming on some MIME support later, it would make sense to
sit down, figure out how we want nmh to work, and then make that happen.
Right now I have a SLIGHT lean toward having the format engine represent
stuff in the native character set.  But this isn't perfect, and let me
give you an example why.

Let's say someone sends you an email that contains UTF-8 in their real
name field, with a character that is only in Unicode.  This email is NOT
encoded using RFC-2047, but is simple bare UTF-8 (which is now permitted
as part of the new email RFCs).  If your locale is ISO-8859-1, or even
worse, ASCII (seriously, WHAT THE HELL PEOPLE?!??  It's 2015!!!), then
converting the name to the local character set means you lose characters
in their name ... and that seems terrible to me.  We could do something
like convert to RFC-2047 encoding in that case.  But what if the email.
address itself contains UTF-8?  We can't do RFC-2047 encoding in that
case.  Hm, I think I just talked myself into a slight lean toward
havin the format engine being UTF-8 internally.

Sigh, the bottom line is that there are no good answers.  It would be
helpful if people might suggest what they expect/want to happen if
a user received a message/global email and they're using a non-UTF-8
locale.  "Shit breaks" is an acceptable answer :-)

--Ken

_______________________________________________
Nmh-workers mailing list
Nmh-workers(_at_)nongnu(_dot_)org
https://lists.nongnu.org/mailman/listinfo/nmh-workers

<Prev in Thread] Current Thread [Next in Thread>