nmh-workers
[Top] [All Lists]

Re: [Nmh-workers] nmh architecture discussion: format engine character set

2015-08-11 11:24:25
I am in no way an expert on this.  But, I won't let that stop me.

Welcome to the club!  I think we're all in the same boat in that
regards.

It seems to me that the only solution is to use Unicode internally.
Disgusting as it seems to those of us who are old enough to hoard
bytes, we might want to consider using something other than UTF-8
for the internal representation.  Using UTF-16 wouldn't be horrible
but I recall that the Unicode folks made a botch of things so that
one really needs 24 bits now, which really means using 32 internally.

AFAICT ... there is probably no advantage in using UTF-16 or UTF-32
versus UTF-8.

People might think that you gain something because with UTF-16 two
bytes == 1 character.  But that's only true for things in the Basic
Multilingual Plane, and people are now telling us 🖕 because they want
to send emoji in email which are NOT part of the BMP, which means we
have to start dealing with 💩 like surrogate pairs. And really, even
with just the BMP combining characters toss that idea out of the window
UTF-32 lets you say 4 bytes == 1 character ... but do we care about
'characters' or 'column positions'?

So given that, I think sticking with UTF-8 is preferrable; it has the
nice property that we can represent text as C strings and it's just
ASCII if you're living in a 7-bit world.

On the output side, we just have to do the best we can if characters in
the input locale can't be represented in the output locale.  This is
independent of the internal representation.

Well, this works great if your locale is UTF-8.  But ... what happens
if your email address contains UTF-8, and your locale setting is
ISO-8859-1?

--Ken

_______________________________________________
Nmh-workers mailing list
Nmh-workers(_at_)nongnu(_dot_)org
https://lists.nongnu.org/mailman/listinfo/nmh-workers

<Prev in Thread] Current Thread [Next in Thread>