nmh-workers
[Top] [All Lists]

[Nmh-workers] nmh architecture discussion: format engine character set

2015-08-09 20:48:43
Greetings all,

To kick off the EAI discussion, let's start on a nmh architecture discussion.
Specifically: how should i18n characters be represented in the format engine?

I decided to start with the format engine because it's used for a lot of
things inside of nmh, and deciding what to do with it really makes other
decisions clearer.  And also it gets at the fundamental question of how
we want to deal with i18n characters.

To state the problem more explicitly: right now stuff inside the format
engine is assumed to be a mix of ASCII and things in the local character
set.  We don't really have a way to tag stuff in format strings as being
a particular character set.  Either we assume the stuff is ASCII, or
we magically convert into the native character set (%(decode) does this,
for example, and when we retrieve MIME parameters they get magically
converted into the native character set).

So, in this mostly non-specified space, things kinda mostly sort of work.
But now with the existance of message/global, things get a bit more
complicated.  Specifically, before we could assume pretty much everything
in the format engine was ASCII, but now if we get 8-bit characters in
the format engine they might be stuff in the native character set
(output from %(decode)) or UTF-8.  So what should we do?

One possible option: convert everything to the local character set when
text is input to the format engine.  This would basically continue existing
practice: strings output from the format engine could be directly output
to the user without any additional effort, as we do now.  This is relatively
simple to implement, as we're mostly doing this now.

The downside here is that if a message comes in with unencoded UTF-8 in
the headers (it's clear this is where the world is headed) and the user
is NOT using a UTF-8 locale, then you have to convert UTF-8 to something
else and potentially lose some characters if the target character set
doesn't contain the Unicode character.

Another option is to simply convert everything to UTF-8 as it gets read
into the format engine.  I am assuming at this point that Unicode is
a superset of all other character sets; assuming this is true, then no
information is lost when converting incoming text no matter the character
set.

However, while this SEEMS like it would be easier, it actually complicates
the code quite a bit.  The format engine would have to change it's API;
since right now the text is in the native character set, we know how
many column positions we've consumed and we can stop when we reach the limit.
But if the format engine has UTF-8 internally we wouldn't know that the
output has reached the character limit (and we can't process this after
the fact, since we wouldn't know which characters don't count against the
character limit from things like zputlit).  We could change the mh-format
API to indicate if the resulting text is supposed to be for display and
convert it to the native character set, but then that makes me wonder
what the value is to the UTF-8 conversion in the first place.  Also,
this might result in the possibly undesirable state of someone with a
ISO-8859-1 locale sending out headers encoded in UTF-8 (although maybe
that's not so bad?  I am undecided here).  It would require some careful
work to get it right.

Thinking more about it ... hell, I don't know which one is right.  I'm
open to suggestions here.  If you have a better idea, please share it!

One final note: Lyndon has suggested that the stdio libraries that are part
of Plan 9 might help; I did look at them before, and I do not believe that
they will.  Specifically, they assume all output is in UTF-8 (because
that's how Plan 9 works), but that's not a valid assumption for us.

--Ken

_______________________________________________
Nmh-workers mailing list
Nmh-workers(_at_)nongnu(_dot_)org
https://lists.nongnu.org/mailman/listinfo/nmh-workers

<Prev in Thread] Current Thread [Next in Thread>