nmh-workers
[Top] [All Lists]

Re: [Nmh-workers] nmh architecture discussion: format engine character set

2015-08-12 10:08:40
On Tue, Aug 11, 2015 at 11:30 PM, Ken Hornstein wrote:

I confess that I am surprised the "UTF-8 or die" crowd has been so unaminous
so far.  No one dissents from this view?  Like I said, it simplifies a WHOLE
bunch of code (at the cost of adding a new library dependency), so I would
actually be fine with it.

Since I will likely not be doing any of the actual coding, I have no
real skin in the game, however...

I think it is questionable design-wise to take the "UTF-8 or die"
approach, especially when there are operations that are unavoidable that
would facilitate a more general-purpose design.

It appears the basic processing model is a pipeline:

  Raw -> [Encoder] -> UTF8 -> [Processor] -> UTF8 -> [Encoder] -> Output


An encoder is needed to deal with whatever character encodings may be
present in the original, raw data.  This is unavoidable if nmh is going
to properly support the various mail standards and the bulk of mail that
still goes out today in non-UTF-8 encodings.

The [Encoder] will normalize all character data into UTF8.  Nmh,
[Processor], then does whatever it needs to do (like parsing addresses).
The immediate result of that is UTF8, which is then piped into the
[Encoder] to generate the final output based on locale settings.

The final [Encoder] may be a no-op if the output is to be in UTF8, but
if not (either due to environment locale setting or explicit
configuration setting), [Encoder] does it thing.

Since the need for an Encoder is unavoidable from the raw input reading
side, might as well reuse it on the output side, allowing nmh to be
friendly to any locale the end-user is using.

For maximum flexibility, the [Encoder] could be pluggable.  I.e.
Provide config option that allows one to register an external program to
do the encoding, where the data is provided via stdin and nmh gets the
results from stdout.  Such flexibility would allows folks to
evaluate/use other encoders and likely handle character data not
supported in Unicode (I know, a rare case, but is theorectically
possible--Klingon is still not officially part of Unicode ;).

--ewh

_______________________________________________
Nmh-workers mailing list
Nmh-workers(_at_)nongnu(_dot_)org
https://lists.nongnu.org/mailman/listinfo/nmh-workers

<Prev in Thread] Current Thread [Next in Thread>