nmh-workers
[Top] [All Lists]

Re: [Nmh-workers] nmh architecture discussion: format engine character set

2015-08-11 11:43:26
On 8/11/2015 11:08 AM, Jon Steinhart wrote:

It seems to me that the only solution is to use Unicode internally.
Disgusting as it seems to those of us who are old enough to hoard
bytes, we might want to consider using something other than UTF-8
for the internal representation.  Using UTF-16 wouldn't be horrible
but I recall that the Unicode folks made a botch of things so that
one really needs 24 bits now, which really means using 32 internally.

UTF-8 should be sufficient, and it does not suffer from byte-ordering
and byte order mark problems that other encodings suffer from.

Some general comments about this discussion:

- Message should be stored in their original forms.  I.e.  The
  character encoding transformation should only be done for
  display/access purposes.

  Main reasons:
    - Protects against bugs in encoding transformation code.  The
      original message is always left untouched.
    - Maintains compatibility with folks that use external tools and
      scripts to access nmh messages.
    - Avoids invalidating any digital signatures since any
      modification of the message will invalidate them.

  If there is a desire to support transformation of incoming mail, so
  mail is stored in "normalized form", then it should be a configurable
  option.  I personally would not want all my stored email converted to
  UTF-8 due to reasons cited above.


- I think using a character encoding library is unavoidable.  Is iconv()
  sufficient?.  If UTF-8 is to be used as the normalized encoding
  format, a library is needed that can transform the various encodings
  into it, and likely from it.  Maybe it is not as big an issue as it
  was in the past, but not everyone was sold on Unicode.  In my
  mail-related project, I had users that preferred they local character
  encoding formats over anything Unicode related.

  Character encoding choices can get quite political.

  If a library is adopted, then users have full control of what encoding
  they prefer.

  Side Note: My project is in Perl, so I can leverage Perl's Encode
  module for character encoding conversion.  However, I even had code
  that dealt with character encoding conversion pre-Encode days.


- As for parsing message headers, make it a configurable option
  on what the default character encoding should be.  UTF-8 could be the
  default (which is fortunately is US-ASCII compatible).

  Real-world note: I have encountered emails that actually use a
  non-ASCII default encoding for message header data.  Messages in
  non-English locale.  Technically, these message are not conformant to
  the RFCs, but such messages actually exist.  Hence, in my project, I
  support an option that specifies what the default encoding is.


- I think it is perfectly reasonable to leverage the current locale
  setting to determine defaults, but one should be able to explicit
  override such defaults via .mh_profile and command-line options.


On the output side, we just have to do the best we can if characters in
the input locale can't be represented in the output locale.  This is
independent of the internal representation.

A '?' is commonly used when one cannot map a character in one encoding
to another, likely since '?' is likely representable in all encodings.
Some systems will use a special glyph, but I think that is not possible
for nmh to do.

Warning message(s) should be generated when character data is lost due
to conversion.

--ewh

_______________________________________________
Nmh-workers mailing list
Nmh-workers(_at_)nongnu(_dot_)org
https://lists.nongnu.org/mailman/listinfo/nmh-workers

<Prev in Thread] Current Thread [Next in Thread>