nmh-workers
[Top] [All Lists]

Re: [Nmh-workers] nmh architecture discussion: format engine character set

2015-08-11 13:03:21
- Message should be stored in their original forms.  I.e.  The
  character encoding transformation should only be done for
  display/access purposes.

Completely, 100% agree here.

- I think using a character encoding library is unavoidable.  Is iconv()
  sufficient?.  If UTF-8 is to be used as the normalized encoding
  format, a library is needed that can transform the various encodings
  into it, and likely from it.  Maybe it is not as big an issue as it
  was in the past, but not everyone was sold on Unicode.  In my
  mail-related project, I had users that preferred they local character
  encoding formats over anything Unicode related.

Weeeeel .... not exactly.  It's not just a transformation issue; if it
was, iconv() would be fine.

The issue in the format engine is: we need to know about things, like is
' ' a space? (the format engine does space compression) If the strings
are UTF-8, we can't use isspace() on it.  We can't even use iswspace(),
because that requires the locale to be set to an UTF-8 locale.  So we
need a library that can process UTF-8, regardless of the locale setting.

  Character encoding choices can get quite political.

  If a library is adopted, then users have full control of what encoding
  they prefer.

Well, I was thinking that the locale would control the display/encoding
character set, like it does now.

- As for parsing message headers, make it a configurable option
  on what the default character encoding should be.  UTF-8 could be the
  default (which is fortunately is US-ASCII compatible).

  Real-world note: I have encountered emails that actually use a
  non-ASCII default encoding for message header data.  Messages in
  non-English locale.  Technically, these message are not conformant to
  the RFCs, but such messages actually exist.  Hence, in my project, I
  support an option that specifies what the default encoding is.

While I understand where you're coming from, back before EAI those
messages were invalid according to the RFCs.  Now the RFCs have changed
and those messages are defined as being UTF-8, full stop, no exceptions.
I understand the need to define a default character set for messages
which don't meet the RFCs, but it feels wrong to me to allow the user to
override the interpretation of a message which is now legal.  I welcome
discussion in this area.

- I think it is perfectly reasonable to leverage the current locale
  setting to determine defaults, but one should be able to explicit
  override such defaults via .mh_profile and command-line options.

Well, a user can already override that by changing locale environment
variables.  To me that seems like the right mechanism; you can do that
on the command line, with shell wrappers, whatever.

Warning message(s) should be generated when character data is lost due
to conversion.

It's unclear to me where those messages should go, and it doesn't seem like
anyone else does that.

--Ken

_______________________________________________
Nmh-workers mailing list
Nmh-workers(_at_)nongnu(_dot_)org
https://lists.nongnu.org/mailman/listinfo/nmh-workers

<Prev in Thread] Current Thread [Next in Thread>