Re: RFC 2047 and gatewaying


Leo Bicknell wrote:

In a message written on Fri, Jan 03, 2003 at 09:01:42PM -0800, Russ Allbery 
wrote:

* sendmail still is not fully 8-bit clean in the latest analysis that
  I've seen, which means that 7-bit issues continue to plague the mail
  system.



While I run sendmail, it is not the only mail system.  qmail,
vmmail, etc all are widely used.  Change the RFC, then complain
about the program.


While sendmail-bashing may be a popular pastime in some circles, sendmail
isn't the entirety of the issue.  There are a number of issues which have
affected the mail protocols:
1. the need for certain uniform conventions, e.g. end-of-line.
2. software constraints, e.g. use of ASCII NUL to terminate character
   strings in some library functions in some programming languages.
3. character set limitations, e.g. EBCDIC.
4. network limitations, e.g. Bitnet.

If you could wave a magic wand and make sendmail completely 8-bit
transparent, there would still be network and character set issues
that would prevent use of certain characters (many of them within
the US-ASCII set) without encoding.

Unlike Usenet, mail is vital to many businesses and individuals.  One
cannot simply "change the RFC, then complain" when things stop
working en masse.  *Before* the protocols can be changed, it is
necessary to ensure that software *and* networks, *as affected
by their operating environments* are compatible.  Before considering
the considerable amount of research required even to determine
what the various network and software constraints are, it is prudent
to consider the cost/benefit ratio.  Mail works fine now -- what
benefit is to be achieved by changing the protocol?  In the case
of the characters set limitations, any benefits are likely to be
minimal and are constrained by a number of factors:
1. There is no universal character set (don't point to Unicode; there
   are a number of incompatible Unicode versions, and as noted in
   this thread, Unicode is unacceptable to some)
2. It is not possible to identitfy a character set reliably without
   some external indication (a.k.a. tagging).
3. Low-level protocols are based on octets, and any character set
   with any hope of being a universal character set will have
   "characters" wider than one octet (once upon a time, Unicode
   was guaranteed to be 16-bits, but that was when Unicode was for
   characters -- before it was extended to include things which it
   specifically excluded, such as musical notes).  Therefore, *some*
   encoding will be required to split the N bits comprising a
   character into some sequence of octets. E.g. there are several
   varieties of utf-8.  Of course, there's also utf-7, which avoids
   the network and software issues in the first place...
4. There is often a need for language tagging, as provided for by
   RFCs 2047 / 2231. E.g. =?iso-8859-1*en?q?boot?= is very different
   from =?iso-8859-1*de?q?boot?= even though the character set and
   encoding is invariant.  A very recent change to Unicode provides
   for language tagging (via an *encoded* set of tags), but that
   tagging feature is specifically disallowed with "higher-level protocols",
   specifically mentioning MIME.
The key points are:
a. encoding will still be required
b. language tagging is still desirable in many situations
c. character set tagging will still be required.
Proponents of utf-8 and opponents of RFC 2047 / 2231 (often the same
individuals) focus on the encoding issue and lose sight of the charset
tagging and language tagging issues, which are equally important.

The bottom line is that even if it were possible to eliminate the
encdoding isse, 2047 / 2231 would still be needed for charset and
language tagging.  Since it isn't possible to avoid the encoding
issue (at least until such time as the "character" width stabilizes
and is no wider than the low-level protocol transmission width), we
will continue to need mechanisms like 2047 / 2231 [yes, Charles, even
20 or 30 years hence (though hopefully the RFCs will have been updated
to consolidate requirements, conform to current nomenclature, and
correct errors)].