Re: RFC 2047 and gatewaying


Charles Lindsey <chl(_at_)clw(_dot_)cs(_dot_)man(_dot_)ac(_dot_)uk> writes:

6. MIME is a set of protocols defined for the Email world, as is
explicitly stated in RFC 2045. Officially speaking, there is no such
thing as MIME within Netnews (which accounts for its extremely poor
uptake within Usenet).


I doubt this.  I believe that the slow uptake of MIME within Usenet is due
to other issues entirely, primarily that many news servers have had to
block attachments of any kind because of space constraints and abuse and
MIME makes it easier for people to attach documents, and even more because
of some horrible misuse of multipart/alternative back in the early days.

I really don't think the lack of official standardization for Usenet had
anything to do with it, and I believe that if you look, you'll find that
many, if not most, Usenet messages have Content-Type and MIME-Version
headers.  I don't think that's extremely poor uptake.

There continues to be a somewhat irrational bias against any sort of MIME
multipart since (largely due to multipart/alternative) they're considered
synonymous on Usenet with massive, unnecessary bloat that looks horrible
in news readers, even though most news readers cope fine now.

There were also some secondary problems with quoted-printable encoding and
clients that were too aggressive about going to an encoding rather than
just figuring out the right charset tag to use, since quoted-printable
pretty much destroys source code excerpts for a client that doesn't know
how to decode it.  Technical mailing lists have had this same problem.

9. In fact, there are only two places where MIME in Netnews will differ
from MIME in Email. One is in allowing full UTF-8 within quoted-strings
within parameters in the Content-* headers (and probably in comments
within those headers too).


I think this is a really unfortunate decision to try to push.  There are
already two encoding methods in widespread use for parameters in MIME
headers, namely the correct RFC 2231 encoding and the incorrect but widely
used and understood mangling of RFC 2047.  Adding yet a third encoding
that, on top of everything else, is unlabelled really isn't doing anyone
any favors.

It is already the case that 2.5% [1] of Usenet articles use raw
Non-ASCII characters in their headers. I have not heard of any damage or
chaos arising from gateways that cannot cope.


Gateways pretty much universally pass those characters into e-mail
unencoded, which for the most part works because nearly all widely used
encodings avoid the no-man's-land of unassigned ISO 8859-1 codes that are
mangled by sendmail.  UTF-8 does not.

However, I think one key point that you're missing is that the arrival of
those messages at a destination mailbox without any changes made to the
bytes contained in the message isn't success in and of itself.  Those
bytes also have to be *interpreted* correctly.  As long as they're not
correctly interpreted by the end-user's mail client, the communication has
failed.

I daresay that you're likely not seeing Korean characters in all those
8-bit e-mail messages that you receive, indicating that the encoding that
those messages are using is failing to actually communicate.  The fact
that you received the raw bytes as they were sent is really rather
uninteresting by comparison.  And the reason why you're not seeing Korean
characters is that your client doesn't have enough information to know
what character set to use, since the Korean spam is, pretty much without
exception, unlabelled 8-bit content.

-- 
Russ Allbery (rra(_at_)stanford(_dot_)edu)             
<http://www.eyrie.org/~eagle/>