Re: why it is a problem to transmit binary as binary in mail

On Tue, 17 Sep 91 22:06:04 -0400, geoff(_at_)world(_dot_)std(_dot_)com wrote:

To try to drag this discussion down to earth (and answer some nagging
questions), what machines and mail systems strip the high bit of each
incoming character?  I am lead to believe that it's only PDP-10s using
7-bit bytes and old sendmails.


Please read the past several months' worth of prior discussion.  Most of us
have had quite enough of the stupid, obnoxious, and obstreperous behavior of
certain individuals in rehashing ancient topics.  If you do not wish to be
considered stupid, obnoxious, and obstreperous, please review prior discussion
before setting this WG's work back with another round of history-repeating.

To quickly summarize:
 1) There are other SMTP mailers in the world besides sendmail and the PDP-10
    mailer.
 2) There are other mailing technologies in the RFC-822 world besides SMTP.
 3) Character sizes of other than 7-bit and 8-bit exist.  In particular, I
    know of 9-bit, 12-bit, 14-bit, 15-bit, 16-bit, 24-bit, and 32-bit
    character sets.  There are probably more.
 4) 8-bit transport is not necessary to convey characters of size larger than
    7-bits.
 5) 8-bit SMTP is not sufficient to convey characters of size larger than
    7-bits.
 6) There is a large software base and infrastructure of presently-conforming
    e-mail software that knows about the 7-bit restriction and depends upon it
    in some way.  Contrary to popular belief, most of this infrastructure is
    not on PDP-10's and in fact is on `modern' Unix systems.  Horror stories
    about 8-bit were mentioned at the St Louis IETF by Unix sites, referring
    to software which can not be changed.

The problem of fixing sendmail is much easier than
. the problem of installing the new sendmail everywhere which is much easier
  than
. the problem of fixing all the other transport programs which is much easier
  than
. the problem of installing the new versions of all the other transport
  programs everywhere which is much easier than
. the problem of fixing all the RFC-822 handling software everywhere which is
  much easier than
. the problem of installing the new versions of all the RFC-822 handling
  software everywhere.

And all this work is for a kludge which does not solve the general problem and
is not even necessary!

toward bytes, the fundamental unit of networking


This is imprecise, inaccurate, and misleading.  Octets are a fundamental unit
of TCP/IP.  To express the difference in C terms:
        octet != bytes          [although sizeof (octet) == sizeof (byte)]
        networking > TCP/IP

yet the applications all feel constrained to whack off
the high bit just in case some PDP-10 is still running somewhere on the
Internet.


PDP-10 bashing is stupid and irrelevant.  I can easily arrange for almost all
the PDP-10's remaining on the Internet to use 8-bit characters if that were to
become necessary; unlike byte-oriented (fixed at 8) architectures a PDP-10's
idea of a `byte' is any number of consecutive bits from 1 to 36.  I wrote most
of the PDP-10 mailer and am still reasonably familiar with how it works.  [Not
to mention that it's still more robust and reliable than sendmail.  `Unknown
mailer error 1', a 1000 character limit on alias expansions, `Host name
configuration error', indeed!]

PDP-10's were never an issue.

The real question is: how many bits are there in a byte?


This is not the `real' question.

The `real' question is: what is a character?

You will find almost universal agreement that a 7-bit USASCII value in the low
order bits of an 8-bit octet is a character.  However, when it comes to
anything else you will find intense disagreement.

It isn't ISO latin.  I would hazard a guess that most of the terminals in use
in North America do *not* support ISO Latin.  Many of them are 7-bit and think
the 8th bit is a parity character.  Some have other (including manufacturer
proprietary) glyphs up there.  Terminals in Japan have JIS as the non-USASCII
character set.  Terminals in Taiwan have BIG5.  Terminals in mainland China
have GB.  So what if an 8-bit message is received.  What is it?

The point of Internet is *interoperability* which implies a common platform.
If we do not have a common platform, then we'll end up like Europe with its
dozens of ridiculous little countries, would-be countries, and feuding tribes,
none of which talk to each other very well.  The only difference is that in
polite politically correct Internet terms we're going to call them `enclaves'.

What we have here is a West-Euro-centric view that wants ISO Latin handled as
simply as USASCII, and to hell with anyone else.  For example, none of these
proposals do a damn bit of good for Japanese, yet Japan is more important to
computing than the entire European continent.

If we are going to support the interchange of non-USASCII characters -- and it
appears to be undeniably a necessary thing -- then we *must* do it in an
interoperable manner, not in a half-assed fashion that makes it easy for a
vocal constituency of lazy programmers to offer non-interoperable domestic
character handling within a small set of countries.

If there are 8, then we must not mangle data by
stripping the high (8th) bit of each byte.


Read RFC-821.  Compatibility with published standards is important, otherwise
there will be no credibility in the standards.

If you transmit other than 7-bit data in an SMTP session you are in violation
of the current effective standard.

If there are 7, why are we
shipping them around in 8-bit octets?


Read any introductory text on networking regarding separation of layers.