RE: on character sets and encodings

My original posting  >
Ned, 13 Feb     +
Erik, 14 Feb    #
Ned, 14 Feb     !

The situation with MIME is, to put it mildly, perceived of as a disaster
and a complete refutation of the hypothesis that MIME can operate
transparently, with no change to MTAs or gateways.


+ Whoa. There is no such hypothesis that I'm aware of. Moreover, anyone
+ who makes such claims is in serious need of a reality check.
    In case it wasn't clear, I'm trying to report here in the hope of
renewing a dialogue and indicating the level of emotion and/or concern. 
I'm not nearly as concerned, or people would have heard about it a long
time ago.
    The hypothesis is, however, out there.   I'm happy to recirculate
messages and identities privately, but I think we'll stay closer to the
technical issues if I don't.

+ And remember, we're talking
+ about strict MTAs here -- as I've just illustrated, once you enter the
+ world of gateways all bets are off.
+...
+ So, what we have here is that MIME, if properly used, always generates 
messages
+ that unconditionally fall within the set of messages a conformant MTA is
+ required to be able to transport. Conclusion: Any RFC821 MTA that cannot carry
+such a message is either incompliant with standards or misclassified and hence
+...
+ But what about non-RFC821 MTAs, you ask? Theoretically, these fall into two
+ groups -- those that transfer RFC822 using something other than SMTP and those
+ that transfer messages in another format. Now, if the transport carries RFC822
+ ...
Ned's logic here is impeccable, as usual.


+ The BITNET issue boils down to one question that's really easy to
+ state: What's the message format that BITNET uses? If the message format
+ is RFC822 (and hence the use of EBCDIC is transformation of convenience
+ that occurs at, say, the transport layer) then the BITNET-Internet
+ gateways are really not gateways at all; they are just MTAs. 
   I'd suggest that, while there has been (and continues to be)
confusion about BITNET envelope formats, the message format is pretty
clear.  It isn't RFC822, since RFC822 specifies that headers are to be
in ASCII.  But it is a format for which an exact definition would be
obtained by going through RFC822 and replacing "ASCII" with "EBCDIC"
every place the former appears either explicitly or implicitly.
   While we can debate the MTA/UA boundary and related things endlessly,
the BITNET <-> Internet boundary is clearly marked by gateways, not
MTAs.  Those gateways have an SMTP-over-TCP transport on one side and a
different one (usually BITNET-BSMTP over NJE) on another, often have to
rewrite addresses that are valid on one side into address forms that are
valid on the other side, and have to convert from ASCII to EBCDIC or
vice versa (i.e. convert between RFC822 and its EBCDIC clone). 
Operations of any one of these classes has traditionally be adequate for
us to identify something as a gateway.
 
+ MIME should work transparently across all of BITNET is
+ this is the case. But if the message format isn't RFC822 then there was never
+ any sort of expectation that no modifications of the gateways would be needed.

  Philosophically, this may be the crux of the issue (operationally, it
may not be interesting).  Many people construed the "no transport
modification" discussion as implying that any gateway that could
properly handle RFC822 (regardless of what was on the other side)
without loss of headers would not have to be modified to adequately pass
MIME.  The BITNET gateways can properly handle RFC822 (at least in the
Internet->BITNET direction) but may need modification because the format
on the other side isn't RFC822 but a clone derived from an orderly
transformation.

+ This is why I strongly supported the development of consistent character set
+ standards for BITNET at the least CRENTAC meeting I attended. Such a standard
+ is a necessary first step in coming to grips with the problem. There are
+ several additional steps that have to be taken as well; the resulting
+...
  While I certainly don't disagree with this, the argument on the BITNET
side focuses on the somewhat sloppy definitions of "EBCDIC" now in use
there.  More complex character sets (e.g., EBCDIC clones of 8859-n sets)
certainly raise additional problems.  As we have seen with "ASCII" in
both the 822 and SMTP extensions efforts, we could live with relatively
sloppy definitions for years and needed to aggressively tighten those
definitions when we started doing new things.  But a different version
of "the first step" is that it is necessary to be able to handle a MIME
  Content-type: text/plain; charset=us-ascii
message at least as well as old-fashioned 822 messages are handled. 
Then one can move forward with the interesting problems.

+ I cannot speak to the political agenda, but I agree that flag-day switchover
+ isn't feasible. However, there is no need for such a flag-day; migratory
+ and staged approaches to solving the problem do exist.
   The concern has been that staged solutions would result in some
things being labelled ASCII that aren't, and some things being labelled
that way that actually are.  Erik may have provided a trick solution for
this.

#Instead, the UAs should be altered (not necessarily all on the same
# day) so that, when MIME support is added, the program is also made to
# check what code the headers are in.  I.e. if the header says
# "us-ascii", but the "u" is 0xa4 (hex a4), then the program could guess
# that it was converted to EBCDIC.
   Needs a little refinement (e.g., to check for 0xE4 as well as 0xA4), 
but has potential, I think.  At the risk of leaping headlong down the
slipperly slope, five minutes of analysis suggests that, if we were
willing to impose one constraint on ourselves-- that character set names
must always contain "-" in one of the first four positions and may not
contain P there-- we could give BITNET (and Internet implementations
striving for robustness) some sensible, if odd, advice:
  (i) Don't make up "BITNET" names for character sets, use the
registered MIME ones, but assume that they represent their EBCDIC-based
clones in EBCDIC environments.
  (ii) For situations in which "EBCDIC environment" cannot be determined
with precision, note that either 0x50 or 0x2D will appear within the
first four characters.  If 0x50 appears, the set is EBCDIC-derived.  If
0x2D appears without a preceeding 0x50, then it is ASCII-derived.

Now, an alternative to this is might be to use the first rule, but
encourage BITNET gateways to invent a Content-transfer-encoding
modification to denote when the "US-ASCII" really means "us-ascii" and
not "generic gateway EBCDIC".

Does this lead to progress?  Part of the political problem is that some
of the people on that side of the gateways feel that we did this "to
them" without thinking out the implications for them.  If we could make
a few specific constructive suggestions, rather than just suggesting
that solutions exist, it would probably help a lot.

!Problems arise when such a UA emits a MIME message; if it emits EBCDIC and
!marks it as such the result will be mislabelled if it goes out on the 
!Internet. (And once you start creating such things there's no way to guarantee
!that they'll never escape.)
    But if, as suggested above and as you said, they spell "EBCDIC" as
"us-ascii" (these are, after all, just codes), this problem does not
arise as long as every gateway to the Internet continues to do what
every one of them does today--which is to convert messages from EBCDIC
to ASCII without paying attention to anything beyond the headers.

! But there's something a bit unsettling about this approach
!in the long term.
   Well, we are encouraging them to lie in their own environment if one
thinks the character-code-designation-strings really mean anything. 
Unattractive (but I mean--and meant earlier--just that: not
"intolerable" or "likely to lead to short-term meltdown", just
"aesthetically unpleasing"), but it might be the best technical
solution.

! Whether BITNET can live with it is another matter.
   If it is a workable technical solution and doesn't cause other
problems, then it seems to me that IETF responsibility ends with
suggesting it.  Nothing prevents them from coming up with other
strategies that work better (if such things exist) or, for that matter,
from rejecting all MIME messages at the gateways (if their users will
let them get away with that, which they won't).  I think there is an
IETF responsibility to show that the problem and issues have been
considered and to demonstrate that plausible solutions (things like flag
day transitions and complete changes of transport don't fall in that
category) exist.

   --john