EBCDIC, uuencode, etc

  I decided to keep my mouth shut about this until I consulted an
expert and generally figured out what I was talking about.  I don't
want to suggest that others haven't done the same, but one or two of
the posted comments have more closely resembled religious debates than
they have rational discussion.
  First of all, let me echo Keld's and Stef's most recent comments:
While it is certainly the case that the vast majority of Internet hosts
(on a headcount basis) use internal storage forms that are very similar
to that-which-we-mail-down-the-wire (more or less ASCII, stream-
oriented text with line boundaries identified by embedded control
characters (CR, LF, or CRLF)), many don't.  We have Internet EBCDIC
machines; we have machines that use record markers or line length
counts to delimit lines; we have machines that store "binary" files in
strict left-to-right bit order but that store characters in NUXI or
XNIU order; we may, for all I know, have native BCD machines around
still (old CDC 6600s?).  Each of those machines must adapt to the
Internet protocols--what goes onto, or comes off of, the wire--to the
extent needed to convert those forms accurately too and from internal
forms.
  I responded to the "what do I do about line wrapping" question
privately because I didn't think it really belonged on the list.  I was
wrong, and appreciate Stef's response, which only partially duplicated
mine.  It is profoundly relevant because it illustrates the range of
conversions that have to be managed at the boundaries, without a single
hint of "gateway" or "adapting to the broken behavior of other
networks".  The fact is that, if we solve our own problems in a clear
and unambiguous way, then most of the technical gateway problems
disappear.  After that, it is, as we say, a small matter of programming
for the delivery MTA-UA relationships and for the gateway developers/
maintainers.  And that is how it should be.
  What does this have to do with EBCDIC?  The "EBCDIC problem" provides
a particularly nasty example of the reasons why, if we are going to
involve ourselves with encodings other than plain ASCII, we need to be
extremely careful about what we are doing, and careful about two
things: 

  The first is very clear, that RFC-XXXX has to have a sufficiently
clear header model that a gateway or host that knows about it will be
able to decode whatever coding has been created in an unambiguous way.
Given an absolutely clear and precise definition of network-uuencode
and agreement by everyone to use it, rather than local versions, this
is really not a problem.  The receiver gets the code, reverses the
encoding specification or applies an appropriate character mapping, and
presto.  Really no worse than having to adjust end-of-line conventions.
  There are two problems: (i) Like Nathaniel, I'm still waiting to see
that specification of network-uuencode.  More important and strictly
pragmatically, I think we have had enough experience with some members
of the U**X vendor and developer community that we can predict that
some will ignore the "network-uuencode" specification and assume that
whatever they do already is close enough to the network model that they
can label it "network-uuencode" and send it out.  That is, in itself,
an argument for avoiding uuencode in RFC-XXXX, just on a "why go
looking for trouble that we know will come" basis.

  Second, and more important, there is a fundamental philosophical
principle underlying the RFC-XXXX work and the "change the message
format, not the envelope" reasoning that preceeded it.  And that
principle is that it is possible to use codings that will get through
MTAs that have never heard of RFC-XXXX without being destroyed, and
that can even pass unscathed through UAs that have never heard of
RFC-XXXX so that users can push messages into files and apply private
decoders.  I think that is a very worthwhile principle, even though I'm
spending energy on 8-bit transport.
  But, as soon as one starts ignoring it and assumes, in any way, that
all MTAs and UAs (including ones into machines with "funny" internal
storage or character models and gateways into strange networks) will,
to the degree necessary, under RFC-XXXX and its new headers, the
assumption that what is coming is "ordinary 822 ASCII mail" takes over
and those files may not be converted correctly so that the user
programs can unpack them.
  One either has to be frightfully careful and conservative about this,
or we need to go back and change the envelope to ask an "are you ready
to receive XXXX, rather than minimal 822" question.  The other
alternative bears an extremely close resemblance to "just declare the
old systems broken; sooner or later they will get themselves
straightened out and life will be much easier for the rest of us in the
interim".

Transparency between ASCII and "EBCDIC" (in quotes because there are,
as has been pointed out, several of them):
  I forwarded the extensive analysis that Keld posted to Ed Hart and
asked him to comment on the list of characters.  Ed heads the SHARE
committee on trying to rationalize and internationalize EBCDIC (this is
the "group within SHARE..." referred to in Roger Fajman's note from
yesterday).  Ed is also (although fairly recently) vice chair of the
U.S. committee that corresponds to ISO/IEC JTC1/SC2, so it is
reasonable to assume that he is competent to talk about ISO character
set issues.
  That committee, which has a few people who have been following this
list as participants, started as an "find a standard EBCDIC for use in
international networks" effort and has ended up working on three
problems that turn out to need to be solved together.  I would
characterize them as:
  -- identify the best "code page" for use in EBCDIC-based
     international networks
  -- identify standard mappings between 8859-1 and 10646 and EBCDIC in
     that code page.
  -- educate IBM about the real problems and issues in their character
     set strategies and "standards".
 The last of these should not be ignored--IBM has learned a lot in the
process.

Anyway, Ed's comments follow.  Lines marked ">" come out of the
original posting, clear lines are Ed's remarks, and lines marked % are
my annotations.

------------------------------

% ...


# is missing in 12 sets.
$ is missing in 8 sets
@ is missing in 13 sets
[] is missing in 20 sets
\ is missing in 16 sets
^ is missing in 17 sets, but the not character is defined in all those.
` is missing in 10 sets
{} is missing in 17 sets
| is missing in 16 sets (broken bar).
~ is missing in 20 sets


 Because of the problem, I'm not sure which characters are identified
 above.

% Ed does his work on an EBCDIC machine that is located on BITNET.
%Everyone who sees Keld's name with a vertical bar in the middle should
%think carefully about the implications of the fact that, even in clear
%text, with no special encodings, we can't communicate ISO646 national
%variations across the extended mail internet, or even the TCP/IP
%Internet, without distortion that might obscure the meaning of what was
%said.

Conclusion: These 14 characters should not be used in a 64-char encoding.
With some good will you may be able to use !" and maybe also ^
as invariant charaters.


You cannot discuss any of this without naming the characters because
I cannot tell which code page is being used nor which gateway may have
translated ISO to EBCDIC in the middle.

% See my comment above this paragraph and then read that paragraph
%again.  Please.  Note that Ed says "ISO" and "EBCDIC".  He is an
%optimist. 

% If the first sentence below is not clear, see the comments about
%"educating IBM" above.

An IBM document contains the result of IBM analysis.  I discovered the
table by accident as I was browsing through it.  The only character here
but not in the above is the " (QUOTATION MARK).
In IBM C-H 3-3220-050, IBM Corporate Specification:  REGISTRY, Graphic
Character Sets and Code Pages, page 403 is Figure 5, Data Processing
Invarient Set, Syntactic Subset, 81 Characters Plus Space.  It includes:

A-Z,a-z,0-9 as above
       .<(+       FULL STOP (PERIOD), LESS THAN, LEFT PARENTHESIS, PLUS
&       *(;       AMPERSAND, ASTERIS, LEFT PARENTHESIS, SEMICOLON
-/     ,%_>?      MINUS, SOLIDUS (SLASH), COMMA, PER CENT, UNDERLINE,
                       GREATER THAN, QUESTION MARK
      :  '="      COLON, APOSTROPHE, EQUAL, QUOTATION MARK

and
SPace, EO (Eight Ones (X'FF'))

With SPace but not Eight Ones, the set contains 82 characters.

The IBM document is "internal use only".  Moreover, your IBM
representative cannot order the manual by using that number.  It will
take some extra work for your rep to order it for you.

I hope this helps you.  All other characters are dangerous.

-----------------

OK?  For whatever it is worth, I think that is definitive.

  It is also as close as one will get to a consensus answer from the
community that Randall describes as:

 I know of sites in the US that do use a fair number of IBM
S/370-type systems with EBCDIC and they have conversion software at
their Internet link to handle the problem of converting US ASCII
to/from EBCDIC at that point.  It seems to me that such sites are in
the best position to comment on that situation. 
 [text omitted] 
I know of sites in the US that do use a fair number of IBM
S/370-type systems with EBCDIC and they have conversion software at
their Internet link to handle the problem of converting US ASCII
to/from EBCDIC at that point.  It seems to me that such sites are in
the best position to comment on that situation.


OK, you've got it :-).

Nathaniel suggested yesterday that the RFC should provide a way to tag
"uuencoded compressed tar" files, even while encouraging other things
instead.  I think that, absent a clear and accessible specification of
what those words mean, this may be undesirable and may send the wrong
message.  The precise implications of the argument *for* supporting
uuencode--that "everyone" does it--have the nasty habit of creating de
facto standards in which bad, but established, approaches drive out
good ones, and we get into "your implementation isn't acceptable, even
though it meets all of the RFC requirements, because it doesn't support
*my* uuencode-compressed-tar model".  Let's not encourage it.
  I think it would be extremely useful to establish a model for private
types for use among consenting parties
(X-SuperWidget-uuencode-compress-3a ?), but let's not standardize them,
even as a discouraged form, unless we are very clear about what we are
standardizing and that they won't get trashed by 822-conforming hosts.

Keld wrote on the 9th:

Another restricted character set that could be of interest
in determining "problem" characters is ASN.1 PrintableString
(ISO 8824:1987 p 21 table 5).
From ASCII the following characters are missing in PrintableString:
!"#$%&*;<>@[\]^_`{|}~

This adds the problem chars in addition to ISO 646 and EBCDIC
problem characters: %&*;<>_

I agree with Randall that we should not restrict ourselves too much
from these restricted character sets.


  Well, I'm not sure *I* agree.  As long as we need to have these
things pass transparently onto conforming, moral, upright hosts who
have not yet implemented support for RFC-XXXX, we need to be very
restrictive about these things.

I believe that Roger's comments, Keld's comments, Stef's most recent
comments, and the position outlined above are consistent.  To rephrase
Stef's recent question, it seems to me that we may have reasonable
consensus on this, and on Base64.  Can we conclude that this is true
and get on with it? 

    --john