Re: 8bit transport and smart/trusted MTAs

Bob Smart writes:

Secondly I retain severe doubts about the whole process. The assumption
of the Content-charset header is that the gateway designer can
design an Ascii->Ebcdic algorithm that will be optimal for
text/us-ascii and for text-plus/TeX with Content-Charset:us-ascii.
My knowledge of TeX suggests that this is unlikely, I suspect
that ebcdic sites use slightly different macros for TeX functions
to fit the limitations of the character set.


TeX is an excellent example that is well worth considering here.

When TeX is installed on a given computer system, part of that installation
defines a mapping from the system's character set to TeX's internal character
set. Originally the internal character set was just 7-bit US-ASCII (this is not
completely true, but it is close enough for this discussion); in TeX V2.0 8-bit
support was added, but the mapping of the additional 128 characters was left
undefined in general (i.e. the definitions of these characters is not defined
internally, nor is the external to internal mapping of them defined).
Presumably, any implementation that chooses to define these characters would
also make appropriate changes in the various standard macro packages to support
these definitions internally. (I have not found this to be true in general, but
this is a problem with the installation methodology, not with the concept,
which is clearly laid out in the source code.)

A quote from the TeXBook might be useful at this point (from Appendix C):

  Different computers tend to have different ways of representing the
  characters in files of text, but TeX gives the same results on
  all machines, because it converts everything to a standard internal
  code when it reads a file. TeX also converts back from its internal
  representation to the appropriate external code, when it writes
  a file of text; therefore most users need not be aware of the fact
  that the codes have actually switched back and forth inside the machine.

So what does all this mean? Well, in effect it means that when you install a
particular implementation of TeX, you basically define the external character
set it expects its source files to be in. This is actually done in a clever way
by the TeX source code; assuming that you've converted the source from what one
Pascal compiler wants to what another wants, the binding of external characters
to internal characters is, for the most part, automatic.

Different implementations of TeX will use different input character sets, but
in all cases the meanings of the various characters in the external set should
be clearly defined. Note that the character set TeX uses may not be the one the
rest of the system's software uses, but it is clearly defined nevertheless.

Since the external character set being used is clearly defined, there should be
no problem with placing this information in a content-charset header (assuming
this character set has a name, which is usually does), and there should be no
problem in automatically converting TeX source from one charset to another. The
only problem is making sure that (1) you know the character set the source is
really in, and (2) you know what character set the target wants to get. If you
know these two things, automatic conversion should work well in practice.

Since the TeX input language makes heavy use of characters that are not part of
the minimal invariant set of ASCII/EBCDIC characters (backslashes and curly
braces are essential in TeX source), ASCII<-->EBCDIC conversion of TeX source
is actually something of a problem since some of the common TeX characters do
move around from one character set to another. I've spent a fair amount of time
switching one character for another in TeX input sources. If I had known the
character sets being used I could have automated this process completely.

For that matter, it should be to take a copy of TeX plus Keld's tables of 
character sets and build a version where the character set is specified on
the command line. A scary thought.

Basically I would like to see a description of a gateway which
would use the Content-charset header to do translations + an example
of a non-text Content-type with enough detail to convince me that
this would actually work.


How about this -- I run a mail server on both the Internet and on BITNET that
provides access to a huge library of TeX sources. I've received a fair number
of complaints about curly braces and other characters getting seriously toasted
by the server. Of course, I just use the standard US-ASCII codes for these
things, but I use a fixed mapping of ASCII-->EBCDIC, so things can get pretty
ugly if you use some other EBCDIC that does not match this mapping. (I also
have a similar problem with square brackets in input files, since the file
specifications on the gateway use them, and some EBCDIC hosts cannot produce
the right characters for them.)

This all changes with a content-charset header. First of all, I'd let sites
specify what character set they want the files converted to. But more
important, I'd mark the character set I use in the message header, so automatic
conversion by my gateway (which might eventually get information about what
BITNET sites use what EBCDIC variants into its tables) is feasible. I'd also
accept requests in different character sets, which would make it possible for
me to recognize variant forms of square brackets from other folks. (Should the
gateway care about the content-type of request messages it gets? I think not.
But it does have to care about the character set, and it would be nice if the
gateway could extract this information from messages regardless of
content-type.)

So, Bob, I now have presented not one but two examples where a content-charset
header is useful and the content-type is not strictly text: (1) text-plus/tex
and (2) a case where I probably should accept anything that's of type text or
text-plus regardless of the subtype.

I should close by saying that I have no idea how most other document processors
like troff or Scribe work when it comes to character set issues. They may
provide counterexamples to all this, but that does not eliminate the
utility of this approach for TeX.

                                 Ned