non-ASCII in headers

Well, well.  21Kb of messages late on a Friday night and early Saturday 
morning.  I had every confidence that folks would speak up when things 
got out of hand.  I can't wait until Monday :-)

Let me try to address a few issues in the hope of focusing the 
discussion which I would hope we can now have....

(1) While they have been quiet, and polite, and really continue to be, I 
think there is now a clear message that something has to be done about
non-ASCII in headers in *this* version of RFC-XXXX.  Frankly, I would
have raised the issue but for a few things: I been preoccupied with some
transport issues and some things unrelated to mail; I've been hoping
that the silent minority would speak up because their articulating their
own positions is, IMHO, critical to the overall success of this effort; 
and, ultimately, I don't need non-English characters in the headers of 
very many of the messages I write.  Not zero, just not very many.

(2) I want to disagree with Patrik, Peter, and Olle about one principle. 
While I think they are showing a spirit of willingness to compromise and 
accommodate that may be unusual for this list, non-ASCII in Subject, 
Comment, and Content-type lines alone is inadequate.  I would expect
that, if we make that decision, we will be faced with having to further
extend later (probably soon), and solutions that work well for one
particular field (Subject in this case) may turn out to be very poor
engineering if generalized. People get very sensitive about how to spell
their own names.  At the risk of drawing on national stereotypes, we
have just heard, and heard vigorously, from folks who come out of a part
of Europe with a long history of being reasonable, sometimes to their
own disadvantage.  We haven't yet heard from the folks to their south
and southeast who have, with some historical justification, got other
reputations.  And I'm not being Euro-centric here: if we can't solve the
problem for Latin-based alphabets, we certainly cannot solve it for the
situations in which the issues become even more important. 
  I think one could plausibly prohibit non-ASCII in parenthesised
comments and in undocumented "add on" fields, but I think subjects and
address "phrase" fields are critical, and that the engineering should be
done now. 

(3) In case it is not clear by now, Ned, your comments were out of line
(I am certain because they were based on a misunderstanding).  There
have traditionally been three ways to represent non-ASCII Latin based
characters in computer systems.  (i) Use of national language variations
on ISO 646, (ii) Use of ISO 2022 switching and registered character
sets, (iii) use of ISO8859-n.  The first of this is always a 7bit
solution and the second can be configured to be a 7bit solution; only
the third requires 8bits (either in transport or in a re-encoding into a
7bit form).  To our considerable advantage and good luck--because it
poses hard problems that were discussed at length months ago--the
ISO2022 approach has not been heavily used in western Europe.  Until
quite recently, the norm has been the ISO 646 NLVs. 
  Since they don't identify which character set is being used, they 
cause a lot of problems, even within a country and certainly between 
countries.  They, and not bit-stripping, are, I think, why you see odd
characters in Patrik's name (and, if I recall, in Keld's).
  There is also a very simple way to handle the identification 
difficulty with the ISO 646 NLVs, and that is to add a header field or 
two that identifies the character set in use, typically by its 
registration number.  There is a lot of experience with that approach, 
and we have seen it in every message Keld has posted for the last 10 
months.  And, contrary to Patrik's admission of guilt, one cannot assert 
that the "ISOC-8859-1" extension to 822 is a violation without 
describing RFC-XXXX as an incompatible revision, rather than an 
extension.  RFC-822 says that new fields can be added, and does not 
specify the ritual for adding them.  So they added one.  The field
becomes "a violation" only if my receiving UA is required to recognize 
and understand it to present the message.
  Independent of the use of unstandardized headers, is the use of
national variations of ISO 646 (other than ASCII) invalid under RFC822? 
Well, I think so.  As everyone who has followed these lists knows to the
point of exhaustion (;-) ),  I've historically tended to read 821 and
822 narrowly, and national variations on ISO 646 are not ASCII.  But
this is dangerous ground.  If one focuses only on the
religious/philosophical issues of retroactively reading a standard to
permit something that it (if read narrowly) previously explicitly
disallowed, then virtually any argument that permits RFC-XXXX (which
uses 7bit sequences (octets with the high-order bit off) that are not
intended to be interpreted as ASCII characters) permits ISO 646 NLVs.
And, conversely, any line of reasoning that bans one bans the other. 
  To put this in a more obnoxious way, if one is going to ban 646 NLVs 
but permit RFC-XXXX, one is walking dangerously close to a position that 
might be familiarly described as "we've improved the functionality 
definition around here and, if it gets you into trouble, that is your 
problem to fix".

(4) Having suggested that looking at "Subject" alone is not enough 
functionally and that, moreover, it will lead to bad engineering and the 
potential need to un-do what has been done, let me risk getting lynched 
and suggest that there is a possibility that RFC-XXXX itself is subject 
to a "bad engineering" criticism.  It is, I think, a masterful job of 
drawing together a framework for dealing with a lot of interesting 
problems.  But part (most?) of the original charge was to deal with
messages in international characters.  Without the headers, I suggest
that it is now a proposal that contains support for transporting
documents that contain international characters, but that is a little 
bit different, and not what was asked for.
  More important, the "how does one handle non-ASCII in the headers"
issue was raised and discussed as one of the hard (and critical)
problems many months ago, even before we split the list.  Certain of us
even argued that one of the major reasons for requiring [transport]
envelope changes was that, by specifying and negotiating a character set
*there* and a mandatory set of semantics to go with it, the
non-ASCII-in-headers problems could be dealt with cleanly and fairly
elegantly.  But the opinion and consensus was that all of this could be 
handled by 822 extensions alone, leading ultimately to RFC-XXXX.  So now 
we discard the one problem that was identified originally as the likely
sticking point... :-(
  When this sort of thing happens around my department, the story goes 
"This is a lovely design, the only problem is that it [the building] 
can probably not be built and, if it could be, it will fall down."  "Ah,
that is true.  But look what a nice job I did on all of the other design
criteria." 

(5)  Part of the other reason for trying to solve this problem now is 
that I fear that it may be the soft underbelly of RFC-XXXX, that 
studying it may lead to other changes in the model.  I am not convinced 
that it is really a problem that is isolated from the rest, even though 
one can (and we have) successfully ignore it and solve only the rest.

How do we get there from here?  Well, first of all, someone needs to sit 
down and study 822 and XXXX, one header type and field at a time.  For
each one, there is a decison whether it must be kept in something that
can clearly be mapped onto ASCII, whether it is a candidate for enhanced
character set treatment, and whether enhanced character set treatment is
necessary.  IMHO, fields not specified in 822 or XXXX can be ignored: if 
someone cares enough about them, let them write an RFC that describes 
what they are and whether enhanced character treatment applies and push 
it through the standards process.  Otherwise, I think we can safely 
consider them noise.  ASCII noise, but noise.  An assumption of 
ASCII-ness in unspecified headers is, however, an assertion that needs 
to make it into RFC-XXXX if there is going to be any provision for 
non-ASCII headers or fields.

Then we go back and look at the proposals again, with the understanding 
that none of them provides a perfect solution but that decent 
engineering should permit us to select "least bad", if not better than 
that.  We have had a tradition of not requiring that headers come in any 
particular order (other than the MTA-inserted "trace" materials as the 
beginning), and that header fields don't impact each other's semantics 
only, at most, the semantics of the message body.  I think those 
traditions are very handy in terms of processing and certainly preserve 
a cleaner layering than having a lot of intertwined and mutually 
interacting stuff.

To my recollection, the following options are on the table or have been 
on the table recently:

(i)         Text-Header-Field-Type
            Text-Header-Field-Transfer-Encoding

    which are directly parallel to Content-Type and Content-Transfer-
    Encoding respectively, with the restriction that the only permitted
    values for the former is Text/* and Text-Plus/* (and X-*).  (This idea
    is not mine, I'm just supporting and forwarding it.)

(From Peter's notes).
  Strengths: Someone clearly wants this one. Provides nice symmetry with 
the RFC-XXXX body approach.
  Weaknesses: Requires enumerating the headers and fields to which it
applies (most of these will). Creates an interleaving and
interdependency among headers. 

(ii) Use mnemonic in selected locations.
  Strengths:  Clean and elegant system.  Probably no inter-header 
dependencies.
  Weaknesses: Continued debate about Euro-bias; Keld is working on Asian 
character sets, but we won't be able to evaluate the success of that 
effort until he is finished.  Continued debate about whether this can 
cover "all characters". Requires enumerating the headers and fields to
which it applies. 

(iii) Use quoted-printable in selected locations.
  Strengths: Known to handle all possible character sets as long as you 
know what they are.
  Weaknesses:  Requires enumerating the headers and fields to which it 
applies.  Tends to look like an unsightly mess for languages outside the 
extended Roman alphabet group.  May require a separate header to 
identify the character set being referenced, causing the interdependency 
problem.

(iv) Provide for an extension to mnemonic to escape to quoted-printable 
with character set designation when characters for which no mnemonics
are known are encountered.   Then use this in selected locations.
  Strengths: General confidence that this can handle anything.  Probably 
the escape mechanism will never be used.  No inter-header dependencies.
  Weaknesses: Horrible kludge.  The escape basically re-invents ISO2022 
switching.  Most other criticisms of mnemonic apply.

(v) proto-10646 AUC throughout the headers, with ASCII subset
restrictions on some information. 
  Strengths:  Starts the move toward 10646 with all of the advantages of 
that level of standardization.  At least in an 8bit environment, has
sufficiently clean ASCII (and 8859-1) subsets that one could be more 
relaxed about where it was put and where it was not.  No inter-header 
dependencies.
  Weaknesses: Rules for use in 7bit environment not yet worked out.  
Could possibly be very confusing for software (as well as people) when 
not expected (e.g., in pre-XXXX UAs and gateways).  Using it now 
requires anticipating standards.  Waiting might hold up RFC-XXXX 
deployment.  In pathological cases, header lines might get quite long, 
especially after 8th bit escaping (I think that, in theory, one could
end up with 10 octets per character in the worst case). 

(vi) Subject: and Real-Subject:; From: and Real-From:
  Strengths: Does the job and provides both ASCII-ized and correct 
forms.  May not require enumerating the headers to which the model is to 
be applied if everything with semantics is kept in the not-Real part.
  Weaknesses: Inter-header dependencies in terms of linkages if not 
interacting semantics.  Probably fairly ugly.  A lot of such fields 
probably required.  Might interfere with human (and other out-of-band) 
rerouting of messages using address phrases.

(vii) Subject: and XXXX-Subject:; From: and XXXX-From:, with 
XXXX-compliant UAs using only the XXXX-headers and 822 UAs ignoring 
them and using their own.
   Strengths: Almost no inter-header dependencies.
   Weaknesses:  Means either duplicating a lot of headers (and requiring 
that) or a potentially-complex set of rules about defaults in mixed 
cases.  Probably fairly ugly.   XXXX-UAs may have to scan all headers to 
determine whether they are to act in "XXXX-mode" or in "822-mode" wrt a 
particular message.  See the discussion labeled "Also:" under number (x).

(viii) Change the SMTP envelope to specify that enhanced headers are in 
use and which character set is being used with them.
   Strengths: No inter-header dependencies.  Opens up a much wider range 
of options and solutions.  May permit avoiding enumerating headers and 
fields to which it applies, although some (e.g., mailboxes) would still
have to be required to be in characters easily mapped onto ASCII. 
   Weaknesses:  Requires changing MTAs, which seems to be anathema to 
many people on this list, including those who are quite willing to force 
the same MTAs into complex message analysis.


There have also been several other proposals that haven't been heard 
from lately.  In most cases, I can't remember why they disappeared, and 
suspect it was just exhaustion.  The two most interesting ones, in my 
opinion....

(ix) Inter-header symbol references, in which the symbols in some fields 
caused a symbol lookup in other header to get their "real" meaning.
   Strengths: Reasonable separation of information with both forms 
readable.  Much less noise than XXXX- and Real-.
   Weaknesses: Inter-header dependencies.  Questions about how the use 
of this feature would be detected.

(x) A header body part.
  This proposal has floated to the surface several times in several 
forms but, to my recollection and that of a few others, has never been 
discussed on the list.  In outline, it is a comprehensive variation on 
the Real-  or XXXX- proposals in which no non-ASCII characters appear in 
the primary header.  Content-type is extended to provide that, whatever 
the rest of the message looks like, there is an introductory body part 
to be handled in a special way.  That introductory body part contains 
all of the 822 headers, possibly without the primary-header XXXX 
information and/or without the transport-induced trace information.  It 
would have available to it all of the encoding and character set 
designation machinery of RFC-XXXX.  822-UAs would see it as part of the 
message text and ignore it; XXXX-UAs would observe its presence in the 
primary headers but, for purposes of presentation to the user, would 
ignore the primary headers entirely.
   Strengths: Good information layering with the obvious advantages of 
an inner envelope.  No modifications to header fields specified in 822, 
since the only impact of this is on new RFC-XXXX fields.  In many ways, 
much more elegant than the solutions that try to pour the new wine of 
non-ASCII characters into the old bottles of RFC-822-specified fields.
   Weaknesses: Requires duplicating information (all header
information).  Also requires changing the structure of Content-type in
RFC-XXXX and probably consequent delay, which is a reason the whole
header problem should have been dealt with earlier.   Makes most
messages lexically multipart. 
   Also: This inner envelope might be invisible to header-munging agents 
(such as pre-XXXX gateways), implying that they would not be able to 
make address rewritings in it.  Matching original names and addresses 
here to names and addresses in the 822/XXXX header might be pose a
challenge to UAs trying to process "reply" commands.

------------------
So, can we now start going through those options (and any others that 
I've missed) and try to make some reasoned and thoughtful decisions?  I 
believe that the messages of last night and early this morning have 
written "SHOW STOPPER" next to an RFC-XXXX that does not contain support 
for non-ASCII characters in the Subject (and maybe Comments and 
Content-description) header lines.  I'd like to add SHOW STOPPER next to
any proposed solution that does not deal with the general problem, and,
in particular, does not deal with personal names (phrases in addresses)
for two reasons: 
  -- It it won't solve the problem (and I predict that it will be only a
matter of days before someone who is directly impacted says that).
  -- I think that, from an engineering standpoint, Subject-line-only (or 
other "harmless whole header line" solutions are likely to have to be
backed out when general solutions are proposed and the "backing out"
variety of incompatible change is one of the worst things we can do to
ourselves. 

   john
------------
Thought and trivia question for the day:  Some popular operating systems 
have a native mode in which typing the sequence char1-backspace-char2
results in storage of char2 overstruck on char1, rather than the 
deletion of char1 or some strange cursor movement (e.g., to beginning of 
line or top left corner of screen).  What is the origin of this practice 
in operating systems and why?  Hint: the answer does not lie in the 
normal behavior of old teletypes or flexowriters.