Re: non-ASCII in headers

John Klensin writes:

(1) While they have been quiet, and polite, and really continue to be, I 
think there is now a clear message that something has to be done about
non-ASCII in headers in *this* version of RFC-XXXX.  Frankly, I would
have raised the issue but for a few things: I been preoccupied with some
transport issues and some things unrelated to mail; I've been hoping
that the silent minority would speak up because their articulating their
own positions is, IMHO, critical to the overall success of this effort; 
and, ultimately, I don't need non-English characters in the headers of 
very many of the messages I write.  Not zero, just not very many.


I disagree. I think the only compelling reason that this material would have to
be addressed specifically in RFC-XXXX would be that it is not orthogonal to the
issues addressed by RFC-XXXX. Nobody has yet demonstrated that it is not
orthogonal. In fact, given the extensibility of RFC-XXXX there is every reason
to think it would be orthogonal.

Now, if you instead say that this is an issue the working group must deal with,
well, that I agree with. I definitely think we have to reach closure on this.
But whether it gets done in one RFC or six is not the point.

Indeed, if the issues are orthogonal a compelling case, the case of simplicity
and organization, demands that it be addressed in a separate document. We
have done this in the past with issues we were unable to reach closure on
(X.400 bodyparts, for example).

(2) I want to disagree with Patrik, Peter, and Olle about one principle. 
While I think they are showing a spirit of willingness to compromise and 
accommodate that may be unusual for this list, non-ASCII in Subject, 
Comment, and Content-type lines alone is inadequate.  I would expect
that, if we make that decision, we will be faced with having to further
extend later (probably soon), and solutions that work well for one
particular field (Subject in this case) may turn out to be very poor
engineering if generalized. People get very sensitive about how to spell
their own names.  At the risk of drawing on national stereotypes, we
have just heard, and heard vigorously, from folks who come out of a part
of Europe with a long history of being reasonable, sometimes to their
own disadvantage.  We haven't yet heard from the folks to their south
and southeast who have, with some historical justification, got other
reputations.  And I'm not being Euro-centric here: if we can't solve the
problem for Latin-based alphabets, we certainly cannot solve it for the
situations in which the issues become even more important. 
  I think one could plausibly prohibit non-ASCII in parenthesised
comments and in undocumented "add on" fields, but I think subjects and
address "phrase" fields are critical, and that the engineering should be
done now.


Excellent. We agree on this at least. My position here is almost precisely the
same.

(3) In case it is not clear by now, Ned, your comments were out of line
(I am certain because they were based on a misunderstanding).  There
have traditionally been three ways to represent non-ASCII Latin based
characters in computer systems.  (i) Use of national language variations
on ISO 646, (ii) Use of ISO 2022 switching and registered character
sets, (iii) use of ISO8859-n.  The first of this is always a 7bit
solution and the second can be configured to be a 7bit solution; only
the third requires 8bits (either in transport or in a re-encoding into a
7bit form).  To our considerable advantage and good luck--because it
poses hard problems that were discussed at length months ago--the
ISO2022 approach has not been heavily used in western Europe.  Until
quite recently, the norm has been the ISO 646 NLVs.


I don't think I misunderstood and I don't think my comments were out of
line.

 Since they don't identify which character set is being used, they 
cause a lot of problems, even within a country and certainly between 
countries.  They, and not bit-stripping, are, I think, why you see odd
characters in Patrik's name (and, if I recall, in Keld's).


I understand this. I never thought this was a case of bit-stripping. It simply
means that the character set was not identified and properly converted for use
on my hardware. I never said or intended to say that this usage violates any
standard.

  There is also a very simple way to handle the identification 
difficulty with the ISO 646 NLVs, and that is to add a header field or 
two that identifies the character set in use, typically by its 
registration number.  There is a lot of experience with that approach, 
and we have seen it in every message Keld has posted for the last 10 
months.


I understand this too, and in fact I endorse this practice. My understanding
was that you were one of the people that did not endorse this, since it leads
to a large number of character sets being permissible on the net. For this
reason and this reason only I have backed off of recommending that we
standardize some sort of header whose value could specify any of the 200-odd
character sets that Keld has itemized for us all. I don't have a problem with a
formal restriction to the use of mnemonic 8859 variants, and eventually 10646
only (coupled with whatever encoding is appropriate, of course), but the only
reason I saw for this is to ease the burden on implementors who don't want to
implement suport for Keld's table in their software.

The scheme that Bob and I and others have been batting around easily extends
to cover this usage in any case. We can debate whether or not to allow
that scheme to do so later.

  And, contrary to Patrik's admission of guilt, one cannot assert 
that the "ISOC-8859-1" extension to 822 is a violation without 
describing RFC-XXXX as an incompatible revision, rather than an 
extension.  RFC-822 says that new fields can be added, and does not 
specify the ritual for adding them.  So they added one.  The field
becomes "a violation" only if my receiving UA is required to recognize 
and understand it to present the message.


What on earth do you mean here? The usage I have been talking about as being
illegal is the embedding of 8 bit characters in the phrases before routing
addresses and the use of 8 bit characters in the Subject: and Comments:
headers. RFC822 explicitly specifies that these headers are all in 7-bit ASCII.
It is in the BNF. While RFC822 explicitly allows the addition of new headers,
it does not specify a mechanism for modifying the syntax of headers it
standardizes.

There is definitely a loophole in that the character set used in an extension
or user-defined field is not specified. However, we are not talking about
those fields here. We are talking about standard RFC-822 fields, plain and
simple. And even if we were talking about extending other fields to allow
8 bit character sets, we would then have to address the transport issues for
them.

  Independent of the use of unstandardized headers, is the use of
national variations of ISO 646 (other than ASCII) invalid under RFC822? 
Well, I think so.  As everyone who has followed these lists knows to the
point of exhaustion (;-) ),  I've historically tended to read 821 and
822 narrowly, and national variations on ISO 646 are not ASCII.  But
this is dangerous ground.


I tend to read things more liberally, I agree this is dangerous ground. The
thing I fail to see is why were even in/on it right now.

If one focuses only on the
religious/philosophical issues of retroactively reading a standard to
permit something that it (if read narrowly) previously explicitly
disallowed, then virtually any argument that permits RFC-XXXX (which
uses 7bit sequences (octets with the high-order bit off) that are not
intended to be interpreted as ASCII characters) permits ISO 646 NLVs.
And, conversely, any line of reasoning that bans one bans the other.


Correct, but this was never the point of any discussion I have seen on this
list.

  To put this in a more obnoxious way, if one is going to ban 646 NLVs 
but permit RFC-XXXX, one is walking dangerously close to a position that 
might be familiarly described as "we've improved the functionality 
definition around here and, if it gets you into trouble, that is your 
problem to fix".


Nobody has ever talked about banning the 646 NLVs! The only thing that has come
close to this is the desire to keep the number of character sets small. I have
lost interest in this issue since it does not matter to me. All I care about is
that we label things for what they are. Once we decide to do this I don't
really care what the set of things we allow labels for is.

The only, repeat ONLY, thing that is banned is the use of 8 bit characters! I
have only, repeat ONLY, objected to the notion that the use of 8859 does not
violate the standards! It does!

(4) Having suggested that looking at "Subject" alone is not enough 
functionally and that, moreover, it will lead to bad engineering and the 
potential need to un-do what has been done, let me risk getting lynched 
and suggest that there is a possibility that RFC-XXXX itself is subject 
to a "bad engineering" criticism.  It is, I think, a masterful job of 
drawing together a framework for dealing with a lot of interesting 
problems.  But part (most?) of the original charge was to deal with
messages in international characters.


This is not my understanding of the original charge. The original proposal put
forward that started this group was indeed the extension of SMTP to handle 8
bit characters. This was deemed interesting, but only because it was a place to
start. Right away there was discussion of RFC1154 and encapsulation
methodologies, the need for a separation of content type and encoding
information was identified, etc. Thus, while the impetus to start these
discussions was perhaps to allow the transmission of 8 bit, from the very
beginning the charge placed on the group was and is much broader. However, the
issue of whether we explicitly have to solve the header character set problem
or not was never raised, it certainly was never considered to be the _only_
thing we had to get done.

Without the headers, I suggest
that it is now a proposal that contains support for transporting
documents that contain international characters, but that is a little 
bit different, and not what was asked for.


What was asked for originally and what this group was charged to come up
with are not the same things.

  More important, the "how does one handle non-ASCII in the headers"
issue was raised and discussed as one of the hard (and critical)
problems many months ago, even before we split the list.  Certain of us
even argued that one of the major reasons for requiring [transport]
envelope changes was that, by specifying and negotiating a character set
*there* and a mandatory set of semantics to go with it, the
non-ASCII-in-headers problems could be dealt with cleanly and fairly
elegantly.  But the opinion and consensus was that all of this could be 
handled by 822 extensions alone, leading ultimately to RFC-XXXX.  So now 
we discard the one problem that was identified originally as the likely
sticking point... :-(


I disagree with this conclusion. And even so, it seems to me that RFC-XXXX is
incontestably useful in its own right, and that regardless of whether it
addresses all the issues we have to deal with is irrelevant. The two questions
before us are:

(1) Should we address the character set in headers problem?
(2) Should we address it in RFC-XXXX?

The answer to (1) is yes, I believe, which totally negates any argument you may
make that we're ignoring the goals we were charged with. The answer to (2)
cannot be addressed by looking at the agenda. It can only be addressed by
looking at the issue of the inter-relatedness of all these proposals.

  When this sort of thing happens around my department, the story goes 
"This is a lovely design, the only problem is that it [the building] 
can probably not be built and, if it could be, it will fall down."  "Ah,
that is true.  But look what a nice job I did on all of the other design
criteria."


I think this is totally unwarranted and pretty unfair.

(5)  Part of the other reason for trying to solve this problem now is 
that I fear that it may be the soft underbelly of RFC-XXXX, that 
studying it may lead to other changes in the model.  I am not convinced 
that it is really a problem that is isolated from the rest, even though 
one can (and we have) successfully ignore it and solve only the rest.


Then convince me, because I'm not convinced that it cannot. It is definitely
true that RFC-XXXX has modified the framework on which the character set
solution must be built. But that is not the question. The question is: "Are
changes going to be needed in RFC-XXXX to deal with the solution to the
character set in headers problem?". In order to answer this with a "yes" you'll
have to put forward a proposal that's both a reasonable solution and actually
does make such changes. I have not seen such a proposal from anyone. In fact, I
only see two broad classes of proposals for solutions, and both of them are
orthogonal to RFC-XXXX.

How do we get there from here?  Well, first of all, someone needs to sit 
down and study 822 and XXXX, one header type and field at a time.  For
each one, there is a decison whether it must be kept in something that
can clearly be mapped onto ASCII, whether it is a candidate for enhanced
character set treatment, and whether enhanced character set treatment is
necessary.


I did this before I proposed anything. Granted, I only posted the conclusions I
came to, rather than the details. If you like I'll go ahead and itemize
everything.

  IMHO, fields not specified in 822 or XXXX can be ignored: if 
someone cares enough about them, let them write an RFC that describes 
what they are and whether enhanced character treatment applies and push 
it through the standards process.  Otherwise, I think we can safely 
consider them noise.  ASCII noise, but noise.  An assumption of 
ASCII-ness in unspecified headers is, however, an assertion that needs 
to make it into RFC-XXXX if there is going to be any provision for 
non-ASCII headers or fields.


Modulo the fact that we have to deal with RFC-822 changes that have
crept in via unrelated documents (RFC-1123 is the only example I know of),
I agree 100%.

Then we go back and look at the proposals again, with the understanding 
that none of them provides a perfect solution but that decent 
engineering should permit us to select "least bad", if not better than 
that.  We have had a tradition of not requiring that headers come in any 
particular order (other than the MTA-inserted "trace" materials as the 
beginning), and that header fields don't impact each other's semantics 
only, at most, the semantics of the message body.  I think those 
traditions are very handy in terms of processing and certainly preserve 
a cleaner layering than having a lot of intertwined and mutually 
interacting stuff.


I agree, with only one minor nit to cite -- the PEM people use header order in
some cases. True, these are not RFC-822 headers, but I thought I should mention
the one exception in current practice that I'm aware of. I don't think
it concerns us here. (But insofar as we're now talking about working PEM
into things, and PEM has hence become something of an agenda item, I don't
think this issue can be ignored out of hand.)

To my recollection, the following options are on the table or have been 
on the table recently:


I think that you've elected to enumerate the proposals in more detail than is
warranted, and as a result proposals that are basically the same underneath
come out looking fairly different.

To my mind, there are really only four solutions that have been proposed in
broad terms:

(1) Select a group of pieces of existing headers and provide a mechanism for
    the specification of the character and encoding they are represented in.

    This includes (i), (ii), (iii), and (iv).

(2) Duplicate information and cross-reference it.

    (vi), (vii), (ix), and (x).

(3) Simply change everything to use a new character set.

    (v).

(4) Do it in SMTP.

    (viii).

Of these, I believe categories (3) and (4) has been shot down in flames in the
past. In fact, I think they both went down in flames in St. Louis. I therefore
propose that we discuss in broad terms whether we want (1) or (2), and then
worry about the details. This is, in fact, precisely what I've been trying to
do in the discussions of Real- headers versus mnemonic. Instead of focusing
on the details, let's consider the limitations of each approach and see
where it leads us.

So, can we now start going through those options (and any others that 
I've missed) and try to make some reasoned and thoughtful decisions?  I 
believe that the messages of last night and early this morning have 
written "SHOW STOPPER" next to an RFC-XXXX that does not contain support 
for non-ASCII characters in the Subject (and maybe Comments and 
Content-description) header lines.


I agree that this may be a SHOW STOPPER for the group, but why can't we
get the work we've done out the door?

One of the biggest problems this group faces is the sheer repetition of settled
issues. A lot of this would disappear if we got something out and into the
pipe. This does not mean we cannot revise things somewhat after that, but it
does mean we can avoid having to reach consensus over and over again on the
same points.

I'd like to add SHOW STOPPER next to
any proposed solution that does not deal with the general problem, and,
in particular, does not deal with personal names (phrases in addresses)
for two reasons:


I have no problem with this.

                                Ned