Re: new-ish idea on non-ascii headers

John Klensin writes:

I've never heard this before. If it is true, I think the people who came up
with this idea should get a stern talking-to for being so silly. This idea is
completely unworkable in practice.

Routing and delivery of a given message instance should depend totally on the
message envelope. The message header has nothing whatsoever to do with it.
There's nothing that says an address in the envelope has to appear in the
header. It often does not in practice. There's also nothing to prevent an
...

I think I was not clear enough here.  I don't disagree about any of 
this.  At the level of "delivery" we are normally talking about when we 
are discussing "mail handling systems", of any flavor, systems deliver 
to mailboxes, mailboxes are specified in envelopes (and, secondarily, in 
headers), and that is the end of the story.


Ah, but we're only concerned with delivery up to the mailbox. Past that
point anything can happen (and probably will).

But there can be systems that touch mail post-mailbox-delivery.  The 
only point is that we should not deliberately write software that 
interferes with their operation.  If it helps to write the words 
"not-mail" in front of systems in that sentence, I have no problems with 
it.


But I don't think we have... Presently people can put anything they want in the
phrases that accompany addresses. And people take advantage of this fact. We're
simply adding a recommendation of how to display these phrases, and this only
when a new header is present. That's it. Nothing else changes.

This principle is reiterated in X.400, incidentally, and any
hope of X.400 interoperability depends totally on this being true.

  I don't want to debate this issue, but it is worth noting that the 
X.400 address itself contains provision for identifying the concepts 
that we would call "mailbox" and the concepts we might call "extended 
personal name" separately.  So one could reverse this argument by 
claiming that X.400 responded to our being vague about this issue and 
its semantics by providing for incorporating that information into the 
address itself, resulting in both having the information retained in 
canonical form *and* in "reiterating the principle".


Actually, this is a very interesting point, and there's more to it than you
have written. Specifically, there are two kinds of X.400 addresses, one for
envelopes and one for messages, and they are different. The ones that
appear in the message can contain something called a free form name. Since
this does not appear in the envelope, it cannot be used for routing purposes.

The envelope, on the other hand, contains information about how delivery is/was
accomplished (including expansions of distribution lists and delivery mechanism
preferences) that cannot appear in the message header! 

There is also language surrounding free form names that basically says you
cannot use them for the purposes we've discussed here. I don't know how formal
this is (I don't have the documents handy right now) but there is definitely a
sense that this extra field was added to prevent leakage of ad-hoc information
into any other form.

It is also interesting to note that a message content may contain nothing but
free form names -- the actual addresses do not have to appear anywhere. This is
actually a significant problem in practice since there is no analogue in RFC822
except for the group-syntax with no group members. And yes, this case actually
arises in practice. I wish it did not, but it does.

It is interesting to note that at least one significant commercial 
vendor of Internet-to-X.400 gateway services picks up the "phrase" from 
an Internet address being routed into its system and uses it as 
"surname".  They apparently do not provide any other way of specifying 
"surname" in the syntax they use when translating X.400 addresses back 
into the Internet.  And, for most X.400 systems, surname is pretty 
important, not miscellaneous noise.


Yuck.

Now I think they are doing bad stuff here, and are seriously broken.  
They have been flamed, lavishly and with extensive technical discussion, 
in private.  They have not felt it necessary to respond to those 
comments.  But it is precisely this type of behavior that started me 
thinking, again (it is a cyclic disease with me), that we need to take 
address phrases a little more seriously.


Actually, this is a little more serious than you let on. If RFC1148 goes to
standards-track (and this is a planned action, I believe) they will be more 
than broken, they will be incompliant with Internet standards. This will give
affected parties the possibility of saying "either conform to the standards or
we'll disconnect you". And that, I think, will have some effect.

   And it *is* a topic with which this group should concern itself with, 
if only because nothing in the protocols justifies discarding that 
information or otherwise trashing it.


I don't think it is a matter for this group, actually. We might want to liase
with the X.400 people on this, and in fact we do this (somewhat informally, but
I see a lot of familiar faces/addresses in the X.400-related discussions I
participate in). But as long as the emerging standards in this area endorse our
understanding of things (specifically, that the phrase  is simply decorative
material and not useful for other purposes) I don't see a problem. My current
reading of RFC1148bis says we have nothing to worry about.

And frankly, stuff that the authors debated is not
of interest -- if it didn't make it into the document there's no reason we
should concern ourselves with it.

  The HR discussion was raised only to indicate that this is not a new
issue that I'm raising out of sheer perversity.  A lot of things were
left out of the documents because they would have required breaking new
ground (which the WG chair wisely (IMHO) kept that group out of) or
because no clear agreement could be reached about what needed to be done
(or, in fairness, whether anything needed to be done).  But, in many of
the latter cases, there was a clear hope that, when WGs came along to
deal with the embedding context, they would take up the issues and deal
with them.


As long as we also note that the HR group (wisely) left this alone, I 
have no problem with this.

That raises two issues, on which I'd appreciate comments from the Chair. 
The first is one of agenda: if the goal of this group is "finish 
RFC-XXXX", then I favor, strongly, clearing as much of everything else 
away as possible--presumably including trying to fold in PEM and sender 
authentication, a lot of essentially transport issues, etc.--adopting
only a "try to do no harm" guideline.  If it is "fix 822", then all of
these complex issues are on the table and won't go away.  Personally,
I'd prefer to see RFC-XXXX in my lifetime, and that argues for the first
approach.


I agree very strongly with this, and I have so indicated to Nathaniel (he is
the current owner of the draft right now). I think we should leave RFC-XXXX
neutral on the problem of how to specify extended character set material in
headers. This means leaving out any mechanism for doing it -- anything else
is not neutral.

  The second issue is whether, given that we are trying to extend 822 
and not replace it, whether is a reasonable to argue for a doctrine of 
being very conservative about the assumptions we make about what is 
going on in practice.  Doing so will tend to preserve interoperability 
with systems that are now doing things that are marginally within the 
specifications but in the "strange", "bizarre", "bad idea", or 
"stretched interpretation" categories.  I'm not talking about things 
that Ned and I (and any other right-thinking person :-) ) would conclude 
are clearly banned, either in RFC-821/822 or in HR, but about things 
that out past where the specifications stop.


I am always in favor of conservative approaches. But I think one place
where we disagree is on whether or not the use of mnemonic or quoted-printable
is conservative. I happen to think they are, and moreover that they may
enhance interoperability with broken systems rather than hinder it.

  Take my brain-damaged Internet-PrivateSystem-X.400 gateway situation.  
The specifications in RFC-821/822 stop with mail transport and mail 
delivery to a mailbox.  As I said, Ned and I don't disagree about that.


Actually, specifications can extend to cover interaction with other standard
protocols like X.400, especially since there is X.400 on the Internet already.
But this simply refutes this instance, not the general case.

But, in this particular case, what that means is:
    S>  RCPT TO: 
<funny-string-with-more-%-signs%mumble(_at_)gateway(_dot_)domain>
    R<  250 OK I know what to do with that.
    ...
    To: Smith <funny-string-with-%-signs%mumble(_at_)gateway(_dot_)domain>
This is accepted for delivery, and the gateway host effectively delivers 
it to a mailbox with semantics specified by the local-part.  That is as 
far as we [can] specify.  Now the process that starts rewriting this for 
forwarding (an explicitly invoked gateway, with all of the authority 
that goes with that) might check the header address for consistency (as 
it defines that) against the envelope address, but we can't even require 
that it do that as long as we can't detect the symptoms of certain 
misbehavior from "the outside".  It then builds an X.400 address from 
the header address, and, if that process sucks up the "phrase" and gives 
it specific X.400 definition, well... I think it is a really stupid 
idea, but I don't think it is prohibited unless we suddenly and 
retroactively adopt an "anything not explicitly permitted is prohibited" 
doctrine.  This is simply beyond the scope, not only of what 821/822 
specified, but of the areas in which they presumed to specify.
   Note that, if we adopt such a doctrine, it probably kills RFC-XXXX, 
so let's not :-).


I don't disagree with any of this, but I do disagree about its applicability
to the present proposals. We're not proposing any restrictions on existing
legal usage of phrases in addresses. We are simply proposing a new 
mechanism that can optionally be used to view these phrases, together with
a mechanism for specifying when this new usage is in effect.

Present usage within the limits of US-ASCII and RFC822 does not change in any
way. New usage has to be dealt with, but new usage is _always_ a problem for
applications that don't implement the standards. I claim that we're not
stressing existing mechanisms much at all.

One case does come up, and that's existing incompliant usage (e.g. present use
of 8 bit characters in headers) and how a compliant system will cope with it.
Any compliant system I write will convert such incompliant material into
compliant material, guessing, if necessary, in an attempt to do it right. (The
alternative is to reject the message completely, and this is certainly an
option, but not one I will use personally since I feel it is a bad idea.)

This last change may in fact screw up existing usage that's outside the scope
of the specifications. To which I say: "That's just too damn bad -- you did
something that wasn't allowed, and now you're screwed -- it serves you right".
Now, I would not say this if this consequence of this change is going to be
commonplace. But it won't be -- I don't think you can back up any claim that it
will be. In fact, my experience tells me that it is much more likely that
adoption of these new measures will, on the average, increase interoperabily on
the average by avoiding marginal situations, rather than decreasing it. You
cannot satisfy everyone universally, but you can make intelligent decisions
that minimize damage. I think the present proposal does just this.

I also don't see a lot of difference between the Real- headers and the
mnemonic proposal in this regard.

Also note that this argument doesn't overlap into the "just send 
8bit" stories: RFC821 and 822 are very clear on those issues.


Well, the one interoperability problem I see does overlap this case somewhat.
I think what you're saying here is that the one interoperability problem I've
identified is a non-issue to you too.

Similarly, if we know that certain features cause lots of 
interoperability problems already, is it permitted to argue that we 
should avoid stressing those features further, or is that argument 
prohibited on the grounds that the deviant and inadequate systems should 
be fixed?  Note that it is plausible to argue that precisely those 
features that have caused problems in the past, especially if they are 
marginal, should be used *more* because that is the best way to get them 
fixed in all cases.


No, I'm arguing that the reverse is true. These new proposals may have the
effect of lessening the stress on subtle features of RFC822. I believe that
they will have precisely this effect, and I've backed up this conclusion with a
careful examination of the character set issues that are involved here.

By taking this position I don't have to give an opinion on how to deal with
stressing marginal features of RFC822. But I'll give one anyway: we should not
stress such features unless we have to, but on the other hand we should not
avoid such stress because of the existence of broken software. I think the
existence or nonexistence of broken software should not concern us. The only
time it should concern us is when we're really in the margins, e.g. when we're
asking what an SMTP server does with unrecognized commands (the standard does
not say in so many words). Then, and only then, are we justified in taking the
status quo into account (and we have done so). By constrast, the syntactic
rules in RFC822 are quite rigid and are not subject to debate, hence there is
little need to strutinize existing practice for information on how things are
done.

What I'm looking for here is a ruling/suggestion on whether these are 
legitimate areas of inquiry, or whether they should be dropped as not 
part of the WG mandate.  Mr Chairman?


Let's determine whether we can let sleeping dogs lie before we take a
vote on waking them up...

I disagree. I think the use of mnemonic is clean, simple, and elegant.

  With the understanding that I like mnemonic, and have always like 
mnemonic, and will probably continue to like mnemonic...
  Every time mnemonic comes up, one of two objections comes up (the 
second more in the earlier days, and the first more lately):

   (i)  Mnemonic is much better adapted to the character sets that 
reflect languages that have a relatively small repertiore of alphabetic 
or phonetic characters than it is to languages with, e.g., potentially 
unbounded collections of ideographic characters.


Very true. But this extends to existing standard practices, computer languages,
and computer systems themselves as well. It is not in and of itself a problem
with mnemonic. It is more a reflection that mnemonic inherits than
an intrinsic problem with mnemonic itself.

   (ii) However mnemonic is expanded, and no matter what character 
collections are registered, there will always be "one more character 
set" that it does not accomodate today, even if it might accomodate it 
tomorrow.  Unless we are going to tell people to not use those character 
sets (tempting, indeed), we will always need an escape mechanism that 
depends on a pairing of character set identification and recoding of the 
bit patterns (e.g., quoted-printable) to supplement a system that 
depends on a glyph registry (e.g., mnemonic).


Yes, but... Regardless of whether we use mnemonic, or quoted-printable, or
whatever, the problem of registering character sets will not go away. In order
to achieve interoperability you have to register the damn things. Otherwise I
say I'm talking in "foo" and you don't have a clue what "foo" is. Or worse, you
think my "foo" is the same as the "foo" you got from someone else yesterday,
but is isn't.

The only solution to this is registration of the character sets we allow. This
is totally independent of the encodings we use (it applies even if we don't
encode things at all). 

Does mnemonic encoding impose an additional burden? The answer is "yes it
does". Precisely, it requires the person doing the registering to assess the
meanings of the glyphs in the character set and figure which of the "standard"
mnemonic codes apply, and to invent new ones if they are needed. I would argue
that any character set where this is an intractable problem does deserve
registration. (And I expect this to be more of a problem for western
symbol sets than it is for eastern character sets.) In fact, one of the
reasons I support mnemonic is so that I am guaranteed of something more than
a name for a given character set. Mnemonic gives me the substance of the
character set if a mnemonic representation is a requirement for registration.

Perhaps what we need is a little structuring, internal to the "phrase" 
that permits either mnemonic or quoted-printable to be used as 
appropriate.  I think the idea stinks, but the alternatives may all be 
worse.   Maybe that is the two sentence summary of my long note.


As long as you're willing to accept the notion that one or the other would be
used consistently throughout a given header, this is in fact the way things are
stated now. The advantage of an external specification is that it does not
change the meaning of any existing header.

You really should take a look at mnemonic if you haven't
done so already -- I can usually figure out the meaning of the "chords" it 
uses
without looking at the document. Keld did a fantastic job in this regard.

   I have.  Several times.  And I agree about the fantastic job.  But, 
being of limited imagination, my ability to figure out "chords" seems to 
decrease as the underlying character set has more and more glyphs that 
don't have obvious analogies in the writing systems derived from Greek, 
Latin, and maybe North Semitic.  I wouldn't have expected otherwise.


True enough. But I think we're agreed that it does the best possible
job within the constraints we have to live with.

Regardless
of the direction you jump, I think the one sure way you lose is with the
Real- headers or with the status quo.

   I think it agree with this analysis.  The only other case that is 
plausibly worth considering would be the "complete alternate form" 
headers, which are not part of the "hunt for Real- headers" problem.


I concur. And while I think a discussion of such a scheme might be
interesting, I believe it is outside the purview of this group, unless
we extend our timetables to a degree that I have difficulty imagining.

                                Ned