Re: New-ish idea on non-ascii headers

Craig Everhart writes:

About the mnemonic-in-headers idea as the way to get non-USASCII text
into mail headers.  I want to expand on Nathaniel's problem and remind
the audience of another one.  Also, Robert Ullmann's suggestion deserves
a comment.

 Sure, Nathaniel's example was unquoted, and thus was a straw-man.  Too
bad.  Perhaps his problem could have been re-phrased something like:
Lots of systems out there will break when exposed to the morass of
quoted characters that this proposal will engender.


You have yet to demonstrate that this proposal will engender increased use
of quoting. A case can be made that it will not. I have made such a case
elsewhere.

Even if you disagree about mnemonic, you cannot possibly make this case
stick with quoted-printable. Quoted-printable will decrease the use of
special RFC822 quoting unconditionally, and if it is adopted I plan to
use it just for this reason!

That's a different
sort of problem, indeed.  Some of us try to be careful in our rfc-822
lexical scanners and quoters, and RFC 822 is pretty rigorous in how
correct quoters and scanners will behave, but that doesn't mean that
those rules are followed rigorously.  If I can be forgiven the raising
of some hackles (Mark Crispin's, I think) a little, I'll remind folks of
the problems that MM had on CMU Tops-20 systems.  We guys on
andrew.cmu.edu started using plus signs in mail addresses (legitimate by
RFC 822), but MM treated such addresses as obviously invalid.


Yeah, but did this stop you from continuing from using plus signs? It did
not. Instead you insisted that people fix their software to comply with 
the standard. I have no problem with this. I do have a problem with defining
the present standard as broken (see below) but we're not doing that here.

What do you put in From and what goes in Real-From?  The From address
can be stripped down to the mailbox name, and the Real-From can be any
user-oriented decoration at all.  Thus, I'd express mail from somebody
with mailbox he(_at_)idt(_dot_)unit(_dot_)no and real-name (in mnemonic) 
``H&XFard
Eidnes'' as as the combination (forgive my detail misunderstanding)
      From: he(_at_)idt(_dot_)unit(_dot_)no
      Real-From: US-ASCII/mnemonic: H&XFard Eidnes
instead of
      From: H&XFard Eidnes <he(_at_)idt(_dot_)unit(_dot_)no>


My experience indicates that this is a very bad idea. Many parsers RFC822
parsers I've encountered are properly compliant with RFC822, but do not include
the mandatory extensions made in RFC1123. Thus, you cannot blindly strip
phrases from routes without getting into trouble with RFC822-compliant
parsers... In fact, most of my software manufactures leading phrases (it just
quoted the local-part) to avoid this problem, I've seen it so often.

This of course does not mean that you cannot come up with something that will
work. You can. But it needs to be specified -- this is a lack of rigor in the
Real- header proposal that needs to be addressed.

A severe additional problem with the mnemonic proposal (besides its
western-centricity) is this.


See Keld's comments. Mnemonic can be made to support as many character sets as
you like. Mnemonic is therefore not western-centric, and it is definitely less
western-centric than are many other things (e.g. the use of English in header
tags, restrictions on the local-part of addresses, etc.) that we don't plan to
fix.

Given that the special header appears that
identifies key text strings as being in mnemonic encoding, what fields
and sub-fields are subject to this encoding (and therefore decoding)? 
Clearly, lines like Subject:/Comments: are to be decoded.  And the
intent is that the ``mailbox'' RFC 822 type is to receive special
treatment:

     mailbox     =  addr-spec                    ; simple address
                 /  phrase route-addr            ; name & addr-spec

I'd guess that the ``phrase'' should be decoded, but in addr-spec, the
``local-part'' should be left alone.  How about comments?  Are they to
be decoded in From: lines?  How about in other lines, like Received: or
Date:?  How about all the extension fields--are they to be decoded?  How
about the Received: header: its optional ``for'' clause contains an
addr-spec.  Is that to be decoded?  What about the message massagers
along the way that add Received: lines and possibly manipulate other
headers?  What should they do about interpreting encoded text, or making
sure that the text they generate is encoded if that's indicated?


I've already addressed this in previous messages -- please read them.

Now, I can make guesses about these questions as well as the next guy,
but implementors can't be allowed to guess.  The problem must admit
these decisions as part of its specification, and I'm not convinced that
the problem will admit any such solution.


Of course you cannot simply guess, and such rigor is to be expected of
a standard. But such rigor has in fact been proposed -- you're simply
ignoring the messages that proposed it, and then claiming that this
problem has not been addressed. It has been. There are more open issues
with Real- headers than there are with the use of mnemonic.

About Robert Ullmann's observation, which I think is that 8-bit
characters can be encoded via RFC 822 simply by prefixing them with the
appropriate quote character.  My copy of RFC 822 says that the
characters that can appear in header fields must fundamentally be
subsets of its CHAR class, which is listed as:
     CHAR        =  <any ASCII character>        ; (  0-177,  0.-127.)
As far as I can tell, this excludes the possibility of putting 8-bit
characters in verbatim.  The 7-bit restriction isn't just an SMTP issue;
rather, I have to read this line as applying it to the headers (and
bodies) of RFC 822 messages.


I agree with you completely on this. An RFC822 parser might elect to strip high
bits or perhaps reject messages that contain such material -- it would be
completely within its rights in doing so. This is independent of any RFC821
requirement.

Either Mr. Ullmann made a mistake here, or
this is another provocative and contentions assertion that the 7-bit
issues can go away just by pretending that they don't exist, and
declaring all non-8-bit mail handlers broken.


This has always boiled down to a axiomatic point. The question is, "can we
declare existing software broken?". I don't think we can, and therefore I don't
think we need to entertain discussion of "should we do it?".

It would be nice if we had clearer guidance on this. The little I have seen
from "higher up" has indicated that we CANNOT declare existing software as
broken. However, it would be nice if there was a final decision on this,
once and for all.

It would also be nice if Robert would call a spade a spade and not try to sneak
such changes by under the cloak of another proposal. I'm not claiming you did
this intentionally, Robert, but that's how it came across to me.

                                Ned