To: ietf-822(_at_)dimacs(_dot_)rutgers(_dot_)edu
Subject: Re: SPEAK NOW OR HOLD YOUR PEACE (Was Re: Yet another
proposal for non-ASCII chars in headers)
Date: Fri, 25 Oct 91 14:25:33 +1000
From: Bob Smart <smart(_at_)mel(_dot_)dit(_dot_)csiro(_dot_)au>
[ regarding my encoded-word proposal for non-ASCII charsets in headers ]
Firstly I think it should be a separate RFC. We want it to apply to all
messages, not just to rfc-xxxx messages with a Content-type header. The
reason is this: you send an rfc-xxxx message to someone with an old mail
reader/writer. They hit reply. We want the things in the headers that were
in an alternate character set and which get transferred by the reply program
to the new headers to become visible in their original form again when they
get to new mail-readers.
I too want the encoded-word stuff to apply to all messages, but I don't
think that in itself warrants making my proposal a separate RFC. There
is an advantage to having all of the message format extensions documented
in one place. Either way - if it allows us to cross the "header character
set" problem off of our list of outstanding issues, then I'm for it.
That being the case it might be nice to separate out the common character
set and encoding stuff into a separate RFC: "Character Sets and Encodings
for Internet Mail Messages". So there would be 3 rfcs. Alternatively
Keith's RFC can reference rfc-xxxx for this stuff. But the latter will
be a bit strange because there will be aspects which are not used in rfc-xxxx
(namely the short [single character] names for encodings and character
sets, and the _ in quoted-printable). The one character "names" for
character sets and encodings are an essential feature of Keith's proposal
in interoperating with existing software without embedding the real info
in very large amounts of junk.
I'd hate to see an extra RFC just for the encodings. If my proposal
remains a separate document it should be sufficient to reference
RFC XXXX.
(2) Numbering things is fine but I want the ability to name them as well.
The example "numbers" in Keith's proposal should be "small number
[normally one] of letters or digits". We should grab "M" for mnemonic.
I thought about "M" for "mnemonic" but I saw an advantage in being able
to easily distinguish numeric "aliases" from charset names (which presumably
don't begin with a digit). Mnemonic could have a number assigned ("10"?
"99"? "00"?).
I have to wonder if Q-encoded mnemonic text will retain enough readability
to make it worthwhile, especially in the context of a "phrase" preceeding an
address, where the set of available special characters is so small. Perhaps
someone (Keld?) will experiment with this and let us know the results.
We also need to add "_" to quoted-printable encoding, with "=_" as an allowed
escape.
Not only should the Q encoding be made the same as quoted-printable,
but the B encoding should be the same as base64. The only difference for
the latter was the removal of "," from B. However this doesn't solve
the whole problem of vertical motion: The correct answer is to prohibit
vertical motion and then the ban on "," follows without requiring a
separate encoding. You can lift text from the mnemonic proposal on this
(or any other) matter: it will save it going to waste.
The "," in base64 does not necessarily indicate vertical motion; it
indicates end-of-line or end-of-record. The B encoding doesn't include
comma because (a) it wouldn't be legal in some contexts, and (b) I saw
no need to provide an end-of-record mechanism in message header text.
The "," in base64 can appear in body parts for which "end-of-line"
or "end-of-record" has no meaning. That doesn't make the message
syntactically invalid. On the other hand, a B-encoded-word that
appears in a "phrase" can never be allowed to contain a "," because
it would violate RFC 822 syntax rules.
While I appreciate the simplicity of having Q == quoted-printable and
B == base64, they were meant to address different problems. We shouldn't
combine the two solutions if doing so severely compromises either one.
Finally, Keith's proposal makes [cq ]text look more horrible than is
necessary. I think space should be allowed to stand for itself in
[cq ]text. It isn't hard to search past one word looking for the 4th "?".
With this change it makes sense to allow 7bit encoding as well [code "7",
usable when no "=" or "?" in text]. This change is only necessary
for improved interoperability with existing mail readers, but that is
likely to be an issue for quite a while.
While I agree that the result sometimes looks horrible, and could
probably be improved, I object to allowing spaces within encoded-words.
As currently defined, encoded-words are self-contained and easily
distinguished from non-encoded portions of the message header. If you allow
spaces within an encoded-word, it looks like several separate tokens to an
RFC 822 parser, and a single token to a parser that knows about
encoded-words. Danger!
The "_" hack in the Q encoding was invented for this very reason. I'm
not sure it belongs in quoted-printable.
Keith