Here are a some late comments on draft-ietf-822ext-mime-part2-00.txt
in response to the "WG Last Call!!!!" of Mon Apr 19 23:25:15 1993.
Most of them are minor issues about wordings.
1) The "tspecials" definition:
tspecials = "(" / ")" / "<" / ">" / "@" / "," / ";" / ":" / "\" /
<"> / "/" / "[" / "]" / "?" / "." / "="
This means that this Part 2 of the MIME standard will use a
different "tspecials" concept than Part 1. (The difference is
that "." is a part-2-tspecials but not a part-1-specials.)
Is this intentional? If so, it might be better to use another
term than "tspecials" here to avoid unnecessary misunderstanding.
2) Status of malformed encoded words:
I'm not sure if strings such as
=?ISO-8859?Q?a?= (illegal "charset")
=?ISO-8859-1?A?a?= (illegal "encoding")
=?ISO-8859-1?Q? a?= (illegal character in "encoded-text")
=?ISO-8859-1?B?-abc?= (illegal character in "encoded-text"
according to the indicated "encoding")
J=?ISO-8859-1?Q?=E4?=rnefors ("encoded-word" not separated
by linear white-space from its
surroundings)
or strings with the ABNF syntax of an "encoded-word" but longer
than 75 characters are _allowed_ in e.g. the Subject: field of
a message conforming to MIME Part 2. _If_ they are allowed,
should they be treated (a) as just a sequence of ASCII
characters, or (b) as the sequence of ASCII characters
that form the "encoded-text" part, or (c) as an "encoded-word"
with undefined meaning (that should be displayed as nothing at
all or perhaps by an error indicator)? Maybe the correct
alternative should be indicated explicitly in the text.
It says in the Conformance section:
A mail composing program claiming compliance with this specification
MUST ensure that any string of printable ASCII characters in a "text" or
"ctext" entity within a header, or any "atom" within a "phrase", that
begins with "=?" and ends with "?=" be a valid encoded-word.
Do we really have to forbid strings such as
Why you have seen at lot of =? and ?= in email recently
which clearly doesn't contain "encoded-word"s?
3) Displaying encoded-words with unsupported charset:
If the mail reader does not support the character set used, it may
either display the encoded-word as ordinary text (i.e., as it appears in
the header), or it may substitute an appropriate message indicating that
the decoded text could not be displayed.
There are several other "fall-back" methods which a non-minimal
implementation could use, e.g. display all ASCII characters
represented by the "encoded-word", with undisplayable characters
indicated by some marker such as "?" in inverse video (and
leaving out the ugly "=?...?...?...?=" stuff). At least this
paragraph should say that also other, more sophisticated methods
than the two mentioned exist and can be used by an
implementation.
4) Ignoring linear-white-space:
A sequence of one or more encoded-words is used to represent non- ASCII
textual data within a header field. An encoded-word must be separated
from an adjacent encoded-word, "word", "text", "ctext", or "special" by
a linear white-space character or a newline. When displaying a
particular header field that contains multiple encoded-words, any
linear-white-space that separates a pair of adjacent encoded-words is
ignored. ...
I think it also should be said explicitly that
linear-white-space separating an "encoded-word" from something
else than an "encoded-word" _shall_ be displayed. (I assume
this is the intention.)
This has as a consequence that very long strings with one
non-ASCII character and no spaces must be split into several
"encoded-word"s in sequence, only one of which will contain the
representation of a non-ASCII character; the other(s) will
represent only ASCII characters, a kind of "encoded-word"s that
is discouraged later:
Use of these methods to encode non-textual data (e.g., pictures or
sounds) is not defined by this memo. Use of encoded-words to
represent strings of purely ASCII characters is allowed, but
discouraged.
To this text could be added something like: "... except in those
rare cases were such encoded-words must be used."
5) Separation of encode-words:
A sequence of one or more encoded-words is used to represent non- ASCII
textual data within a header field. An encoded-word must be separated
from an adjacent encoded-word, "word", "text", "ctext", or "special" by
a linear white-space character or a newline. ...
This is not completely true, since "(" and ")" are "special"s
and may be adjacent to an "encode-word" in a "comment", as is
shown in one of the examples:
From: Nathaniel Borenstein <nsb(_at_)thumper(_dot_)bellcore(_dot_)com>
(=?iso-8859-8?b?7eXs+SDv4SDp7Oj08A==?=)
6) The encoded-word delimiters:
An "encoded-word" is more precisely defined by the following ABNF
grammar, using the notation of RFC 822:
encoded-word = "=" "?" charset "?" encoding "?" encoded-text "?" "="
Does the use of
"=" "?" and "?" "="
in this definition, rather than
"=?" and "?="
imply that linear-white-space may be inserted between "=" and
"?", or is there any other reason?
7) The "text" and "ctext" confusion:
One common mistake when using the syntactical concepts of
RFC 822 is to think of "text" and "ctext" objects as strings of
characters. This is false, a "text" or "ctext" is one single
character. The definition of e.g. "text" in RFC 822 is:
: text = <any CHAR, including bare ; => atoms, specials,
: CR & bare LF, but NOT ; comments and
: including CRLF> ; quoted-strings are
: ; NOT recognized.
As a matter of fact this mistake was made even in the RFC 822 itself:
: Note: Any field which has a field-body that is defined as
: other than simply <text> is to be treated as a struc-
: tured field.
(In the right-hand part of the RFC 822 syntax definitions "text"
occurs only as "*text".)
An encoded-word may be distinguished from an ordinary "word", "text", or
"ctext", as follows: An encoded-word begins with "=?", ends with "?=",
Its very easy indeed to distinguish an "encoded-word" from a
"text" or "ctext", since these consist of only one character!
It would be better to write: "... from an ordinary "word", a
"*text" string, or a "*ctext" string, as follows ..."
contains exactly four "?" characters including the delimiters, and is
followed by a SPACE or newline. If the "word", "text", or "ctext" does
Use "*text" and "*ctext" here.
A mail composing program claiming compliance with this specification
MUST ensure that any string of printable ASCII characters in a "text" or
"ctext" entity within a header, or any "atom" within a "phrase", that
I think 'a "*text" or "*ctext" string within a header' would
be better here.
begins with "=?" and ends with "?=" be a valid encoded-word.
A mail reading program claiming compliance with this specification must
be able to distinguish encoded-words from "text", "ctext", or "word"s
anytime they appear in appropriate places in message headers. In
I suggest: '... from "*text" or "*ctext" strings, or "word"s ...'
8) The word "token":
Since the word "token" is given a specialized meaning in this
proto-RFC, it shouldn't be used with a looser meaning in the
same document:
- An encoded-word may replace a "text" token (as defined by RFC 822) in:
It would be better to write: '... may replace a "*text" string in: ...'
(1) a Subject or Comments header field, (2) any extension message
header field, (3) any user-defined message header field, or (4) any
RFC 1341++ body part header field (such as Content-Description) for
which the field body contains only "text"s.
It's probably better to write:
RFC 1341++ body part header field (such as Content-Description)
whose syntax is defined as "*text".
because in a way _all_ field bodies contains only "text"s (no
ASCII character is excluded from being a "text" object).
9) Where encoded-words can't be used:
These are the ONLY locations where an encoded-word may appear. In
particular, an encoded-word MUST NOT appear in any portion of an
"address". In addition, an encoded-word MUST NOT be used in a
Received header field.
I think it would be useful to add here:
Encoded-words cannot occur in quoted-strings. Anything
looking like an encoded-word in a quoted-string is to be
treated as an ordinary sequence of ASCII characters.
10) Where phrases can occur:
- As a replacement for a "word" entity within a "phrase", for example,
one that precedes an address in a From, To, or Cc header. ...
To help the reader it could be added here:
("phrase"s also may occur in "group" addresses before the
":" and in the header fields In-Reply-To, Keywords, and
References.)
11) Excluding X-type charsets:
The Appendix states that X-* charsets are not allowed:
2. Character sets allowed to include IANA-registered charsets in
addition to those defined in RFC 1341++. (X-* charsets are still
excluded.)
In the section "Character sets", it says:
In an encoded-word, the character set associated with the unencoded text
is specified by a charset. A charset can be any of the character set
names allowed in an RFC 1341++ "charset" parameter of a "text/plain"
body part, or any character set name registered with IANA for use with
the MIME text/plain content-type. ...
This wording doesn't exclude the use of character set names
starting with "X-" (after private agreement), since such names
are allowed in RFC 1341++. Neither do I understand why they
should be excluded, if there exists a private agreement to use
such a charset name.
12) "Latin" character sets:
Initially, the legal values for "encoding" are "Q" and "B". These
encodings are described below. The "Q" encoding is recommended for
use with Latin character sets, and the "B" encoding for all others.
What is a "Latin" character set? ISO 8859-5 contains both the
Latin script and the Cyrillic script. Is it a "Latin" character
set? I would base a recommendation about Q and B encodings not
on the character set used, but on the relative frequency of
ASCII characters in the text to be encoded.
13) The appearance of an encoded-word:
These are the ONLY locations where an encoded-word may appear. In
particular, an encoded-word MUST NOT appear in any portion of an
"address". In addition, an encoded-word MUST NOT be used in a
Received header field.
Whenever such words appear in a header being displayed, an enlightened
mail reader will decode the text and render it appropriately.
An "encoded-word" may of course be a part of that portion of an
"address" that is a "phrase" before the "addr-spec" that is
surrounded by "<" and ">". Furhermore, as "encoded-word" is
defined in the section Encodings (a sequence of printable ASCII
characters that begins with "=?", ends with "?=", and has two
"?"s in between) it may also occur in the "addr-spec" itself,
but it must not be _interpreted_ and displayed as an
"encoded-word" in that case.
I don't think that the second paragraph adds anything of
substance to the document, so it can be removed.
A better wording for this part of the text might then be:
These are the ONLY locations where the rules of this RFC
for interpreting what appears as an encoded-word are to
be applied. In particular, the encoded-word interpretation
MUST NOT be used for any portion of an "addr-spec". In
addition, it MUST NOT be used in a Received header field.
14) Miscellaneous:
1. Any 8-bit value may be represented by a "=" followed by two
hexadecimal digits. For example, if the character set in use
were ISO-8859-1, the "=" character would thus be encoded as
"=3D", and a SPACE by "=20".
The characters "=" and SPACE can be encoded in this way in _all_
character sets defined in RFC 1341++. It might be better to use
an example from the right-hand half of ISO-8859-1, such as "=A9"
COPYRIGHT SIGN or "=E9" LATIN SMALL LETTER E WITH ACUTE.
These are the ONLY locations where an encoded-word may appear. In
particular, an encoded-word MUST NOT appear in any portion of an
"address". In addition, an encoded-word MUST NOT be used in a
Received header field.
Whenever such words appear in a header being displayed, an enlightened
mail reader will decode the text and render it appropriately.
Only textual data (printable and white space characters) should be
encoded using this scheme. However, since these encoding schemes
allow the encoding of arbitrary 8-bit values, mail readers that
implement this decoding should also ensure that display of the decoded
data on the recipient's terminal will not cause unwanted side-effects.
Use of these methods to encode non-textual data (e.g., pictures or
sounds) is not defined by this memo. Use of encoded-words to
represent strings of purely ASCII characters is allowed, but
discouraged.
The indentation of these paragraphs should be reduced by two
positions, because they are not related only to the preceding
bulleted part of the text.
K. Moore [Page 2]
Internet Draft Expires 22 September 1993 22 March 1993
encoded- words before inserting them into the message header.
++++++++++++++
encoded-words
A sequence of one or more encoded-words is used to represent non- ASCII
++++++++++
non-ASCII
textual data within a header field. ...
--
Olle Jarnefors Internet:
ojarnef(_at_)admin(_dot_)kth(_dot_)se
Information Management Services UUCP: ...!uunet!mcsun!sunic!kth!ojarnef
Royal Institute of Technology (KTH) BITNET: ojarnef(_at_)sekth Fax:+46 8 10
25 10
S-100 44 Stockholm, Sweden Phone: +46 8 790 71 26 (time zone +0200)