Comments on the draft MIME Part 2 document

Here are a some late comments on draft-ietf-822ext-mime-part2-00.txt
in response to the "WG Last Call!!!!" of Mon Apr 19 23:25:15 1993.
Most of them are minor issues about wordings.


1) The "tspecials" definition:

   tspecials = "(" / ")" / "<" / ">" / "@" / "," / ";" / ":" / "\" /
               <"> / "/" / "[" / "]" / "?" / "." / "="


This means that this Part 2 of the MIME standard will use a 
different "tspecials" concept than Part 1.  (The difference is
that "." is a part-2-tspecials but not a part-1-specials.)
Is this intentional?  If so, it might be better to use another
term than "tspecials" here to avoid unnecessary misunderstanding.


2) Status of malformed encoded words:

I'm not sure if strings such as

   =?ISO-8859?Q?a?=       (illegal "charset")

   =?ISO-8859-1?A?a?=     (illegal "encoding")

   =?ISO-8859-1?Q? a?=    (illegal character in "encoded-text")

   =?ISO-8859-1?B?-abc?=  (illegal character in "encoded-text"
                          according to the indicated "encoding")

   J=?ISO-8859-1?Q?=E4?=rnefors  ("encoded-word" not separated
                                 by linear white-space from its
                                 surroundings)

or strings with the ABNF syntax of an "encoded-word" but longer
than 75 characters are _allowed_ in e.g. the Subject: field of
a message conforming to MIME Part 2. _If_ they are allowed,
should they be treated (a) as just a sequence of ASCII 
characters, or (b) as the sequence of ASCII characters 
that form the "encoded-text" part, or (c) as an "encoded-word"
with undefined meaning (that should be displayed as nothing at 
all or perhaps by an error indicator)?  Maybe the correct
alternative should be indicated explicitly in the text.

It says in the Conformance section:

   A mail composing program claiming compliance with this specification
   MUST ensure that any string of printable ASCII characters in a "text" or
   "ctext" entity within a header, or any "atom" within a "phrase", that
   begins with "=?" and ends with "?=" be a valid encoded-word.


Do we really have to forbid strings such as

   Why you have seen at lot of =? and ?= in email recently

which clearly doesn't contain "encoded-word"s? 


3) Displaying encoded-words with unsupported charset:

   If the mail reader does not support the character set used, it may
   either display the encoded-word as ordinary text (i.e., as it appears in
   the header), or it may substitute an appropriate message indicating that
   the decoded text could not be displayed.


There are several other "fall-back" methods which a non-minimal
implementation could use, e.g. display all ASCII characters
represented by the "encoded-word", with undisplayable characters
indicated by some marker such as "?" in inverse video (and
leaving out the ugly "=?...?...?...?=" stuff).  At least this
paragraph should say that also other, more sophisticated methods
than the two mentioned exist and can be used by an
implementation.


4) Ignoring linear-white-space:

   A sequence of one or more encoded-words is used to represent non- ASCII
   textual data within a header field.  An encoded-word must be separated
   from an adjacent encoded-word, "word", "text", "ctext", or "special" by
   a linear white-space character or a newline.  When displaying a
   particular header field that contains multiple encoded-words, any
   linear-white-space that separates a pair of adjacent encoded-words is
   ignored. ...


I think it also should be said explicitly that
linear-white-space separating an "encoded-word" from something
else than an "encoded-word" _shall_ be displayed.  (I assume
this is the intention.)

This has as a consequence that very long strings with one
non-ASCII character and no spaces must be split into several
"encoded-word"s in sequence, only one of which will contain the
representation of a non-ASCII character; the other(s) will
represent only ASCII characters, a kind of "encoded-word"s that
is discouraged later:

     Use of these methods to encode non-textual data (e.g., pictures or
     sounds) is not defined by this memo.  Use of encoded-words to
     represent strings of purely ASCII characters is allowed, but
     discouraged.


To this text could be added something like: "... except in those
rare cases were such encoded-words must be used."


5) Separation of encode-words:

   A sequence of one or more encoded-words is used to represent non- ASCII
   textual data within a header field.  An encoded-word must be separated
   from an adjacent encoded-word, "word", "text", "ctext", or "special" by
   a linear white-space character or a newline. ...


This is not completely true, since "(" and ")" are "special"s
and may be adjacent to an "encode-word" in a "comment", as is
shown in one of the examples:

   From: Nathaniel Borenstein <nsb(_at_)thumper(_dot_)bellcore(_dot_)com>
         (=?iso-8859-8?b?7eXs+SDv4SDp7Oj08A==?=)



6) The encoded-word delimiters:

   An "encoded-word" is more precisely defined by the following ABNF
   grammar, using the notation of RFC 822:

   encoded-word = "=" "?" charset "?" encoding "?" encoded-text "?" "="


Does the use of

   "=" "?"    and    "?" "="

in this definition, rather than

      "=?"    and    "?="

imply that linear-white-space may be inserted between "=" and
"?", or is there any other reason?

7) The "text" and "ctext" confusion:

One common mistake when using the syntactical concepts of
RFC 822 is to think of "text" and "ctext" objects as strings of
characters.  This is false, a "text" or "ctext" is one single
character.  The definition of e.g. "text" in RFC 822 is:

:      text        =  <any CHAR, including bare    ; => atoms, specials,
:                      CR & bare LF, but NOT       ;  comments and
:                      including CRLF>             ;  quoted-strings are
:                                                  ;  NOT recognized.

As a matter of fact this mistake was made even in the RFC 822 itself:

:         Note:  Any field which has a field-body  that  is  defined  as
:                other  than  simply <text> is to be treated as a struc-
:                tured field.

(In the right-hand part of the RFC 822 syntax definitions "text"
occurs only as "*text".)

   An encoded-word may be distinguished from an ordinary "word", "text", or
   "ctext", as follows: An encoded-word begins with "=?", ends with "?=",


Its very easy indeed to distinguish an "encoded-word" from a
"text" or "ctext", since these consist of only one character!
It would be better to write: "... from an ordinary "word", a
"*text" string, or a "*ctext" string, as follows ..."

   contains exactly four "?" characters including the delimiters, and is
   followed by a SPACE or newline.  If the "word", "text", or "ctext" does


Use "*text" and "*ctext" here.

   A mail composing program claiming compliance with this specification
   MUST ensure that any string of printable ASCII characters in a "text" or
   "ctext" entity within a header, or any "atom" within a "phrase", that


I think   'a "*text" or "*ctext" string within a header'   would
be better here.

   begins with "=?" and ends with "?=" be a valid encoded-word.

   A mail reading program claiming compliance with this specification must
   be able to distinguish encoded-words from "text", "ctext", or "word"s
   anytime they appear in appropriate places in message headers.  In


I suggest: '... from "*text" or "*ctext" strings, or "word"s ...'


8) The word "token":

Since the word "token" is given a specialized meaning in this
proto-RFC, it shouldn't be used with a looser meaning in the
same document:

   - An encoded-word may replace a "text" token (as defined by RFC 822) in:


It would be better to write: '... may replace a "*text" string in: ...'

     (1) a Subject or Comments header field, (2) any extension message
     header field, (3) any user-defined message header field, or (4) any
     RFC 1341++ body part header field (such as Content-Description) for
     which the field body contains only "text"s.


It's probably better to write:

       RFC 1341++ body part header field (such as Content-Description)
       whose syntax is defined as "*text".

because in a way _all_ field bodies contains only "text"s (no
ASCII character is excluded from being a "text" object).


9) Where encoded-words can't be used:

     These are the ONLY locations where an encoded-word may appear.  In
     particular, an encoded-word MUST NOT appear in any portion of an
     "address".  In addition, an encoded-word MUST NOT be used in a
     Received header field.


I think it would be useful to add here:

       Encoded-words cannot occur in quoted-strings.  Anything
       looking like an encoded-word in a quoted-string is to be
       treated as an ordinary sequence of ASCII characters.
       

10) Where phrases can occur:

   - As a replacement for a "word" entity within a "phrase", for example,
     one that precedes an address in a From, To, or Cc header.  ...


To help the reader it could be added here:

       ("phrase"s also may occur in "group" addresses before the
       ":" and in the header fields In-Reply-To, Keywords, and
       References.)


11) Excluding X-type charsets:

The Appendix states that X-* charsets are not allowed:

     2. Character sets allowed to include IANA-registered charsets in
        addition to those defined in RFC 1341++.  (X-* charsets are still
        excluded.)


In the section "Character sets", it says:

   In an encoded-word, the character set associated with the unencoded text
   is specified by a charset.  A charset can be any of the character set
   names allowed in an RFC 1341++ "charset" parameter of a "text/plain"
   body part, or any character set name registered with IANA for use with
   the MIME text/plain content-type.  ...


This wording doesn't exclude the use of character set names
starting with "X-" (after private agreement), since such names
are allowed in RFC 1341++.  Neither do I understand why they
should be excluded, if there exists a private agreement to use
such a charset name.


12) "Latin" character sets:

   Initially, the legal values for "encoding" are "Q" and "B".  These
   encodings are described below.  The "Q" encoding is recommended for
   use with Latin character sets, and the "B" encoding for all others.


What is a "Latin" character set?  ISO 8859-5 contains both the
Latin script and the Cyrillic script.  Is it a "Latin" character
set?  I would base a recommendation about Q and B encodings not
on the character set used, but on the relative frequency of
ASCII characters in the text to be encoded.


13) The appearance of an encoded-word:

     These are the ONLY locations where an encoded-word may appear.  In
     particular, an encoded-word MUST NOT appear in any portion of an
     "address".  In addition, an encoded-word MUST NOT be used in a
     Received header field.

     Whenever such words appear in a header being displayed, an enlightened
     mail reader will decode the text and render it appropriately.


An "encoded-word" may of course be a part of that portion of an
"address" that is a "phrase" before the "addr-spec" that is
surrounded by "<" and ">".  Furhermore, as "encoded-word" is
defined in the section Encodings (a sequence of printable ASCII
characters that begins with "=?", ends with "?=", and has two
"?"s in between) it may also occur in the "addr-spec" itself,
but it must not be _interpreted_ and displayed as an
"encoded-word" in that case.

I don't think that the second paragraph adds anything of
substance to the document, so it can be removed.

A better wording for this part of the text might then be:

       These are the ONLY locations where the rules of this RFC
       for interpreting what appears as an encoded-word are to
       be applied.  In particular, the encoded-word interpretation
       MUST NOT be used for any portion of an "addr-spec".  In
       addition, it MUST NOT be used in a Received header field.



14) Miscellaneous:

   1.  Any 8-bit value may be represented by a "=" followed by two
       hexadecimal digits.  For example, if the character set in use
       were ISO-8859-1, the "=" character would thus be encoded as
       "=3D", and a SPACE by "=20".


The characters "=" and SPACE can be encoded in this way in _all_
character sets defined in RFC 1341++.  It might be better to use
an example from the right-hand half of ISO-8859-1, such as "=A9"
COPYRIGHT SIGN or "=E9" LATIN SMALL LETTER E WITH ACUTE.

     These are the ONLY locations where an encoded-word may appear.  In
     particular, an encoded-word MUST NOT appear in any portion of an
     "address".  In addition, an encoded-word MUST NOT be used in a
     Received header field.

     Whenever such words appear in a header being displayed, an enlightened
     mail reader will decode the text and render it appropriately.

     Only textual data (printable and white space characters) should be
     encoded using this scheme.  However, since these encoding schemes
     allow the encoding of arbitrary 8-bit values, mail readers that
     implement this decoding should also ensure that display of the decoded
     data on the recipient's terminal will not cause unwanted side-effects.

     Use of these methods to encode non-textual data (e.g., pictures or
     sounds) is not defined by this memo.  Use of encoded-words to
     represent strings of purely ASCII characters is allowed, but
     discouraged.


The indentation of these paragraphs should be reduced by two
positions, because they are not related only to the preceding
bulleted part of the text.

K. Moore                                                        [Page 2]
Internet Draft         Expires 22 September 1993           22 March 1993



   encoded- words before inserting them into the message header.

     ++++++++++++++
     encoded-words

   A sequence of one or more encoded-words is used to represent non- ASCII

                                                                  ++++++++++
                                                                  non-ASCII

   textual data within a header field. ...



--
Olle Jarnefors                     Internet: 
ojarnef(_at_)admin(_dot_)kth(_dot_)se
Information Management Services        UUCP: ...!uunet!mcsun!sunic!kth!ojarnef
Royal Institute of Technology (KTH)  BITNET: ojarnef(_at_)sekth  Fax:+46 8 10 
25 10
S-100 44  Stockholm, Sweden           Phone: +46 8 790 71 26 (time zone +0200)