nmh-workers
[Top] [All Lists]

Re: [Nmh-workers] RFC2047 section 5 and other MIME issues for the new scan

2010-11-14 17:56:26
On Sun, Nov 14, 2010 at 11:45 AM, Jon Steinhart <jon(_at_)fourwinds(_dot_)com> 
wrote:
My preference is to say that we'll treat any =?...?= as an encoded word
wherever it appears and that we'll decode it.  It appears that the authors of
RFC2047 expect that everything will be parsed into tokens and examined before
looking for encoded words.

You right.  RFC 822 defined the basic tokenization rules,
and MIME attempts to stay compatibile with that.  I.e. You have
a system that knows how to due RFC 822 tokenization, and then
that token data can be passed to the MIME-aware layer.
Here is a relevant note from RFC 2047:

   IMPORTANT: 'encoded-word's are designed to be recognized as 'atom's
   by an RFC 822 parser.  As a consequence, unencoded white space
   characters (such as SPACE and HTAB) are FORBIDDEN within an
   'encoded-word'.  For example, the character sequence

      =?iso-8859-1?q?this is some text?=

   would be parsed as four 'atom's, rather than as a single 'atom' (by
   an RFC 822 parser) or 'encoded-word' (by a parser which understands
   'encoded-words').  The correct way to encode the string "this is some
   text" is to encode the SPACE characters as well, e.g.

      =?iso-8859-1?q?this=20is=20some=20text?=


I think many mail implementations today probably do
not work that way, mainly due to ignorance of the developers.
Although not related to this topic, an example of this
ignorance is the syntax adopted in DKIM headers.

As for space between encoded word, such space should be
collapsed.  I.e. Two adjacent encoded words should be
concatenated together after decoding, with no space between
them.

Note, it is a mistake to blindly assume that all sequences
of =?...?= should be decoded, which has lead to some erroneous
uses by some software.  For example, using =?...?= inside
parameter values vs using RFC 2184 (now RFC 2123).

My current plan for the new scan code is to:

 1.  Read a header field name.

 2.  Read a header field body if the header field is used by the format,
    unfolding folded lines in the process.

 3.  Look for encoded words and decode them creating a UTF-8 version of the
    header field body.

I've never really dived into MH/nmh parsing code.  Is there any
attempt to perform RFC 822 based tokenization before duing any
other processing?

Decoding of encoded words should only be done in specific contexts.
Look at Section 5 of RFC 2047 the contexts that encoded words are
allowed.

--ewh

_______________________________________________
Nmh-workers mailing list
Nmh-workers(_at_)nongnu(_dot_)org
http://lists.nongnu.org/mailman/listinfo/nmh-workers

<Prev in Thread] Current Thread [Next in Thread>