[Top] [All Lists]

Re: Getting RFC 2047 encoding right

2004-01-04 02:37:22

Arnt Gulbrandsen wrote:


I have a tiny little problem that you chaps may be able to help me with.

Suppose that a mail client receives a message whose subject is encoded according to RFC 2047. Suppose further that the message is decoded and stored in a database somewhere - in RAM, on disk, on a server somewhere. Some time later, the same or a different program sends a reply.

Now, to be kind and courteous that program should use the same subject field, perhaps prefaced by "Re: " (or "Auto: "), such that if the recipient threads based on subject, everything work, no matter whether the recipient supports RFC 2047 or not.

That implies using the same character set, q/b encoding etc. as the original.

To be kind and courteous, the program should use a widely supported character set when 2047-encoding (whether it's composing original messages or replies), and use as few 2047-encoded words as possible.

To be kind and courteous, the program should observe all other rules set out in RFC 2047 and 2822 (including ones which were broken by an earlier message).

To be reliable and bug-free, the algorithm that does all this should be simple and straightforward.

There's a conflict here. How do you all address this?

In reverse order:

The nature of the problem precludes a truly simple algorithm; it's a complex issue.

Some errors can be repaired; others cannot. Attempts to repair errors might be
more-or-less successful. Successful repair is contingent on being able to
unambiguously determine what was intended.

As far as practicable, original content should be preserved. E.g. in a reply, the address given in the original message's Reply-To (or From) field should be used verbatim (same case, same display name if present, including the same encoded-words if used)
in the To field of the reply.

I would extend that to the Subject field, and go so far as to say that "Re:", "Auto:", etc. are best avoided. Incidentally, collating (colloquially "sorting") by subject is not the same as "threading"; the latter entails use of References and/or In-Reply-To fields with Message-ID fields to follow a related "thread" of messages (Consider a collection of 10 messages with "Subject: Help" and 50 with "Subject: Re: Help" -- collating by Subject (with or w/o stripping "Re: ") doesn't group responses with

Non-transient storage of a message is best done in RFC 2822/MIME format,
possibly with lossless compression if the tradeoff between space and compression/ decompression effort warrants it, and possibly with encryption where necessary or desirable. That does not preclude some additional metadata regarding the message, but the original message ought to be 100% recoverable for use in replies, etc. Even conversion of CRLF to local line endings can be troublesome (consider a multipart
MIME message with a binary part containing the octets 0x0D 0x0A).

In <vyDCOksjW0T5JTUD+zj/Vw(_dot_)md5(_at_)prosecco(_dot_)oryx(_dot_)com>:

If origmsg->subject is "=?latin_1?q?=80?=" and the user doesn't change the subject, newmsg->subject is "Re: =?latin_1?q?=80?=". If the decoder knows whether the string could have been generated by a reasonably conservative generator, that case can be avoided. A very tricky decision.

Observing RFC 2047 rules, "=?latin_1?q?=80?=" should remain unchanged -- it's NOT an encoded-word. RFC 2047 section 3 requires that the charset name be one that is allowed in a MIME charset parameter for media type text/plain or that it be registered for use with text/plain. The rules for text/plain are in RFC 2046 section 4.1.2 which
states (in part):

" No character set name other than those defined above may be used in
Internet mail without the publication of a formal specification and
its registration with IANA, or by private agreement, in which case
the character set name must begin with "X-".

The "defined above" text refers to the us-ascii and iso-8859-X charsets. "latin_1" is not registered nor is it in the initial set of MIME-compatible charsets in RFC 2046 (all of which are now registered), and it obviously does not begin with "X-",
therefore the RFC 822 atom containing "latin_1" is NOT an encoded-word. It
should be displayed verbatim and should remain unchanged.

In <vblVCzk7yuOTmh4aNJKFdQ(_dot_)md5(_at_)libertango(_dot_)oryx(_dot_)com>:

Suppose the original message had "Subject: =?latin_1?q?The price is =80216". The MUA fuzzily matches latin_1 to the IANA-defined alias latin1, knows about the Microsoft breakage, and presents the user with "Subject: The price is €216".

That's where things go wrong. There's no encoded-word -- "=?latin_1q?The" and "=80216" should be displayed verbatim. Even if the spaces in the subject were replaced with underscores, and the subject ended with "?=" -- as would be the case in a real encoded-word -- the subject would have to be displayed verbatim as it still would not contain a valid encoded-word. And as Keith Moore has pointed out, if it so happened that "latin_1" were valid, but not recognized by the MUA in question, it should still be displayed verbatim (because that MUA has no way to know what to display). In
this case the "fuzzily matches" is wrong, and two wrongs don't make a right.

The reply is a message from the user to the recipient(s), and should faithfully encode whatever the user saw and typed. The MUA SHOULD NOT substitute some other text of unknown meaning for its user's text.

And that's exactly *why* "fuzzily matches" is wrong -- it involves substituting text
of unknown meaning for the original text.

In <xqRenBEr5R5kjjJ6Xlpwgw(_dot_)md5(_at_)prosecco(_dot_)oryx(_dot_)com>:

Something like this, usually:

a = new QLineEdit(...); //
if ( reply )
a->setText( "Re: " + orig->subject() );
else if ( forwarding )
a->setText( "Fwd: " + orig->subject() );

Why the special case for "Fwd: "? Where is that standardized? Why not "FW: "?

If the editor were internal the MUA could do 2047-decoding for display purposes and keep the raw data as its basic storage. But since the editor is external, the MUA must do 2047-decoding and hand the result to the editor. Later, when the editor hands it back, the "obvious" way is to 2047-encode the editor's result use it. Then there's only one encoder to write and test, and it's used for original messages, for forwarding and for replies. Less to write, less to test, fewer bugs.

Editing of the subject field is contrary to "to be kind and courteous that program should use the same subject field". That's a design decision for an MUA author. Clearly, eliminating editing means "[l]ess to write, less to test, fewer bugs". If editing is desired, that need not mean that the entire field is decoded, then re-encoded; real-world editing usually means that some portion(s) of the text are added, deleted, or changed, while much is unchanged.

"Re: " is an interesting case. To be kind, courteous, and RFC 2277 conforming, one should indicate the language. So that should probably be "=?us-ascii*la?q?Re:?= " (or the B encoded equivalent, and/or using the ISO 3-letter code for Latin, "lat", and/or using any other MIME-compatible charset, and/or another capitalization variant...). And the reply should have appropriate References and In-Reply-To fields, assuming that the original had a

Subject is supposed to be an unstructured field, but things like "Re: " impose unnecessary
and useless structure. Consider
Subject: FW: Sv: Fwd: Re^2: =?us-ascii*en-us?q?Auto:_?= =?iso-8859-1*lat?q?Re:?= RE: Auto: cmsg sendsys Do you really want to have to be able to recognize and handle every type of hack to the subject field, in every possible combination of capitalization, in encoded or unencoded form? What do any of them indicate that isn't already evident via MIME packaging, Resent- fields, References,
In-Reply-To, and Message-ID fields, and/or Auto-Submitted fields?

<Prev in Thread] Current Thread [Next in Thread>
  • Re: Getting RFC 2047 encoding right, Bruce Lilly <=