Re: Getting RFC 2047 encoding right


Arnt Gulbrandsen wrote:

Hi,

I have a tiny little problem that you chaps may be able to help me with.
Suppose that a mail client receives a message whose subject is encodedaccording to RFC 2047. Suppose further that the message is decoded andstored in a database somewhere - in RAM, on disk, on a serversomewhere. Some time later, the same or a different program sends areply.
Now, to be kind and courteous that program should use the same subjectfield, perhaps prefaced by "Re: " (or "Auto: "), such that if therecipient threads based on subject, everything work, no matter whetherthe recipient supports RFC 2047 or not.
That implies using the same character set, q/b encoding etc. as theoriginal.
To be kind and courteous, the program should use a widely supportedcharacter set when 2047-encoding (whether it's composing originalmessages or replies), and use as few 2047-encoded words as possible.
To be kind and courteous, the program should observe all other rulesset out in RFC 2047 and 2822 (including ones which were broken by anearlier message).
To be reliable and bug-free, the algorithm that does all this shouldbe simple and straightforward.
There's a conflict here. How do you all address this?


In reverse order:

The nature of the problem precludes a truly simple algorithm; it's acomplex issue.

Some errors can be repaired; others cannot. Attempts to repair errorsmight be

more-or-less successful. Successful repair is contingent on being able to
unambiguously determine what was intended.

As far as practicable, original content should be preserved. E.g. in areply, the addressgiven in the original message's Reply-To (or From) field should be usedverbatim(same case, same display name if present, including the sameencoded-words if used)

in the To field of the reply.

I would extend that to the Subject field, and go so far as to say that"Re:", "Auto:",etc. are best avoided. Incidentally, collating (colloquially "sorting")by subject isnot the same as "threading"; the latter entails use of References and/orIn-Reply-Tofields with Message-ID fields to follow a related "thread" of messages(Consider acollection of 10 messages with "Subject: Help" and 50 with "Subject: Re:Help" --collating by Subject (with or w/o stripping "Re: ") doesn't groupresponses with

originals).

Non-transient storage of a message is best done in RFC 2822/MIME format,

possibly with lossless compression if the tradeoff between space andcompression/decompression effort warrants it, and possibly with encryption wherenecessary ordesirable. That does not preclude some additional metadata regarding themessage,but the original message ought to be 100% recoverable for use inreplies, etc. Evenconversion of CRLF to local line endings can be troublesome (consider amultipart

MIME message with a binary part containing the octets 0x0D 0x0A).

In <vyDCOksjW0T5JTUD+zj/Vw(_dot_)md5(_at_)prosecco(_dot_)oryx(_dot_)com>:

If origmsg->subject is "=?latin_1?q?=80?=" and the user doesn't changethe subject, newmsg->subject is "Re: =?latin_1?q?=80?=". If thedecoder knows whether the string could have been generated by areasonably conservative generator, that case can be avoided. A verytricky decision.

Observing RFC 2047 rules, "=?latin_1?q?=80?=" should remain unchanged --it'sNOT an encoded-word. RFC 2047 section 3 requires that the charset namebe one thatis allowed in a MIME charset parameter for media type text/plain or thatit be registeredfor use with text/plain. The rules for text/plain are in RFC 2046section 4.1.2 which

states (in part):

" No character set name other than those defined above may be used in
Internet mail without the publication of a formal specification and
its registration with IANA, or by private agreement, in which case
the character set name must begin with "X-".
"

The "defined above" text refers to the us-ascii and iso-8859-X charsets."latin_1"is not registered nor is it in the initial set of MIME-compatiblecharsets in RFC 2046(all of which are now registered), and it obviously does not begin with"X-",

therefore the RFC 822 atom containing "latin_1" is NOT an encoded-word. It
should be displayed verbatim and should remain unchanged.

In <vblVCzk7yuOTmh4aNJKFdQ(_dot_)md5(_at_)libertango(_dot_)oryx(_dot_)com>:

Suppose the original message had "Subject: =?latin_1?q?The price is=80216". The MUA fuzzily matches latin_1 to the IANA-defined aliaslatin1, knows about the Microsoft breakage, and presents the user with"Subject: The price is €216".

That's where things go wrong. There's no encoded-word --"=?latin_1q?The" and"=80216" should be displayed verbatim. Even if the spaces in the subjectwere replacedwith underscores, and the subject ended with "?=" -- as would be thecase in a realencoded-word -- the subject would have to be displayed verbatim as itstill would notcontain a valid encoded-word. And as Keith Moore has pointed out, if itso happenedthat "latin_1" were valid, but not recognized by the MUA in question, itshould stillbe displayed verbatim (because that MUA has no way to know what todisplay). In

this case the "fuzzily matches" is wrong, and two wrongs don't make a right.

The reply is a message from the user to the recipient(s), and shouldfaithfully encode whatever the user saw and typed. The MUA SHOULD NOTsubstitute some other text of unknown meaning for its user's text.

And that's exactly *why* "fuzzily matches" is wrong -- it involvessubstituting text

of unknown meaning for the original text.

In <xqRenBEr5R5kjjJ6Xlpwgw(_dot_)md5(_at_)prosecco(_dot_)oryx(_dot_)com>:

Something like this, usually:

a = new QLineEdit(...); // http://doc.trolltech.com/3.0/qlineedit.html
if ( reply )
a->setText( "Re: " + orig->subject() );
else if ( forwarding )
a->setText( "Fwd: " + orig->subject() );
a->show();

Why the special case for "Fwd: "? Where is that standardized? Why not"FW: "?

If the editor were internal the MUA could do 2047-decoding for displaypurposes and keep the raw data as its basic storage. But since theeditor is external, the MUA must do 2047-decoding and hand the resultto the editor. Later, when the editor hands it back, the "obvious" wayis to 2047-encode the editor's result use it. Then there's only oneencoder to write and test, and it's used for original messages, forforwarding and for replies. Less to write, less to test, fewer bugs.

Editing of the subject field is contrary to "to be kind and courteousthat program should usethe same subject field". That's a design decision for an MUA author.Clearly, eliminatingediting means "[l]ess to write, less to test, fewer bugs". If editing isdesired, that need notmean that the entire field is decoded, then re-encoded; real-worldediting usually means thatsome portion(s) of the text are added, deleted, or changed, while muchis unchanged.

"Re: " is an interesting case. To be kind, courteous, and RFC 2277conforming, one shouldindicate the language. So that should probably be "=?us-ascii*la?q?Re:?=" (or the Bencoded equivalent, and/or using the ISO 3-letter code for Latin, "lat",and/or using any otherMIME-compatible charset, and/or another capitalization variant...). Andthe reply shouldhave appropriate References and In-Reply-To fields, assuming that theoriginal had a

Message-ID.

Subject is supposed to be an unstructured field, but things like "Re: "impose unnecessary

and useless structure. Consider

Subject: FW: Sv: Fwd: Re^2: =?us-ascii*en-us?q?Auto:_?==?iso-8859-1*lat?q?Re:?= RE: Auto: cmsg sendsysDo you really want to have to be able to recognize and handle every typeof hack to the subjectfield, in every possible combination of capitalization, in encoded orunencoded form? Whatdo any of them indicate that isn't already evident via MIME packaging,Resent- fields, References,

In-Reply-To, and Message-ID fields, and/or Auto-Submitted fields?