Re: Getting RFC 2047 encoding right
2004-01-04 02:37:22
Arnt Gulbrandsen wrote:
Hi,
I have a tiny little problem that you chaps may be able to help me with.
Suppose that a mail client receives a message whose subject is encoded
according to RFC 2047. Suppose further that the message is decoded and
stored in a database somewhere - in RAM, on disk, on a server
somewhere. Some time later, the same or a different program sends a
reply.
Now, to be kind and courteous that program should use the same subject
field, perhaps prefaced by "Re: " (or "Auto: "), such that if the
recipient threads based on subject, everything work, no matter whether
the recipient supports RFC 2047 or not.
That implies using the same character set, q/b encoding etc. as the
original.
To be kind and courteous, the program should use a widely supported
character set when 2047-encoding (whether it's composing original
messages or replies), and use as few 2047-encoded words as possible.
To be kind and courteous, the program should observe all other rules
set out in RFC 2047 and 2822 (including ones which were broken by an
earlier message).
To be reliable and bug-free, the algorithm that does all this should
be simple and straightforward.
There's a conflict here. How do you all address this?
In reverse order:
The nature of the problem precludes a truly simple algorithm; it's a
complex issue.
Some errors can be repaired; others cannot. Attempts to repair errors
might be
more-or-less successful. Successful repair is contingent on being able to
unambiguously determine what was intended.
As far as practicable, original content should be preserved. E.g. in a
reply, the address
given in the original message's Reply-To (or From) field should be used
verbatim
(same case, same display name if present, including the same
encoded-words if used)
in the To field of the reply.
I would extend that to the Subject field, and go so far as to say that
"Re:", "Auto:",
etc. are best avoided. Incidentally, collating (colloquially "sorting")
by subject is
not the same as "threading"; the latter entails use of References and/or
In-Reply-To
fields with Message-ID fields to follow a related "thread" of messages
(Consider a
collection of 10 messages with "Subject: Help" and 50 with "Subject: Re:
Help" --
collating by Subject (with or w/o stripping "Re: ") doesn't group
responses with
originals).
Non-transient storage of a message is best done in RFC 2822/MIME format,
possibly with lossless compression if the tradeoff between space and
compression/
decompression effort warrants it, and possibly with encryption where
necessary or
desirable. That does not preclude some additional metadata regarding the
message,
but the original message ought to be 100% recoverable for use in
replies, etc. Even
conversion of CRLF to local line endings can be troublesome (consider a
multipart
MIME message with a binary part containing the octets 0x0D 0x0A).
In <vyDCOksjW0T5JTUD+zj/Vw(_dot_)md5(_at_)prosecco(_dot_)oryx(_dot_)com>:
If origmsg->subject is "=?latin_1?q?=80?=" and the user doesn't change
the subject, newmsg->subject is "Re: =?latin_1?q?=80?=". If the
decoder knows whether the string could have been generated by a
reasonably conservative generator, that case can be avoided. A very
tricky decision.
Observing RFC 2047 rules, "=?latin_1?q?=80?=" should remain unchanged --
it's
NOT an encoded-word. RFC 2047 section 3 requires that the charset name
be one that
is allowed in a MIME charset parameter for media type text/plain or that
it be registered
for use with text/plain. The rules for text/plain are in RFC 2046
section 4.1.2 which
states (in part):
" No character set name other than those defined above may be used in
Internet mail without the publication of a formal specification and
its registration with IANA, or by private agreement, in which case
the character set name must begin with "X-".
"
The "defined above" text refers to the us-ascii and iso-8859-X charsets.
"latin_1"
is not registered nor is it in the initial set of MIME-compatible
charsets in RFC 2046
(all of which are now registered), and it obviously does not begin with
"X-",
therefore the RFC 822 atom containing "latin_1" is NOT an encoded-word. It
should be displayed verbatim and should remain unchanged.
In <vblVCzk7yuOTmh4aNJKFdQ(_dot_)md5(_at_)libertango(_dot_)oryx(_dot_)com>:
Suppose the original message had "Subject: =?latin_1?q?The price is
=80216". The MUA fuzzily matches latin_1 to the IANA-defined alias
latin1, knows about the Microsoft breakage, and presents the user with
"Subject: The price is €216".
That's where things go wrong. There's no encoded-word --
"=?latin_1q?The" and
"=80216" should be displayed verbatim. Even if the spaces in the subject
were replaced
with underscores, and the subject ended with "?=" -- as would be the
case in a real
encoded-word -- the subject would have to be displayed verbatim as it
still would not
contain a valid encoded-word. And as Keith Moore has pointed out, if it
so happened
that "latin_1" were valid, but not recognized by the MUA in question, it
should still
be displayed verbatim (because that MUA has no way to know what to
display). In
this case the "fuzzily matches" is wrong, and two wrongs don't make a right.
The reply is a message from the user to the recipient(s), and should
faithfully encode whatever the user saw and typed. The MUA SHOULD NOT
substitute some other text of unknown meaning for its user's text.
And that's exactly *why* "fuzzily matches" is wrong -- it involves
substituting text
of unknown meaning for the original text.
In <xqRenBEr5R5kjjJ6Xlpwgw(_dot_)md5(_at_)prosecco(_dot_)oryx(_dot_)com>:
Something like this, usually:
a = new QLineEdit(...); // http://doc.trolltech.com/3.0/qlineedit.html
if ( reply )
a->setText( "Re: " + orig->subject() );
else if ( forwarding )
a->setText( "Fwd: " + orig->subject() );
a->show();
Why the special case for "Fwd: "? Where is that standardized? Why not
"FW: "?
If the editor were internal the MUA could do 2047-decoding for display
purposes and keep the raw data as its basic storage. But since the
editor is external, the MUA must do 2047-decoding and hand the result
to the editor. Later, when the editor hands it back, the "obvious" way
is to 2047-encode the editor's result use it. Then there's only one
encoder to write and test, and it's used for original messages, for
forwarding and for replies. Less to write, less to test, fewer bugs.
Editing of the subject field is contrary to "to be kind and courteous
that program should use
the same subject field". That's a design decision for an MUA author.
Clearly, eliminating
editing means "[l]ess to write, less to test, fewer bugs". If editing is
desired, that need not
mean that the entire field is decoded, then re-encoded; real-world
editing usually means that
some portion(s) of the text are added, deleted, or changed, while much
is unchanged.
"Re: " is an interesting case. To be kind, courteous, and RFC 2277
conforming, one should
indicate the language. So that should probably be "=?us-ascii*la?q?Re:?=
" (or the B
encoded equivalent, and/or using the ISO 3-letter code for Latin, "lat",
and/or using any other
MIME-compatible charset, and/or another capitalization variant...). And
the reply should
have appropriate References and In-Reply-To fields, assuming that the
original had a
Message-ID.
Subject is supposed to be an unstructured field, but things like "Re: "
impose unnecessary
and useless structure. Consider
Subject: FW: Sv: Fwd: Re^2: =?us-ascii*en-us?q?Auto:_?=
=?iso-8859-1*lat?q?Re:?= RE: Auto: cmsg sendsys
Do you really want to have to be able to recognize and handle every type
of hack to the subject
field, in every possible combination of capitalization, in encoded or
unencoded form? What
do any of them indicate that isn't already evident via MIME packaging,
Resent- fields, References,
In-Reply-To, and Message-ID fields, and/or Auto-Submitted fields?
<Prev in Thread] |
Current Thread |
[Next in Thread> |
- Re: Getting RFC 2047 encoding right,
Bruce Lilly <=
|
|
|