[Top] [All Lists]

Re: The last structural shortcoming of MIME: how to remove it

1995-12-04 08:58:14
Keith Moore <moore(_at_)cs(_dot_)utk(_dot_)edu> wrote in message

Olle: In general, I like the proposal, *except* that it would allow a 
charset to be associated with any MIME parameter.  For some parameters, 
a charset might be meaningless (when the parameter describes non-
character data); for others, there is a sometimes a need to limit the 
charsets which might be used.

The specification of a parameter can restrict the syntax of its
legal values. If charset indicators are forbidden, a conforming
implementation must not send values including such indicators.
If it receives such values, the charset indicator should be

Any proposal to extend parameters to handle different character sets, 
needs to fully specify what happens when the character set in the 
parameter isn't understood by the recipient's mail reader.

I don't think that it should force a certain way of handling
this situation, but it should suggest one or a few reasonable
ways of doing that.

For example, it's completely unreasonable to expect a mail reader
to recognize a file name in any MIME character set and translate
that character set to whatever is used locally.

It's unreasonable for an arbitrary recipient in the world. But
in a nationally or linguistically closed group of recipients, a
significnat part of them might be able to handle the most
frequently used coded character sets in that community in an
intelligent way.

But, as always, a well-behaved _sender_ should itself convert
the local coded character set to the first possible of:


1-9) ISO-8859-1 through ISO-8859-9 (but what about ISO-8859-10?)

10)  ISO 10646/Unicode _without_ combining characters, in an
     encoding suitable for the transport

11)  ISO 10646/Unicode _with_ combining characters, in an
     encoding suitable for the transport.

A message-composing program may, however, be unable to correctly
convert text, for which it knows the coded character set, into
one of these character sets. In that case it's better that it
doesn't make a blind fall-back conversion to US-ASCII (which in
some cases can give unwanted results) but keeps the original
data, supplemented with a charset indicator. The recipient, or
perhaps some of the recipients of a mailing list message, may be
able to use it in the orginal form. Or they may be able to
convert it, without distorsion, to a coded character set that
the recipient can use.

For example, if an Icelandic filename, in its original coded
character set, the Macintosh character set for Iceland, is given
for a binary attachment and this is sent to a mainly Icelandic
mailing-list, the recipients that have Macintoshes themselves
can use the filename directly. Those Icelanders that have Unix
or MS Windows systems will not unlikely have software support
for converting the IceMac character set to ISO-8859-1. For the
recipients who are unable to handle this character set, what
remains is to replace all non-ASCII characters in the suggested
filename by a suitable substitution character, say "x".

There is an even more difficult situation. The message composing
program may be including files from different machines on a
heterogeneous LAN with for example Macintoshes using the
"macintosh" charset, PCs running DOS using one of then "ibm437",
"ibm850", and "ibm865" IBM PC code pages, and PCs running
MS Windows, using "iso-8859-1". Normally, it will have no chance
to know the coded characer set for any filename in this LAN with
octets > 127.

Also here it's better if the filename is included in the
filename parameter in its original form, with high octets coded
with the %-notation. No charset should be indicated, or the
charset value "unknown-8bit" can be used. A character
set-enabled receiving UA can then guess the nationality of the
message from the top level domain name, select the three or four
most frequently used coded character sets in that country,
convert the filename value from each of these character sets to
the local coded character set, and display the different results
to the user as alternative filenames, if he/she wants to save
the message body part to a local file. If any of the guessed
character sets is the correct one, and the user understands the
language of the filename, he/she will immediately see which
alternative is the correct one (by means of so-called
non-artificial intelligence).

<Prev in Thread] Current Thread [Next in Thread>