Re: Character-set header (was Re: Minutes of the Atlanta 822ext meeting)

Nathaniel writes:

Like Einar and Mark, I'm very sorry to have missed the Atlanta meeting,
where it seems (to me) that negative progress was made.  I'm delighted
that we appear to be converging on a prohibition against nested
encodings.  Now I'd like to add my voice  to those who are unhappy with
the separate Character-set header.


First of all, let me say that this is not a show-stopper for me -- yet. I was
strongly in favor of the Content-Charset: header; I don't believe that there
were any strong objections to it from the any attendee. But while I can live
without it -- barely -- there were others who proposed and supported it 
strongly. They should speak up now, or risk losing it ;-)

The biggest problem I have with this header is that it will, much of the
time, be meaningless.  I have a problem with headers that are defined
because their meaning is semantically crucial to the message, but which
have no rational application.  Let's consider the 9 message types
defined by RFC-XXXX:


I don't find this to be a particularly valid argument against or for anything.
The point of having the header is that it will be meaningful some of the time.
Many headers are not meaningful all the time -- it is not an ipso facto
argument against having them.

At this point let me try to justify what I see as the reason for having this
separate header. It is very simple -- I want to be able to translate character
sets without having to worry about the specific subtype of data in the 
bodypart. In this respect I view a character set specification as something a 
little more powerful than a reference that indicates the character set -- it 
says, in effect, that the data is currently meaningful in character set x, and 
if translated in a loss-less fashion to another character set, it will remain
meaningful.

Consider the problem faced by a gateway that is converting from the RFC822
world to a non-RFC822-like system (e.g. a BITNET gateway, or an X.400
gateway, or various other sorts of gateways). In order to do an effective job 
the character set of the message being converted should be known. Without a 
character set header, this information is housed in different (sub)*type fields 
under different types. A table of all the type/subtype combinations is 
necessary is necessary to even know where the character set information is 
located! Since this information will change from time to time as new subtypes
are invented, it means that gateways will also have to be extended as new 
subtypes are added. I don't like this -- it breaks the "set it and forget it"
quality I think gateways need. It puts a burden on the gateway maintainer to
update yet another table. And worse, it is unnecessary -- the problem is
totally an artifact of the fact that we don't have uniform place to specify
this information.

There's also the possibility that the character set specification might appear
as an content-type parameter (with whatever syntax somebody thinks is cool),
or on a random additional header. This possibility _IS_ a show-stopper for
me. I'm willing to have a table in gateways that specifies, for a given
type/subtype/... combination, which piece contains the character set 
specification and how to deal with it. But if that information starts popping
up in other fields, in other syntaxes, it means that new code is necessary
to deal with evolutionary additions to RFC-XXXX. This is too nasty to
contemplate -- and it is stupid and unnecessary. It is a show-stopper for me. 
If character set information is going to be specified, it has to either by a 
subtype or on a separate well-defined header. Nothing else is adequate.

Now, Nathaniel has argued that a charset specification is meaningless for
some content-types, and I agree with this. I'd go as far as listing whether
or not a character set specification on a particular top-level type is
meaningful or should be ignored. (Yet another advantage to having a limited
number of top-level types surfaces here.)

For subtypes under top-level types where a character set specification is
potentially meaningful, the subtypes are constrained to either allow 
specification of a character set or forbid it. The results will be meaningless 
when a character set is specified for a subtype that does not support it. And 
this is a danger. But remember that specification of the wrong type or subtype
is also possible, and no more risky, in my opinion. Thus, while this is
a danger, I think it is no more dangerous than the problem that type/subtype
information already presents. In addition, I can have the same old table
that tells me whether or not the specification is valid. The difference is
that while it still needs to be updated, not updating it will just mean that
someone's message that is incompliant will not be handled correctly. Just
deserts -- it removes the urgency for keeping the table up-to-date, while
preserving the protection against silliness over time.

Now, given these facts, what does it mean to a UA if it sees a
character-set header?  It means that you have to look at the
Content-type header.  If that happens to be "text", the meaning is
obvious, but otherwise it is at best confusing and at worst totally
undefined.


A UA is a more restrictive case than a gateway. A UA must do something
more than repackage the data in a palatable way; it must interact with the
user and present the data in an acceptable format. From a UA's point of view
having the character set information in a well defined part of the type/subtype
specification is adequate. But it cannot move around -- if it does the
UA, like the gateway, will need new code, and that is simply unacceptable.

 But if the Content-type has a critical impact on the
semantics of the character set specification, why shouldn't it be
specified as part of the content-type?


See above. I only have a big problem with uncontrolled movement of the
specification. I have a little problem with it always being somewhere in the
type/subtype list.

And if only one (or even 2 or 3)
content-type can sensibly have character set information, why not make
that information part of the content-type for that (those) specified
type(s)?


I think it is inevitable that others are going to surface. I see quite a 
few possibilities already.

I have yet to hear any clear reason why the "text/char-set" model is
inadequate, and I find the addition of a Character-set header
potentially very confusing.  I would strongly advocate that we return to
using text subtypes to specify character sets.  This is close to being a
showstopper for me, though I'm trying to keep an open mind.


I'm rather surprised you see it this way. At worst the information is
moved to a well-defined alternate location. If you want to paste it back
into the type/subtype information so you can look up the combination in a
table, it seems that it would be rather easy to do this. Just remember that
while this operation is easy, the reverse (pulling it back out) is hard, since
you don't know where it is without full knowledge of all possible content-type
syntaxes.

                                        Ned