What about this?
1) In all cases, charset parameter is required.
There is no default. Failure is an unrecoverable
error, for general applications. Detection is
mandatory.
2) In all cases, all code sequences in
the document must match code sequences allowed
by the encoding specified by the charset parameter.
Failure is an unrecoverable error, for general
applications. Detection is not mandatory.
3) In all cases, if the document starts with a BOM,
the charset parameter must indicate which flavour
of UTF-16 is being used. There is no default.
Failure is an unrecoverable error, for general
applications. Detection is not mandatory, but should
be made so at some future date.
4) If the document is sent text/xml, the encoding
parameter of the XML header is not checked. However,
well-behaved systems should rewrite the encoding
attribute of the XML header to agree with charset
parameter.
5) If the data is sent application/xml then
the charset parameter must agree with the
encoding attribute of the XML header. Failure is
an unrecoverable error, for general applications.
Detection is not mandatory.
6) The rules above can be bent or strengthened for
specialist applications, by specific agreement between
the recipient and sending parties. The main
alteration envisaged would be to allow, as an
obvious error-recovery strategy, that if the
charset parameter is missing, the encoding attribute
of the XML header can be used. Another alteration
envisaged is for some defaulting to be used.
However, specialist applications which require this
behaviour should not, in general be using text/xml*
or application/xml*.
Discussion:
The reason for 1) is that we have a clash between user expections
(iso8859-1), RFCs (US-ASCII) and XML defaults (UTF-8). There is
no winnable solution to defaults.
The reason for 2) is simply to state clearly that error-recovery
from corrupted data is not the norm.
The reason for 3) is that, as Murata-san's proposed
Japanese Profile of XML makes clear, there are Japanese flavours
of Unicode floating about. So just relying on the BOM is not
satisfactory. (Also, I see no reason why it may not be useful
to distinguish in the charset parameter whether Unicode 2 or
Unicode 3 is being used; but that is another issue.) This
issue also impacts 1): if UTF-8 is the default, it is easier
to be lazy, which in turn makes it easier for Japanese data
to be mislabelled as standard UTF-8.
The reason for 4) is that traditionally the text/* types allow
point-to-point transcoding: DOS to MAC to UNIX newlines,
character encoding, perhaps even trailing white-space trunctation
are the kinds of things.
The reason for 5) is that the reason why we have application/xml
as well as text/xml is to prevent point-to-point manipulation of
the data. It should be treated like a binary file. It should
allow end-to-end data integrity.
(There is a fundamental weak point in point-to-point charset
parameter transmission: there is no standard mechanism for
registering the character set of individual files which a
webserver can pick up: furthermore, some programming languages
such as C do not have a character type but operate on storage types,
so the encoding data is not available automatically anyway;
also, on UNIX systems using pipes, there is no parallel channel
available for out-of-band information between the processes on
either side of the pipe, so encoding information may be
difficult to propogate automatically. However, the
point-to-point mechanism of text/xml is clearly generally
useful and usable for single-locale sites and important to
support.)
7) These measure are perhaps more extreme than many people
would wish. That is why the detection requirements are so lax,
and the provisions for bending the rules are spelled out.
Rick Jelliffe