Re: Some text that may be useful for the update of RFC 2376

What about this?

        1) In all cases, charset parameter is required.
        There is no default. Failure is an unrecoverable
        error, for general applications. Detection is
        mandatory.

        2) In all cases, all code sequences in
        the document must match code sequences allowed
        by the encoding specified by the charset parameter.
        Failure is an unrecoverable error, for general
        applications. Detection is not mandatory.
        
        3) In all cases, if the document starts with a BOM,
        the charset parameter must indicate which flavour
        of UTF-16 is being used. There is no default.
        Failure is an unrecoverable error, for general
        applications. Detection is not mandatory, but should
        be made so at some future date.

        4) If the document is sent text/xml, the encoding
        parameter of the XML header is not checked. However,
        well-behaved systems should rewrite the encoding
        attribute of the XML header to agree with charset 
        parameter. 

        5) If the data is sent application/xml then
        the charset parameter must agree with the
        encoding attribute of the XML header. Failure is
        an unrecoverable error, for general applications.
        Detection is not mandatory.

        6) The rules above can be bent or strengthened for
        specialist applications, by specific agreement between
        the recipient and sending parties. The main 
        alteration envisaged would be to allow, as an 
        obvious error-recovery strategy, that if the 
        charset parameter is missing, the encoding attribute
        of the XML header can be used. Another alteration
        envisaged is for some defaulting to be used.
        However, specialist applications which require this
        behaviour should not, in general be using text/xml*
        or application/xml*.

Discussion:

The reason for 1) is that we have a clash between user expections
(iso8859-1), RFCs (US-ASCII) and XML defaults (UTF-8). There is
no winnable solution to defaults. 

The reason for 2) is simply to state clearly that error-recovery
from corrupted data is not the norm.

The reason for 3) is that, as Murata-san's proposed
Japanese Profile of XML makes clear, there are Japanese flavours
of Unicode floating about. So just relying on the BOM is not
satisfactory. (Also, I see no reason why it may not be useful
to distinguish in the charset parameter whether Unicode 2 or 
Unicode 3 is being used; but that is another issue.)  This 
issue also impacts 1): if UTF-8 is the default, it is easier
to be lazy, which in turn makes it easier for Japanese data
to be mislabelled as standard UTF-8.

The reason for 4) is that traditionally the text/* types allow
point-to-point transcoding: DOS to MAC to UNIX newlines, 
character encoding, perhaps even trailing white-space trunctation
are the kinds of things. 

The reason for 5) is that the reason why we have application/xml
as well as text/xml is to prevent point-to-point manipulation of
the data. It should be treated like a binary file. It should 
allow end-to-end data integrity. 

(There is a fundamental weak point in point-to-point charset 
parameter transmission: there is no standard mechanism for 
registering the character set of individual files which a 
webserver can pick up: furthermore, some programming languages 
such as C do not have a character type but operate on storage types, 
so the encoding data is not available automatically anyway; 
also, on UNIX systems using pipes, there is no parallel channel 
available for out-of-band information between the processes on 
either side of the pipe, so encoding information may be
difficult to propogate automatically.  However, the 
point-to-point mechanism of text/xml is clearly generally 
useful and usable for single-locale sites and important to 
support.)

7) These measure are perhaps more extreme than many people 
would wish. That is why the detection requirements are so lax,
and the provisions for bending the rules are spelled out.


Rick Jelliffe