Re: Some text that may be useful for the update of RFC 2376

In message "Re: Some text that may be useful for the update of RFC 2376",
Rick Jelliffe wrote...

What about this?

     1) In all cases, charset parameter is required.
     There is no default. Failure is an unrecoverable
     error, for general applications. Detection is
     mandatory.


This is a change I can agree on.

     2) In all cases, all code sequences in
     the document must match code sequences allowed
     by the encoding specified by the charset parameter.
     Failure is an unrecoverable error, for general
     applications. Detection is not mandatory.


Agreed.  I think that this is not an issue of RFC 2376 but an 
issue of XML 1.0.

     3) In all cases, if the document starts with a BOM,
     the charset parameter must indicate which flavour
     of UTF-16 is being used. There is no default.
     Failure is an unrecoverable error, for general
     applications. Detection is not mandatory, but should
     be made so at some future date.


UTF-16le and UTF-16be cannot be used for XML.  XML mandates 
the BOM for utf-16.  Meanwhile, utf-16le and utf-16be cannot 
have the BOM.  More about this, see RFC 2781.

     4) If the document is sent text/xml, the encoding
     parameter of the XML header is not checked. However,
     well-behaved systems should rewrite the encoding
     attribute of the XML header to agree with charset 
     parameter.


When the recipient has to discard the MIME header, it has 
to change the encoding PI.  I believe that RFC 2376 already 
covers this.

     5) If the data is sent application/xml then
     the charset parameter must agree with the
     encoding attribute of the XML header. Failure is
     an unrecoverable error, for general applications.
     Detection is not mandatory.


In other words, you are proposing that XML-unaware transcoders 
should not be used for application/xml.  Since I would like to encourage 
effecient and generic transcoders, I am reluctant.

     6) The rules above can be bent or strengthened for
     specialist applications, by specific agreement between
     the recipient and sending parties. The main 
     alteration envisaged would be to allow, as an 
     obvious error-recovery strategy, that if the 
     charset parameter is missing, the encoding attribute
     of the XML header can be used. Another alteration
     envisaged is for some defaulting to be used.
     However, specialist applications which require this
     behaviour should not, in general be using text/xml*
     or application/xml*.


Some restrictions are useful for some XML-based media types.  For 
example, application/iotp-xml might allow Unicode only.   I am 
willing to mention such restrictions in the I-D.

Discussion:

The reason for 1) is that we have a clash between user expections
(iso8859-1), RFCs (US-ASCII) and XML defaults (UTF-8). There is
no winnable solution to defaults.


I am personally happy to mandate the charset parameter.  

When RFC 2376 was sent to the IAB, the default for text/xml in the 
case of HTTP was 8859-1.  The IAB suggested US-ASCII.

The reason for 2) is simply to state clearly that error-recovery
from corrupted data is not the norm.

The reason for 3) is that, as Murata-san's proposed
Japanese Profile of XML makes clear, there are Japanese flavours
of Unicode floating about.


As Martin corrected, conversion tables are ambiguous.  But there 
are no flavors of Unicode.

The reason for 5) is that the reason why we have application/xml
as well as text/xml is to prevent point-to-point manipulation of
the data. It should be treated like a binary file. It should 
allow end-to-end data integrity.


I do not understand why we have to prohibit transcoding that 
does not rewrite encoding declarations.  The main argument against 
the charset parameter is that it is often missing or incorrect.  
Application/xml allows the omission of the charset parameter.  
If it is omitted, we rely on autodection described in XML 1.0.  
I believe that it was Martin who proposed this compromise in the 
W3C XML SIG and everybody can live with it.

I see no reasons for preserving byte sequences.  We only have to 
preserve XML information sets.

(There is a fundamental weak point in point-to-point charset 
parameter transmission: there is no standard mechanism for 
registering the character set of individual files which a 
webserver can pick up: furthermore, some programming languages


AddType and AddCharset of Apache allows registeration for 
each directory.  We can also use conventions for file extensions.

It would be great if the W3C team further enhances Apache.

such as C do not have a character type but operate on storage types, 
so the encoding data is not available automatically anyway;


Existing programming languages do not support Unicode very well, as 
I see it.

also, on UNIX systems using pipes, there is no parallel channel 
available for out-of-band information between the processes on 
either side of the pipe, so encoding information may be
difficult to propogate automatically.


This is true, but programs interchange DOM data rather than textual 
XML.



----
MURATA Makoto  muraw3c(_at_)attglobal(_dot_)net