In message "Re: Some text that may be useful for the update of RFC 2376",
Rick Jelliffe wrote...
What about this?
1) In all cases, charset parameter is required.
There is no default. Failure is an unrecoverable
error, for general applications. Detection is
mandatory.
This is a change I can agree on.
2) In all cases, all code sequences in
the document must match code sequences allowed
by the encoding specified by the charset parameter.
Failure is an unrecoverable error, for general
applications. Detection is not mandatory.
Agreed. I think that this is not an issue of RFC 2376 but an
issue of XML 1.0.
3) In all cases, if the document starts with a BOM,
the charset parameter must indicate which flavour
of UTF-16 is being used. There is no default.
Failure is an unrecoverable error, for general
applications. Detection is not mandatory, but should
be made so at some future date.
UTF-16le and UTF-16be cannot be used for XML. XML mandates
the BOM for utf-16. Meanwhile, utf-16le and utf-16be cannot
have the BOM. More about this, see RFC 2781.
4) If the document is sent text/xml, the encoding
parameter of the XML header is not checked. However,
well-behaved systems should rewrite the encoding
attribute of the XML header to agree with charset
parameter.
When the recipient has to discard the MIME header, it has
to change the encoding PI. I believe that RFC 2376 already
covers this.
5) If the data is sent application/xml then
the charset parameter must agree with the
encoding attribute of the XML header. Failure is
an unrecoverable error, for general applications.
Detection is not mandatory.
In other words, you are proposing that XML-unaware transcoders
should not be used for application/xml. Since I would like to encourage
effecient and generic transcoders, I am reluctant.
6) The rules above can be bent or strengthened for
specialist applications, by specific agreement between
the recipient and sending parties. The main
alteration envisaged would be to allow, as an
obvious error-recovery strategy, that if the
charset parameter is missing, the encoding attribute
of the XML header can be used. Another alteration
envisaged is for some defaulting to be used.
However, specialist applications which require this
behaviour should not, in general be using text/xml*
or application/xml*.
Some restrictions are useful for some XML-based media types. For
example, application/iotp-xml might allow Unicode only. I am
willing to mention such restrictions in the I-D.
Discussion:
The reason for 1) is that we have a clash between user expections
(iso8859-1), RFCs (US-ASCII) and XML defaults (UTF-8). There is
no winnable solution to defaults.
I am personally happy to mandate the charset parameter.
When RFC 2376 was sent to the IAB, the default for text/xml in the
case of HTTP was 8859-1. The IAB suggested US-ASCII.
The reason for 2) is simply to state clearly that error-recovery
from corrupted data is not the norm.
The reason for 3) is that, as Murata-san's proposed
Japanese Profile of XML makes clear, there are Japanese flavours
of Unicode floating about.
As Martin corrected, conversion tables are ambiguous. But there
are no flavors of Unicode.
The reason for 5) is that the reason why we have application/xml
as well as text/xml is to prevent point-to-point manipulation of
the data. It should be treated like a binary file. It should
allow end-to-end data integrity.
I do not understand why we have to prohibit transcoding that
does not rewrite encoding declarations. The main argument against
the charset parameter is that it is often missing or incorrect.
Application/xml allows the omission of the charset parameter.
If it is omitted, we rely on autodection described in XML 1.0.
I believe that it was Martin who proposed this compromise in the
W3C XML SIG and everybody can live with it.
I see no reasons for preserving byte sequences. We only have to
preserve XML information sets.
(There is a fundamental weak point in point-to-point charset
parameter transmission: there is no standard mechanism for
registering the character set of individual files which a
webserver can pick up: furthermore, some programming languages
AddType and AddCharset of Apache allows registeration for
each directory. We can also use conventions for file extensions.
It would be great if the W3C team further enhances Apache.
such as C do not have a character type but operate on storage types,
so the encoding data is not available automatically anyway;
Existing programming languages do not support Unicode very well, as
I see it.
also, on UNIX systems using pipes, there is no parallel channel
available for out-of-band information between the processes on
either side of the pipe, so encoding information may be
difficult to propogate automatically.
This is true, but programs interchange DOM data rather than textual
XML.
----
MURATA Makoto muraw3c(_at_)attglobal(_dot_)net