Re: [xml-dev] Text/xml with omitted charset parameter




Elliotte Rusty Harold wrote:


Among other things, this means that the same document may be interpreted with 
a different encoding when read from HTTP than when read from the local file 
system. YUCK! This is a major interoperability problem.


I agree and I have always asserted that same position. I feel that the
xml encoding declaration, which is under the authors control and is
creatable in a standard form by any authoring software, is a far more
secure an error-free way to indicate the encoding than to rely on the
vagarities of how (if at all) some server is set up to generate charset
parameters and what meands (separate files, filename conventions, etc)
it uses to trigger such behavior.

Those people who wish to use non-xml-aware transcoders to do content
encoding conversion and were not willing to have those encoders made
xml-aware enough to alter the xml encoding declaration when transcoding,
won (unfortunately, in my view) and as a result, for application/xml as
wel as for text/xml, the charset patrameter *whether present or absent*
overrides the encoding declaration in the xml file. 

Thus, simplicity for charset transcoder is achieved at the expense of
compexity for all xml clients, which have to perform a check and
potentially rewrite the xml file when saving to disk, assuming they want
it to be parsed afterwards.


Perhaps worse yet, since the default for text/xml is us-ascii and not utf-8, 
this means that serving an XML document using any non-ASCII characters over 
HTTP requires the author to set the charset parameter of the MIME media type. 
This is non-trivial in most environments and impossible in many.


Correct.

According to RFC 3023, "US-ASCII was chosen, since it is the intersection of 
UTF-8 and ISO-8859-1 and since it is already used by MIME." However, this 
really strikes me as insufficient justification given the major practical 
problems it presents for non-ASCII documents.


This was chosen to retain backwards compatibility with what the MIME
spec allows for any text/* document - treat it as text/plain.

Is there any chance of superseding this RFC with one that specifies UTF-8? 
This still isn't perfect, but it at least allows full use of Unicode.

Interestingly, application/xml does not have this problem, at least not all 
of it.


Yes, although prior to this RFC, application/* had *none* of it because
there was no charset.

In the absence of an explicit charset parameter, then application/xml falls 
back to the normal heuristics for guessing the encoding of an XML document 
(e.g. byte order mark, encoding declaration, etc.)


Thus, all xml content should be served as applicvation/xml and use of
text/xml should be deprecated.

There's still a problem if the MIME charset disagrees with the document 
internal information,


Yes. One of the things that can be done in practival terms is to make it
an error if these disagree.

but in practice this isn't nearly as big a problem. Maybe that's what should 
be done with text/xml as well? It certainly seems to be what Mozilla is 
already doing.

In the meantime, I think I'm going to start recommending the use of 
application/xml and deprecating the use of text/xml.


I agree. Due to the legacy requirements of text/* types, it is really
not very suitable for XML.


-- 
Chris