Re: Some text that may be useful for the update of RFC 2376



MURATA Makoto wrote:


In message "RE: Some text that may be useful for the update of RFC 2376",
Langer, Paul wrote...

 >We are developing an XML-database that gets input via HTTP.
 >In a previous release we implemented RFC 2376 correctly (for
 >media type text/xml we used the value of the charset parameter to
 >determine the encoding of input documents; if this parameter was
 >omitted we used the default "us-ascii").

We are all aware of this problem.  We are also aware of transcoders
which changes the charset parameter but does not rerwrite encoding
declarations.


Yes - such behaviour is clearly broken. Since a transcoder is changing many
or all the other bytes in the file, expecting it to also correctly update
the encoding declaration rather than leaving it broken is not asking too
much.

In Japan, we have a very interesting problem.  We have XML, XSL,
Javascript, VBScript, CSS, and HTML, which reference to each other.  Some
formats provide inline declarations.  Other formats do not.  IE 5.0
appear to assume that if an HTML document is in UTF-16, anything
referenced from this HTML is also in UTF-16.


This assumption of IE5 is not correct; and will get them into severe
trouble if carried forward to XML - different entities can use different
encodings, as the XML spec clearly says.

Unfortunately, even
when XML, XSL, and CSS are all in Shift_JIS, an internally generated
HTML is in UTF-16.  Thus, we have data corruption.


It is entirely legal for the CSS stylesheet and for the HTMl docuiment to
use different encodings.  There is absolutely no problem with the style
sheet being in (one of the many) Shift-JISes and the HTML or XML document
being in UTF-16. If IE5 or any other browser does not deal with this use
case correctly, it should be fixed.

I have come to believe that we need a single solution for every format.


I have come to believe that taking an out of band solution which sort of
works for text/* over HTTP and email, and trying to extend it to
application/* and image/* and video/* and ftp and file and other sorts of
protocol - to make it fit all cases - has some very clear problems in
extremely common use cases. Like the fact that 99.999% of content providers
have no control over the configuration of the web server they use, but do
have control of the content that they place there. And the fact that
file-based processing (on servers, on clients) is extremely common. Making
these common cases not work, to save a few lines of code in a transcoder
which knows it is converting from encoding A to encoding B and thus knows
what encoding declaration to write out if it could be bothered to do so,
seems highly curious.

The charset parameter is such a solution.


It is one such solution. There are better ones, and indeed a much better
one in the XML specification. Wisely, XML instances which are read with the
wrong encoding give well formedness arrors and halt. This is excellent.
Unfortunately, complicating the issue with sometimes-there Content-type
headers with sometimes-there encoding ("charset") declartations reverts the
state of play to  transport-dependent defaults and "some sort of error
correction", a world we are trying to move away from, a world of silent
data corruption.

We should not try to bend
specifications only to invent an ad-hoc solution for a particular format.


I can only agree with that sentence by replacing "format" with "protocol".

On the contrary, I believe that the excellent basis of XML should not be
bent to cope with historical details in text/* media types and their
divergent defaults depending on transport protocol.

Let us strongly request internationalized WWW browsers & servers to
Microsoft and Netscape.


No one could fail to endorse such a general message (and to add the many
other suppliers of technology to the list) but it does not follow that your
preferred complete reliance on the charset parameter will achieve such an
end - in fact, I would say rather the reverse.

Earlier, I suggested revised rules for encoding determination which are
completely rigorous, deterministic, do not rely on any hand waving or
ill-specified error correction, and allow automated content creation tools
to do the right thing simply and easily and for XML files to work correctly
in all cases and to not ever have multiple inconsistent sources of encoding
declaration. I would consider such rules to be an essential step towards
the worth (and indeed, readilly achievable) goal in your message above.


--
Chris