Re: Some text that may be useful for the update of RFC 2376



MURATA Makoto wrote:


In message "Re: Some text that may be useful for the update of RFC 2376",
Chris Lilley wrote...
 >and attempt to
 >stretch this to make loose and wooly the current, fairly good state of
 >encoding declaration of XML files?

Unfortunately, we do not have "fairly good state of encoding declaration
of XML files".  People generate XML documents by XSLT or their own programs,
and fail to specify the correct charset.


That is not a problem. Such files will not be well formed and thus, will
fail toparse.

 Encoding PIs are not bad when
the MIME header is absent.  But mistakes do happen.

 >Several people have pointed out that I am focussing on XML here. I would
 >refer them to the name and scope of the mailing list.

I think that you are not paying attention to other textual format.


Oh I am, but not on this list where it is off topic.

I would
like XML to be a good citizen of the WWW and to establish a good practise


As would I. I don't consider the propogation of known faults to be "good
practice".

 >Incidentally, XML is probably not best described as a textual format. It is
 >a data format, which can among other things be used to describe
 >international text. I am aware that the text/* media types have some
 >historical requirements regarding 'character set'; this is sufficient that
 >my opinion is that text/* should not be used for XML in general.
 >Application/xml has no such problems (though it seems that people propose
 >to propogate these problems there).

I think that many XML documents are readable for casual users and that
the top-level type "text" is most appropriate.


As long as they only use US-ASCII.

 The charset parameter
is not a historical requirement.  Rather, it is the right solution,
which is just about to take off.  I think that we are wasting our
limited resources by repeating old discussion rather than doing more
implemenations.


You consistently fail to address the issue of file system processing of
XML, and instead characterise all opposition to your proposal as "time
wasting". I will be happy to characterise it as that once you have given a
satisfactory response to the questions I pose.

 >It is possible for example to take a payload of image/svg-xml and alter it
 >from UTF-16 to ISO-8859-15 (this would entail rewriting the encoding
 >declaration and insertion of NCRs for any characters outside the repertoire
 >of 8859-15). I would be most upset, as would every decoder on the planet,
 >if the same conversion was performed on image/png.

Since XML processors support UTF-8 and UTF-16, transcoding from Unicode to
legacy encodings does not look very attractive.


I agree that such transcoding is unattractive, but you seem to want to bias
the XML MIME specification to supporting such transcoding whatever the cost
to other sorts of processing.

 What is needed is the
other way around: conversion from legacy encodings to Unicode.  Such 
transcoders
do not need character references by numbers.


Thanks. That is the first time that I saw you limit these "all text" 
transcoders to somewhere that they might at least be useful and be able to
represent all the characters.

However, something that converts an XML file from 8859-1 to UTF-8 and
leaves the endoding declaration saying 8859-1 is not useful. It has not
generated XML. It has made a thing which will fail to parse. A transcoder
that *knows* is converting from (list of legacy 8-bit charsets) to (UTF-8
or UTF-16) can always do the right thing by always emitting an XML
declaration without an encodingdeclareation (or better, one thatsauys which
of UTF-8 or UTF-16 is used). This is one line of code. So then all it needs
to do is strip out any existing XML declaration. That is pretty trivial,
too. I mean, grep -v will do in a pinch ;-)

--
Chris