Re: Announcement of a new I-D


Rick Jelliffe posted this reply to xml-dev.  Since this I-D 
should be discussed here, I am forwarding his message.

Murata Makoto wrote:

We are looking forward to your feedbacks.



Very disappointing to see: 


1) Charset "should" be given for application/xml. HTTP has a character 
set handling concept that comes from fantasyland. I would recommend a 
very different policy: never use xml/data, always use application/xml; 
never use charset, always use xml encoding declarations. 


2) "When non-validating processors handle XML documents, they do not 
      always read external parsed entities. Thus, interoperability is 
      not guaranteed." 
This is just FUD: why isnt this handled by the "standalone" declaration. 
If it is a comment about bugs in software, that is out-of-place here. 



3) Support for xml:base. Xml:base is currently being railroaded through 
W3C with requirements document--on behalf of my organization which is a 
W3C member I have repeatedly asked what its justification is, and there 
has never been any answer. Xml:base is dangerous because it creates an 
unlabelled dialect of XML--a general XML editor cannot treat URIs as 
text when cutting and pasting, it also may have to do something with 
xml:base. It would be OK if managed as part of some more general 
package, but not by itself. In any case, it is not clear whether 
xml:base applies to all data marked by a schema as a URI or just to data 
marked as an xlink:href. It does not apply to URIs in SYSTEM identifiers 
in entities, ASAIK. 


Without knowing which URIs are "known" by xml:base, and what its 
interaction is with xml schemas, broad statements that embedded URIs 
should be interpreted relative to xml:base are surely incorrect, or at 
least too early. There is no need for F with xml:base, but U and D are 
certainly warranted. 


4) The rocket scientists at IETF have managed a new thing with the spec 
for utf16be (if you use utf16be you cannot have a BOM apparantly): it 
means that not only can you do too little as far as labelling your data, 
you can now do *too much*! If you want to use big-endian utf16 and your 
software sticks in a BOM just to be safe you are ruined. I thought I had 
seen everything. This makes the well-intentioned user pay: it looks like 
an enabling provision, but its effect is surely to prevent the use of 
big-endian UTF16. Users should not be penalized for providing "too 
much" labelling. 


5) Along similar lines, but far worse and of major importance for 
internationalization, the fragment identifier of a URI has to be in 
US-ASCII with %HH escaping. Here I am in Taipei and I want to include 
an Xpointer to refer to an ID or element name or attribute name or 
value, and I have to first find the numeric values of my Big5, then 
trancode it into Unicode, then find out what the Unicode values are in 
HEX, then put them in. Is that the way it is supposed to work? This is 
exactly the sort of thing that should be provided by the XML 
infrastructure, not by the poor user: to tell the user "you can say 
'yes' in your native language in that attribute value, but you cannot 
type it directly when you want to reference that elemement" is not 
acceptable. And what about XSLT: does it mean that when I use an XPath 
which includes a document() reference, that I have to suddenly stop when 
I get to the fragment identifier of the URI and switch to %HH? 


This draft has lost the plot. XML is first and foremost a markup 
language: that is its name, that is its purpose, that is what we want. 
Someone should be able to open their local text editor and create a 
legitimate document using all the characters available in that editor, 
without every having to perform any character-to-number conversions or 
looking up any character tables. This is a basic operational simplicity 
which gives XML 99% of its value. 


If HTTP requires data in a different format, the XML infrastructure 
should provide that transparently. If IETF or any RFC pupports to make 
any requirement on how I can mark up legitimate characters, then the 
comment we should respond is terse an monosyllabic: it is not the 
business of an RFC to mandate any particular encodings within an XML 
document. It is ultra vires. (Note, I am *not* saying that an RFC cannot 
proscribe certain characters. I am saying that an RFC oversteps itself 
if it tries to tell me that I must use %HH rather than &#HHHH; or a 
direct character inside my XML document.) 


If a future technology like xml:base is included in discussion, why is 
there no discussion of international domain names? XML should not 
constrain the domain names to be any character (the RFCs currently keep 
the door open): whatever mechanism is eventually used to allow 
internationalised domain names, that should be handled transparently by 
the XML processor and the user should be able to see and type the direct 
characters (or have NCRS). 
So mandating %HH has the problem that we will have to revisit this RFC 
as soon as international DNS comes online (which will probably be sooner 
rather than later: there is a staggering demand here in Asia for it). 
It would be best if XML kept out of the issue entirely: in particular, 
if it is decided that CNRP should be used to convert IDNS names into 
ASCII domain names, then that is definitely something that would make 
the approach of %HH in the domain name part unneeded: why should we have 
one rule in the domain name and another rule for other places. 



Good to see: 


1) |xml suffix is great idea 


2) MIME types for DTDs and external parsed entities 



I regret to say, I think these flaws are so great that the draft should 
be withdrawn and retought at once. Especially point 5 is a disaster. In 
particular, whenever there is some conversion between IETF syntax 
requirements and simple plain text editing, this should be hidden from 
the user and taken care of by the XML processor. 


The current draft is a step backwards for internationalization of the 
WWW in practise. Or, at least, it makes life simpler for ASCII users 
but much more difficult for us non-ASCII users. And I think it makes 
life much more complicated for implementers: it means that user 
interfaces will have to have data conversion routines built in, rather 
than just leaving it to the URI referencing library routines. 



Rick Jelliffe