Re: Finishing the XML-tagging discussion

At 04:22 PM 3/17/00 -0800, Paul Hoffman / IMC wrote:

Nope, we agree that there are very many. Where we disagree is in how they 
will labelled when they are moved in MIME-based systems.

Your assumption appears to be that they will all follow the lead of IOTP 
and have their own sub-type tags. A different assumption is that they will 
mostly use text/xml and application/xml. Doing so makes them quite friendly 
to the random MIME parsers you envision. Each will have a unique DTD 
identifier, so there is no need to have their own subtype.


I don't think text/xml and application/xml are going to have much of a life
except as a holding tank for XML formats that are either in development,
and thus not ready for a MIME content identifier, or which aren't meant for
any kind of large scale interchange.

If I'm going to the trouble of organizing a group of developers to create a
format we all agree on, I'm definitely going to take the tiny extra step of
creating a label for it that we can agree to use to simplify interchange.
If I have a million documents arriving at a gateway every minute, for three
different sets of purposes involving different DTDs, I'm not going to want
to sniff every document to find out what it contains.  I'd much rather look
at a content-type label and THEN pass it to a processor that can combine
the sniffing/validation with parsing that actually brings the information
into my application structures.

XML's ability to ride commodity HTTP infrastructures makes this possible,
cheap, and very likely.  I don't think the creators of most XML formats are
going to go to the length of creating their own transport protocols, but
they will want to be able to take maximum advantage of existing protocols.
The scenario described above takes maximum advantage of an HTTP server
(most likely) and can perform extensive and efficient processing without
having to resort to sniffing.

Sniffing XML can be fairly ugly - DOCTYPE declarations aren't required, and
they can appear fairly deep into a document under certain unpreventable
circumstances. (The XML declaration, comments, and processing instructions
can occupy space before the DOCTYPE declaration.)  Even sniffing the
DOCTYPE isn't completely reliable, thanks to a variety of issues involving
default values for namespaces within external DTDs and overrides inside the
internal DTD...

Sniffing is not an acceptable solution in a large-scale environment, in any
case, and the widespread availability of commodity components will lead
developers to find the easiest way to connect those components.  Sniffing's
a mess.

While I can always slap the wrong MIME content type identifier on a
document and cause problems that resemble a failed sniff, that's a risk
with any MIME identifier and so far as I know the Internet is still running.

For the record, I don't know why the authors of IOTP chose to use a 
different sub-tag; they may have a very good reason. But my guess is that 
most XML-based applications that want to be found by generic XML parsers 
SHOULD use text/xml and application/xml.


'want to be found' is making assumptions that don't hold very well with
XML.  Suppose that the gateway described above is also making copies of all
messages for regulatory reasons, passing them to different data
repositories based on their content.  Everything except the XML gets passed
to a traditional file storage system, while the XML gets fed into a
hierarchical store that provides random access to the information.

Having an -xml suffix on the information would make it extremely easy to
sort out the XML from the non-XML without relying on fallible and
inefficient sniffing techniques.  If messages started arriving in XML
without the -xml (say in an x- type), they couldn't be analyzed with the
same set of tools.  Human intervention might be worthwhile at that point,
but the overall costs are a lot lower.

Like Ned, I have nothing against -xml as a concept. I'm just convinced that 
the systems that use it will also reflexively resort to sniffing everything 
anyway, so why give the false impression that all subtags that go over XML 
should end in -xml? Let them sniff away.


I don't know where your assumptions about programmers come from - the ones
I know tend to prefer the easiest way of doing things.  And why the 'false
impression' rhetoric?  Should I talk about the 'false impression' that
image/png provides?

Put simply, sniffing isn't done with off-the-shelf parts and it doesn't
scale very well.  I think most programmers would rather read a clear label
than sniff the box to find out if it contains perfume, bills, or manure.

Simon St.Laurent
XML Elements of Style / XML: A Primer, 2nd Ed.
Building XML Applications
Inside XML DTDs: Scientific and Technical
Cookies / Sharing Bandwidth
http://www.simonstl.com