RE: Parameters for top-level XML media types?

At 12:31 PM 5/6/99 -0700, Larry Masinter wrote:

Before we invest a lot of effort into solving this problem, could you
please give a couple of _realistic_ examples where
  a) this is really a problem and
  b) MIME type labelling would actually help?


Sure.  Whenever I hear someone ask for '_realistic_ examples' I expect
they're planning to dislike whatever I present on philosophical grounds,
but I'll give it a shot.

It sounds nice in principal, but for the most part, the examples given
are very weak. For the most part, the examples I've seen are based on
some hypothetical world in which there are thousands of different documents
and document types in different repositories, and people are browsing
indiscriminately among the ones that fit their application.


The principle is extremely simple.  File types should be labeled as
precisely as possible.  XML documents have a double identity - they can be
processed by generic XML processors and stored in generic repositories -
and they also have specific content that may require a particular
processor.  At present, MIME types can identify the first identity (using
text/xml or application/xml) OR the section (typically using
application/x-whatever), but not both.

As for 'some hypothetical world', I don't think we're really that far away
from the very scenario you describe.  People and software browsing
indiscriminately is here today, as are thousands of different documents in
different repositories.  We don't _yet_ have thousands of different
document types, but as XML makes plenty of provision for this, it might be
wise to be prepared.

I've not yet seen a real example. The closest I've seen is the calendaring
example, where someone gets lots of calendar messages along with other
messages in their email box, and the application is trying to sort out
which things are calendar requests from just ordinary email by filtering
on the MIME type. But I'm suspicious of this application; for a long
list of reasons, it seems like the wrong use of the MIME type to determine
the intent of the message.


Lacking any strong theoretical feelings about the 'proper' use of MIME
types, I'd happily make the claim that if it works, great, and if we can
make it work better, even better.

Now for some 'real world' examples, though they'll have to be somewhat
vague given the current lack of 'real world' XML currently in circulation.

XMLNews (www.xmlnews.org) provides an XML vocabulary for marking up news
stories.  At present, I can't find them serving any 'live' XMLNews
information, so I'll make this up out of the usual conjecture and guesswork.

Suppose I subscribe to a news feed that uses XMLNews format, say one
focused on XML. (Maybe Robin Cover's site in XML form.)  The feed comes in
through my email, and hopefully I've finally built a mail program that I
like better than Eudora Pro. Because subject headers are such crummy things
to sort on, I feed the XMLNews information into a repository that works
well with XML.  There are a few ways I could set this up:

1) Filter based on sender (okay for simple case)
2) Filter based on MIME-type: all text/xml goes through another processor
and into the repository. If I'm smart, a post-processor figures out which
XML is which so I don't end up collecting business cards.
3) Filter based on MIME-type: all application/x-xmlnews is separated,
inspected to make sure it's 'really' XML and not a mere name collision and
then sent into the repository.
4) Filter based on MIME-type: all xml/x-xmlnews is separated, put in the
repository.

Number 4 seems to me the most trustworthy and requires the least
post-processing.  Seems like a good combination to me.

Similarly, suppose I have a search engine that scours the Web seeking out
XMLNews information and indexing it.  ('News beyond the Wire' or
something.)  If XMLNews is identified as text/xml or application/xml, the
signal-to-noise ratio is going to be extremely high.  I'll be loading lots
of documents and throwing them in the discard pile.  If it's
application/x-xmlnews, the signal-to-noise ratio will be much lower. On the
other hand, if I build a search engine that simply indexes XML documents,
it's going to have to load all kinds of application/x-* to figure out
what's in XML and what isn't - a problem given the likely proliferation of
application/x-* that XML makes possible.

If it's xml/x-xmlnews, then both my XMLNews-specific search engine and my
generic XML search engine are happy - both know that they can index the
material, and the odds of wasted transfers decline. (Admittedly, because
non-XML formats are likely to grow much more slowly, we could train the
generic search engine to ignore bad matches.  On the other hand, it seems a
lot easier to get the matches right the first time.)

Similar issues arise for other specific and generic processors.  Browsers
can display (either as a tree or with style sheets) any XML that's text/xml
or application/xml.  If they get an unknown MIME type, however, they're
going to have to bug the user to figure out what to do with it. Again, they
could check everything they get to find out if it's XML, but why not get it
right the first time?

Creating a first-level xml space tells generic XML processors that they can
do _something_ to the material contained in the file. Then the second level
can provide a more specific description useful for processors that don't
want to waste their time with the wrong XML information. 

We have an opportunity here to strengthen the use of MIME types by making
them more meaningful.  MIME types are an amazingly underutilized and
frequently misunderstood resource for identifying information types.
Unfortunately, if we continue down the path of text/xml, application/xml,
and application/x-*, we're not providing applications with enough
information to use MIME types meaningfully and reliably.


Simon St.Laurent
XML: A Primer
Sharing Bandwidth / Cookies
http://www.simonstl.com