RE: MIME types and fragment identifiers in HTML and XML

One more reference to include: the XPointer draft claims to
define what fragment identifiers for XML content-type are supposed
to be, but isn't referenced in your analysis

http://www.w3.org/1999/07/WD-xptr-19990709 says:

  XPointer defines the meaning of the "selector" or "fragment identifier"
  portion of URIs that locate resources of MIME media types "text/xml" and
  "application/xml".

1. URI and URI reference

A URI does not have a fragment identifier, but a URI reference (as defined by
RFC 2396) may have a fragment identifier.  (Note: HTML and XML very often
say "URI" when it should actually say "URI reference".).


To be fair, the terminology has changed over time; the term "URI" is
used ambiguously in various documents.

1) URI

An entity is returned by some protocol  (here, the word "entity" is used as
in RFC 2616).  The protocol should provide some mechanism for transmitting or
inferring media type. In HTTP and email, this is done explicitly with the
'content-type' header.


Not all resources identified by a URI have a way of obtaining an
'entity'. HTTP does, and so "http:" URLs can talk about entities
being returned, but "mailto:"; has no corresponding entity. It isn't
necessary for there to be a 'protocol' that 'returns' entities, either;
for example, the "cid" URL scheme makes reference to content of
MIME messages without a specified protocol, but it would be possible
to use a fragment identifier with a "cid" URL.

Those URIs which don't have a way of obtaining entities also don't allow
fragment identifiers.

Your use of 'email' here is somewhat confusing, because there is often no
appropriate URI to use for an email message. However, other protocols
use MIME content-type for content identification, including IMAP and
IPP.

2) URI reference

A URI is first constructed.


I'm not sure "constructed" is the right word here, but I'm not
sure what you meant.

An entity is returned or accessed interactively
by some protocol.  The protocol should indicate the media type of the
entity.
Then, the user agent for this media type may extract or locate some fragment
of this entity by using the fragment identifier.


I don't think "user agent" is the right term here; it is the
"interpreter for this media type"; in some cases, the interpreter
is part of a user agent, and in others, it's part of some other
function.

The protocol does not indicate the media type for that fragment.


You're saying that fragments don't have media types themselves, I think.

 Thus, it
does not have content types, unless the fragment contains some other way of
specifying media types. (For example, RFC 2397, "data" URL scheme, provides
a way of including MIME content-type along with encoded data.)


I think you're trying to argue that fragments don't, in general, have
media types, even when there are some cases where compound objects have
embedded data which *does* have a media type. The data URL scheme is
certainly one kind of counterexample, but I'm not sure I would recommend
it.

2.  Media types specified by HTML or XML language constructs

HTML and XML provides many constructs which specifies both an URI reference
and a media type.  The HTML and XML specifications are rather silent about
the intended semantics.

1) URI

One could argue that the specified media type is used when the protocol does
not indicate the content type of the entity.  One could even argue that
the specified media type always override the content type indicated by the
protocol.  (Note: Many implementations fail to indicate media types
correctly.)

One could also argue that the specified media type is used to predict or
restrict the content type of the desired entity.  That is, if an A
link contains a 'type' attribute and the resulting URI returns an
entity with a different content-type, then an error has occurred.

This isn't so different as getting a '404 not found'.  Something
happened which wasn't expected. There are various ways of recovering,
but any attempt to override one piece of MIME data with something
 that's "fresher" and more authoritative seems wrong.


I think I sent something like this earlier, but it came out wrong.
The problem when you have conflicting sources of information ("what
is the MIME type of this data") that you *do* want to select the
one that is fresher and more authoritative, but that there are cases
where the data associated with the URI itself is likely to be more
authoritative (e.g., with some FTP sources) and other cases where
the data associated with the entity is more authoritative (e.g., with
a recently maintained HTTP server.)

This is a design issue with HTML and (I suppose) XLink. Perhaps
there needs to be more than one way of associating a content-type
with a URI, one of which says "override content-type" and another
of which says "default content-type".

Note that even MIME-compliant protocols that normally associate
content-type with data can disclaim responsibility for it by,
say, using content-type: application/octet-stream. We might have
some theory of overriding, where text/xml would override
application/octet-stream and text/html would override text/xml
(specific overrides generic).

2) URI reference

If a construct in HTML or XML specifies a URI reference containing a fragment
identifier, the construct also specifies a media type, and the protocol
indicates the content type of the entity, what will happen?

One could argue that the specified media type is used for the desired
fragment, unless the fragment contains some other way of specifying media

types.

I don't understand this case ('the fragment contains some other way...')

One could also argue that the fragment must indicate the media type and that
it must coincide with the media type specified by the HTML or XML
construct (fragment).


This is unreasonable, since it isn't compatible with normal usage
where fragments are used without explicit media type.

If the fragment does not explicitly specify the same media type, an error has
occurred.


Perhaps you mean "if the fragment explicity specifies a media type but
it isn't the same", since not specifying a media type shouldn't be an error.

I believe that there is a generic form of a "fragment" which is
"an uninterpreted name", and that we should expect that most media
types that allow fragments have a way of looking up named components.
This would correspond to <A NAME=..> names in HTML and IDs in XML.
We should expect that other media types that have fragments will
also define named components, too.

I'd like this definition of fragment identifiers to be more specific
about encoding, though; currently it is common practice to use
spaces in fragment identifiers, for example, rather than %20 encoding
them.

[1] HTML 4.0

(http://www.w3.org/TR/html40/types.html#h-6.7)

6.7 Content types (MIME types)

Note. A "media type" (defined in [RFC2045] and [RFC2046]) specifies
the nature of a linked resource. This specification employs the term
"content type" rather than "media type" in accordance with current
usage. Furthermore, in this specification, "media type" may refer to
the media where a user agent renders a document.

This type is represented in the DTD by %ContentType;.

Content types are case-insensitive.

Examples of content types include "text/html", "image/png",
"image/gif", "video/mpeg", "audio/basic", "text/tcl",
"text/javascript", and "text/vbscript". For the current list of
registered MIME types, please consult [MIMETYPES].
Note. The content type "text/css", while not currently registered with
IANA, should be used when the linked resource is a [CSS1] style sheet.

http://www.w3.org/TR/REC-html40/present/styles.html#h-14.2.3


I've been trying to deal with issues around this paragraph in the
HTML working group; certainly, since "text/css" has been registered,
this paragraph should change.