Re: The last structural shortcoming of MIME: how to remove it

Harmonizing the syntax of MIME parameter values and URLs
--------------------------------------------------------

Referring to my recent proposal for allowing octets > 127 and
different character sets to be indicated in MIME parameters [1],
Glenn Adams <glenn(_at_)stonehand(_dot_)com> wrote in message
<9511292047(_dot_)AA03921(_at_)trubetzkoy(_dot_)stonehand(_dot_)com>:

I am very pleased to see your proposal, and strongly endorse it.  However,
I'd like to suggest an editorial change that is motivated by a conversation
I am currently having about how to solve the character encoding
identification problem that holds with URLs in general.

Given the following syntax:

quoted-string-with-charspec : '"' %-text-with-charspec '"'
%-text-with-charspec        : charspec %-text
charspec                    : charspec-prefix '<' charset '>'
charspec-prefix             : "=?%"
charset                     : as in RFC 1522
%-text                      : %-octet | %-octet %-text
%-octet                     : unescaped-octet | escaped-octet
unescaped-octet             : octet whose value is the code value of
                              any printable ASCII character other than
                              SPACE or %-specials
escaped-octet               : '%' hex-digit hex-digit
%-specials                  : '"' | '{' | '}' | '|' | '\' | '^' | '~' |
                              '[' | ']' | '`' | '#' | '<' | '>' | '%'


Your restriction to "%-specials" and "escaped-octet" for what's
allowed in a "quoted-string-with-charspec" after the "charspec"
wouldn't hurt.

Then RFC 1521 could be updated to read:

value : token | quoted-string-with-charspec


This is more problematic. I don't think we can ignore the
current use of "normal" quoted-strings to give values of MIME
parameters. So the RFC 1521 definition of value should rather be
updated to:

  value = token / quoted-string / quoted-string-with-charspec
     ; when the "quoted-string" and "quoted-string-with-charspec"
     ; interpretation of a "value" both are possible, the latter
     ; shall be used

In fact all "quoted-string-with-charspec"s are syntactically
also "quoted-string"s, so the proposal is syntactically, though
not semantically, backwards compatible.

And RFC 1738 could be updated to:

url-with-charspec           : charspec url
url                         : as specified by RFC 1738

Given these changes, url-with-charspec could be used as a parameter value
simply by adding quotes, since url-with-charspec satisfies the lexical syntax
of %-text-with-charspec.


The "charspec-prefix" must also be added.


The user-unfriendliness of not-only-ASCII URLs. HURLs
-----------------------------------------------------

Now that we have made things so simple, it's time to complicate
them again, but in other respects ...

The requirements RFC applicable to URLs (RFC 1736) asks for
several things, among them:

: 4.5 Locators are "transport-friendly".
:  
:    Internet locators can be transmitted from user to user (e.g, via e-
:    mail) across Internet standard communications protocols without loss
:    or corruption of information.

: 4.6 Locators are human transcribable.
:  
:    Users can copy Internet locators from one medium to another (such as
:    voice to paper, or paper to keyboard) without loss or corruption of
:    information.  This process is not required to be comfortable.

Requirement 4.5 makes it necessary to encode non-ASCII
characters as "escaped-octet"s, consisting of "%" and two
hexadecimal digits. This, however, makes it impossible to meet
requirement 4.6, if languages other than English are involved.

Take as an example a file called "Stockholm", but containing
information about my home city in Greek, located on a Greek WWW
server, where the filename of course is in Greek. Let's say the
filename is coded in ISO-8859-7. A "url-with-charspec" for this
resource could be of this form:

   =?%<iso-8859-7>http://...../%d3%f4%ef%ea%f7%fc%eb%ec%e7

The initial "charspec" part may be acceptable for human users,
taking the non-requirement for comfortability into
consideration. The final part of the URL, specifying the Greek
part of the URL, is, however, not acceptable. This is merely an
incomprehensible stream of digits and letters in a foreign
alphabet(to a Greek), with "%" signs interspersed. To transcribe
this from e.g. a newspaper article without making any error
would be an achievement in itself. To be blunt, this URL is
really _user-hostile_, and requirement 4.6 is not met for these
kinds of URLs.

So what to do? The only solution I can see is to acknowledge
that a special human-oriented form of URLs is needed. Let's call
these things, to separate them from _real_ URLs, "human-oriented
uniform resource locators" (HURLs).

The HURLs should be designed to meet requirement 4.6, but still
be trivial to convert to a real URL by a program, knowledgable
about the coded character sets involved.

The simplest case of HURLs could be the form specified in the
appendix "Recommendations for URLs in Context" in RFC 1738:

   <URL:scheme:schemepart>

This would be the form of HURLs when the coded character set is
US-ASCII (only characters < 128 spcified), or is unknown.
Linear white space would be allowed but insignificant. To secure
the HURLs against operations performed by text-reformatting
programs, it's perhaps also best to require that "-" is encoded
by "%2d" or "%2D", and any occuring "-" character be
insignificant.

The really useful HURL form is needed for the new
"url-with-charspec" kind of URLs. The HURL would be formed by
replacing any "escaped-octet"s for an octet > 127 with the
character having that coded representation in the specified
charset. The HURL can then be written in any coded character set
that contains those non-ASCII characters, not only in the coded
character set that it specifies itself. Even more important, it
can be written and printed on paper with the component names
shown as the words or letter combinations they really consist
of, without any distracting or readability-destroying
%-encoding.

It's possible to somewhat simplify the syntactic suger in HURLs.
Instead of

   =?%<charset>scheme:string-with-lots-of-percent-signs

the HURL could have the form

   <URL-charset:scheme:the-true-component-names>

Say that you are a Greek, and saw the HURL for the resource
about Stockholm in the example above in a magazine. That HURL
would of course not itself _have_ any coded character set. The
word following "<URL-" at its beginning, however, would be
"iso-8859-7", and _specify_ a coded character set, the Greek
part of the international standard for simple 8-bit character
sets.

If you had a sophisticated WWW-client on your computer that used
for example the ISO 10646 character set, it should be possible
to type this HURL, exactly as it's printed in the magazine, into
the text field "Location" of the "Open" command. The client will
then construct the real URL, by
o  considering the ISO 10646-coded representation for each non-ASCII
   character typed,
o  looking up the coded representation for that same character
   in the coded character set specified at the beginning of the
   HURL, namely "macintosh",
o  encode this octet value > 127 by the %-encoding,
o  inserting this string into the real URL.

By reversing this process the client can also display any URL
containing a charset indicator that it receives as a more
user-friendly HURL.


There are not only URLs, but RURLs, FURLs, HURLs, maybe even TURLs
------------------------------------------------------------------

Besides the

 _real_ URLs


there were, from the beginning,

 _relative_ URLs, RURLs.


Another derivative class of identifiers are

 _fragment_ URLs, FURLs.


These are used mostly with the http: scheme, and consists of a
URL with a "#" followed by a fragment or anchor identifier
appended.

This proposal would introduce a fourth kind of URL-like strings,

 _human-oriented_ URLs, HURLs.


A possible fifth category of strings, that can be reduced to
real URLs, is

 _template_ URLs, TURLS.


For example, to allow for some generality in mailserver: URLs,
it would be useful to have TURLs with a simple parameter
substitution mechanism, such as in this plain text example:

   To get a copy of the document to your fax at international
   telephone number ^fno, use:
   
<TURL:mailserver:query(_at_)admin(_dot_)kth(_dot_)se/ur-architecture-plan/fax:^fno>


See also
--------

[1] <URL:ftp://ftp.admin.kth.se/pub/misc/ojarnef/disc/mime-param/
     951129-mime-param-1.mbox;type=a>
    From: Olle Jarnefors <ojarnef(_at_)admin(_dot_)kth(_dot_)se>
    Date: Wed, 29 Nov 95 17:46:59 +0100
    Message-ID: 
<9511291646(_dot_)AA01932(_at_)mercutio(_dot_)admin(_dot_)kth(_dot_)se>
    Subject: The last structural shortcoming of MIME: how to remove it


/Olle

--
Olle Jarnefors, Royal Institute of Technology, Stockholm 
<ojarnef(_at_)admin(_dot_)kth(_dot_)se>