[Top] [All Lists]

Re: Character set registration

1995-12-18 16:01:21
Maybe I misunderstood the discussion, but it was my belief from
comments on the mailing list that there were browsers and servers
today that supported accept-charset: unicode-1-1 and would transmit
documents in 16-bit Unicode using two octets for each character, and
where octets 10 and 13 had no special significance.

The HTML parser requires a front end to translate the sequence of
octets used in the character encoding (iso-2022-jp, shift-jis, etc.)
into a sequence of characters. Even if it _is_ 'difficult' to
implement an agent that can accept such character encodings, it
shouldn't mean that text/html; charset=unicode-1-1 should be
disallowed even in negotiated situations.

As it stands, the MIME proposal would make such an indication not only
unwanted in situations where the charset was not previously
negotiated, it would make the negotiation itself syntactically
illegal. I don't think this is a requirement either for mail or for
the web.

It may be that the distinction between text/* and application/* is
artificial, and that we should move toward automatically
cross-registering media types (or at least register anything that is
seen as text/foo to also be application/foo) in order to get around
this dilemma.

From:   Keith Moore <moore(_at_)cs(_dot_)utk(_dot_)edu>
To:     Larry Masinter <masinter(_at_)parc(_dot_)xerox(_dot_)com>
cc:     NED(_at_)innosoft(_dot_)com, 
Subject: Re: Character set registration 
In-reply-to: Your message of "Mon, 18 Dec 1995 12:55:59 PST."
Date:   Mon, 18 Dec 1995 14:35:44 -0800

In the context of real time connection between sender and recipient
and the ability of the recipient to indicate the allowable media types
and charset values (such as found in HTTP), this is not an issue.

I disagree.  It's still an issue for HTML.  It's very difficult
to write an HTML parser that handles arbitrary byte sequences 
for end-of-line, "<", ">", "\", "!", "&", ";" and other characters 
that are significant in HTML.  (not impossible, but considerably
more difficult than assuming that those octet values are reserved).
Given that difficulty, it's quite reasonable for text/html to have
very similar restrictions on character sets to text/plain.
(We've had similar discussions with text/enriched.)

Which is not to say that someone can't a more versatile html-like
type and call it application/html.

One consequence of using the same content-type system for both email and
the web is that any class of content-types for use in both environments,
must deal with the needs of both environments.