[Top] [All Lists]

Re: Character set registration

1995-12-18 20:25:59
One distinction between text/* and application/* which is highly
important in choosing which one to use, is that an unknown subtype
of text/* can be displayed as if it were text/plain.

The issue isn't whether unknown an subtype of text/* can be displayed
as if it is text/plain, it is whether an unknown charset can always be
treated as if it agreed with US-ASCII.

I beg to differ. This isn't the issue at all. If it were we'd have problems
with various national character set variants,  which routinely redefine various
US-ASCII characters in incompatible ways. It would also disallow character sets
based on ISO-2022 like ISO-2202-JP, as well as various Unicode-derived
character sets like UTF-7 and, I believe, UTF-8. We clearly could not tolerate
such a restrictive definition.

There are only three problem characters here: CR, LF, and NUL. The
specification of the text type says that CR and LF can only appear in the
context of an end-of-line sequence, that sequence must be CRLF, and NUL cannot
appear at all. This is far less restrictive than any requirement that the
entire 7bit space align perfectly with US-ASCII.

Personally, I prefer the model that suggests that text/* media types
are those that are logically considered as a sequence of character
objects and represented by a sequence of octets using a 'charset'
encoding, and that the restriction on 'charset' encodings is similar
to the restriction on transfer-encodings: don't send unknown
transfer-encodings to unsuspecting recipients.

Treating the 'charset' as a (nested) transfer-encoding has a lot of
advantages. Even for text/plain unicode, you might choose to use
base64 transfer-encoding with charset=unicode-1-1, quoted-printable
transfer-encoding with charset=unicode-1-1-utf8, or no encoding with
charset=unicode-1-1-utf7. The results would be the same sequence of
_characters_ but increasing legibility (for ascii text, at least) when
dealing with user agents that don't actually understand unicode.

This is a losing proposition for reasons I have already described -- it
means that any agent that does conversions to local format (and in practice
this means any agent, period) must maintain a comprehensive list of all
character sets and their characteristics. It also breaks the text you
yourself pointed out in the HTML specification (RFC1866).