Re: restrictions when defining charsets

In <IfQ02uK0M2Yt4JslNo(_at_)thumper(_dot_)bellcore(_dot_)com>, Nathaniel wrote:

The term "character set", wherever it is used in this document, refers
to a set of rules for the interpretation of an octet stream for the
display of character-based text, such that the interpretation of each
octet cannot be questioned, the number of representable characters is
limited, and no further information is needed to determine the complete
identity of the character set.  The term "character set" must not be
misinterpreted as meaning "a set of characters".


Will that make people happy?  Any suggested improvements?


I'm afraid that the above text is only an improvement if the
reader was also at the Santa Fe meeting (which I was not), or
otherwise knows as much as the IETF does about this squirrely
little issue.  For the discussion to have any real meaning, it
needs realistic and convincing examples of, or some other
equivalently elucidative explanation of, characters sets which:

     1. do and do not permit the interpretation of each octet to
        be questioned;

     2. do and do not have a limited number of representable
        characters;

     3. do and do not require further information to determine
        the complete identity of the character set; and

     4. are and are not merely "a set of characters".

It could also use some mention of the problems which could arise
in the absence of, and which are being attempted to be
forestalled by, these four rules.

As a recent participant in these discussions, I have found my own
assumptions about these issues to be significantly but "silently"
at odds with the thinking of the crafters of MIME and RFC-822.
By "silently" I mean that the differing assumptions have not made
their way to prominence in the discussions, such that
counterproductive discussion at cross purposes continued longer
than necessary.  I know that I am not alone in this regard;
unstated and unshared assumptions seem to be particularly rampant
in the world of character sets.

I am gradually figuring out what (in my case, anyway) these
differing assumptions are, and I hope in the next day or to to
post to this list a few notes dissecting them.  But the
differing assumptions will proliferate, and the IETF's task will
remain unfulfilled, unless language sufficiently descriptive to
impart the proper assumptions to an outside reader finds its way
into the appropriate published documents.

For two immediate examples, I would still think that "the
interpretation of each octet cannot be questioned" was intended
to rule out things like ISO-2022-JP, except for having read
suggestions on this list that it's really supposed to avoid
things like undifferentiated ISO-646.  I still have no idea what
"the number of representable characters is limited" is supposed
to accomplish.  (It probably rules out mnemonic encoding schemes,
but for what seem to me to be the wrong reasons.)

                                        Steve Summit
                                        scs(_at_)adam(_dot_)mit(_dot_)edu