Re: restrictions when defining charsets

Nathaniel wrote on 5 Feb...

Keld has suggested more or less the following wording for the defintion
of "character set" -- please comment if you object:

The term "character set", wherever it is used in this document, refers
to the ISO definition of the term "coded character set":  "a set of
unambiguous rules that establishes a character set and the one-to-one
...


Nathaniel,
   I've read this definition several times.   I think I know what it
means, but I'm not sure, which is not a good sign.
   The place where I'm getting stuck is on the combining sequences
associated with 10646 (or, for that matter, ASCII's funny language about
permitting overstruck character representations).
   In both of those situations, if we look at a code (or code sequence)
we get one-and-only-one character (plus or minus alternate uses for the
same positions).  But:

   -- if there are multiple names for the same position, then alternate
symbol stylizations are permitted, e.g., the ASCII apostrophe and acute
accent case.   We need to be very clear about whether that is permitted
and, if not, if there has to be an IETF profile that specifies one or
the other before the thing becomes 'a character set'.
    Whatever choice is made about that, we need to be very clear about
it.  The cited text isn't.
    Whatever we do, deciding that ASCII is not a character set is not,
IMnvHO, viable.

  -- If I start with a character (either name/concept or glyph), neither
ASCII nor 10646 guarantee a unique representation in a code sequence. 
There are, for example, no rules in those standards (none in ASCII, none
that I'm aware of in 10646) about the order in which combining sequences
must appear with regard to the primary character, nor any rules that
require the use of a pre-composed character code rather than a
constructed sequence when the results would appear to be the same. 
Again, I think it is necessary to be very precise and clear about what
we are permitting or encouraing here, and I'm not persuaded that the
cited definition does this job for anyone who hasn't been following
these discussions and the rather careful use of language that goes with
them.

If we cannot quickly resolve this, I want to make an alternate
suggestion in what I think is the spirit of Dave Crocker's "take it
elsewhere" suggestions.  Realistically, I can't use a character set with
text/plain without first registering it with IANA.  And, from a MIME
standpoint at the moment, the only thing we really care about is what
can be used with text/plain (and maybe with text/richtext however that
is spelled).  These other issues are fascinating and vitally important,
but ultimately come down to questions about what we want to let IANA
register.

I suggest that can be punted by rewriting the current registration
material for character sets, not to include a comprehensive definition,
but to indicate that IANA may register things from a certain enumerated
class (the ones in 7.1.1, plus "profiles on use of ISO 2022 of
specificity at least equivalent to that of the 2022-JP RFC" would be
fine with me) and that additional classes can be permitted by pushing
through a standards-track RFC and getting approval on either the
definition or its style.

This would change the F.2 template to require reference to one of:
  -- an Internet Standard specifically approved with reference to MIME
     use.
  -- An RFC or RFC-to-be or International standard that fits within the
     rules for registerable character sets explicitly included in 
     RFC1341bis. 
  -- An RFC or RFC-to-be or International Standard that fits within the
     rules for registerable MIME character sets defined by another
     standards-track RFC, which must be cited.

The people who want to count the angels on the head of this pin can then
go try to develop that third bootstrap document.  I might even join in,
but we shouldn't hold up RFC1341bis until that discussion converges.

Incidentally, the "Beyond US-ASCII" note in the middle of 7.1.1 of
RFC1341 (bottom of page 19 in the PostScript version) needs a careful
review at this time.  While no one has seen it, it is likely that 10646
is about as "defined" as it is going to be for the next five years or
so.  And the recent discussions on this list make it clear that, if
nothing else, 10646 isn't useable without some additional profiling
and/or agreement about coding conventions.

The "lowest common denominator" rule at the end of 7.1.1 may also need
re-examination.  My recollection is that, e.g., ISO8859-1 does not
permit dual interpretation of even those characters that X3.4 does in
the "ASCII subset".  If true, 8859-1 may be considered to be more
precise than ASCII, even for the codes of columns 2-7 and the rule is
ambiguous.  The rule also runs into trouble with 8859-1 and a UTF-2
encoding of 10646: I believe that these are identical, but it might be
excessively burdensome (and certainly a violation of the "one universal
character set hope) to insist that a message that contained UTF-2 and
might contain non-Latin-1 characters be examined for the Latin-1 (single
octet) subset.

Conversely, assuming Keld is correct (my memory is too vague, and my
copy of 8859-1 is in a box some miles away) that 8859-1 does not
specify control characters (the ECMA graphic character registrations
certainly don't, but they are separate), then there is no sense in which
X3.4 is a proper subset of the 8859-n sets, since it does specify
control characters and some things about their interpretation.

   --john