Comments on RFC-CHAR

I think it is time for this group to get started on serious review of RFC-CHAR.
Hence this message... I had refrained from posting many comments about RFC-CHAR
previously because I was uncertain of its status. I am still uncertain of its
status, but this general silence seems to have been taken as a sign that there
are no problems with RFC-CHAR. This is very far from true, at least as far as
I'm concerned.

Without further ado, here are the problems that I see in the current RFC-CHAR
draft.

(0) RFC-CHAR, in its current draft, makes use of the current draft version of
    10646. In fact, the character names and definitions of 10646 form the
    basis for RFC-CHAR. 10646 is in a state of flux at present, and its future
    is far from certain. I realize that RFC-CHAR only uses the formal names and
    the 2-octet encoding sequences from 10646, but the bottom line is that it
    references 10646 in a nontrivial way. The implication is that further
    tracking of the evolution of 10646 will also be done. My understanding is
    that this is an untenable position for a Internet standards-track document
    to take.

    Either the dependence on 10646 must go away (and having said that, I don't
    know quite to what extent this must be accomplished, and I would really
    like to see some authoritative information on this. Mr Chair?) or 10646
    must achieve solidity before RFC-CHAR proceeds down the standards path.

    I have no problem with the short-term or even immediate publication of
    RFC-CHAR as an experimental RFC. The remaining issues on this list still
    need to be addressed, however.

(1) There are a bunch of duplicate character set names and aliases in the table.
    In my opinion, a character set name or alias should uniquely identify the
    character set. If this is not the case language that describes how
    duplicates are to be handled MUST be added to the document.
    
    In case anyone cares, the duplicate names in the current table are:

        JIS_C6220-1969
        greek7
        cn
        jp-ocr-a
        ISO_8859-supp

    The alias cp290 is also defined twice, but it is entered twice for the same
    character set. This then is just a typo, I guess -- or does it have some
    extra meaning I'm not aware of? There is also a funny sequence in front of
    the "code 0" specification for IBM290. I think this is out of order:

        charset IBM290
        alias cp290
        alias EBCDIC-JP-kana
        alias cp290           <-- Unnecessary duplicate
        __                    <-- What is this?
        code 0

(2) RFC-CHAR defines various non-spacing characters. Most of them are
    accents:

    "2      e002    NON-SPACING UMLAUT (ISO 5426 201)
    "!      0300    COMBINING GRAVE ACCENT
    "'      0301    COMBINING ACUTE ACCENT
    ">      0302    COMBINING CIRCUMFLEX ACCENT
    "?      0303    COMBINING TILDE
    "-      0304    COMBINING MACRON
    "(      0306    COMBINING BREVE
    ".      0307    COMBINING DOT ABOVE
    ":      0308    COMBINING DIAERESIS
    "0      030a    COMBINING RING ABOVE
    ""      030b    COMBINING DOUBLE ACCUTE ACCENT
    "<      030c    COMBINING CARON
    ",      0327    COMBINING CEDILLA
    ";      0328    COMBINING OGONEK
    "_      0332    COMBINING LOW LINE
    "=      0333    COMBINING DOUBLE LOW LINE
    "1      0334    COMBINING DIAERESIS WITH ACCENT
    "/      0338    COMBINING LONG SOLIDUS OVERLAY

    First of all, the terms COMBINING and NON-SPACING need to be defined. I
    realize that these terms are defined in various ISO documents, but the
    need to reference other documents to even be able to read RFC-CHAR should
    be avoided. (Quick question to the list readership: does a combining
    accent precede or follow the character it applies to? No fair peeking at
    other documents or reading the rest of this note!)

    These combining characters can be used to build a lot of characters by
    composition that appear elsewhere in the canonical set. The reason for
    having such things in the base mnemonic set is threefold: (1) Other
    character sets (notably T.61) use this technique and must have something
    in mnemonic to convert to, (2) It is possible to build characters using
    these facilities that are not in the canonical set, and (3) They are
    definitely part of Unicode and 10646 and aligning mnemonic with them is
    a good idea.

    However, the presence of these combining characters presents us with two
    problems. The first is one of canonicalization -- when you can represent
    something in two different ways, which way do you use? This is a problem
    that Unicode and 10646 have to deal with as well, and I have no problem
    punting on the issue and letting those folks deal with it (but see (0)
    above and what this then implies about RFC-CHAR as a standard). But
    RFC-CHAR is simply going to follow the Unicode/10646 lead on this point,
    it should say so.

    The second problem is more serious. Let's say I'm converting from
    T.61 to 8859-1 using RFC-CHAR's tables. There are no equivalents for T.61's
    combining characters in 8859-1. There are, however, many equivalents for
    the composition of various combining characters with other characters. For
    example, suppose you have a LATIN CAPITAL LETTER A followed by a
    COMBINING ACUTE ACCENT in T.61. There is no equivalent for
    COMBINING ACUTE ACCENT in 8859-1. But there is a LATIN CAPITAL LETTER A
    WITH ACUTE in 8859-1. Should a conversion from T.61 to 8859 use it?

    You might say that most of this is obvious from inspection, and I'd be
    inclined to agree in most cases. But a reader should not have to do
    this sort of analysis to get the technical meat out of RFC-CHAR -- the
    rules should be spelled out clearly.

    This is far from an academic point. T.61 is the dominant character set
    used in X.400. MIME has chosen to use 8859-n instead (a wise choice, in my
    opinion). But the result of this choice is to make such conversions
    essential. Moreover, the conversions need to be done as correctly.

(3) The preceeding point brings the whole issue of character set conversion
    into sharper focus. Currently RFC-CHAR only deals with the problem of
    converting from one character set to a mnemonic encoding on top of another
    character set. This is fine as far as it goes, but it does not solve
    real-world problems adequately. For example, when converting messages into
    X.400 and T.61, there is no way to say that a mnemonic encoding is being
    used on top of T.61. (For that matter, the tendency in MIME work has been
    to limit the character sets mnemonic can be used on top of to straight
    US-ASCII.)

    For these reasons it is imperative that conversions exploit the base
    character set to the fullest extent possible. And to do this explicit
    rules must be provided that detail all the gnarly twists that have to be
    accounted for to do a "good" conversion.

    RFC-CHAR could elect to punt on the conversion issue. If it does so it
    becomes a far less useful document, in my opinion, and immediately raises
    the need for a companion document that covers this material.


(4) The list of references in RFC-CHAR is pretty skimpy. I cannot believe
    that all this detail about all these character sets was extracted from
    only five documents! I would like to see a much more complete
    bibliography. In fact, I'd like to have one that is comprehensive enough
    that I can tell where each character set's definition came from. Such a
    list is essential if conflicts arise (and they are certain to) -- it lets
    us determine what source was used and allows us to judge between
    conflicting sources of information. Moreover, it allows us to assess the
    technical validity of RFC-CHAR; without it all I can do are spot-checks.

(5) The current draft of RFC-CHAR (Decmember 5th) does not seem to be
    available from the drafts repository. It has been out for almost a month
    now... Does this mean the current draft has not been accepted as an
    Internet draft? I got my copy from Keld's system directly. How many people
    have reviewed the current draft, given its limited availability?

(6) Extremely minor typos:

    First page. 5rd December 1991 --> 5th December 1991.

    Section 2.1. definiton --> definition.

That's it for now.

                                        Ned