[Top] [All Lists]

Re: Comments on RFC-CHAR

1992-01-02 10:51:23
(0) RFC-CHAR, in its current draft, makes use of the current draft version of
    10646. In fact, the character names and definitions of 10646 form the
    basis for RFC-CHAR. 10646 is in a state of flux at present, and its future
    is far from certain. I realize that RFC-CHAR only uses the formal names 
    the 2-octet encoding sequences from 10646, but the bottom line is that it
    references 10646 in a nontrivial way. The implication is that further
    tracking of the evolution of 10646 will also be done. My understanding is
    that this is an untenable position for a Internet standards-track document
    to take.

It uses the long descriptive names of 10646, but it could as well
use the long descriptive names of ISO 10367, which are the same,
for the same characters. 10367 is a full ISO standard, as far as I know.
But 10646 has more characters, mostly special characters if we focus
on the use in RFC-CHAR. I expect the DIS 10646 to be ready in
the coming two weeks, and I would be happy to use that as a reference.
ISO DIS-es are valid references in full ISO international standards,
so I expect that Internet standards can also reference DIS-es
(or else they will have conceptual problems referencing ISO standards
at all).

    Either the dependence on 10646 must go away (and having said that, I don't
    know quite to what extent this must be accomplished, and I would really
    like to see some authoritative information on this. Mr Chair?) or 10646
    must achieve solidity before RFC-CHAR proceeds down the standards path.

    I have no problem with the short-term or even immediate publication of
    RFC-CHAR as an experimental RFC. The remaining issues on this list still
    need to be addressed, however.

Yes, I second the request for authoritative information.

(1) There are a bunch of duplicate character set names and aliases in the 
    In my opinion, a character set name or alias should uniquely identify the
    character set. If this is not the case language that describes how
    duplicates are to be handled MUST be added to the document.
    In case anyone cares, the duplicate names in the current table are:


OK, they are corrected.

    The alias cp290 is also defined twice, but it is entered twice for the 
    character set. This then is just a typo, I guess -- or does it have some
    extra meaning I'm not aware of? There is also a funny sequence in front of
    the "code 0" specification for IBM290. I think this is out of order:

        charset IBM290
        alias cp290
        alias EBCDIC-JP-kana
        alias cp290           <-- Unnecessary duplicate
        __                    <-- What is this?
        code 0

Ok, the duplicate is removed.

__ indicates that more work is needed on this character set.

(2) RFC-CHAR defines various non-spacing characters. Most of them are

    "2      e002    NON-SPACING UMLAUT (ISO 5426 201)
    "!      0300    COMBINING GRAVE ACCENT
    "'      0301    COMBINING ACUTE ACCENT
    "?      0303    COMBINING TILDE
    "-      0304    COMBINING MACRON
    "(      0306    COMBINING BREVE
    ".      0307    COMBINING DOT ABOVE
    ":      0308    COMBINING DIAERESIS
    "0      030a    COMBINING RING ABOVE
    "<      030c    COMBINING CARON
    ",      0327    COMBINING CEDILLA
    ";      0328    COMBINING OGONEK
    "_      0332    COMBINING LOW LINE
    "=      0333    COMBINING DOUBLE LOW LINE

    First of all, the terms COMBINING and NON-SPACING need to be defined. I
    realize that these terms are defined in various ISO documents, but the
    need to reference other documents to even be able to read RFC-CHAR should
    be avoided. (Quick question to the list readership: does a combining
    accent precede or follow the character it applies to? No fair peeking at
    other documents or reading the rest of this note!)

Well, I do not define the meaning of the charaters in this document.
This could e.g. be done for control characters.
But I did not do that, and that is in line with how ISO treats characters.

I would like to provide a list of equivalence of characters - e.g.
in 6937-2 a <"'> and an <A> is equvalent to <A'> .

    These combining characters can be used to build a lot of characters by
    composition that appear elsewhere in the canonical set. The reason for
    having such things in the base mnemonic set is threefold: (1) Other
    character sets (notably T.61) use this technique and must have something
    in mnemonic to convert to, (2) It is possible to build characters using
    these facilities that are not in the canonical set, and (3) They are
    definitely part of Unicode and 10646 and aligning mnemonic with them is
    a good idea.

    However, the presence of these combining characters presents us with two
    problems. The first is one of canonicalization -- when you can represent
    something in two different ways, which way do you use? This is a problem
    that Unicode and 10646 have to deal with as well, and I have no problem
    punting on the issue and letting those folks deal with it (but see (0)
    above and what this then implies about RFC-CHAR as a standard). But
    RFC-CHAR is simply going to follow the Unicode/10646 lead on this point,
    it should say so.

The application of 10646 and UNICODE to Internet mail may be a tricky
one, which it is my intention to work on when we have the final specs.
I think it will require some effort from the whole of this group.
I would like to address T.61 (6937-2) before that happens.

    The second problem is more serious. Let's say I'm converting from
    T.61 to 8859-1 using RFC-CHAR's tables. There are no equivalents for 
    combining characters in 8859-1. There are, however, many equivalents for
    the composition of various combining characters with other characters. For
    example, suppose you have a LATIN CAPITAL LETTER A followed by a
    COMBINING ACUTE ACCENT in T.61. There is no equivalent for
    WITH ACUTE in 8859-1. Should a conversion from T.61 to 8859 use it?

    You might say that most of this is obvious from inspection, and I'd be
    inclined to agree in most cases. But a reader should not have to do
    this sort of analysis to get the technical meat out of RFC-CHAR -- the
    rules should be spelled out clearly.

    This is far from an academic point. T.61 is the dominant character set
    used in X.400. MIME has chosen to use 8859-n instead (a wise choice, in my
    opinion). But the result of this choice is to make such conversions
    essential. Moreover, the conversions need to be done as correctly.

Yes, treatment of T.61 conversion to 8859-1 (and 2,3,4,10) is indeed
a much wanted feature of RFC-CHAR. I am working on it. It is not
nessecary for supporting RFC-MIME though. Let's see if it will
happen in the next draft of RFC-CHAR. Or else it may be specified
in a revision of RFC-CHAR or a new RFC, much like we expect new
RFCs to be specified with the RFC-MIME framework.

(3) The preceeding point brings the whole issue of character set conversion
    into sharper focus. Currently RFC-CHAR only deals with the problem of
    converting from one character set to a mnemonic encoding on top of another
    character set. This is fine as far as it goes, but it does not solve
    real-world problems adequately. For example, when converting messages into
    X.400 and T.61, there is no way to say that a mnemonic encoding is being
    used on top of T.61. (For that matter, the tendency in MIME work has been
    to limit the character sets mnemonic can be used on top of to straight

Oh, well, there *is* a way to specify mnemonic on top on T.61,
and that is actually what RARE-WG3 (X.500) has specified to use in
the European x.500 pilot service. The charset is called
mnemonic+t.61+38 .

    For these reasons it is imperative that conversions exploit the base
    character set to the fullest extent possible. And to do this explicit
    rules must be provided that detail all the gnarly twists that have to be
    accounted for to do a "good" conversion.

There are many things that can be adressed there.
Which do you think are relevant?

    RFC-CHAR could elect to punt on the conversion issue. If it does so it
    becomes a far less useful document, in my opinion, and immediately raises
    the need for a companion document that covers this material.

Conversion may be a big issue, and currently RFC-CHAR only addresses 
the mnemonic way of doing such conversion, which retains all
information. Some mechanisms are built into RFC-CHAR to do
conversion with information loss, but which in some cases provide
a "better" result.

(4) The list of references in RFC-CHAR is pretty skimpy. I cannot believe
    that all this detail about all these character sets was extracted from
    only five documents! I would like to see a much more complete
    bibliography. In fact, I'd like to have one that is comprehensive enough
    that I can tell where each character set's definition came from. Such a
    list is essential if conflicts arise (and they are certain to) -- it lets
    us determine what source was used and allows us to judge between
    conflicting sources of information. Moreover, it allows us to assess the
    technical validity of RFC-CHAR; without it all I can do are spot-checks.

Almost all of the character sets were taken from the two main sources,
the ECMA registry and the IBM manual. And what I did was actually
just to use these documents and write the tables.
There may be a few vendor charsets that lack a reference, but that is all.
I will add those references to the draft.

(6) Extremely minor typos:

    First page. 5rd December 1991 --> 5th December 1991.

    Section 2.1. definiton --> definition.

OK, corrected.

That's it for now.

I appreciate your comments.


<Prev in Thread] Current Thread [Next in Thread>