I think it is time for this group to get started on serious review of RFC-CHAR.
Hence this message... I had refrained from posting many comments about RFC-CHAR
previously because I was uncertain of its status. I am still uncertain of its
status, but this general silence seems to have been taken as a sign that there
are no problems with RFC-CHAR. This is very far from true, at least as far as
I'm concerned.
Without further ado, here are the problems that I see in the current RFC-CHAR
draft.
(0) RFC-CHAR, in its current draft, makes use of the current draft version of
10646. In fact, the character names and definitions of 10646 form the
basis for RFC-CHAR. 10646 is in a state of flux at present, and its future
is far from certain. I realize that RFC-CHAR only uses the formal names and
the 2-octet encoding sequences from 10646, but the bottom line is that it
references 10646 in a nontrivial way. The implication is that further
tracking of the evolution of 10646 will also be done. My understanding is
that this is an untenable position for a Internet standards-track document
to take.
Either the dependence on 10646 must go away (and having said that, I don't
know quite to what extent this must be accomplished, and I would really
like to see some authoritative information on this. Mr Chair?) or 10646
must achieve solidity before RFC-CHAR proceeds down the standards path.
I have no problem with the short-term or even immediate publication of
RFC-CHAR as an experimental RFC. The remaining issues on this list still
need to be addressed, however.
(1) There are a bunch of duplicate character set names and aliases in the table.
In my opinion, a character set name or alias should uniquely identify the
character set. If this is not the case language that describes how
duplicates are to be handled MUST be added to the document.
In case anyone cares, the duplicate names in the current table are:
JIS_C6220-1969
greek7
cn
jp-ocr-a
ISO_8859-supp
The alias cp290 is also defined twice, but it is entered twice for the same
character set. This then is just a typo, I guess -- or does it have some
extra meaning I'm not aware of? There is also a funny sequence in front of
the "code 0" specification for IBM290. I think this is out of order:
charset IBM290
alias cp290
alias EBCDIC-JP-kana
alias cp290 <-- Unnecessary duplicate
__ <-- What is this?
code 0
(2) RFC-CHAR defines various non-spacing characters. Most of them are
accents:
"2 e002 NON-SPACING UMLAUT (ISO 5426 201)
"! 0300 COMBINING GRAVE ACCENT
"' 0301 COMBINING ACUTE ACCENT
"> 0302 COMBINING CIRCUMFLEX ACCENT
"? 0303 COMBINING TILDE
"- 0304 COMBINING MACRON
"( 0306 COMBINING BREVE
". 0307 COMBINING DOT ABOVE
": 0308 COMBINING DIAERESIS
"0 030a COMBINING RING ABOVE
"" 030b COMBINING DOUBLE ACCUTE ACCENT
"< 030c COMBINING CARON
", 0327 COMBINING CEDILLA
"; 0328 COMBINING OGONEK
"_ 0332 COMBINING LOW LINE
"= 0333 COMBINING DOUBLE LOW LINE
"1 0334 COMBINING DIAERESIS WITH ACCENT
"/ 0338 COMBINING LONG SOLIDUS OVERLAY
First of all, the terms COMBINING and NON-SPACING need to be defined. I
realize that these terms are defined in various ISO documents, but the
need to reference other documents to even be able to read RFC-CHAR should
be avoided. (Quick question to the list readership: does a combining
accent precede or follow the character it applies to? No fair peeking at
other documents or reading the rest of this note!)
These combining characters can be used to build a lot of characters by
composition that appear elsewhere in the canonical set. The reason for
having such things in the base mnemonic set is threefold: (1) Other
character sets (notably T.61) use this technique and must have something
in mnemonic to convert to, (2) It is possible to build characters using
these facilities that are not in the canonical set, and (3) They are
definitely part of Unicode and 10646 and aligning mnemonic with them is
a good idea.
However, the presence of these combining characters presents us with two
problems. The first is one of canonicalization -- when you can represent
something in two different ways, which way do you use? This is a problem
that Unicode and 10646 have to deal with as well, and I have no problem
punting on the issue and letting those folks deal with it (but see (0)
above and what this then implies about RFC-CHAR as a standard). But
RFC-CHAR is simply going to follow the Unicode/10646 lead on this point,
it should say so.
The second problem is more serious. Let's say I'm converting from
T.61 to 8859-1 using RFC-CHAR's tables. There are no equivalents for T.61's
combining characters in 8859-1. There are, however, many equivalents for
the composition of various combining characters with other characters. For
example, suppose you have a LATIN CAPITAL LETTER A followed by a
COMBINING ACUTE ACCENT in T.61. There is no equivalent for
COMBINING ACUTE ACCENT in 8859-1. But there is a LATIN CAPITAL LETTER A
WITH ACUTE in 8859-1. Should a conversion from T.61 to 8859 use it?
You might say that most of this is obvious from inspection, and I'd be
inclined to agree in most cases. But a reader should not have to do
this sort of analysis to get the technical meat out of RFC-CHAR -- the
rules should be spelled out clearly.
This is far from an academic point. T.61 is the dominant character set
used in X.400. MIME has chosen to use 8859-n instead (a wise choice, in my
opinion). But the result of this choice is to make such conversions
essential. Moreover, the conversions need to be done as correctly.
(3) The preceeding point brings the whole issue of character set conversion
into sharper focus. Currently RFC-CHAR only deals with the problem of
converting from one character set to a mnemonic encoding on top of another
character set. This is fine as far as it goes, but it does not solve
real-world problems adequately. For example, when converting messages into
X.400 and T.61, there is no way to say that a mnemonic encoding is being
used on top of T.61. (For that matter, the tendency in MIME work has been
to limit the character sets mnemonic can be used on top of to straight
US-ASCII.)
For these reasons it is imperative that conversions exploit the base
character set to the fullest extent possible. And to do this explicit
rules must be provided that detail all the gnarly twists that have to be
accounted for to do a "good" conversion.
RFC-CHAR could elect to punt on the conversion issue. If it does so it
becomes a far less useful document, in my opinion, and immediately raises
the need for a companion document that covers this material.
(4) The list of references in RFC-CHAR is pretty skimpy. I cannot believe
that all this detail about all these character sets was extracted from
only five documents! I would like to see a much more complete
bibliography. In fact, I'd like to have one that is comprehensive enough
that I can tell where each character set's definition came from. Such a
list is essential if conflicts arise (and they are certain to) -- it lets
us determine what source was used and allows us to judge between
conflicting sources of information. Moreover, it allows us to assess the
technical validity of RFC-CHAR; without it all I can do are spot-checks.
(5) The current draft of RFC-CHAR (Decmember 5th) does not seem to be
available from the drafts repository. It has been out for almost a month
now... Does this mean the current draft has not been accepted as an
Internet draft? I got my copy from Keld's system directly. How many people
have reviewed the current draft, given its limited availability?
(6) Extremely minor typos:
First page. 5rd December 1991 --> 5th December 1991.
Section 2.1. definiton --> definition.
That's it for now.
Ned