In the interest of a better set of Internet Mail RFCs, I'm appending
my concerns below for everyone's consideration. I'd like to hear what
others think about the RFCs as well -- either in off-list mail or via
the list. I think that this is the right list because these two RFCs
directly impact the work done on RFC-XXXX.
Well, I'm impressed that, with something this complex, we are so quickly
and smoothly down to this level of quibbles (and some quibbles I, and
probably others, have discussed with Keld directly). I have to think
further about some of the points Ran raises, but my reactions to those I
can react to quickly follow:
2) RFC-CHAR refers to the ISO/ECMA registered version of ASCII which
is reportedly not identical with the current definition of ASCII
(ANSI X3.4). The Internet standards require ANSI X3.4 rather than
the ISO/ECMA version. RFC-CHAR should be modified to use ANSI X3.4
as the reference and conform to X3.4 in its table of characters.
To do otherwise creates needless incompatibility with existing
...
There is a problem here, but this isn't it.
Let's back up and review something. There are things that we have
historically referred to as "character sets". ASCII is one of them.
ISO646 (plus or minus weasel words about "national variations") is one
of them. ISO 8859-1 is one of them. The 7bit ones of those assign
specific meanings to each of 128 code positions (bit patterns). The
8bit ones of those assign specific meanings to each of 256 code
positions. Let's call these, notationally, Character Sets or, for
emphasis, Complete Character Sets.
ECMA DOES NOT REGISTER THOSE. NEVER HAS. These Complete Character
Sets grow up to be National or International Standards or CCITT
Recommendations, not registrations.
There are procedures for _registering_:
- collections of 94 graphic characters, usually designated for use in
the GL positions (columns 2 through 7) of something.
- collections of 96 graphic characters, usually designated for use in
the GR positions (columns 10 through 15) of something.
- collections of control characters for use in the C0 positions
(columns 0 and 1) or the C1 positions (columns 8 and 9) or both.
Now, just to keep everyone alert, the things-that-are-registered are
usually referred to as "character sets". This similarity in terminology
confuses some of the people all of the time, and the rest of the people
some of the time. It is not Keld's fault, or Ran's fault, or my fault.
ISO/IEC JTC1/SC2 on Codes and Character Sets did it all by themselves,
and, ultimately, only they can clean it up.
Now, the general implication of the above is that the "ISO/ECMA
registration for Latin-1" is a registration of a 96 character set
defining what turns out to be the GR positions of the Latin-1 Character
Set, ISO8859-1". If that sentence isn't clear, folks, please go back
and reread the preceeding part of this note until it is. The
distinction is subtle, but very important.
RFC-CHAR lists a registration number for ISO8859-1 that is actually the
registration number for a 96 character GR set. He correctly points out
that this is a convention commonly used by "character set experts", a
list that includes some of the people who have been involved in SC2 and
who, therefore bear some of the blame for this mess. But the
registration number does not, strictly speaking, identify the ISO8859-1
Character Set. To do that, one also needs the registration numbers for
the GL portion of ISO8859-1 and for the C0 and C1 portions. One
Complete Character Set, four registered character sets (!) required to
identify it precisely. Otherwise, you are relying on two pieces of
out-of-band information when you see the registration number: (i) you
have to "know" that an 8 bit Character Set is really being discussed and
(ii) you have to "know" that it is a member of the 8859 family and,
hence, be able to infer the other three registrations.
Now that brings us back to Ran's point/question about ASCII. There is
an ECMA registration for the GL portion of ASCII. The US registered it,
and it is exact. The ASCII Standard cites the thing. But to have a
registration that completely identifies ASCII, you need a pair: that
registration for GL (since ASCII is a 7-bit set, there is no GR) *and*
the registration for the associated C0 set.
If one is being precise, that is the problem that needs solving, and it
is a problem with ever Character Set which Keld identifies with a single
ISO/ECMA registration number.
As I suggested to Keld in private correspondence, there is only one
"right" solution to this problem, and it involves communicating to SC2
that it is time to create simple identifications (isomorphic with
registration numbers) and designation sequences for all of their
Complete Character Sets, especially the 8859-n group about which there
is the most confusion. Find your national delegation to SC2 and tell
them. They have a plenary coming up this fall and it would be nice if
they had something significant to worry about besides 10646 ( :-) ).
The agenda probably closes soon.
4) The example header refers to "ISO_8859-1" while RFC-XXXX uses
"ISO-8859-1". RFC-MNEM should change to use the format and
syntax specified in RFC-XXXX.
A third candidate is ISO8859-1. There is a matter of correctness
here, in that ISO always writes these things as "ISO 8859-1" and, very
occasionally, as "ISO8859-1". They are never, never, written as
"ISO-8859-1". So, if one is going to follow ISO's notation, which is
usually considered polite, that form would be excluded. Now Keld
observed to me this morning that ISO usually writes the things with a
blank (which is true) and that his notation uses "_" consistently to
substitute for blanks (which is reasonable). I've had long-standing
human factors problems that cause me to argue against the use of "-" and
"_" in the same context, at least without a clear and easy-to-remember
rule, but "substitutes for blanks" may be such a rule, at least for
those of us who are familiar with ISO's own notation.
I agree that the two should be consistent, but would not mind
changing RFC-XXXX at the next draft, or changing both of them.
I want to take Ran's item 6 out of order....
I think it is important for all RFCs to be clear and unambiguous and
to actively try to prevent confusion from arriving.
I think we all agree with this.
This is one area
where the IETF has historically done a good job (in contrast to other
standards groups).
I think this is debatable, but this is not the time or the place.
Referring to ISO standards for the sake of not
referring to the more correct ANSI standards is counter-productive.
I don't believe that this is what Keld intends, even if the effect
appears to be the same. The argument is for, when possible, referring
to ISO standards instead of equivalent, or even nearly equivalent, ANSI
Standards because (i) They are International Standards and the Internet
should be moving in that direction, (ii) it is sometimes very difficult
and time-consuming to obtain ANSI Standards outside the US, so the ISO
documents are better to reference, and (iii) ANSI is gradually moving
toward ratification of the ISO stuff, rather than developing local
variations anyway. In the third case, it would, I think, be appropriate
to include in a reference "ISO NNNN:YYYY (technically identical to ANSI
X3.MMM-YYYY)" and, if available and useful, to appropriate other
national standards.
What we should avoid, and here is where the ASCII story comes back
in, is retroactively changing the definition of a protocol by changing
the targets of its references. So, for example, the definitive RFC822
character set is ASCII, not ISO646, because the latter isn't ASCII and
substituting one for the other could change the protocol in a subtle (or
not-so-subtle) way. It is appropriate to view ASCII as a national
variation on ISO646, but that makes it a national Standard, not an
International one.
6) It would be desirable for all references to "ASCII" in RFC-MNEM and
RFC-CHAR be changed to "US ASCII" so that people outside the US who
are accustomed to referring to ALL 7-bit character sets as "ascii" in
common usage do not inadvertently misread the content of the RFCs.
Again, I think there are *very* strong arguments for consistency with
RFC-XXXX. That said, I think "US ASCII" is redundant, silly, and a
general eyesore. There are lots more people within the US who confuse
ASCII with "any 7 bit character set" than there are outside. Most of
the "outsiders" are painfully aware of the differences between their
national 646 variations and ASCII, of the differences between their
keyboards and the US norms, etc. I don't think there is any special
virtue gained or lost in the process, but we (in the US) have been
amazing successful in half-exporting our technology and conventions.
I'd guess there are a lot of devices in Denmark that display Keld's name
with a vertical bar, and I'd guess that, every time he sees that
vertical bar, he is reminded that the Danish national 646 variation
isn't ASCII. And that realization doesn't take Keld's knowledge and
experience to understand. I'd imagine every Danish schoolchild who has
been through "typing on the computer 1 and 2" understands it also.
The above story is obviously also true if you substitute any
non-English-speaking country that uses the Latin alphabet for Denmark.
Now, I've got a friend who has several dogs. And one of them is very
large, and black, and shaggy, and sort of lumbers around. As a result,
it is often referred to as "the bear". Since he doesn't own any bears,
there is rarely any ambiguity. When ambiguity is possible, we don't
start using bear-bear and dog-bear, we just call dogs "dogs" and bears
"bears".
All ASCIIs are American [National] Standard... The only real excuse
for putting "US" in front of it is to clarify the same thing the
"National" part of ANSI's current name clarifies: that this is not a
Pan-American Standard, or a hemispherical Standard, or some such thing.
I don't feel nearly as strongly about this as I probably sound, but I
am throughly sick of the argument. :-}
7) It isn't clear to me that "quoted printable" is useful for many
non-European languages because many glyphs cannot be usefully
represented using strings of US ASCII. I'm not sure that this is
fixable, but I'm concerned that we not end up being Euro-centric.
We should in fact try to address the non-European concerns
(Chinese, etc.) as well.
I'm not convinced that quoted printable is Euro-centric. Even if
Keld's starting point was needs in Europe, I think things have evolved
beyond that (I do, to keep this in perspective, consider last winter's
DIS10646 to have been severely Western Euro-centric).
What it clearly is at the moment is Alpha-centric, in the sense that
it copes well with alphabetic and phonetic languages, and less well with
ideographic ones. For better or worse, that is consistent with the
general state of the art in both character handling and linguistics,
partially because alphabetic writing systems tend to involve a small and
closed set of symbols while the ideographic ones rapidly turn into
problems of enumeration and classification.
While I'd be pleased to see a few centuries of linguistic problems
solved in quoted-printable, I think that, if it could rationalize all of
the alphabetic characters (I'm not convinced of that, incidentally, but
have run out of counterexamples that Keld and I both find satisfying),
it would be a major step forward.
And the problem is not "European" versus "non-European", incidentally.
There is at least some evidence that old Minoan (along with Middle
Kingdom Egyptian, whose influence spread quite widely) are ideographic.
And Sanscrit and Thai are clearly alphabetic, as are Arabic, Hindi,...
--john