Re: Problems with RFC-MNEM & RFC-CHAR

 In the interest of a better set of Internet Mail RFCs, I'm appending
my concerns below for everyone's consideration.  I'd like to hear what
others think about the RFCs as well -- either in off-list mail or via
the list.  I think that this is the right list because these two RFCs
directly impact the work done on RFC-XXXX.


Well, I'm impressed that, with something this complex, we are so quickly 
and smoothly down to this level of quibbles (and some quibbles I, and 
probably others, have discussed with Keld directly).  I have to think 
further about some of the points Ran raises, but my reactions to those I 
can react to quickly follow:

2) RFC-CHAR refers to the ISO/ECMA registered version of ASCII which
 is reportedly not identical with the current definition of ASCII
 (ANSI X3.4).  The Internet standards require ANSI X3.4 rather than
 the ISO/ECMA version.  RFC-CHAR should be modified to use ANSI X3.4
 as the reference and conform to X3.4 in its table of characters.
 
 To do otherwise creates needless incompatibility with existing 
...

  There is a problem here, but this isn't it.
  Let's back up and review something.  There are things that we have
historically referred to as "character sets".  ASCII is one of them.
ISO646 (plus or minus weasel words about "national variations") is one
of them.   ISO 8859-1 is one of them.  The 7bit ones of those assign
specific meanings to each of 128 code positions (bit patterns).  The
8bit ones of those assign specific meanings to each of 256 code
positions.   Let's call these, notationally, Character Sets or, for 
emphasis, Complete Character Sets.

   ECMA DOES NOT REGISTER THOSE.  NEVER HAS.  These Complete Character 
Sets grow up to be National or International Standards or CCITT 
Recommendations, not registrations.

   There are procedures for _registering_:
  -  collections of 94 graphic characters, usually designated for use in
the GL positions (columns 2 through 7) of something. 
  - collections of 96 graphic characters, usually designated for use in
the GR positions (columns 10 through 15) of something. 
  - collections of control characters for use in the C0 positions 
(columns 0 and 1) or the C1 positions (columns 8 and 9) or both.

Now, just to keep everyone alert, the things-that-are-registered are 
usually referred to as "character sets".  This similarity in terminology 
confuses some of the people all of the time, and the rest of the people 
some of the time.  It is not Keld's fault, or Ran's fault, or my fault. 
ISO/IEC JTC1/SC2 on Codes and Character Sets did it all by themselves, 
and, ultimately, only they can clean it up.

Now, the general implication of the above is that the "ISO/ECMA 
registration for Latin-1" is a registration of a 96 character set 
defining what turns out to be the GR positions of the Latin-1 Character 
Set, ISO8859-1".  If that sentence isn't clear, folks, please go back 
and reread the preceeding part of this note until it is.  The 
distinction is subtle, but very important.

RFC-CHAR lists a registration number for ISO8859-1 that is actually the 
registration number for a 96 character GR set.  He correctly points out 
that this is a convention commonly used by "character set experts", a 
list that includes some of the people who have been involved in SC2 and 
who, therefore bear some of the blame for this mess.  But the 
registration number does not, strictly speaking, identify the ISO8859-1 
Character Set.  To do that, one also needs the registration numbers for 
the GL portion of ISO8859-1 and for the C0 and C1 portions.  One 
Complete Character Set, four registered character sets (!) required to 
identify it precisely.  Otherwise, you are relying on two pieces of 
out-of-band information when you see the registration number: (i) you 
have to "know" that an 8 bit Character Set is really being discussed and 
(ii) you have to "know" that it is a member of the 8859 family and, 
hence, be able to infer the other three registrations.

Now that brings us back to Ran's point/question about ASCII.  There is 
an ECMA registration for the GL portion of ASCII.  The US registered it, 
and it is exact.  The ASCII Standard cites the thing.  But to have a 
registration that completely identifies ASCII, you need a pair: that 
registration for GL (since ASCII is a 7-bit set, there is no GR) *and* 
the registration for the associated C0 set.

If one is being precise, that is the problem that needs solving, and it 
is a problem with ever Character Set which Keld identifies with a single 
ISO/ECMA registration number.

As I suggested to Keld in private correspondence, there is only one 
"right" solution to this problem, and it involves communicating to SC2 
that it is time to create simple identifications (isomorphic with 
registration numbers) and designation sequences for all of their 
Complete Character Sets, especially the 8859-n group about which there 
is the most confusion.  Find your national delegation to SC2 and tell 
them.  They have a plenary coming up this fall and it would be nice if 
they had something significant to worry about besides 10646 ( :-) ).  
The agenda probably closes soon.

4) The example header refers to "ISO_8859-1" while RFC-XXXX uses 
   "ISO-8859-1".  RFC-MNEM should change to use the format and
  syntax specified in RFC-XXXX.

  A third candidate is ISO8859-1.  There is a matter of correctness 
here, in that ISO always writes these things as "ISO 8859-1" and, very 
occasionally, as "ISO8859-1".  They are never, never, written as 
"ISO-8859-1".  So, if one is going to follow ISO's notation, which is 
usually considered polite, that form would be excluded.  Now Keld 
observed to me this morning that ISO usually writes the things with a 
blank (which is true) and that his notation uses "_" consistently to 
substitute for blanks (which is reasonable).  I've had long-standing 
human factors problems that cause me to argue against the use of "-" and 
"_" in the same context, at least without a clear and easy-to-remember 
rule, but "substitutes for blanks" may be such a rule, at least for 
those of us who are familiar with ISO's own notation.
   I agree that the two should be consistent, but would not mind 
changing RFC-XXXX at the next draft, or changing both of them.

I want to take Ran's item 6 out of order....

 I think it is important for all RFCs to be clear and unambiguous and
 to actively try to prevent confusion from arriving.

   I think we all agree with this.

This is one area
 where the IETF has historically done a good job (in contrast to other
 standards groups).

    I think this is debatable, but this is not the time or the place.

Referring to ISO standards for the sake of not 
 referring to the more correct ANSI standards is counter-productive.

   I don't believe that this is what Keld intends, even if the effect 
appears to be the same.  The argument is for, when possible, referring
to ISO standards instead of equivalent, or even nearly equivalent, ANSI
Standards because (i) They are International Standards and the Internet 
should be moving in that direction, (ii) it is sometimes very difficult 
and time-consuming to obtain ANSI Standards outside the US, so the ISO
documents are better to reference, and (iii) ANSI is gradually moving 
toward ratification of the ISO stuff, rather than developing local 
variations anyway.  In the third case, it would, I think, be appropriate 
to include in a reference "ISO NNNN:YYYY (technically identical to ANSI 
X3.MMM-YYYY)" and, if available and useful, to appropriate other 
national standards.
   What we should avoid, and here is where the ASCII story comes back 
in, is retroactively changing the definition of a protocol by changing 
the targets of its references.  So, for example, the definitive RFC822 
character set is ASCII, not ISO646, because the latter isn't ASCII and 
substituting one for the other could change the protocol in a subtle (or 
not-so-subtle) way.  It is appropriate to view ASCII as a national 
variation on ISO646, but that makes it a national Standard, not an 
International one.

6) It would be desirable for all references to "ASCII" in RFC-MNEM and
 RFC-CHAR be changed to "US ASCII" so that people outside the US who
 are accustomed to referring to ALL 7-bit character sets as "ascii" in
 common usage do not inadvertently misread the content of the RFCs.


Again, I think there are *very* strong arguments for consistency with 
RFC-XXXX.  That said, I think "US ASCII" is redundant, silly, and a 
general eyesore.  There are lots more people within the US who confuse 
ASCII with "any 7 bit character set" than there are outside.  Most of 
the "outsiders" are painfully aware of the differences between their 
national 646 variations and ASCII, of the differences between their 
keyboards and the US norms, etc.  I don't think there is any special 
virtue gained or lost in the process, but we (in the US) have been 
amazing successful in half-exporting our technology and conventions.  
I'd guess there are a lot of devices in Denmark that display Keld's name 
with a vertical bar, and I'd guess that, every time he sees that 
vertical bar, he is reminded that the Danish national 646 variation 
isn't ASCII.  And that realization doesn't take Keld's knowledge and 
experience to understand.  I'd imagine every Danish schoolchild who has 
been through "typing on the computer 1 and 2" understands it also.
   The above story is obviously also true if you substitute any 
non-English-speaking country that uses the Latin alphabet for Denmark.
   Now, I've got a friend who has several dogs.  And one of them is very 
large, and black, and shaggy, and sort of lumbers around.  As a result, 
it is often referred to as "the bear".  Since he doesn't own any bears, 
there is rarely any ambiguity.  When ambiguity is possible, we don't 
start using bear-bear and dog-bear, we just call dogs "dogs" and bears 
"bears".
   All ASCIIs are American [National] Standard...  The only real excuse 
for putting "US" in front of it is to clarify the same thing the 
"National" part of ANSI's current name clarifies: that this is not a 
Pan-American Standard, or a hemispherical Standard, or some such thing.
   I don't feel nearly as strongly about this as I probably sound, but I 
am throughly sick of the argument.   :-}

7)  It isn't clear to me that "quoted printable" is useful for many
 non-European languages because many glyphs cannot be usefully
 represented using strings of US ASCII.  I'm not sure that this is
 fixable, but I'm concerned that we not end up being Euro-centric.
 We should in fact try to address the non-European concerns 
 (Chinese, etc.) as well.

   I'm not convinced that quoted printable is Euro-centric.  Even if 
Keld's starting point was needs in Europe, I think things have evolved 
beyond that (I do, to keep this in perspective, consider last winter's 
DIS10646 to have been severely Western Euro-centric).
   What it clearly is at the moment is Alpha-centric, in the sense that 
it copes well with alphabetic and phonetic languages, and less well with 
ideographic ones.  For better or worse, that is consistent with the 
general state of the art in both character handling and linguistics, 
partially because alphabetic writing systems tend to involve a small and 
closed set of symbols while the ideographic ones rapidly turn into 
problems of enumeration and classification.
   While I'd be pleased to see a few centuries of linguistic problems 
solved in quoted-printable, I think that, if it could rationalize all of 
the alphabetic characters (I'm not convinced of that, incidentally, but 
have run out of counterexamples that Keld and I both find satisfying), 
it would be a major step forward.
  And the problem is not "European" versus "non-European", incidentally. 
There is at least some evidence that old Minoan (along with Middle 
Kingdom Egyptian, whose influence spread quite widely) are ideographic.  
And Sanscrit and Thai are clearly alphabetic, as are Arabic, Hindi,...
   --john