Comments & Reactions to the draft


 Here are some comments on the draft RFC and also some comments in
 reaction to others' comments.  I've tried to keep the quotes and
 such in context, if I've failed and gotten the attribution wrong
 please accept my apologies.  I may repeat myself a bit in this note
 because I'm trying to reply to several other messages at once,
 apologies for any inconvenience this causes.

 Date: Tue, 23 Apr 91 14:42:45 -0400
 From: Greg Vaudreuil <gvaudre(_at_)nri(_dot_)reston(_dot_)va(_dot_)us>
 Message-Id:  
<9104231442(_dot_)aa11986(_at_)NRI(_dot_)NRI(_dot_)Reston(_dot_)VA(_dot_)US>

% Third, none of the content-types are for 8 bit text.  Only ASCII is
% specified as a defined content type.  I realize that this is not yet
% settled on the mailing list, but it would be nice to have at least a
% strawman available for other content-types as well as examples.
% Possible examples include the 8859-n family, the 2022 family, 646
% national variant family, and the 10646 set.

 I think that we do need to explicitly support at least the ISO 8859/n
 family of character set encodings.  These are already widely implemented
 and are clearly where much of the world is heading.  The ISO DIS 10646
 and UNICODE drafts are not yet complete or approved and so it would be
 premature to explicitly support them.  It might be reasonable to indicate
 that future support for them would be considered, but actually adding it
 would be premature just now.

% That said, I'd like to see a "common" character set defined in this
% document.  At this point, that seems to point to either 10646, or
% Unicode.  Both have their dis-advantages, but they are implementable.
% Use of other codes are also acceptable. 

 The above proposal seems premature in terms of what is practical in
 the real world for the near term.  The IETF and IAB really can't 
 force someone to pay the extra dollars for multilingual support on 
 their system if they don't need it.  Already there are a lot of 
 systems out there that cannot conform to other RFCs (e.g. using
 the DNS instead of /etc/hosts) and I don't see the point in drafting
 a requirement that isn't going to be enforcable.

 No common character set should be defined in the current RFC,
 instead it should support the key US and ISO standard character
 set standards that are in wide use within the Internet community.

 Date: Tue, 23 Apr 91 21:41:26 +0200
 From: Keld J|rn Simonsen <keld(_at_)dkuug(_dot_)dk>
 Message-Id: <9104231941(_dot_)AA07836(_at_)dkuug(_dot_)dk>
 To: erik(_at_)sra(_dot_)co(_dot_)jp, 
ietf-822(_at_)dimacs(_dot_)rutgers(_dot_)edu
 Subject: Re:  ISO-CHARSET-TYPE -- some comments

%Erik (living in Japan) writes:

%> Another approach would be to acknowledge that what we are really
%> trying to support are the national ISO 646 variants. So you might give
%> the Content-Type a name like this:
%> 
%>      ISO-646-<registration-number>
%> 
%> E.g.
%> 
%>      ISO-646-4       (for the United Kingdom)
%> 
%> Comments? Keld?

% I don't think that the main thing is support for national ISO 646 variants.
% The main thing is support for other characters/letters than ASCII. 
% And then support for the character sets that is used on the machines
% that the users use... Be it latin1, latin2, Greek, national ISO 646
% variant, japanese encoding, IBM codepages ..., and hopefully
% without loss of interoperability and without information loss.

% I would rather stick to the more general naming than ISO-646-4,
% such as ISO-IR-4 - as I think the 8-bit sets may become extremely
% important. The ISO-IR-nr is not my invention, BTW, but an EWOS PT
% recommendation - as mentioned earlier.

  I am firmly with Erik on this.  The main issue is support for
character set encodings that are widely used within the Internet
community.  This clearly includes non-Roman character sets.  It will
be much less likely that implementors will understand how to correctly
implement this RFC if it is defined in terms of the ISO-IR-*
registrations than if it is defined in terms of standards in common
use such as the ISO 646 and ISO 8859 families of character set
encodings.

  Also, the control code issues don't arise within either the ISO 646
or ISO 8859 families because the base control code definitions are the
same as in US ASCII (X3.4-1986).  Also see Erik's comments below on
why defining ISO 8859/1 in terms of ISO-IR-* is a bad idea and my
later comments in this note.

  Please note that I am really a lot more concerned that ISO-8859-N be
added as a fully supported type than I am with ISO-646-N, because the
ISO 646 standards are quickly fading away in favor of the ISO 8859
family which supercedes them.  Also, we need to keep in mind that
16bit and 32bit support will be needed in the future.

 Date: Tue, 23 Apr 91 22:31:47 +0200
 From: Keld J|rn Simonsen <keld(_at_)dkuug(_dot_)dk>
 Message-Id: <9104232031(_dot_)AA09292(_at_)dkuug(_dot_)dk>
 To: gvaudre(_at_)nri(_dot_)reston(_dot_)va(_dot_)us, 
net(_at_)ymir(_dot_)claremont(_dot_)edu
 Subject: Re: TEXT version of Draft RFC

% On the other hand, I would recommend that only a selected list of character
% sets should be generally accepted. NETF and EUnet has decided for 2
% such universal accepted character sets namely ASCII and 10646 in compaction
% method 5 level 2. If this list should be extended, I would recommend
% the 8859 series and nothing more. Well, Japanese, Chinese ...

 It seems to me that it is necessary to support the ISO 8859 family of
full character set definitions in the proposed RFC.  In particular,
the font selections under X11 on all the systems here (various
vendors) define the font in terms of WHICH member of the ISO 8859
family is implemented, not in terms of ISO IR-N or other less commonly
used terminology or definitions.

  The ISO DIS 10646 is not yet complete and still has proposed changes
pending (for example it needs some slight changes in order to support
Vietnamese correctly).  It should not be explicitly supported until it
is a final approved ISO standard not just a draft. 

  UNICODE is similarly dynamic at present and lacks support for
Vietnamese, Thai, and a number of other languages.  Once it settles
down, consideration as to whether it should be supported would be
appropriate, but not until it is finished being defined.  Also, the
concern of the RFC should be on character set encodings that are or
will be widely used in the Internet community and it isn't clear that
UNICODE will be in that group.  There has been some discussion about
combining the UNICODE effort with the DIS 10646 though there is
dissention between the two groups.

  The Japanese and Chinese situation is awkward at best.  It appears
that the ISO DIS 10646 will be the best long term solution, but it
isn't here yet.  (I just got an email note from a colleague at HP that
indicated that ANSI has rejected the DIS 10646 with specific comments
about changes needed to support Vietnamese and also seeking Han
character unification.  It might take a while for the DIS to be
approved at this rate. :-) The Japanese standards are clearly defined
and widely used already though and some case could be made that they
should be included.

 From: "Erik M. van der Poel" <erik(_at_)sra(_dot_)co(_dot_)jp>
 To: ietf-822(_at_)dimacs(_dot_)rutgers(_dot_)edu
 Subject: Re: ISO-CHARSET-TYPE -- some comments
 Date: Wed, 24 Apr 91 13:41:39 +0900

% OK, let me try to clarify what I meant about ISO-IR-n. Let's start
% with Latin-1 as an example. This is used as an 8-bit code, maximally
% with C0, G0, C1 and G1 (though C1 may not be as frequently used -- I
% don't know), so the registration numbers would be something like:
%
%       C0      control         ISO-IR-1
%       G0      graphic         ISO-IR-6
%       C1      control         ISO-IR-77
%       G1      graphic         ISO-IR-100
%
% These numbers may be wrong, but the point is that "Latin-1" (ISO 8859/1)
% can be construed to include the above *four* sets. So how
% would you write the header?
%
%       Content-Type: ISO-IR-1,6,77,100 ???
%
% This is why I'm suggesting a separate Content-Type for Latin-1.

Erik's example clearly shows why we need to specifically support not
just ISO 8859/1 ("LATIN-1") but also all of the other ISO 8859 family
of character sets.  It is much cleaner and easier to implement support
for a single ISO 8859 definition than to try to piece it together bit
by bit.  As noted above, the stock implementation of X11 on the machines
here is oriented towards the ISO 8859 family already.  Similarly, DEC
and HP terminals have been implementing ISO 8859/1 (more or less) for
some time.  The ISO 8859 family is rapidly replacing the various ISO
646 standards within North America and Europe.

% Now to come back to ISO 646. These national variants are used as 7-bit
% sets, with the "usual" set for C0. So my suggestion was to fix the C0
% part, and make the G0 part variable, the value given by the header:
% 
%       Content-Type: ISO-646-<registration number>

  Erik's suggestion seems to be very worthwhile and a practical
approach to the problem of supporting national variants during this
transition period to the ISO 8859 family and the ISO 10646 (once it is
finalised and approved).  Implementing in terms of the ISO 646
definitions is much more straight forward than trying to do so in
terms of ISO IRs.

  I really have come to dislike the term "MAILASCII" as I've read
through the draft RFC.  It appears to just be the US ANSI definition
of ASCII (X3.4 - 1986) and it would be clearer and more in accord with
common usage amongst computer users to just call it "ASCII" or maybe
"US-ASCII".

  In accord with all of these comments, I would like to see the
definition of iso-charset-type modified roughly as shown below 
(please forgive any E-BNF errors):

  iso-charset-type := "ISO-8859-" 1*DIGIT /
                      "ISO-646-"  2*DIGIT 

  This could be extended to include the string "ISO-10646" as well
(once the DIS becomes an approved International Standard).  Also, the
ISO-IR-* approach could also be kept IF there is some specific needed
ability provided by it that isn't provided by the proposal above.

  The text on the top of page 6 of the current draft should be modified to 
replace the references to ISO-IRs with text similar to:

  "Indicates that the document contains text encoded in the ISO standard
   character set indicated.  Each ISO standard character set in these
   families defines a new standard mail content type.  The ISO 8859
   family of character set standards defines multiple 8-bit character
   encodings to support different areas of the world.  For example,
   ISO 8859/1 which is also called "Latin-1" supports the languages in 
   common use in Western Europe.  The ISO 646 family define 7-bit national 
   variants derived from US ASCII.  In any event, the character positions 
   10,13, and 32 (decimal) will always be interpreted as LF, CR, and SPACE, 
   respectively."

  Finally, the UNICODE draft is a 16-bit character set and the ISO DIS
10646 is a 32-bit character set.  Most implementations supporting
Chinese or Japanese characters require at least 16 bits.  Even though
there might not be full support for either the DIS or the UNICODE
drafts now, there should be support for 16-bit and 32-bit character
sets in the RFC so that implementers can start working on adding the
needed underpinings now.  Also, this would permit sites that agree to
support some Japanese or Chinese standard to get the data correctly
sent via mail.  Hence, the "Content-Encoding:" field should add the
types below to support 16-bit and 32-bit encoded character sets:
        "16bit"
        "32bit"

  The discussion on compress and uuencode caught my attention.  Both
were considered as part of the IEEE POSIX.2a standards work and both
were dropped from the POSIX.2a drafts, in part because of the lack of
a concise programming language-independent definition of the widespread
implementations.  This might have a bearing on the discussion here,
though I have no opinion on whether either belongs or not (surprise :-).

I apologise for the length of this note and hope that it has been
reasonably clear and will contribute towards a practical solution
to the problems addressed by the draft RFC.

Randall Atkinson
randall(_at_)Virginia(_dot_)EDU