[Top] [All Lists]

Re: Character set Detail Considered Harmful

1991-12-23 12:58:51

This is in response to the notes from Vaudreuil, Borenstein, Klensin and 
Simonsen who were, in turn, responding to my original note.  I've tried 
to summarize their set of reactions, but generally without attributing 

1.  Challenge to the timing of this note & claim of no earlier comments 
from me:

This was not the first time I've raised this concern.  I have been 
raising this concern for some months, both publicly and privately.  I 
decided to raise it rather more forcefully, now, because we are at a 
point-of-no-return and this is the one item that I see as having the 
potential of seriously injuring the success of RFC XXXX.  I believe that 
difficulties with lack of understnding character set issues will render 
uses of XXXX to be non-interoperable in cases that should interoperate.

While it probably won't make anyone feel any better, I really am not 
happy about raising the issue and suspect that I know just the kind of 
frustration this sort of note can engender.  And I am hoping that we can 
understand the basis for the concern and find workable resolution to it.

2.  Only US ASCII is needed, or we should wait to standardize any other 
character sets:

These were remarkable mis-interpretations of my note, since I said 
exactly the opposite.  I think that the character set efforts should 
proceed aggressively and I think that the Internet should be rather 
embarrassed that we have not attended to this topic sooner.  However, 
the complexity of the topic dictates that it be treated separately from 
an email format standard, though I believe that it is essential that the 
email standard contain the appropriate hook for accessing this other 
work.  (And I believe charset= is adequate to that end.)  This does not 
have to introduce any delay at all.

3.  The topic IS messy and should be left to the experts:

Well, I agree completely.  In fact, that is the major reason I think it 
should be handled separately from an email format standard.  I am trying 
to get an email spec, namely XXXX, out of the business of discussing 
character set detail, unless it wishes to provide discussion about 
translating between character sets, which it currently does not do.

4.  The topic is quite stable and well-established standards already 

This item conflicts somewhat with the previous item, which is exactly 
the conflict that I read in the set of notes from Vaudreuil, et al.  They 
seem to have some disparity among themselves, about the breadth and 
depth of existing experience.  However, there seems to be some common 
thread, through their notes, which suggests that a subset of the 
documents cited in RFC XXXX really are well-established, have 
significant field field experience, and are well understood.  Assuming 
this is true, that is great.  It may even make them appropriate to cite 
within RFC XXXX, though I claim it isn't necessary.  Registering these 
character sets with IANA is all that is truly required.

From the collection of responses, it does appear, however, that XXXX has 
some citations specified incompletely and/or is citing some 
specifications which are quite unstable.  At the very minimum, I believe 
that all such citations should be removed, since they only serve to pass 
their instability on to XXXX.

From the collection of notes, it sounds as if:

8859 has exactly one very-well established part (part 1); does this 
overlap with ASCII?  If so, how are users of each to interoperate?  
Klensin indicates that translation behavior is well-established, but I 
see no indication of any such documentation in XXXX, to give guidance to 
implementors.  How are implementors to know what to do with mail that is 
in a different character set than they display (but which they could 
translate from, if only they knew how?)  It also sounds as if the 
citation for 8859 may need tightening.  Some notes thought that I was 
claiming that 8859 had an uncertain status or that I was otherwise 
misrepresenting 8859; I was merely noting the lack of that information 
in XXXX.  

2022jp is claimed to have a solid user base.  That is fine, but the 
documentation of 2022jp details, within XXXX, I believe is entirely 
inappropriate. Worse, it sounds as if getting those details correct is 
difficult.  However, I suspect that having a Japanese-only version 
of the spec is workable, assuming that appropriately knowledgeable 
persons can speak to the IETF/IESG/IAB and convince us of the 
specification's stability and experience.  However, I'm unclear why we 
would want to specify a regional convention, within XXXX, rather than 
merely citing it, via IANA.  We don't have that kind of detail about 
ulaw encoding of audio in the spec.

10646 apparently is every bit as unstable as I had thought.

MNEMONIC is acknowledged to be new and, therefore, untested, as is RFC-
CHAR.  In the responses to me, there was some tone that I was critical 
of them.  I am not.  Actually, I think they are quite good efforts, 
within the limits of my ability to judge this topic.  Rather, my point 
is that they are working within an area that clearly is taking a long 
time to settle down and, therefore, I think that XXXX should detach 
itself from the details of that entire realm, except for the charset= 
hook, and a pointer to IANA registration of specs.

Some of the references to multiple, interoperable implementations 
surprised me, since I don't recall having seen email about email-based 
use of these character sets.  I would appreciate hearing more (or at 
least receiving copies of the previous group discussion about it; sorry 
I missed it.)

5.  International standards already exist, so the Internet should just 
adopt them:

Sorry, but no.  The Internet is quite selective in its use of 
specifications, including those from outside the Internet.  The fact 
that a spec is a standard from another international body IS quite 
important, but does not guarantee use within the Internet.  Worse, it is 
quite clear that the character set topic is in considerable flux, so it 
is not simply the case that there are  multiple specs because there are 
multiple real-world character sets, but it appears that there are 
multiple specs which cover the same territory.  That is, specs which 
overlap.  This is an invitation to interoperability problems and we 
should not ignore the potential.

6.  RFC XXXX must cite some or all of the character set specifications, 
or else there will be no support for multiple character sets:

I believe that this is a technically incorrect assessment of the results 
of following my suggestion.  RFC XXXX has many places in which it allows 
extension to various lists, via IANA.  Audio, for example, cites only 
one spec, but leaves the door open for more.  I also should note that 
the debate over the citation for ulaw ended up making things absolutely 
as simple as possible, even removing the ability to specify options.  I 
am merely suggesting that we keep the same philosophy for charset.

In any event, having XXXX point to IANA, for the list of authorized 
character sets, is entirely sufficient.  I do not understand the 
assertion that the failure to include the details in XXXX somehow 
cripples XXXX.  It doesn't.

7.  Use of "X-" labelled charsets is inappropriate:

While I don't agree with the severity of this response, I think I erred 
in referencing only X- labels.  Klensin thinks that X- means 
experimental; I believe it merely means "private" and that is the error:  
I believe that IANA can register names for specs which are published but 
not yet standardized.  Hence, the X- needn't be used; the details for 
the charset can be available; and it only is the standards status of 
each charset spec that would remain at issue.

The RFC Editor probably can clarify this procedural point.

8. ... Wrapping up...

The first Internet network management MIB specified only a very small, 
very simple set of variables.  Many, many more have been specified since 
then.  But the concern, initially, was to require only a minimum set, so 
that the focus could be on building the network management 
infrastructure, rather than on the details of all the network management 
information.  For example, it took about two additional years to get a 
reasonably complete set of MIB extensions for the common media.

I am suggesting that we take a similarly conservative approach for XXXX, 
since I see it as serving a similar role of establishing an 
infrastructure.  This can result in a free market for various 
extensions, including character sets, if it does not lock down an issue 
too quickly and if we can get the infrastructure stable.  The 
standardized use of varied character sets, in the Internet, appears to 
be a complex issues and needs an opportunity to gain experience.  XXXX 
provides the platform for gaining that experience, but it gets bogged 
down when it tries to state the details of specific character sets.  It 
shouldn't try. It doesn't need to.