Character set Detail Considered Harmful

An initial motivation for this working group's efforts was to
develop an enhancement to electronic mail which would comfortably 
support languages other than English, by extending the set of 
characters that could be transmitted.  As is often true with such 
projects, the bulk of the effort has been devoted to fundamental 
enhancement of the email content-structuring infrastructure, with 
the resulting, explicit support for multiple character sets being 
a relatively small part of the new specification.

But there has been significant discussion on the topic of 
character sets and, I believe, extremely useful education of the 
technical community about the issues.  Unfortunately from the 
discussions, I have come to one conclusion that I find 
inescapable:  

     Support for international character sets is a messy 
     problem, for which there is no clear solution and 
     apparently no significant field experience even with 
     messy solutions.  

For example, there are multiple relevant international standards, 
and no clear basis for believing that any one of them dominates.  
Worse, there even is a standard which apparently only serves the 
purpose of, itself, selecting different character sets from other 
specifications, just as RFC XXXX's charset= attribute allows.

     By any reasonable measure, the topic of communicating 
     information in an environment that supports multiple 
     character sets MUST be viewed as exploratory and
     inadequately-understood.  In Internet parlance, I take 
     this to mean that any specification of character set 
     detail MUST be considered to be Experimental.

Electronic mail on the Internet requires a level of 
interoperability that is unique, since mail objects traverse a 
much larger space than the IP Internet.  Further, levels of 
implementation compliance tend to be poor.  Therefore, I feel that 
it is essential that there be a reasonably strong basis for 
believing that the Internet understands how to use multiple 
character sets.  

     I currently believe that we have no basis for such an 
     understanding.  Hence, I strongly doubt that mail with 
     multiple character sets is going to be interoperable, 
     except in private environments. 

One could argue that any and all new specifications obviously are 
poorly understood, since they are new.  But the Internet tends to 
distinguish between specifications which document ideas and 
procedures that either a) have considerable operational experience 
as the basis for the specification, or b) involve system behaviors 
which are sufficiently simple as to leave most readers of the 
specification with a high comfort level.  That is, specifications 
which the Internet moves to Proposed Standard, without prior 
testing, either contain little new technical content or contain 
technical content which is simple.  When neither of these 
conditions is met, the community tends to be uncomfortable with 
the specification, until it receives testing.

RFC XXXX contains a basic mechanism for labeling a content part as 
being in a random character set.  The charset= Content-type 
attribute is simple and straightforward and seems, to me, like an 
excellent framework for labeling different content as to different 
character sets.

However, RFC XXXX also contains significant reference to details 
about SPECIFIC character set specifications.  I believe that 
virtually all such references should be removed, since they refer 
to specifications which apparently have little or no concrete 
experience and about which there is no general, strong community 
sense of comfort.  In other words, there is no reason to believe 
that the character set mechanisms that are cited will be 
sufficient or will be used in the real world, in spite of the fact 
that some of the citations are for documents on the international 
standards track.

The Internet has been very conservative in its adoption of 
specifications which are poorly understood, lacking field 
experience, and likely to take a long time to stabilize and 
mature.  I believe that it is essential that RFC XXXX continue 
this tradition, by retaining the specification for the charset= 
attribute, but removing essentially all references to specific 
character set definitions.  

     I believe that RFC XXXX, with such modifications, would 
     then contain technical specifications which are 
     generally well-understood and relatively simple (and 
     powerful) thereby making it entirely aappropriate for
     entering the standards track immediately.  I believe 
     that retention of the character set detail will render 
     the document appropriate for Experimental, non-standards 
     track status.

RFC XXXX specifies mechanisms which will result in a very 
substantial improvement in the capabilities of Internet mail.  In 
my opinion, it generally specifies those mechanisms remarkably 
well, while simultaneously juggling requirements to a) disrupt the 
current Internet mail installed base as little as possible, b) 
provide a rich set of new functions, and c) keep the new functions 
simple, easy to understand, and safe.  The issue of character set 
detail is the one place in which I think RFC XXXX leaves itself 
seriously exposed to dangerous misunderstanding and misuse.



The rest of this note discusses RFC XXXX details:

In section 7.1.1, The charset parameter, the text contains an 
italicized note which begins "Beyond US-ASCII..." and offers a 
view of engineering preference, as well as stating a belief about 
the long-term outcome.  It includes the sentence "This future ISO 
10646 standard will probably provide the best means for universal 
text representation."  The next paragraph acknowledges that the 
spec is not complete.  It is my understanding that that area of 
work is very much in flux.  It therefore seems, to me, 
unreasonable to anchor RFC XXXX to that specification.  When 10646 
gets enough experience and demonstrates its leadership position, 
then the Internet can specify its use.  At the moment, however, 
the field still appears to be open.

I should note that it is legal for Internet specifications at the 
Proposed Standard level to cite documents from other standards 
group which are not yet full standards, in their own community.  
However, the Internet document may not advance to full Internet 
Standard until the cited documents have reached full standard 
status in their own community, so that RFC XXXX progress could be 
delayed by this dependency.

The rest of section 7.1.1 goes into detail about specific, 
character-set related specifications, including ISO-8859-X, ISO-
2022-jp, ISI-10646, and MNEMONIC.  10646 apparently is at DIS 
level. 2022 is a full standard, but is only a means of switching 
to character sets rather than, itself, specifying a character set.  
The status of 8859 is not clear, from the References section of 
RFC XXXX.  And MNEMONIC is a brand new spec, from the Internet 
community.

For one thing, the mere presence of such a large set of 
alternatives ought to give one pause and further ought to suggest 
that no specification should tie itself to any of these documents, 
individually or collectively.  RFC XXXX should let the character 
set area progress at its own pace and should wait for its dynamics 
to settle down.

The discussion of ISO-2022jp includes "It appears necessary to 
explicitly specify the ISO-2022 methods that will be permitted in 
text mail so as to avoid the need for private agreements about, 
e.g., the specific character sets being used in message.  IT IS 
EXPECTED THAT THOSE INTERESTED IN ISO-2022 MAIL WILL DEVISE AND 
PUBLISH SUCH A SPECIFICATION IN THE FUTURE."  (emphasis mine.)

     In other words, ISO-2022 is not yet usable.  

Discussion of ISO-10646 and MNEMONIC is prefaced with the 
statement "The use of the following... is expected to be defined 
by forthcoming documents."

     In other words, use of 10646 and MNEMONIC is, at this 
     point, purely speculative.

The last paragraph of section 7.1.1 does attempt to give some 
guidance that is designed to increase interoperability.  It 
advises senders to use the "lowest common denominator" character 
set.  While it then also provides an example, RFC XXXX contains no 
detailed specification of determining the lowest common 
denominator.  And, at this point, I claim that it would be 
impossible for it to contain such detail, since I believe that the 
basis for making such a judgement is not yet understood.

Appendix F contains detail about current Japanese use of 1022, but 
it also states that it expects to be to superseded by a more 
formal specification.  The fact that this appendix is only 
informational, refers only to use by a specific community, and is 
expected to be replaced (soon?) strongly suggests that the is not 
appropriate content for a standards specification.