Re: Character set Detail Considered Harmful

An initial motivation for this working group's efforts was to
develop an enhancement to electronic mail which would comfortably 
support languages other than English, by extending the set of 
characters that could be transmitted.  As is often true with such 
projects, the bulk of the effort has been devoted to fundamental 
enhancement of the email content-structuring infrastructure, with 
the resulting, explicit support for multiple character sets being 
a relatively small part of the new specification.


Yes, the support for multiple character sets were the initial
goal of this effort. And we have reached consensus on it.
Why do you throw a bomb like this into the debate in the
last minute?

But there has been significant discussion on the topic of 
character sets and, I believe, extremely useful education of the 
technical community about the issues.  Unfortunately from the 
discussions, I have come to one conclusion that I find 
inescapable:  

     Support for international character sets is a messy 
     problem, for which there is no clear solution and 
     apparently no significant field experience even with 
     messy solutions.  

For example, there are multiple relevant international standards, 
and no clear basis for believing that any one of them dominates.  
Worse, there even is a standard which apparently only serves the 
purpose of, itself, selecting different character sets from other 
specifications, just as RFC XXXX's charset= attribute allows.


Well, this is what character sets are all about. They cover
different languages of the world - because the world has different
languages and different characters are being used for those languages.
What you are saying is that the world is messy, and we should not try
to accomodate it.

I would agree with you that the world is a mess, but then there has been
brought some order to its use of character sets. This is done
by ISO - the ECMA registry, and the issue of different character sets
is well defined, it is a full ISO standard, ISO 2375. 
It may be confusing to you, perhaps because of the volume of it, but it
is still well defined. It is not confusing to me and a lot of other people.
Then I am confused about video formats etc, but I realize that I am
not an expert on video formats, and I then let the experts do
the specifications on that subject.

To the experiences with multiple character sets: well there have
being consideral experience with this in Europe within EUnet, for
more than a full year. So this has more experience behind
it than the rest of RFC-XXXX where the specs is about to be
implemented. To me it sounds like you are saying that the mechanisms
in the new RFC-XXXX has not been tested, so we cannot make it a
RFC. But this will be the case with any new RFC.

One idea to remedy this situation is to first promote RFC-XXXX
as an experimental RFC, and then based on the experience with it
promote it to proposed standard and internet standard in due time.

     By any reasonable measure, the topic of communicating 
     information in an environment that supports multiple 
     character sets MUST be viewed as exploratory and
     inadequately-understood.  In Internet parlance, I take 
     this to mean that any specification of character set 
     detail MUST be considered to be Experimental.

Electronic mail on the Internet requires a level of 
interoperability that is unique, since mail objects traverse a 
much larger space than the IP Internet.  Further, levels of 
implementation compliance tend to be poor.  Therefore, I feel that 
it is essential that there be a reasonably strong basis for 
believing that the Internet understands how to use multiple 
character sets.


The same can be said for the other new concepts in RFC-XXXX.


However, RFC XXXX also contains significant reference to details 
about SPECIFIC character set specifications.  I believe that 
virtually all such references should be removed, since they refer 
to specifications which apparently have little or no concrete 
experience and about which there is no general, strong community 
sense of comfort.  In other words, there is no reason to believe 
that the character set mechanisms that are cited will be 
sufficient or will be used in the real world, in spite of the fact 
that some of the citations are for documents on the international 
standards track.


I think you are using quite strong wording here without having
sufficient backing information. It is the consensus of this
list and the WG that the charsets specified in RFC-XXXX are adequate
for the use of internet mail. Why have you not expressed
your concerns earlier? Where do you see the problems?


The rest of this note discusses RFC XXXX details:

In section 7.1.1, The charset parameter, the text contains an 
italicized note which begins "Beyond US-ASCII..." and offers a 
view of engineering preference, as well as stating a belief about 
the long-term outcome.  It includes the sentence "This future ISO 
10646 standard will probably provide the best means for universal 
text representation."  The next paragraph acknowledges that the 
spec is not complete.  It is my understanding that that area of 
work is very much in flux.  It therefore seems, to me, 
unreasonable to anchor RFC XXXX to that specification.  When 10646 
gets enough experience and demonstrates its leadership position, 
then the Internet can specify its use.  At the moment, however, 
the field still appears to be open.


The 10646 is indeed not yet ready, and care should be taken to
not place too much emphasis on this.

The rest of section 7.1.1 goes into detail about specific, 
character-set related specifications, including ISO-8859-X, ISO-
2022-jp, ISI-10646, and MNEMONIC.  10646 apparently is at DIS 
level. 2022 is a full standard, but is only a means of switching 
to character sets rather than, itself, specifying a character set.  
The status of 8859 is not clear, from the References section of 
RFC XXXX.  And MNEMONIC is a brand new spec, from the Internet 
community.


The ISO 8859 standards are full ISO standards. Please do not state
your own misguided opinions as facts.

For one thing, the mere presence of such a large set of 
alternatives ought to give one pause and further ought to suggest 
that no specification should tie itself to any of these documents, 
individually or collectively.  RFC XXXX should let the character 
set area progress at its own pace and should wait for its dynamics 
to settle down.


There are full ISO standards on character sets and that means that
they have been brought to a completion that is more settled than
any Internet standard will ever be. Come on! The character sets are well
defined and it should be possible to handle them in an Internet RFC.
I would say not being able to handle more than ASCII in RFC-XXXX
in a well-defined way will be a SHOW STOPPER to me and
probably most other Europeans.

The discussion of ISO-2022jp includes "It appears necessary to 
explicitly specify the ISO-2022 methods that will be permitted in 
text mail so as to avoid the need for private agreements about, 
e.g., the specific character sets being used in message.  IT IS 
EXPECTED THAT THOSE INTERESTED IN ISO-2022 MAIL WILL DEVISE AND 
PUBLISH SUCH A SPECIFICATION IN THE FUTURE."  (emphasis mine.)

     In other words, ISO-2022 is not yet usable.


The definition of ISO-2022-JP is as far as I can see complete,
it defines the usage unambigously, and there is consensus in this
WG that the specs are OK.

Discussion of ISO-10646 and MNEMONIC is prefaced with the 
statement "The use of the following... is expected to be defined 
by forthcoming documents."

     In other words, use of 10646 and MNEMONIC is, at this 
     point, purely speculative.


MNEMONIC is defined in RFC-CHAR obtainable as
internet-drafts-822ext-charsets-01.txt obtainable at your local
internet-drafts provider. It has been around since July this
year and the new draft is very compatible with the old draft,
mostly reflecting decisions of the WG in Santa Fe.
The statement about "forthcoming documents" was misleading for MNEMONIC.

Appendix F contains detail about current Japanese use of 1022, but 
it also states that it expects to be to superseded by a more 
formal specification.  The fact that this appendix is only 
informational, refers only to use by a specific community, and is 
expected to be replaced (soon?) strongly suggests that the is not 
appropriate content for a standards specification.


I think this is a problem with the wording in RFC-XXXX.
I beleive the specification of ISO-2022-JP is formal enough
for Internet usage. I would rather use the reference in RFC-CHAR 
which is a document on charsets, and have more-or-less the same
wording on ISO-2022-JP.

Keld