An initial motivation for this working group's efforts was to
develop an enhancement to electronic mail which would comfortably
support languages other than English, by extending the set of
characters that could be transmitted. As is often true with such
projects, the bulk of the effort has been devoted to fundamental
enhancement of the email content-structuring infrastructure, with
the resulting, explicit support for multiple character sets being
a relatively small part of the new specification.
But there has been significant discussion on the topic of
character sets and, I believe, extremely useful education of the
technical community about the issues. Unfortunately from the
discussions, I have come to one conclusion that I find
inescapable:
Support for international character sets is a messy
problem, for which there is no clear solution and
apparently no significant field experience even with
messy solutions.
For example, there are multiple relevant international standards,
and no clear basis for believing that any one of them dominates.
Worse, there even is a standard which apparently only serves the
purpose of, itself, selecting different character sets from other
specifications, just as RFC XXXX's charset= attribute allows.
By any reasonable measure, the topic of communicating
information in an environment that supports multiple
character sets MUST be viewed as exploratory and
inadequately-understood. In Internet parlance, I take
this to mean that any specification of character set
detail MUST be considered to be Experimental.
Electronic mail on the Internet requires a level of
interoperability that is unique, since mail objects traverse a
much larger space than the IP Internet. Further, levels of
implementation compliance tend to be poor. Therefore, I feel that
it is essential that there be a reasonably strong basis for
believing that the Internet understands how to use multiple
character sets.
I currently believe that we have no basis for such an
understanding. Hence, I strongly doubt that mail with
multiple character sets is going to be interoperable,
except in private environments.
One could argue that any and all new specifications obviously are
poorly understood, since they are new. But the Internet tends to
distinguish between specifications which document ideas and
procedures that either a) have considerable operational experience
as the basis for the specification, or b) involve system behaviors
which are sufficiently simple as to leave most readers of the
specification with a high comfort level. That is, specifications
which the Internet moves to Proposed Standard, without prior
testing, either contain little new technical content or contain
technical content which is simple. When neither of these
conditions is met, the community tends to be uncomfortable with
the specification, until it receives testing.
RFC XXXX contains a basic mechanism for labeling a content part as
being in a random character set. The charset= Content-type
attribute is simple and straightforward and seems, to me, like an
excellent framework for labeling different content as to different
character sets.
However, RFC XXXX also contains significant reference to details
about SPECIFIC character set specifications. I believe that
virtually all such references should be removed, since they refer
to specifications which apparently have little or no concrete
experience and about which there is no general, strong community
sense of comfort. In other words, there is no reason to believe
that the character set mechanisms that are cited will be
sufficient or will be used in the real world, in spite of the fact
that some of the citations are for documents on the international
standards track.
The Internet has been very conservative in its adoption of
specifications which are poorly understood, lacking field
experience, and likely to take a long time to stabilize and
mature. I believe that it is essential that RFC XXXX continue
this tradition, by retaining the specification for the charset=
attribute, but removing essentially all references to specific
character set definitions.
I believe that RFC XXXX, with such modifications, would
then contain technical specifications which are
generally well-understood and relatively simple (and
powerful) thereby making it entirely aappropriate for
entering the standards track immediately. I believe
that retention of the character set detail will render
the document appropriate for Experimental, non-standards
track status.
RFC XXXX specifies mechanisms which will result in a very
substantial improvement in the capabilities of Internet mail. In
my opinion, it generally specifies those mechanisms remarkably
well, while simultaneously juggling requirements to a) disrupt the
current Internet mail installed base as little as possible, b)
provide a rich set of new functions, and c) keep the new functions
simple, easy to understand, and safe. The issue of character set
detail is the one place in which I think RFC XXXX leaves itself
seriously exposed to dangerous misunderstanding and misuse.
The rest of this note discusses RFC XXXX details:
In section 7.1.1, The charset parameter, the text contains an
italicized note which begins "Beyond US-ASCII..." and offers a
view of engineering preference, as well as stating a belief about
the long-term outcome. It includes the sentence "This future ISO
10646 standard will probably provide the best means for universal
text representation." The next paragraph acknowledges that the
spec is not complete. It is my understanding that that area of
work is very much in flux. It therefore seems, to me,
unreasonable to anchor RFC XXXX to that specification. When 10646
gets enough experience and demonstrates its leadership position,
then the Internet can specify its use. At the moment, however,
the field still appears to be open.
I should note that it is legal for Internet specifications at the
Proposed Standard level to cite documents from other standards
group which are not yet full standards, in their own community.
However, the Internet document may not advance to full Internet
Standard until the cited documents have reached full standard
status in their own community, so that RFC XXXX progress could be
delayed by this dependency.
The rest of section 7.1.1 goes into detail about specific,
character-set related specifications, including ISO-8859-X, ISO-
2022-jp, ISI-10646, and MNEMONIC. 10646 apparently is at DIS
level. 2022 is a full standard, but is only a means of switching
to character sets rather than, itself, specifying a character set.
The status of 8859 is not clear, from the References section of
RFC XXXX. And MNEMONIC is a brand new spec, from the Internet
community.
For one thing, the mere presence of such a large set of
alternatives ought to give one pause and further ought to suggest
that no specification should tie itself to any of these documents,
individually or collectively. RFC XXXX should let the character
set area progress at its own pace and should wait for its dynamics
to settle down.
The discussion of ISO-2022jp includes "It appears necessary to
explicitly specify the ISO-2022 methods that will be permitted in
text mail so as to avoid the need for private agreements about,
e.g., the specific character sets being used in message. IT IS
EXPECTED THAT THOSE INTERESTED IN ISO-2022 MAIL WILL DEVISE AND
PUBLISH SUCH A SPECIFICATION IN THE FUTURE." (emphasis mine.)
In other words, ISO-2022 is not yet usable.
Discussion of ISO-10646 and MNEMONIC is prefaced with the
statement "The use of the following... is expected to be defined
by forthcoming documents."
In other words, use of 10646 and MNEMONIC is, at this
point, purely speculative.
The last paragraph of section 7.1.1 does attempt to give some
guidance that is designed to increase interoperability. It
advises senders to use the "lowest common denominator" character
set. While it then also provides an example, RFC XXXX contains no
detailed specification of determining the lowest common
denominator. And, at this point, I claim that it would be
impossible for it to contain such detail, since I believe that the
basis for making such a judgement is not yet understood.
Appendix F contains detail about current Japanese use of 1022, but
it also states that it expects to be to superseded by a more
formal specification. The fact that this appendix is only
informational, refers only to use by a specific community, and is
expected to be replaced (soon?) strongly suggests that the is not
appropriate content for a standards specification.