draft-hoffman-utf8-rfcs-03.txt

Hi.

A few brief comments on this document....

(1) A mailing list for discussion is not designated.  While I
would normally suggested rfc-interest, the document appears to
be written on the assumption that approval of this proposal
rests with the IETF (presumably via the IESG) and IAOC and not
with the RFC Editor (with presumed review by the IAB), so I am
sending to this list.

(2) The document seems to assume that availability of UTF-8
systems (or other systems based on Unicode with easy
transcoding) is now near-ubiquitous.  Actual experience,
especially with documents being transmitted between computers by
email and similar means, appears to be different.  While I look
forward to the day at which comprehensive UTF-8 support is
universally available, at least as an interchange format, I do
not believe that we are there yet... and that there is still a
considerable gap, especially among systems that, instead of
being ASCII-only, have been developed with a focused on either
ISO 8859-1 or on a national coding system in East Asia.

(3) The document indicates that display systems that cannot
properly handle UTF-8 usually display an incorrect character
from which the user can make inferences.   While that sometimes
happens --sometimes with considerable information loss as we
have seen with a common anomaly with quoted sections of email
when the system on which the response is being composed and
those on both sides do not all prefer UTF-8-- it is at least
equally common to see an "undisplayable character" indication
which is the same for all such characters, e.g., a small box or
question-mark.  The problem is less likely with RFCs than with
random email, but we do, often, quote from RFCs and I-Ds in
email messages while working on them.  Once those "undisplayable
character" indicators are transferred from one system to
another, information is irretrievably lost... finding "better
display software" (rarely a realistic choice) is not an option
for recovering that information.

(4) Permitting critical information in RFCs (including any
information that is considered normative and author contact
information) to be exclusively in non-ASCII UTF-8 creates the
possibilities that a would-be implementer may not be able to
interpret the document or that it will be impossible to contact
the author(s), especially if, as an anti-spam precaution,
authors supply postal addresses and not email ones.

(5) I think we could quibble at great length about the advice
that should be given about compatibility characters.  While it
is probably sensible to discourage their use, it is quite easy
to imagine cases in which they might be important if a string
was going to be represented correctly.   Those cases
specifically include correct spelling of author names in some
parts of the world and examples that, for one reason or another,
actually have to illustrate the role of those characters... and
author names and examples appear to be the main justifications
for this proposal.  On the other hand, as the authors point out,
the issues with input methods and display of compatibility
characters are often much more serious than they are with their
equivalents, especially when display routines start performing
character substitutions for characters for which they lack
precise and accurate display capability.


I suggest that the authors concentrate less on painting a rosy
picture of how widely UTF-8 is deployed and how easily the
problems can be overcome (e.g., "just get better display
software [, even if that requires replacing hardware and
operating systems ]"), and, instead, concentrate on a definition
that would provide reasonable and effective fallbacks when
things go wrong as, at least for the present, they certainly
will.  For example, permitting UTF-8 (with arbitrary non-ASCII
characters) by itself in contact information is not sensible for
the reasons given above, but permitting UTF-8 only with a
requirement for either ASCII transliteration (or equivalent) or
RFC 5137 encoding to be present as an alternative might be
perfectly sensible at the current level of UTF-8 deployment and
availability.  Similar comments would apply to references,
especially normative ones (the principle that the IETF operates
in English and that English, and only English, is needed to
understand its technical specifications goes well beyond the
question of UTF-8 in RFCs and this document does not appear to
intend to change it), and to at least some examples that were
necessary to understand the normative text.

    --john

_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www.ietf.org/mailman/listinfo/ietf