Re: I-D file formats and internationalization
At 5:59 PM -0800 11/30/05, Douglas Otis wrote:
On Nov 30, 2005, at 2:23 PM, Paul Hoffman wrote:
At 1:54 PM -0800 11/30/05, Douglas Otis wrote:
Rather than opening RFCs to text utilizing any character-set
anywhere, as this draft suggests,
That is not what the RFC suggests at all. The character set is
Unicode. The encoding is UTF-8. That's it.
Unicode provides a unique number for every possible character within
a current range of about 97,000 characters. These characters
include punctuation marks, diacritics, mathematical and technical
symbols, arrows, dingbats, etc. Displaying one of these characters
requires a character-set (synonymous with a display system's
font-set or character-repertoire), or using the unicode vernacular,
a script. It is not just a matter of which character is displayed,
which character-repertoire is used, but there are also Middle
Eastern right-to-left issues as well.
It may be better to use a single vocabulary for discussing things
such as internationalization and character sets. That's the purpose
of RFC 3536.
Being able to review the ID as it would appear as an RFC would
also seem to be a requirement.
That means changing the Internet Drafts process as well. Certainly
possible, but more daunting that changing one process at a time.
As an ID becomes an RFC, it seems expecting last minute changes to
the document would be even more daunting.
Yep, that's the tradeoff. We already make some automatic changes
after in Internet Draft is approved by the IESG, and we allow others
without IESG oversight. This would be another class. That scares some
people, and not others. Having Internet Drafts use Unicode in UTF-8
instead of ASCII scares some people, and not others.
It seems problematic for protocol examples to use non-ASCII
characters owing to there not being ubiquitously displayable
Unicode is universally displayable if you have the right font(s).
Regardless of that, however, any sane document author would not
assume that every person reading the document could display it.
They would put a legend or explanation near the example.
Assume such characters can not be displayed, at least not with the
ASCII version that excludes the extended character-set allowed by
unicode. An escape mechanism would be needed to accommodate
alternative text, where displaying '?' for the unicode characters
that extends beyond ASCII would not be a very satisfactory solution,
as this would make the ASCII version less authoritative, to say the
least, and break the way many use the RFC text files.
No escape mechanism is needed. Non-displayable characters are still
in the RFC, they simply can't be displayed by everyone (but they can
be displayed by many). This is infinitely simpler, and a much better
long-term solution, than "an escape mechanism". Further, there would
be no more "ASCII version" to be authoritative. The Internet Draft
clearly says that there is a single RFC, and it has a single encoding.
I liked the idea that Frank suggested, use the HTML escape
sequence to declare the unicode character. This allows the ASCII
version to remain authoritative.
... as well as unreadable and unsearchable using normal search
mechanisms. The purpose of the proposal is to allow RFCs to be
readable and searchable using the encoding that is common on the
Internet, without resorting to sorta-kinda-HTML or an "escape
mechanism". Remaining with plain ASCII would be better than either of
--Paul Hoffman, Director
Ietf mailing list