Re: Troubles with UTF-8

Dear Ned,

I do not want to restart an issue on this. I thank you for youranswer. I think I could support most of what you say .... butthroughout different layers. The real issue appears to me a layerconfusion. You describe it well when you oppose charsets and Unicode.Due to historical reasons the Internet has been developed and ismaintained as a US-ASCII English based system. This certainlysimplified many things at prototyping stage (and can be retained as auseful/test-bed default config). But the resulting mono-layerInternet meets architectural difficulties. It can address multi-layerdemands only in blocking them (RFC 2277, 3066 bis) or through options(ex. IDNA tables). It does not scale to the need.

Also, I thank you for confirming that the IETF doctrine is Unicodeand not ISO based. The work of Unicode is oriented towardsglobalization. The world does not want globalization but an equalopportunity multinatinalisation/multilingualisation (which is anequal internationalization for every language). This does not meanthat Unicode is not a key element. But Unicode is not the _central_element. The central element is the user, whatever his language.


jfc




At 19:13 23/12/2005, Ned Freed wrote:

> The IETF mandates the use of UTF-8 for text [RFC2277] as part of
> internationalisation.  When writing an RFC, this raises a number of issues.
> A) Character set. UTF-8 implicitly specifies the use ofUnicode/IS10646 which
> contains 97,000 - and rising - characters.  Some (proposed) standards limit
> themselves to 0000..007F, which is not at all international, others to
> 0000-00FF, essentially Latin-1, which suits many Westernlanguages but is not> truly international. Is 97,000 really appropriate or shouldthere be a defined
> subset?

Short answer: No.

Longer answer: Subsetting makes sense in some places, such as when defining
canonicalization schemes like our various *prep profiles. You pretty much have
to limit yourself to what's currently defined for these and Unicode is, for
better or worse, open-ended.

However, limiting everything to a subset of Unicode simply because "97,000
characters are too many" is a terrible idea. This has been tried -take a lookat the various string types for Unicode in ASN.1 or the text bodypart types in
X.400 - and the result has been nothing but a mess.
> B) Code point. Many standards are defined in ABNF [RFC4234] whichallows code
> points to be specified as, eg,  %b00010011 %d13 or %x0D none of which are
> terribly Unicode-like (U+000D). The result is standards that useone notation> in the ABNF and a different one in the body of the document;should ABNF allow
> something closer to Unicode (as XML has done with &#000D;)?

ABNF is charset-independent, mapping onto non-negative integers, not
characters. Nothing prevents a specification from saying that a given ABNF
grammar specifies a series of Unicode characters represented inUTF-8 and using
%xFEFF or whatever in the grammar itself.
> C) Length. Text is often variable in length so the length must bedetermined.> This may be implicit from the underlying protocol or explicit asin a TLV. The
> latter is troublesome if the protocol passes through an application gateway
> which wants to normalise the encoding so as to improve securityand wants to
> convert UTF to its shortest form with corresponding length changes

The various length issues and tradeoffs that exist in different Character
Encoding Schemes for Unicode are well known, have been extensivey debated, and
are well understood. There are inherent tensions between the various formats
(preserve ASCII vs. equal length for all characters vs. wasting space)
and this makes any choice a compromise.

> (Unicode
> lacks a no-op, a meaningless octet, one that could be added orremoved without
> causing any change to the meaning of the text).

NBSP is used for this purpose.
> Other protocols use a terminating sequence. NUL is widely usedin *ix; some> protocols specify that NUL must terminate the text, some specifythat it must> not, one at least specifies that embedded NUL means that textafter a NUL must> not be displayed (interesting for security). Since UTF-8encompasses so much,
> there is no natural terminating sequence.

This simply isn't true. NUL is present in Unicode and is commonly used as  a
terminator.
> D) Transparency. An issue linked to C), protocols may havereserved characters,
> used to parse the data, which must not then appear in text.  Some protocols
> prohibit these characters (or at least the single octet encoding of them),
> others have a transfer syntax, such as base64, quoted-printable, %xx or an
> escape character ( " \ %).  We could do with a standard syntax.
I disagree. Different environments have very different constraints,and in many
cases these constraints interact with the underlying character data in ways
that force the use of different escaping conventions. For example, the
differences between quoted-printable the content-transfer-encoding and the Q
encoding of encoded-words are NOT gratuitous.
> E) Accessibility. The character encoding is specified in UTF-8[RFC3629] which> is readily accessible (of course:-) but to use it properly needsreference to
> IS10646, which is not.  I would like to check the correct name of eg
> hyphen-minus (Hyphen-minus, Hyphen-Minus, ???) and in the absenceof IS10646 am
> unable to do so.

The entire Unicode character database is readily available online:

   http://www.unicode.org/ucd/

A quick check shows that 0x002D is written as HYPHEN-MINUS in the database and
in the code charts.

I didn't need (and never have needed) a copy of ISO 10646 to find out stuff
like this. I do find a copy of the printed Unicode book to be useful but not
required (and I have versions 1 through 3), but it is readilyavailable, albeit
not free.

The reason many of our standards documents refer to ISO 10646 is that at one
time there was concern that Unicode wasn't sufficiently stable, andit was felt
that reference to the ISO document would offer some protection against
capricious change. I think in retrospect this concern has been shown to be
unwarranted, and all things being equal I would prefer to seereferences to themore readily available Unicode materials. (Given the wide deploymentof Unicode
now there is effectively no chance of a major change along the lines of the
Hangul reshuffle between V1 and V2.)
> Overall, my perception is that we have the political statement -UTF-8 will be
> used - but have not yet worked out all the engineering ramifications.

Well, we have a lot more than a political statement - a huge amount of
engineering work has been done to make Unicode workable in IETF protocols and
elsewhere. Does more work need to be done? Of course it does -these tasks are
by their very nature pretty much unending. But all of the points you have
raised here are either nonissues, settled issues, or engineering compromises
where there is no "right" answer.

                                Ned

_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf



_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf