ietf
[Top] [All Lists]

Re: Troubles with UTF-8

2005-12-23 07:50:08
At 13:44 23/12/2005, Masataka Ohta wrote:
Tom.Petch wrote:
> Overall, my perception is that we have the political statement - UTF-8 will be
> used - but have not yet worked out all the engineering ramifications.

Correct. Like so many results of IETF, enforcing Unicode just does
not work.

Amen. This is an architectural feature decided for political reasons which does not scale.

But, never mind. Unicode has nothing to do with the internationalization.

I beg to differ on wording. Internationalization is an IETF/Unicode word. It is part of the equation "globalization=global environment internationalization + local environment localization". Its IBM understanding is to reduce the lingual barrier between the core and the ends it relates with. I think it is appropriate to the IETF US-ASCII based Internet technology.

But the real world is "multinationalization" (if to keep the same image, or multilingualization): the same but for every end to end relation (and languages). Let consider the IETF RFC 2277 proposition: content must be in Unicode (client system) and the protocol is in US-ASCII (core system). A document may look being in a language, but when you read its source it is in English interspread with unicoded text.

The internationalization (RFC 3066bis) culture is unilateral. Networking calls for a multilateral culture architecture (RFC 4151 may help).

The only solution I see, which addresses the requirements of Tom Petch, is to go through a common universalisation layer (not charset dependent), accepting the existing US-ASCII environment of Masataka Ohta as a maximum. It should then down to Hexa. Getting rid of the Unicode based layer violations, and permitting a full charset support strategy where Unicode could fully play its role of common reference.

Obviously two-tier policies based on langtags could not develop as easily as planned.
jfc






> others to
> 0000-00FF, essentially Latin-1, which suits many Western languages but
> is not truly international.

The only appropriate subset of Unicode is 0000-007f, ASCII. Latin-1,
which introduced the confusions of the currency symbol and NBSP, is
already overkill.

> Unicode lacks a no-op, a meaningless octet,

The confusion of NBSP implies that spaces are not so meaningful
octets so that it may be replaced by line break characters.

So, the situation is worse than you would have considered and even
full Latin-1 is hopeless.

Just interpret UTF-8 ASCII.

                                                        Masataka Ohta


_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf


_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf

<Prev in Thread] Current Thread [Next in Thread>