ietf-822
[Top] [All Lists]

Re: Upgrading to UTF-8

2003-02-11 14:49:11

Sam Roberts <sroberts(_at_)uniserve(_dot_)com> schrieb/wrote:
- objections to utf-8 as a standard/"priviledged" character set

  Fair enough. It has some problems, like any possible charset probably
  might.

Well, there are still contenders to UTF-8, such as GB18030, byText, to
only name a few.

 However, a number of IETF protocol families, like PKIX, are going
 utf-8, rather than have to deal with multiple national language
 encodings.

Given then wide range of upcoming protocols using UTF-8 (or another
Unicode encoding) as the standard charset, the contenders IMO only have
a chance as an ``internal encoding of Unicode'' (where they might be
superior).

Future users will maybe wonder why all networks protocols use such a
funny ``Unicode-UTF-8'' encoding of the charset that is in widespread
use then, converting to and from which includes so strange combining and
bidirectinality rules. But I believe that Unicode will continue to be
able to represent all character sequences (UTF-8 can theoretically be
extended up to 42 bits).

- It is incompatible with RFC[2]822

The problem here seems to be that UTF8 messages might leak out of a
confined UTF8 environment. This does happen with binary messages, too.

The interesting questions seem to be:

1 - does this mean that it can't be standardized?

It could be standardised as an SMTP extension, similar to 8BITMIME.

2 - can a utf-8 encoded message be down-coded during transport?

This is the real problem and IMO the only argument against UTF-8 (but
also a very strong one). It seems impossible due to the problems that
you describe here:

  This is the real problem, it seems, and it seems to be a fundamental
  property of the RFC822 format: the header field formats aren't
  self-describing. Its not possible to know whether a header field is
  unstructured, structured, and if structured, whether words are allowed
  to be encoded.

At least, it has looked like this to me and others here. But now that
you spell out the problem so clearly, I see a possible solution:

It does not have to stay that way. The format of known headers is known.
The format of new headers can be made self-describing to the extent
where it's possible to encode them without losses. We just have to
define a new general format and encoding to be used for all headers
defined in the future. (Actually, ``self-describing'' could also mean
that only an encoding for characters is defined and that all user agents
have to support it).

A gateway could then encode the header the ``old'' way if it is one from
the set of the headers known at day X and the ``new'' way otherwise.

But remember that the use of Unicode in ``names'' (addresses, newsgroup
names) is a completly different issue. Names need to have a canonical
encoding and have to be copyable even if fonts are missing.

Claus
-- 
------------------------ http://www.faerber.muc.de/ ------------------------
OpenPGP: DSS 1024/639680F0 E7A8 AADB 6C8A 2450 67EA AF68 48A5 0E63 6396 80F0

<Prev in Thread] Current Thread [Next in Thread>