Re: [Encode] UCS/UTF mess and Surrogate Handlings

On Fri, 5 Apr 2002, Jarkko Hietaniemi wrote:

P.S.  Does utf8 support surrogates?  Surrogate pair is definitely the


No.  Surrogates are solely for UTF-16.  There's no need for surrogates
in UTF-8 -- if we wanted to encode U+D800 using UTF-8, we *could* --
BUT we should not.  Encoding U+D800 as UTF-8 should not be attempted,
the whole surrogate space is a discontinuity in the Unicode code point
space reserved for the evils of UTF-16.


  I can't agree more with you on this. Unfortunately, people
at Oracle and PeopleSoft think differently. Actually, what happened was
that they made a serious design mistake by making their DBs understand
only UTF-8 up to 3byte long although when they added UTF-8 support,
it was plainly clear that ISO 10646/Unicode was not just for BMP.
When planes beyond BMP finally began to be filled with actual characters,
they came up with that stupid idea of using two 3-byte-long UTF-8 units
(for surrogate pairs) to represent those characters.

  A lot of people on Unicode mailing list voiced a very strong
and technically solid objection against this, but Oracle and PeopleSoft
went on to publish DUTR  #26: Compatibility Encoding Scheme for UTF-16
(CESU-8) (http://www.unicode.org/unicode/reports/tr26). Does Encode
need to support this monster?  I hope not.

   Jungshik Shin

<Prev in Thread]

Current Thread

[Next in Thread>

Previous by Date:

Re: [Encode] Farsi is Okay. The problem is in Indics!, Jarkko Hietaniemi

Next by Date:

Re: [Encode] UCS/UTF mess and Surrogate Handlings, Jarkko Hietaniemi

Previous by Thread:

Re: [Encode] UCS/UTF mess and Surrogate Handlings, Jarkko Hietaniemi

Next by Thread:

Re: [Encode] UCS/UTF mess and Surrogate Handlings, Jarkko Hietaniemi

Indexes:

[Date] [Thread] [Top] [All Lists]