Re: AL32UTF8

Dear Tim,

"CESU-8 defines an encoding scheme for Unicode identical to UTF-8
except for its representation of supplementary characters. In CESU-8,
supplementary characters are represented as six-byte sequences
resulting from the transformation of each UTF-16 surrogate code
unit into an eight-bit form similar to the UTF-8 transformation, but
without first converting the input surrogate pairs to a scalar value."

Yes, that sounds like it.  But see my quote from Oracle docs in my
reply to Lincoln's email to make sure.

(I presume it dates from before UTF16 had surrogate pairs. When
they were added to UTF16 they gave a name "CESU-8" to what old UTF16
to UTF8 conversion code would produce when given surrogate pairs.
A classic standards maneuver :)

IIRC AL32UTF8 was introduced at the behest of Oracle (a voting member ofUnicode) because they were storing higher plane codes using thesurrogate pair technique of UTF-16 mapped into UTF-8 (i.e. resulting in2 UTF-8 chars or 6 bytes) rather than the correct UTF-8 way of a singlechar of 4+ bytes. There is no real trouble doing it that way sinceanyone can convert between the 'wrong' UTF-8 and the correct form. Butthey found that if you do a simple binary based sort of a string inAL32UTF8 and compare it to a sort in true UTF-8 you end up with a subtlydifferent order. On this basis they made request to the UTC to haveAL32UTF8 added as a kludge and out of the kindness of their hearts theUTC agreed thus saving Oracle from a whole heap of work. But all areagreed that UTF-8 and not AL32UTF8 is the way forward.


Yours,
Martin

<Prev in Thread]	Current Thread	[Next in Thread>
AL32UTF8, Tim Bunce Re: AL32UTF8, Jarkko Hietaniemi Re: AL32UTF8, Brian Stell Re: AL32UTF8, Larry Wall Re: AL32UTF8, Tim Bunce Re: AL32UTF8, Jarkko Hietaniemi Re: AL32UTF8, Tim Bunce Re: AL32UTF8, Martin Hosken <= Re: AL32UTF8, Tim Bunce Re: AL32UTF8, Lincoln A. Baxter Re: AL32UTF8, Tim Bunce