Re: AL32UTF8

Tim Bunce wrote:

On Fri, Apr 30, 2004 at 10:58:19PM +0700, Martin Hosken wrote:

IIRC AL32UTF8 was introduced at the behest of Oracle (a voting member ofUnicode) because they were storing higher plane codes using thesurrogate pair technique of UTF-16 mapped into UTF-8 (i.e. resulting in2 UTF-8 chars or 6 bytes) rather than the correct UTF-8 way of a singlechar of 4+ bytes. There is no real trouble doing it that way sinceanyone can convert between the 'wrong' UTF-8 and the correct form. Butthey found that if you do a simple binary based sort of a string inAL32UTF8 and compare it to a sort in true UTF-8 you end up with a subtlydifferent order. On this basis they made request to the UTC to haveAL32UTF8 added as a kludge and out of the kindness of their hearts theUTC agreed thus saving Oracle from a whole heap of work. But all areagreed that UTF-8 and not AL32UTF8 is the way forward.
Um, now you've confused me.

The Oracle docs say "In AL32UTF8, one supplementary character is
represented in one code point, totalling four bytes." which you
say is "correct UTF-8 way". So the old Oracle ``UTF8'' charset
is what's now called "CESU-8" and what Oracle call ``AL32UTF8''
is the "correct UTF-8 way".


> So did you mean CESU-8 when you said AL32UTF8?

I guess so.

Thank you for reminding me of this. I used to know that, but forgot itand was about to write my colleague to use 'UTF8' (instead of'AL32UTF8') when she creates a database with Oracle for our project.

Oracle is notorious for using 'incorrect' and confusing characterencoding names. Their 'AL32UTF8' is the true and only UTF-8 while__their__ 'UTF8' is CESU-8 (a beast that MUST be confined within Oracleand MUST NOT be leaked out to the world at large. Needless to say, it'dbe even better had it not been born.)

Oracle has no execuse whatsoever for failing to get their 'UTF8' rightin the first place because Unicode had been extended beyond BMP a longtime before they introduced UTF8 into their product(s) (let alone thefact that ISO 10646 had non-BMP planes from the very beginning in 1980'sand that UTF-8 was devised to cover the full set of ISO 10646) However,they failed and in their 'UTF8', a single character beyond BMP was (andstill is) encoded as a pair of 3-byte representations of surrogate codepoints. Apparently for the sake of backward compatibility (I wonder howmany instances of Oracle databases existed with non-BMP charactersstored in their 'UTF8' when they decided to follow this route), theydecided to keep the designation 'UTF8' for CESU-8 and came up with a newdesignation 'AL32UTF8' for the true and only UTF-8.



Jungshik