Tim Bunce wrote:
On Fri, Apr 30, 2004 at 10:58:19PM +0700, Martin Hosken wrote:
IIRC AL32UTF8 was introduced at the behest of Oracle (a voting member of
Unicode) because they were storing higher plane codes using the
surrogate pair technique of UTF-16 mapped into UTF-8 (i.e. resulting in
2 UTF-8 chars or 6 bytes) rather than the correct UTF-8 way of a single
char of 4+ bytes. There is no real trouble doing it that way since
anyone can convert between the 'wrong' UTF-8 and the correct form. But
they found that if you do a simple binary based sort of a string in
AL32UTF8 and compare it to a sort in true UTF-8 you end up with a subtly
different order. On this basis they made request to the UTC to have
AL32UTF8 added as a kludge and out of the kindness of their hearts the
UTC agreed thus saving Oracle from a whole heap of work. But all are
agreed that UTF-8 and not AL32UTF8 is the way forward.
Um, now you've confused me.
The Oracle docs say "In AL32UTF8, one supplementary character is
represented in one code point, totalling four bytes." which you
say is "correct UTF-8 way". So the old Oracle ``UTF8'' charset
is what's now called "CESU-8" and what Oracle call ``AL32UTF8''
is the "correct UTF-8 way".
> So did you mean CESU-8 when you said AL32UTF8?
I guess so.
Thank you for reminding me of this. I used to know that, but forgot it
and was about to write my colleague to use 'UTF8' (instead of
'AL32UTF8') when she creates a database with Oracle for our project.
Oracle is notorious for using 'incorrect' and confusing character
encoding names. Their 'AL32UTF8' is the true and only UTF-8 while
__their__ 'UTF8' is CESU-8 (a beast that MUST be confined within Oracle
and MUST NOT be leaked out to the world at large. Needless to say, it'd
be even better had it not been born.)
Oracle has no execuse whatsoever for failing to get their 'UTF8' right
in the first place because Unicode had been extended beyond BMP a long
time before they introduced UTF8 into their product(s) (let alone the
fact that ISO 10646 had non-BMP planes from the very beginning in 1980's
and that UTF-8 was devised to cover the full set of ISO 10646) However,
they failed and in their 'UTF8', a single character beyond BMP was (and
still is) encoded as a pair of 3-byte representations of surrogate code
points. Apparently for the sake of backward compatibility (I wonder how
many instances of Oracle databases existed with non-BMP characters
stored in their 'UTF8' when they decided to follow this route), they
decided to keep the designation 'UTF8' for CESU-8 and came up with a new
designation 'AL32UTF8' for the true and only UTF-8.
Jungshik