perl-unicode

Re: AL32UTF8

2004-04-30 06:30:09
[The background to this is that Lincoln and I have been working on
Unicode support for DBD::Oracle. (Actually Lincoln's done most of
the heavy lifting, I've mostly been setting goals and directions
at the DBI API level and scratching at edge cases. Like this one.)]

On Thu, Apr 29, 2004 at 09:23:45PM +0300, Jarkko Hietaniemi wrote:
Tim Bunce wrote:

Am I right in thinking that perl's internal utf8 representation
represents surrogates as a single (4 byte) code point and not as
two separate code points?

Mmmh.  Right and wrong... as a single code point, yes, since the real
UTF-8 doesn't do surrogates which are only a UTF-16 thing.  4 bytes, no,
3 bytes.

This is the form that Oracle call AL32UTF8.

Does this

http://www.unicode.org/reports/tr26/

look like like Oracle's older (?) UTF8?

"CESU-8 defines an encoding scheme for Unicode identical to UTF-8
except for its representation of supplementary characters. In CESU-8,
supplementary characters are represented as six-byte sequences
resulting from the transformation of each UTF-16 surrogate code
unit into an eight-bit form similar to the UTF-8 transformation, but
without first converting the input surrogate pairs to a scalar value."

Yes, that sounds like it.  But see my quote from Oracle docs in my
reply to Lincoln's email to make sure.

(I presume it dates from before UTF16 had surrogate pairs. When
they were added to UTF16 they gave a name "CESU-8" to what old UTF16
to UTF8 conversion code would produce when given surrogate pairs.
A classic standards maneuver :)

What would be the effect of setting SvUTF8_on(sv) on a valid utf8
byte string that used surrogates? Would there be problems?

You would get out the surrogate code points from the sv, not the
supplementary plane code point the surrogate pairs are encoding.
Depends what you do with the data: this might be okay, might not.
Since it's valid UTF-8, nothing should croak perl-side.

Okay. Thanks.

Basically I need to document that Oracle "AL32UTF8" should be used
as the client charset in preference to the older "UTF8" because
"UTF8" doesn't do the "best"? thing with surrogate pairs.

Seems like "best" is the, er, best word to use here as "right"
would be too strong. But then the "shortest form" requirement
is quite strong so perhaps "modern standard" would be the right words.

Tim.

<Prev in Thread] Current Thread [Next in Thread>