perl-unicode

Re: AL32UTF8

2004-04-29 11:30:12
Tim Bunce wrote:

Am I right in thinking that perl's internal utf8 representation
represents surrogates as a single (4 byte) code point and not as
two separate code points?

Mmmh.  Right and wrong... as a single code point, yes, since the real
UTF-8 doesn't do surrogates which are only a UTF-16 thing.  4 bytes, no,
3 bytes.

This is the form that Oracle call AL32UTF8.

Does this

http://www.unicode.org/reports/tr26/

look like like Oracle's older (?) UTF8?

What would be the effect of setting SvUTF8_on(sv) on a valid utf8
byte string that used surrogates? Would there be problems?

You would get out the surrogate code points from the sv, not the
supplementary plane code point the surrogate pairs are encoding.
Depends what you do with the data: this might be okay, might not.
Since it's valid UTF-8, nothing should croak perl-side.

(For example, a string returned from Oracle when using the UTF8
character set instead of the newer AL32UTF8 one.)

Tim.

<Prev in Thread] Current Thread [Next in Thread>