Re: AL32UTF8



Jarkko Hietaniemi wrote:


Tim Bunce wrote:

Am I right in thinking that perl's internal utf8 representation
represents surrogates as a single (4 byte) code point and not as
two separate code points?


Mmmh.  Right and wrong... as a single code point, yes, since 
the real UTF-8 doesn't do surrogates which are only a UTF-16 
thing.  4 bytes, no, 3 bytes.


Surrogates are the way UTF-16 to encodes non-BMP (>16bit) 
codepoints.

BMP code points are the Unicode codepoints 0 to 0xFFFF (16 bit) 
The non-BMP codepoints are 0x10000-0xFFFFF (20 bit). 

The "shortest form" security requirement requires the BMP and
non-BMP codepoints (encoded as surrogates in UTF-16) be encoded 
in the minimal number of bytes. For UTF-8 this means:

1-3 UTF-8 bytes encodes the BMP
-------------------------------
1 UTF-8 byte  = 7 bits
2 UTF-8 bytes = 5 bits + 6 bits = 11 bits
3 UTF-8 bytes = 4 bits + 6 bits + 6 bits = 16 bits

4 UTF-8 bytes encodes the non-BMP
---------------------------------
4 UTF-8 bytes = 3 bits + 6 bits + 6 bits + 6 bits = 21 bits

I suspect there is confusion in the original posting about 
what is meant by surrogates. Perhaps the question actually 
was intended to be: "when converting from UTF-16 to UTF-8 do 
the surrogate pairs become 4 or 6 UTF-8 bytes?".

This is the form that Oracle call AL32UTF8.


Does this

http://www.unicode.org/reports/tr26/

look like like Oracle's older (?) UTF8?

What would be the effect of setting SvUTF8_on(sv) on a valid utf8
byte string that used surrogates? Would there be problems?


You would get out the surrogate code points from the sv, not the
supplementary plane code point the surrogate pairs are encoding.
Depends what you do with the data: this might be okay, might not.
Since it's valid UTF-8, nothing should croak perl-side.

(For example, a string returned from Oracle when using the UTF8
character set instead of the newer AL32UTF8 one.)

Tim.


-- 
Brian Stell