On Thu, 2004-04-29 at 11:16, Tim Bunce wrote:
Am I right in thinking that perl's internal utf8 representation
represents surrogates as a single (4 byte) code point and not as
two separate code points?
This is the form that Oracle call AL32UTF8.
What would be the effect of setting SvUTF8_on(sv) on a valid utf8
byte string that used surrogates? Would there be problems?
(For example, a string returned from Oracle when using the UTF8
character set instead of the newer AL32UTF8 one.)
I think it makes no difference. (at least I could no find one), except
for the internal storage. Several of the tests I wrote print a sql
DUMP(nch), and you can see the difference in the internal store in those
prints. The strings come back to the client, the way they were put in.
I have tested this with 4 databases
dbcharset/ncharset
--------- --------
us7ascii/utf8
us7ascii/all6utf16
utf8 /utf8
utf8 /al16utf16
All tests produce the same results with all databases using both .UTF8
and .AL32UTF8 in NLS_LANG.
Lincoln