On Sat, 2004-05-01 at 00:37, Lincoln A. Baxter wrote:
On Fri, 2004-04-30 at 08:03, Tim Bunce wrote:
On Thu, Apr 29, 2004 at 10:42:18PM -0400, Lincoln A. Baxter wrote:
On Thu, 2004-04-29 at 11:16, Tim Bunce wrote:
Am I right in thinking that perl's internal utf8 representation
represents surrogates as a single (4 byte) code point and not as
two separate code points?
This is the form that Oracle call AL32UTF8.
[snip]
Were you using characters that require surrogates in UTF16?
If not then you'd wouldn't see a difference between .UTF8 and .AL32UTF8.
Hmmm...err.. probably not... I guess I need to hunt one up.
There is only one case in which 3 and 4 byte characters can be round
tripped. After a bunch of other changes and fixups, I tested with the
following two new totally invented (by me) super wide characters:
row: 8: nice_string=\x{32263A} byte_string=248|140|162|152|186 (3 byte
wide char)
row: 9: nice_string=\x{2532263A} byte_string=252|165|140|162|152|186 (4 byte
wide char)
In a database with ncharset=al16utf16, storage is as follows: (NLS_NCHAR= UTF8
or AL32UTF8)
row 8: nch=Typ=1 Len=10: 255,253,255,253,255,253,255,253,255,253
row 9: nch=Typ=1 Len=12:
255,253,255,253,255,253,255,253,255,253,255,253
Values can NOT be round tripped.
In a database with Ncharset=utf8 storage is as follows (NLS_NCHAR=AL32UTF8)
row 8: nch=Typ=1 Len=15:
239,191,189,239,191,189,239,191,189,239,191,189,239,191,189
row 9: nch=Typ=1 Len=18:
239,191,189,239,191,189,239,191,189,239,191,189,239,191,189,239,191
Values can NOT be round tripped.
In a database with Ncharset=utf8 and NLS_NCHAR=AL32UTF8 storage is as follows:
row 8: nch=Typ=1 Len=5: 248,140,162,152,186
row 9: nch=Typ=1 Len=6: 252,165,140,162,152,186
Values CAN be round tripped!
So, it would appear that UTF8 is the PREFERRED Database NCHARSET, not AL16UTF16
And that NLS_NCHAR=UTF8 is more portable than NLS_NCHAR=AL32UTF8.
[snip]
Seems reasonable. I think you made a good point about the cost of
crawling through the data. I'm convinced. If you have not already
changed it, I will.
p.s. If we do opt for defaulting NLS_NCHAR (effectively) if NLS_LANG
and NLS_NCHAR are not defined then we should use AL32UTF8 if possible.
I changed that last night (to use AL32UTF8).
But given the above results... perhaps I should change it back.
Lincoln