perl-unicode

Re: AL32UTF8

2004-05-02 02:30:05
On Sat, 2004-05-01 at 00:37, Lincoln A. Baxter wrote:
On Fri, 2004-04-30 at 08:03, Tim Bunce wrote:
On Thu, Apr 29, 2004 at 10:42:18PM -0400, Lincoln A. Baxter wrote:
On Thu, 2004-04-29 at 11:16, Tim Bunce wrote:
Am I right in thinking that perl's internal utf8 representation
represents surrogates as a single (4 byte) code point and not as
two separate code points?

This is the form that Oracle call AL32UTF8.

[snip]

Were you using characters that require surrogates in UTF16?
If not then you'd wouldn't see a difference between .UTF8 and .AL32UTF8.

Hmmm...err.. probably not... I guess I need to hunt one up.

There is only one case in which 3 and 4 byte characters can be round
tripped.  After a bunch of other changes and fixups, I tested with the
following two new totally invented (by me) super wide characters:

row:   8: nice_string=\x{32263A}   byte_string=248|140|162|152|186     (3 byte 
wide char)
row:   9: nice_string=\x{2532263A} byte_string=252|165|140|162|152|186 (4 byte 
wide char)

In a database with ncharset=al16utf16, storage is as follows: (NLS_NCHAR= UTF8 
or AL32UTF8)

        row 8: nch=Typ=1 Len=10: 255,253,255,253,255,253,255,253,255,253 
        row 9: nch=Typ=1 Len=12: 
255,253,255,253,255,253,255,253,255,253,255,253 
        
        Values can NOT be round tripped.

In a database with Ncharset=utf8 storage is as follows (NLS_NCHAR=AL32UTF8)

        row 8: nch=Typ=1 Len=15: 
239,191,189,239,191,189,239,191,189,239,191,189,239,191,189  
        row 9: nch=Typ=1 Len=18: 
239,191,189,239,191,189,239,191,189,239,191,189,239,191,189,239,191
        
        Values can NOT be round tripped.

In a database with Ncharset=utf8 and NLS_NCHAR=AL32UTF8 storage is as follows:

        row 8: nch=Typ=1 Len=5: 248,140,162,152,186
        row 9: nch=Typ=1 Len=6: 252,165,140,162,152,186
        
        Values CAN be round tripped!
        
So, it would appear that UTF8 is the PREFERRED Database NCHARSET, not AL16UTF16
And that NLS_NCHAR=UTF8 is more portable than NLS_NCHAR=AL32UTF8.

[snip]
Seems reasonable.  I think you made a good point about the cost of
crawling through the data. I'm convinced. If you have not already
changed it, I will. 

p.s. If we do opt for defaulting NLS_NCHAR (effectively) if NLS_LANG
and NLS_NCHAR are not defined then we should use AL32UTF8 if possible.

I changed that last night (to use AL32UTF8).

But given the above results... perhaps I should change it back.

Lincoln


<Prev in Thread] Current Thread [Next in Thread>