perl-unicode

Re: AL32UTF8

2004-04-30 06:30:09
On Thu, Apr 29, 2004 at 10:42:18PM -0400, Lincoln A. Baxter wrote:
On Thu, 2004-04-29 at 11:16, Tim Bunce wrote:
Am I right in thinking that perl's internal utf8 representation
represents surrogates as a single (4 byte) code point and not as
two separate code points?

This is the form that Oracle call AL32UTF8.

What would be the effect of setting SvUTF8_on(sv) on a valid utf8
byte string that used surrogates? Would there be problems?
(For example, a string returned from Oracle when using the UTF8
character set instead of the newer AL32UTF8 one.)

I think it makes no difference. (at least I could no find one), except
for the internal storage.  Several of the tests I wrote print a sql
DUMP(nch), and you can see the difference in the internal store in those
prints.  The strings come back to the client, the way they were put in.

I have tested this with 4 databases

dbcharset/ncharset
--------- --------
us7ascii/utf8
us7ascii/all6utf16
utf8    /utf8
utf8    /al16utf16

All tests produce the same results with all databases using both .UTF8
and .AL32UTF8 in NLS_LANG.

Were you using characters that require surrogates in UTF16?
If not then you'd wouldn't see a difference between .UTF8 and .AL32UTF8.

Here's a relevant quote from the Oracle 9.2 docs at
http://www.dbis.informatik.uni-goettingen.de/Teaching/oracle-doc/server.920/a96529/ch6.htm#1005295

"You can use UTF8 and AL32UTF8 by setting NLS_LANG for OCI client
applications. If you do not need supplementary characters, then it
does not matter whether you choose UTF8 or AL32UTF8. However, if
your OCI applications might handle supplementary characters, then
you need to make a decision. Because UTF8 can require up to three
bytes for each character, one supplementary character is represented
in two code points, totalling six bytes. In AL32UTF8, one supplementary
character is represented in one code point, totalling four bytes."

So the key question is... can we just do SvUTF8_on(sv) on either
of these kinds of Oracle UTF8 encodings? Seems like the answer is
yes, from what Jarkko says, because they are both valid UTF8.
We just need to document the issue.

Tim.

p.s. If we do opt for defaulting NLS_NCHAR (effectively) if NLS_LANG
and NLS_NCHAR are not defined then we should use AL32UTF8 if possible.

<Prev in Thread] Current Thread [Next in Thread>