perl-unicode

Re: AL32UTF8

2004-05-01 14:30:07
On Fri, 2004-04-30 at 08:03, Tim Bunce wrote:

"You can use UTF8 and AL32UTF8 by setting NLS_LANG for OCI client
applications. If you do not need supplementary characters, then it
does not matter whether you choose UTF8 or AL32UTF8. However, if
your OCI applications might handle supplementary characters, then
you need to make a decision. Because UTF8 can require up to three
bytes for each character, one supplementary character is represented
in two code points, totalling six bytes. In AL32UTF8, one supplementary
character is represented in one code point, totalling four bytes."

So the key question is... can we just do SvUTF8_on(sv) on either
of these kinds of Oracle UTF8 encodings? Seems like the answer is
yes, from what Jarkko says, because they are both valid UTF8.
We just need to document the issue.

No, Oracle's "UTF8" is very much not valid UTF-8. Valid UTF-8 cannot
contain surrogates. If you mark a string like this as UTF-8 neither
the Perl core nor other extension modules will be able to interpret
it correctly.

(As people have pointed out earlier in the thread,
if you want a standard name for this weird form of encoding, that's
"CESU": http://www.unicode.org/reports/tr26/.)

You'll need to do a conversion pass before you can mark it as UTF-8.

Regards,
                                                Owen

Attachment: signature.asc
Description: This is a digitally signed message part

<Prev in Thread] Current Thread [Next in Thread>