perl-unicode

Re: AL32UTF8

2004-05-01 15:30:06
Hello Owen, 

On Sat, 2004-05-01 at 16:46, Owen Taylor wrote:
On Fri, 2004-04-30 at 08:03, Tim Bunce wrote:

"You can use UTF8 and AL32UTF8 by setting NLS_LANG for OCI client
applications. If you do not need supplementary characters, then it
does not matter whether you choose UTF8 or AL32UTF8. However, if
your OCI applications might handle supplementary characters, then
you need to make a decision. Because UTF8 can require up to three
bytes for each character, one supplementary character is represented
in two code points, totalling six bytes. In AL32UTF8, one supplementary
character is represented in one code point, totalling four bytes."

So the key question is... can we just do SvUTF8_on(sv) on either
of these kinds of Oracle UTF8 encodings? Seems like the answer is
yes, from what Jarkko says, because they are both valid UTF8.
We just need to document the issue.

No, Oracle's "UTF8" is very much not valid UTF-8. Valid UTF-8 cannot
contain surrogates. If you mark a string like this as UTF-8 neither
the Perl core nor other extension modules will be able to interpret
it correctly.

(As people have pointed out earlier in the thread,
if you want a standard name for this weird form of encoding, that's
"CESU": http://www.unicode.org/reports/tr26/.)

You'll need to do a conversion pass before you can mark it as UTF-8.


Your message comes at a PERFECT time!

I just spent about 3 hours coming to that same conclusion empiricly:

I made the changes to do what tim had asked (just mark the string
as UTF8), and it breaks a bunch of stuff, like the 8bit nchar test,
and the long test when column type is LONG.

I think I am going to back out (or rather... NOT COMMIT) those changes.
leaving the code that inspects the fetched string to see if it ("looks
like") utf8 before setting the flag.

Thanks for chimming in.

Lincoln



<Prev in Thread] Current Thread [Next in Thread>