perl-unicode

Re: AL32UTF8

2004-05-02 00:30:08

So the key question is... can we just do SvUTF8_on(sv) on either
of these kinds of Oracle UTF8 encodings? Seems like the answer is
yes, from what Jarkko says, because they are both valid UTF8.
We just need to document the issue.


No, Oracle's "UTF8" is very much not valid UTF-8. Valid UTF-8 cannot
contain surrogates. If you mark a string like this as UTF-8 neither
the Perl core nor other extension modules will be able to interpret
it correctly.

Well, it depends what you mean by "interpret correctly"... they will
be perfectly fine _separate_ characters.  But yes, they are pretty
useless -- the UTF-8 machinery of Perl 5 gets rather upset of seeing
these surrogate code points.  No croaks, yes, as I said earlier, but
a lot of -w-noise, and also deeper gurglings from e.g. the regex engine.

(As people have pointed out earlier in the thread,
if you want a standard name for this weird form of encoding, that's
"CESU": http://www.unicode.org/reports/tr26/.)

You'll need to do a conversion pass before you can mark it as UTF-8.

I think an Encode translation table would be the best place to do this
kind of mapping.  Encode::CESU, anyone?

-- 
Jarkko Hietaniemi <jhi(_at_)iki(_dot_)fi> http://www.iki.fi/jhi/ "There is this 
special
biologist word we use for 'stable'.  It is 'dead'." -- Jack Cohen

<Prev in Thread] Current Thread [Next in Thread>