At 5:19 PM -0800 1/18/01, Gisle Aas wrote:
Paul Hoffman <phoffman(_at_)proper(_dot_)com> writes:
> { $OutString .= utf8(uchr(hex("0x$PartString"))); }
> Why is uchr putting out UTF16 instead of UTF8 for the non-BMP character?
Unicode::String is simply UTF16 internally. The ->length and ->substr
methods all operate directly on the UTF16 representation without
looking for surrogates. This is actually the wrong thing to do. If
these where fixed to know about surrogates then I think this example
would work as you expected. The ->hex function should probably also
be made surrogate aware.
Fully agree. Given that Unicode 3.1 is about to come out with >40,000
characters outside of the BMP, doing this soon would be a Very Good
Thing.
> Even if uchr is putting out UTF16, why isn't the utf8() call coercing
the value from UTF16 to UTF8?
utf8() is actually converting from UTF8 to UTF16. uchr() is
converting a numeric value to UTF16.
Well, a UTF8 version of uchr would be good, even if it has a new name.
> How do I get this to put out UTF8, which is what I need?
The ->utf8 method should do that.
Sorry, I don't understand this. Do you mean change
{ $OutString .= utf8(uchr(hex("0x$PartString"))); }
to
{ $OutString .= uchr(hex("0x$PartString"))->utf8; }
If so, that doesn't change the output at all. It is still a surrogate.
--Paul Hoffman