Re: Hassles with Unicode::String

At 5:19 PM -0800 1/18/01, Gisle Aas wrote:

Paul Hoffman <phoffman(_at_)proper(_dot_)com> writes:
 >          { $OutString .= utf8(uchr(hex("0x$PartString"))); }
 > Why is uchr putting out UTF16 instead of UTF8 for the non-BMP character?

Unicode::String is simply UTF16 internally.  The ->length and ->substr
methods all operate directly on the UTF16 representation without
looking for surrogates.  This is actually the wrong thing to do.  If
these where fixed to know about surrogates then I think this example
would work as you expected.  The ->hex function should probably also
be made surrogate aware.

Fully agree. Given that Unicode 3.1 is about to come out with >40,000characters outside of the BMP, doing this soon would be a Very GoodThing.

 > Even if uchr is putting out UTF16, why isn't the utf8() call coercing

 the value from UTF16 to UTF8?


utf8() is actually converting from UTF8 to UTF16.  uchr() is
converting a numeric value to UTF16.


Well, a UTF8 version of uchr would be good, even if it has a new name.

 > How do I get this to put out UTF8, which is what I need?

The ->utf8 method should do that.


Sorry, I don't understand this. Do you mean change
        { $OutString .= utf8(uchr(hex("0x$PartString"))); }
to
        { $OutString .= uchr(hex("0x$PartString"))->utf8; }
If so, that doesn't change the output at all. It is still a surrogate.

--Paul Hoffman