Re: Weird interaction of ord, split, and substr with UTF-8?


I'm not completely sure what you are trying to do with ord
but when I want to look at the UTF8 bytes I do something
along this line:

    @utf8_chars = split('', $utf8_str);
    foreach $utf8_char (@utf8_chars) {
        @utf8_bytes = unpack('C*', $utf8_char);
        ...
    }

As I believe this thread has already discussed: substr in 5.6.0 does
not work correctly.

If one wants to look at the 32 bit value the code is quite
a bit move complex. Here is a C macro I use (when converting
UTF8 strings to UTF16 strings):

#   define UTF8_APPEND_UCHAR(s, c) { \
            if (((uint32_t)(c)) <= 0x7F) { \
                (*(s)++) = ((uint8_t)(c)); \
            } \
            else if (((uint32_t)(c)) <= 0x7FF) { \
                (*(s)++) = (0xC0 | (uint8_t)(((uint32_t)(c))>>6)); \
                (*(s)++) = (0x80 | (uint8_t)(((uint32_t)(c))&0x3F)); \
            } \
            else if (((uint32_t)(c)) <= 0xFFFF) { \
                (*(s)++) = (0xE0 | (uint8_t)(((uint32_t)(c))>>12)); \
                (*(s)++) = (0x80 |
(uint8_t)((((uint32_t)(c))>>6)&0x3F)); \
                (*(s)++) = (0x80 | (uint8_t)(((uint32_t)(c))&0x3F)); \
            } \
            else if (((uint32_t)(c)) <= 0x10FFFF) { \
                (*(s)++) = (0xF0 | (uint8_t)(((uint32_t)(c))>>18)); \
                (*(s)++) = (0x80 |
(uint8_t)((((uint32_t)(c))>>12)&0x3F)); \
                (*(s)++) = (0x80 |
(uint8_t)((((uint32_t)(c))>>6)&0x3F)); \
                (*(s)++) = (0x80 | (uint8_t)(((uint32_t)(c))&0x3F)); \
            } \
            else { \
                (*(s)++) = 0xEF; \
                (*(s)++) = 0xBF; \
                (*(s)++) = 0xBF; \
            } \
    }

Paul Hoffman wrote:


At 6:20 AM +0100 10/31/00, Andreas J. Koenig wrote:

 >>>>> On Mon, 30 Oct 2000 19:02:25 -0800, Paul Hoffman
<phoffman(_at_)proper(_dot_)com> said:

 > Has anyone else come across this? Is there a way to use ord in a
loop after a split
 > that works?

The bug has been fixed in the development version a while after 5.7.0
came out. You find instructions on how to get at the patches in the
perlhack manpage.


Thanks. However, I can't find a patch at
<http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/> that seems
related to the bug. I searched for "utf-8 ord". Is there a patch
number you can give me?

Also, I'd like to distribute my code to others who probably won't
have a patched system. Thus, I'd love to find a way, even a kludgy
way, in 5.6.0 to split up a string into utf-8 characters that will
work with ord. If need be, I could even use Unicode::String, convert
to a UCS-4, slice into four-octet chunks, then convert them back to a
UTF-8, but I'd like something less ugly to show the public.