I'm not completely sure what you are trying to do with ord
but when I want to look at the UTF8 bytes I do something
along this line:
@utf8_chars = split('', $utf8_str);
foreach $utf8_char (@utf8_chars) {
@utf8_bytes = unpack('C*', $utf8_char);
...
}
As I believe this thread has already discussed: substr in 5.6.0 does
not work correctly.
If one wants to look at the 32 bit value the code is quite
a bit move complex. Here is a C macro I use (when converting
UTF8 strings to UTF16 strings):
# define UTF8_APPEND_UCHAR(s, c) { \
if (((uint32_t)(c)) <= 0x7F) { \
(*(s)++) = ((uint8_t)(c)); \
} \
else if (((uint32_t)(c)) <= 0x7FF) { \
(*(s)++) = (0xC0 | (uint8_t)(((uint32_t)(c))>>6)); \
(*(s)++) = (0x80 | (uint8_t)(((uint32_t)(c))&0x3F)); \
} \
else if (((uint32_t)(c)) <= 0xFFFF) { \
(*(s)++) = (0xE0 | (uint8_t)(((uint32_t)(c))>>12)); \
(*(s)++) = (0x80 |
(uint8_t)((((uint32_t)(c))>>6)&0x3F)); \
(*(s)++) = (0x80 | (uint8_t)(((uint32_t)(c))&0x3F)); \
} \
else if (((uint32_t)(c)) <= 0x10FFFF) { \
(*(s)++) = (0xF0 | (uint8_t)(((uint32_t)(c))>>18)); \
(*(s)++) = (0x80 |
(uint8_t)((((uint32_t)(c))>>12)&0x3F)); \
(*(s)++) = (0x80 |
(uint8_t)((((uint32_t)(c))>>6)&0x3F)); \
(*(s)++) = (0x80 | (uint8_t)(((uint32_t)(c))&0x3F)); \
} \
else { \
(*(s)++) = 0xEF; \
(*(s)++) = 0xBF; \
(*(s)++) = 0xBF; \
} \
}
Paul Hoffman wrote:
At 6:20 AM +0100 10/31/00, Andreas J. Koenig wrote:
>>>>> On Mon, 30 Oct 2000 19:02:25 -0800, Paul Hoffman
<phoffman(_at_)proper(_dot_)com> said:
> Has anyone else come across this? Is there a way to use ord in a
loop after a split
> that works?
The bug has been fixed in the development version a while after 5.7.0
came out. You find instructions on how to get at the patches in the
perlhack manpage.
Thanks. However, I can't find a patch at
<http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/> that seems
related to the bug. I searched for "utf-8 ord". Is there a patch
number you can give me?
Also, I'd like to distribute my code to others who probably won't
have a patched system. Thus, I'd love to find a way, even a kludgy
way, in 5.6.0 to split up a string into utf-8 characters that will
work with ord. If need be, I could even use Unicode::String, convert
to a UCS-4, slice into four-octet chunks, then convert them back to a
UTF-8, but I'd like something less ugly to show the public.