perl-unicode

Re: bytes::substr() ?

2003-08-27 09:30:07
Hi,

ed-perluni(_at_)inkdroid(_dot_)org wrote:
I'm working with a byte oriented protocol, and need to extract byte n1 through
byte n2 from a string.

I read this as "*character* n1 through *character* n2", right?

Problem is, the string can be UTF8, and substr() is character oriented. What (if anything) is the best way to do this in Perl?

If the string *can* be UTF-8 but you can *not* be sure about that, then it is hopeless, because the byte stream does not contain any magic that identifies it as UTF-8. You have to know that beforehand.

Any/all ideas welcome. I would prefer a pure Perl (non XS) solution, but if
that's the way to go then that's the way to go.

If you can be sure that the byte stream is a UTF-8 string, you have to convert it to a representation with fixed character width (UTF-16 might be safe, UCS-4 is safe), extract the substring - now that you know the size of a single character in bytes - and convert the extracted substring back to UTF-8 or whatever you need. Encode.pm provides the necessary conversion routines.

        # Extract characters 5 to 8 from a UTF-8 byte stream.
        use Encode;

        my $encoded = encode 'UCS-4', $utf_8_stream;
        _utf8_off ($encoded); # Paranoia.

        # One character is 4 bytes/octets in UCS-4.
        my $extracted_bytes = substr $encoded, 5 * 4, (8 - 5) * 4;
        my $extracted_utf8 = decode 'UCS-4', $extracted_bytes;

        # The resulting string $extracted_utf8 should have the UTF-8
        # flag on (I think so) but the code snippet is not tested at
        # all. :-(

hth

Guido
--
Imperia AG, Development
Leyboldstr. 10 - D-50354 Hürth - http://www.imperia.de/

<Prev in Thread] Current Thread [Next in Thread>