Hi,
ed-perluni(_at_)inkdroid(_dot_)org wrote:
I'm working with a byte oriented protocol, and need to extract byte n1 through
byte n2 from a string.
I read this as "*character* n1 through *character* n2", right?
Problem is, the string can be UTF8, and substr() is
character oriented. What (if anything) is the best way to do this in Perl?
If the string *can* be UTF-8 but you can *not* be sure about that, then
it is hopeless, because the byte stream does not contain any magic that
identifies it as UTF-8. You have to know that beforehand.
Any/all ideas welcome. I would prefer a pure Perl (non XS) solution, but if
that's the way to go then that's the way to go.
If you can be sure that the byte stream is a UTF-8 string, you have to
convert it to a representation with fixed character width (UTF-16 might
be safe, UCS-4 is safe), extract the substring - now that you know the
size of a single character in bytes - and convert the extracted
substring back to UTF-8 or whatever you need. Encode.pm provides the
necessary conversion routines.
# Extract characters 5 to 8 from a UTF-8 byte stream.
use Encode;
my $encoded = encode 'UCS-4', $utf_8_stream;
_utf8_off ($encoded); # Paranoia.
# One character is 4 bytes/octets in UCS-4.
my $extracted_bytes = substr $encoded, 5 * 4, (8 - 5) * 4;
my $extracted_utf8 = decode 'UCS-4', $extracted_bytes;
# The resulting string $extracted_utf8 should have the UTF-8
# flag on (I think so) but the code snippet is not tested at
# all. :-(
hth
Guido
--
Imperia AG, Development
Leyboldstr. 10 - D-50354 Hürth - http://www.imperia.de/