<ed-perluni(_at_)inkdroid(_dot_)org> writes:
On Wed, Aug 27, 2003 at 06:04:48PM +0200, Guido Flohr wrote:
Hi,
ed-perluni(_at_)inkdroid(_dot_)org wrote:
I'm working with a byte oriented protocol, and need to extract byte n1
through
byte n2 from a string.
No problem (honest;-)) (At least in perl5.8 ...)
A byte is a number between 0..255
We can represent that as a character with ordinal value < 256.
So your sequence of bytes maps exactly to a sequence of characters.
So you can take your bytes and but them in a string and then use
substr() etc. on just as you always could in traditional perl (and other
languages).
Where the snags could creep in is if other parts of your application
are dealing with Characters in their Wider meaning. If that is the
case you must make sure they get "encoded" into a byte stream
before your protocol gets to see them. That is what Encode module
and MIME::Base64 etc. are for.
I read this as "*character* n1 through *character* n2", right?
Alas, no -- I'm interested in byte n1 through byte n2. This is because the
protocol I am working with uses byte offsets. substr() works like a charm as
long as 1 char = 1 byte, but in utf8 it breaks down.
No it doesn't - so long as you don't tell perl there are UTF-8 encoded
characters in there then it will not notice.
IO still defaults to reading bytes. You can tell perl that those
bytes represent encoded characters (either as UTF-8 or in your
current locale's encoding) but you don't have to.
So given a string of utf8 data $x I want to be able to extract bytes 3 - 12
from
it...not characters :(
So
my $string = "Any \x{xxxx} etc.";
my $bytes = encode("UTF-8", $string); # output is bounded 0..255
my $field = substr($bytes,3,9);
(Now back in perl5.6 we had not got this thought through and there was
all kinds of weird "use bytes" and "no utf8" confusion in the descriptions.)
Note the above assumes that something is working with $string as characters.
Just messing with an all-bytes protocol is even simpler - perl does not need
to know that (some of) those bytes are UTF-8 for characters. That is
you only need to get Encode involved when you need to mix bytes-for-protocol
with payload-is-characters.
//Ed