perl-unicode

Re: bytes::substr() ?

2003-08-27 10:30:08
<ed-perluni(_at_)inkdroid(_dot_)org> writes:
On Wed, Aug 27, 2003 at 06:04:48PM +0200, Guido Flohr wrote:
Hi,

ed-perluni(_at_)inkdroid(_dot_)org wrote:
I'm working with a byte oriented protocol, and need to extract byte n1 
through
byte n2 from a string. 

No problem (honest;-)) (At least in perl5.8 ...)

A byte is a number between 0..255 
We can represent that as a character with ordinal value < 256.
So your sequence of bytes maps exactly to a sequence of characters.
So you can take your bytes and but them in a string and then use 
substr() etc. on just as you always could in traditional perl (and other 
languages).

Where the snags could creep in is if other parts of your application 
are dealing with Characters in their Wider meaning. If that is the 
case you must make sure they get "encoded" into a byte stream
before your protocol gets to see them. That is what Encode module 
and MIME::Base64 etc. are for.


I read this as "*character* n1 through *character* n2", right?

Alas, no -- I'm interested in byte n1 through byte n2. This is because the
protocol I am working with uses byte offsets. substr() works like a charm as
long as 1 char = 1 byte, but in utf8 it breaks down.

No it doesn't - so long as you don't tell perl there are UTF-8 encoded 
characters in there then it will not notice.

IO still defaults to reading bytes. You can tell perl that those 
bytes represent encoded characters (either as UTF-8 or in your 
current locale's encoding) but you don't have to.


So given a string of utf8 data $x I want to be able to extract bytes 3 - 12 
from
it...not characters :(

So 

  my $string = "Any \x{xxxx} etc.";
  my $bytes  = encode("UTF-8", $string); # output is bounded 0..255
  my $field  = substr($bytes,3,9); 

(Now back in perl5.6 we had not got this thought through and there was 
all kinds of weird "use bytes" and "no utf8" confusion in the descriptions.)

Note the above assumes that something is working with $string as characters.
Just messing with an all-bytes protocol is even simpler - perl does not need
to know that (some of) those bytes are UTF-8 for characters. That is 
you only need to get Encode involved when you need to mix bytes-for-protocol
with payload-is-characters.







//Ed

<Prev in Thread] Current Thread [Next in Thread>