Explaining this behavior (was Re: good name for characters matching [^\0

On 10/18/07, Juerd Waalboer <juerd(_at_)convolution(_dot_)nl> wrote:

E R skribis 2007-10-18 16:21 (-0500):

...

To be honest, I'm not sure you know enough about Perl's string model to
be giving a presentation about Unicode in Perl. You just learnt very
important aspects, and from the things you write, I'd say you still have
some other important aspects to learn or accept. No offense meant.


To be honest, I'm not sure you know enough about the scope, purpose
or audience of my talk to comment on whether or not I'm qualified to give it.
But if it will make you feel any better, I'll be sure to tell the
audience that you
don't think I'm ready to talk about Unicode in Perl.

The problem I need to understand now is the following:

  # running under perl 5.8.0
  $x = "\x{e4}";
  $y = $x . "\x{101}";
  chop $y;  # Encode::is_utf8($y) == 1

  print STDOUT $x; # emits 1 octet
  print STDOUT $y; # emits 1 octet

  # using mod_perl 1.28
  # note: binmode(STDOUT, ":utf8") has no effect
  $r->print($x); # emits 1 octet
  $r->print($y); # emits 2 octets

I get similar behavior when storing $y into an Oracle DB - a string of length 2
is stored. Storing $x, however, results in a length 1 string.

So it seems that in light of this one should always use Encode::encode with
these modules to ensure the data is represented the way you want it.

Just wondering if this the best way to handle these kinds of
situations, or is there
a better way (in these cases or in general)?

Stated another way: if you use a module which converts a Perl string to an octet
sequence, and there is no provision for specifying an encoding, that should be a
red flag that you need to encode the string before you send it to the module.

Explaining this behavior (was Re: good name for characters matching [^\0-\377]?)