perl-unicode

Re: Explaining this behavior (was Re: good name for characters matching [^\0-\377]?)

2007-10-19 15:27:26
E R skribis 2007-10-19 17:14 (-0500):
The problem I need to understand now is the following:
  # using mod_perl 1.28
  # note: binmode(STDOUT, ":utf8") has no effect
  $r->print($x); # emits 1 octet
  $r->print($y); # emits 2 octets
I get similar behavior when storing $y into an Oracle DB - a string of length 
2
is stored. Storing $x, however, results in a length 1 string.

These don't use a filehandle, so :utf8 or :encoding layers don't work.
That leaves two options: either use the encoding functionality by the
module (if any), or encode manually.

AFAIK, mod_perl does not provide transparent encoding for output.
DBD::Oracle does, but you need to enable it. (Don't ask me how; I bailed
out when I saw the complexity of Oracle's charset/encoding support.)

When doing the encoding manually, I strongly suggest that you subclass
the module in question, to prevent that the logic is spread all over the
place. (And please release your subclass to CPAN :))

So it seems that in light of this one should always use Encode::encode with
these modules to ensure the data is represented the way you want it.

Encode::encode, Encode::encode_utf8, or utf8::encode.

Stated another way: if you use a module which converts a Perl string to an 
octet
sequence, and there is no provision for specifying an encoding, that should 
be a
red flag that you need to encode the string before you send it to the module.

Well stated. I have collected a summary at http://juerd.nl/perluniadvice
that is neither complete nor accurate, but it provides more information
than most documentation does. Unfortunately I lack tuits to send bug
reports and make patches.
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  <#####(_at_)juerd(_dot_)nl>  
<http://juerd.nl/sig>
  Convolution:     ICT solutions and consultancy 
<sales(_at_)convolution(_dot_)nl>