perl-unicode

Re: questions about encode/decode

2007-10-15 15:01:26
Thanks for the detailed response - it was very helpful!

As a follow-up, does anyone have any suggestions about optimizing a
routine such as this:

sub escapeHTML {
  my $x = shift;

  $x =~ s/&/&/g;
  $x =~ s/</&lt;/g;
  ...
  Encode::encode("iso-8859-1", $x);
}

Basically I'm concerned about the overhead to constantly look up the
encoder sub for every fragment of HTML I need to escape.

Thanks...


On 10/15/07, Juerd Waalboer <juerd(_at_)convolution(_dot_)nl> wrote:
E R skribis 2007-10-15 16:25 (-0500):
1. What is the result of Encode::encode("iso-8559-1", $x) if $x is not
a utf8 string (i.e. Encode::is_utf8($x) returns false.)

"utf8 string" is already confusing. It can be either one of the
following:

1. byte string with UTF8 encoded text
2. Perl Unicode string that at this point in time is encoded as UTF8
   *internally*

Encode::is_utf8 indicates that the latter is true. You should NOT have
to peek at the status of this internal flag, except for debugging perl
itself.

Encode::encode expects a Unicode string, which can be encoded as
ISO-8859-1 or UTF8 internally. If the Unicode string is ISO-8859-1
internally, is_utf8 returns false, and if it is UTF8 internally, it
returns true.

This is how Encode::encode knows, again: *internally*, how to convert
the string.

Assuming you meant 8859, not 8559, the answer to your question is: a
copy of $x is returned, because the encoding you used happens to equal
the encoding that Perl used internally.

2. What is the result of $string = decode("iso-8859-1", $octets) if
$octets is a utf8 string?

Do not use Encode::decode on unicode strings, but use it on bytestrings
only. Every individual byte of the bytestring is seen as a single
ISO-8859-1 character, so a multi-byte UTF8 sequence will *not* be
interpreted as a single character.

Perhaps helpful: http://tnx.nl/perlunitut,perlunifaq
--
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  <#####(_at_)juerd(_dot_)nl>  
<http://juerd.nl/sig>
  Convolution:     ICT solutions and consultancy 
<sales(_at_)convolution(_dot_)nl>


<Prev in Thread] Current Thread [Next in Thread>