perl-unicode

tr/// and use encoding

2002-10-03 05:30:04
On Thursday, Oct 3, 2002, at 11:29 Asia/Tokyo, Jarkko Hietaniemi wrote:
On Wed, Oct 02, 2002 at 10:44:06PM +0900, Dan Kogai wrote:
On Wednesday, Oct 2, 2002, at 22:34 Asia/Tokyo, Jarkko Hietaniemi wrote:
Both.  I think the operation needed is straight-forward.  When you get
tr[LHS][RHS], decode'em then
feed it to the naked tr// .

Urk...  That means a dip into the toke.c, how the tr/// ranges are
implemented is... tricky.  sv_recode_to_utf8() is needed somewhere...
but I'm a little bit pressed for time right now.  I suggest you
perlbug this and move the process to perl5-porters.  (Inaba Hiroto
also might have insight on this; he's the tr///-with-Unicode sensei,
really-- he practically implemented all of it.  And he might read
*[gk]ana much better than me :-)

So now this thread is in perl5-porter. Since this "undocumented (lack of) feature" has a very easy workaround, I am yet to perlbug this.

=head1 PROBLEM

C<use encoding 'foo-encoding'> nicely converts string literals and regex into UTF-8 so you gen get the power of perl 5.8.0 even when your source code is other text encodings than UTF-8. But tr/// does not embrace this magic.

=head1 WORKAROUND

Suppose your script is in EUC-JP and your source contains this:

  $kana =~ tr/\xA4\xA1-\xA4\xF3/\xA5\xA1-\xA5\xF3/;
              -------- -------- -------- --------

And you want perl to do the following;

  $kana =~ tr/\x{3041}-\x{3093}/\x{30a1}-\x{30f3}/

All you have to do is:

  use encoding 'euc-jp';
  # ....
  eval qq{ \$kana =~ tr/\xA4\xA1-\xA4\xF3/\xA5\xA1-\xA5\xF3/ };

=over

=item chars in this example

  utf8     euc-jp   charnames::viacode()
  -----------------------------------------
  \x{3041} \xA4\xA1 HIRAGANA LETTER SMALL A
  \x{3093} \xA4\xF3 HIRAGANA LETTER N
  \x{30a1} \xA5\xA1 KATAKANA LETTER SMALL A
  \x{30f3} \xA5\xF3 KATAKANA LETTER N

=backs

=head1 DISCUSSION

I found this when I was writing a CGI book and I wanted a form validation/correction. THe example above converts all Hiragana to Kanakana, which is a common task in Japan. Traditionally this kind of operation was done via jcode::tr() (require "jcode.pl";) or Jcode::tr() (use Jcode;). But as of perl 5.6.0 you can apply Japanese directly into regex and tr/// -- so long as your script is in UTF-8.

With perl 5.8.0, the direct application of multibyte regex was made possible via C<use encoding> pragma. use encoding pragma applies its magic as follows. Suppose you C<use encoding 'foo'>;

=over

=item 0.

${^ENCODING}, a special, non-scoped variable, is set to C<Encode::find_encoding('foo')>. if 'foo' is a supported encoding by Encode, ${^ENCODING} is now a "transcoder" object.

=item 1.

all string literals in q//, qq//, qw// and qr// (not sure of qx//) are first fed to ${^ENCODING}.->decode(). So from perl's point of view, it's the same as literals written in UTF-8.

=item 2.

C<binmode STDIN, ":encoding(foo)";> and C<binmode STDIN, ":encoding(foo)"> are implicitly applied So you can feed STDIN in enconding 'foo' and get STDOUT in encoding 'foo'

=back

Very clever and powerful. But 1. is not done to tr///. qq{} is under control of C<use encoding> so eval qq{} works as expected.

Though the workaround is simple, easy and clever it still leaves inconsistency on how ${^ENCODING} gets used; It does indeed works on non-interpolated literals already.

=head1 REPORTED BY

Dan the Encode Maintainer E<lt>dankogai(_at_)dan(_dot_)co(_dot_)jpE<gt>