perl-unicode

Encode::Tcl Mistery Solved!

2002-01-28 22:08:59
Folks,

I think I have finally found the mistery why Encode::Tcl's decode() works while encode() did not. It was quite simple after all. First please take a look at this code. Both table.euc and table.utf8 are guaranteed to be valid.

#!/path/to/perl5.7.2
use strict
use Test;
use Encode;
use Encode::Tcl;

my $euc_file = "t/table.euc";   # Valid EUC-JP text file
my $euc_data;
open my $fh, $euc_file or die "$euc_file:$!";
read $fh, $euc_data, -s $euc_file;
close $fh;

my $utf8_file = "t/table.utf8"; # Valid UTF8 text file
my $utf8_data;
open my $fh, $utf8_file or die "$utf8_file:$!";
read $fh, $utf8_data, -s $utf8_file;
close $fh;

BEGIN { plan tests => 2 ; }
ok(encode('euc-jp', $utf8_data), $euc_data);
ok(decode('euc-jp', $euc_data), $utf8_data);
__END__

  Will it work?  NO!  It will fail like this.

not ok 1
# Test 1 got: <UNDEF> (t/classic.pl at line 24)
#   Expected: '0x0020:
 ...

  You fed pre-certified data and still fails.  What's wrong?
The answer is: $utf8 is no utf8 UNLESS YOU EXPLICITLY SPECIFY SOMEWHERE!
insert

        Encode::_utf8_on($utf8_data);

  before ok() and now it works.  You can also make it work by replacing

        open my $fh, $utf8_file or die "$utf8_file:$!";

  to

        open my $fh, "<:utf8" $utf8_file or die "$utf8_file:$!";

  Encoding engines themselves appears ok.
  I repeat. ENCODING ENGINES THEMSELVES APPEARS OK!
Am I dumb to take so long to find this out? Maybe. But code obfuscation, misleading error message and erroneous document is definitely also to blame. If encode() demands an SV explicitly marked as UTF8, it should carp BEFORE it attempts to encode from the first place. I also found croaking in (en|de)code is problematic in such occasion that you need to determine encodings dynamically. With this in mind, I made changes to encode() and decode() as follows;

sub encode
{
    my ($name,$string,$check) = @_;
    my $enc = find_encoding($name);
    unless (defined $enc){
           # Maybe we should set $Encode::$! or something instead....
           # or should we cast _utf8_on()?
        carp("Unknown encoding '$name'");
        return;
    }
    unless (is_utf8($string)){
        $check += 0; # numify when empty
        carp("\$string is not UTF-8: encode('$name', \$string, $check)");
        return;
    }
    my $octets = $enc->encode($string,$check);
    return undef if ($check && length($string));
    return $octets;
}

sub decode
{
    my ($name,$octets,$check) = @_;
    my $enc = find_encoding($name);
    unless(defined $enc){
        carp("Unknown encoding '$name'");
        return;
    }
    my $string = $enc->decode($octets,$check);
    $_[1] = $octets if $check;
    return $string;
}

There are other places where croak() that should carp() but I'll wait next breadperl to commit these changes. So much as I feel relieved now, I still feel uncomfortable on the API of Encode. UTF8 flag must be explicitly set yet the use of _utf8_on() is depreciated. I am looking for a more elegant way to handle this....

Dan the Man with too Many Charsets to Handle.