Encode::Tcl Mistery Solved!

Folks,

I think I have finally found the mistery why Encode::Tcl's decode()works while encode() did not. It was quite simple after all.First please take a look at this code. Both table.euc and table.utf8are guaranteed to be valid.


#!/path/to/perl5.7.2
use strict
use Test;
use Encode;
use Encode::Tcl;

my $euc_file = "t/table.euc";   # Valid EUC-JP text file
my $euc_data;
open my $fh, $euc_file or die "$euc_file:$!";
read $fh, $euc_data, -s $euc_file;
close $fh;

my $utf8_file = "t/table.utf8"; # Valid UTF8 text file
my $utf8_data;
open my $fh, $utf8_file or die "$utf8_file:$!";
read $fh, $utf8_data, -s $utf8_file;
close $fh;

BEGIN { plan tests => 2 ; }
ok(encode('euc-jp', $utf8_data), $euc_data);
ok(decode('euc-jp', $euc_data), $utf8_data);
__END__

  Will it work?  NO!  It will fail like this.

not ok 1
# Test 1 got: <UNDEF> (t/classic.pl at line 24)
#   Expected: '0x0020:
 ...


  You fed pre-certified data and still fails.  What's wrong?

The answer is: $utf8 is no utf8 UNLESS YOU EXPLICITLY SPECIFYSOMEWHERE!

insert

        Encode::_utf8_on($utf8_data);

  before ok() and now it works.  You can also make it work by replacing

        open my $fh, $utf8_file or die "$utf8_file:$!";

  to

        open my $fh, "<:utf8" $utf8_file or die "$utf8_file:$!";

  Encoding engines themselves appears ok.
  I repeat. ENCODING ENGINES THEMSELVES APPEARS OK!

Am I dumb to take so long to find this out? Maybe. But codeobfuscation, misleading error message and erroneous document isdefinitely also to blame.If encode() demands an SV explicitly marked as UTF8, it should carpBEFORE it attempts to encode from the first place.I also found croaking in (en|de)code is problematic in such occasionthat you need to determine encodings dynamically. With this in mind, Imade changes to encode() and decode() as follows;


sub encode
{
    my ($name,$string,$check) = @_;
    my $enc = find_encoding($name);
    unless (defined $enc){
           # Maybe we should set $Encode::$! or something instead....
           # or should we cast _utf8_on()?
        carp("Unknown encoding '$name'");
        return;
    }
    unless (is_utf8($string)){
        $check += 0; # numify when empty
        carp("\$string is not UTF-8: encode('$name', \$string, $check)");
        return;
    }
    my $octets = $enc->encode($string,$check);
    return undef if ($check && length($string));
    return $octets;
}

sub decode
{
    my ($name,$octets,$check) = @_;
    my $enc = find_encoding($name);
    unless(defined $enc){
        carp("Unknown encoding '$name'");
        return;
    }
    my $string = $enc->decode($octets,$check);
    $_[1] = $octets if $check;
    return $string;
}

There are other places where croak() that should carp() but I'll waitnext breadperl to commit these changes.So much as I feel relieved now, I still feel uncomfortable on the APIof Encode. UTF8 flag must be explicitly set yet the use of _utf8_on()is depreciated. I am looking for a more elegant way to handle this....


Dan the Man with too Many Charsets to Handle.