Folks,
I think I have finally found the mistery why Encode::Tcl's decode()
works while encode() did not. It was quite simple after all.
First please take a look at this code. Both table.euc and table.utf8
are guaranteed to be valid.
#!/path/to/perl5.7.2
use strict
use Test;
use Encode;
use Encode::Tcl;
my $euc_file = "t/table.euc"; # Valid EUC-JP text file
my $euc_data;
open my $fh, $euc_file or die "$euc_file:$!";
read $fh, $euc_data, -s $euc_file;
close $fh;
my $utf8_file = "t/table.utf8"; # Valid UTF8 text file
my $utf8_data;
open my $fh, $utf8_file or die "$utf8_file:$!";
read $fh, $utf8_data, -s $utf8_file;
close $fh;
BEGIN { plan tests => 2 ; }
ok(encode('euc-jp', $utf8_data), $euc_data);
ok(decode('euc-jp', $euc_data), $utf8_data);
__END__
Will it work? NO! It will fail like this.
not ok 1
# Test 1 got: <UNDEF> (t/classic.pl at line 24)
# Expected: '0x0020:
...
You fed pre-certified data and still fails. What's wrong?
The answer is: $utf8 is no utf8 UNLESS YOU EXPLICITLY SPECIFY
SOMEWHERE!
insert
Encode::_utf8_on($utf8_data);
before ok() and now it works. You can also make it work by replacing
open my $fh, $utf8_file or die "$utf8_file:$!";
to
open my $fh, "<:utf8" $utf8_file or die "$utf8_file:$!";
Encoding engines themselves appears ok.
I repeat. ENCODING ENGINES THEMSELVES APPEARS OK!
Am I dumb to take so long to find this out? Maybe. But code
obfuscation, misleading error message and erroneous document is
definitely also to blame.
If encode() demands an SV explicitly marked as UTF8, it should carp
BEFORE it attempts to encode from the first place.
I also found croaking in (en|de)code is problematic in such occasion
that you need to determine encodings dynamically. With this in mind, I
made changes to encode() and decode() as follows;
sub encode
{
my ($name,$string,$check) = @_;
my $enc = find_encoding($name);
unless (defined $enc){
# Maybe we should set $Encode::$! or something instead....
# or should we cast _utf8_on()?
carp("Unknown encoding '$name'");
return;
}
unless (is_utf8($string)){
$check += 0; # numify when empty
carp("\$string is not UTF-8: encode('$name', \$string, $check)");
return;
}
my $octets = $enc->encode($string,$check);
return undef if ($check && length($string));
return $octets;
}
sub decode
{
my ($name,$octets,$check) = @_;
my $enc = find_encoding($name);
unless(defined $enc){
carp("Unknown encoding '$name'");
return;
}
my $string = $enc->decode($octets,$check);
$_[1] = $octets if $check;
return $string;
}
There are other places where croak() that should carp() but I'll wait
next breadperl to commit these changes.
So much as I feel relieved now, I still feel uncomfortable on the API
of Encode. UTF8 flag must be explicitly set yet the use of _utf8_on()
is depreciated. I am looking for a more elegant way to handle this....
Dan the Man with too Many Charsets to Handle.