Dan Kogai <dankogai(_at_)dan(_dot_)co(_dot_)jp> writes:
my $utf8_file = "t/table.utf8"; # Valid UTF8 text file
my $utf8_data;
open my $fh, $utf8_file or die "$utf8_file:$!";
That is supposed to be :
open my $fh,"<:utf8", $utf8_file;
To tell perl that data is UTF-8.
read $fh, $utf8_data, -s $utf8_file;
close $fh;
BEGIN { plan tests => 2 ; }
ok(encode('euc-jp', $utf8_data), $euc_data);
ok(decode('euc-jp', $euc_data), $utf8_data);
__END__
Will it work? NO! It will fail like this.
not ok 1
# Test 1 got: <UNDEF> (t/classic.pl at line 24)
# Expected: '0x0020:
...
You fed pre-certified data and still fails. What's wrong?
The answer is: $utf8 is no utf8 UNLESS YOU EXPLICITLY SPECIFY
SOMEWHERE!
Yes - things are sequences of iso-8859-1 until told otherwise.
insert
Encode::_utf8_on($utf8_data);
before ok() and now it works. You can also make it work by replacing
open my $fh, $utf8_file or die "$utf8_file:$!";
to
open my $fh, "<:utf8" $utf8_file or die "$utf8_file:$!";
Which is the prefered way.
Encoding engines themselves appears ok.
I repeat. ENCODING ENGINES THEMSELVES APPEARS OK!
I think we knew that ;-)
Am I dumb to take so long to find this out? Maybe. But code
obfuscation, misleading error message and erroneous document is
definitely also to blame.
If encode() demands an SV explicitly marked as UTF8, it should carp
BEFORE it attempts to encode from the first place.
It doesn't. If it is not marked as UTF-8 it assumes it isn't. So
(Jarkko's locale stuff aside) it is a sequence of iso-8859-1 chars
for legacy compatibility. You then ask it to convert those bytes to
EUC-JP and lots of high-bit iso-8859-1's (which is what UTF8 encoded
stuff looks like) don't map so you get undefs.
Back to locale ... The idea of the locale stuff is to say "aha - user is in a
Japanese locale
so in absence of instructions to the contrary I will assume that files
are full of iso2022-jp encoded stuff" (or whatever is right thing).
So you will still need to explicitly tell it when you are breaking
that assumption.
I also found croaking in (en|de)code is problematic in such occasion
that you need to determine encodings dynamically. With this in mind, I
made changes to encode() and decode() as follows;
sub encode
{
my ($name,$string,$check) = @_;
my $enc = find_encoding($name);
unless (defined $enc){
# Maybe we should set $Encode::$! or something instead....
# or should we cast _utf8_on()?
carp("Unknown encoding '$name'");
return;
}
unless (is_utf8($string)){
$check += 0; # numify when empty
carp("\$string is not UTF-8: encode('$name', \$string, $check)");
I assume that ESC sequences are iso2022 - this is also "the wrong thing".
Eventually carp is going to write to STDERR stream at it may "know" that
STDERR is iso2022 and do the right thing.
return;
}
my $octets = $enc->encode($string,$check);
return undef if ($check && length($string));
return $octets;
}
sub decode
{
my ($name,$octets,$check) = @_;
my $enc = find_encoding($name);
unless(defined $enc){
carp("Unknown encoding '$name'");
return;
}
my $string = $enc->decode($octets,$check);
$_[1] = $octets if $check;
return $string;
}
There are other places where croak() that should carp() but I'll wait
next breadperl to commit these changes.
The idea of the croak is you can catch it silently with
eval { $string = decode($trythis,... }
(or better yet call find_encoding yourself before getting that far).
The carp is going to leak out to the user and look messy.
So much as I feel relieved now, I still feel uncomfortable on the API
of Encode. UTF8 flag must be explicitly set yet the use of _utf8_on()
is depreciated.
Yes you are supposed to set it on the file handle. Setting it on
may be appropriate if data comes in magically from somewhere else.
I am looking for a more elegant way to handle this....
Dan the Man with too Many Charsets to Handle.
--
Nick Ing-Simmons
http://www.ni-s.u-net.com/