jhi and porters,
With Encode done, I am now focusing on other codes and documentations
that are related. Naturally there are many but before just sending
patches, I would like to call for an attention. Many documents in the
core state that legacy encoding defaults to ISO-8859-1. Though this is
NOT WRONG it is NOT CORRECT EITHER.
There are millions of perl scripts with literals non-latin1. /(.)/
matches a single "character" on not only ISO-8859-1 but also on
virtually any single-byte encodings. The correct answer is: perl was
not encoding conscious except for single-byte locale until Unicode
support was introduced
To clear my point, Let me show you a table. "SBE" is short for
Single Byte Encodings and "OK" means works as expected.
Literals substr() /(.)/ /¥w/
----------------------------------------------------------
ASCII OK OK OK OK
Latin1 OK OK OK OK
(with locale)
Other SBE OK OK OK
Somewhat OK
EUC OK NG NG
NG
Shift_JIS NG NG NG
NG
Contrary to popular (mis)?belief that you needed a special version of
perl such as Jperl until perl 5.6 is very wrong. jcode.pl by
utashiro-san has existed since perl4 (and it does encoding conversion
via regex! And it is still maintained). Speaking of the Japanese
environment alone, My Jcode took over in Perl5 and it has been working,
with (crude) Unicode support via XS. Jperl was needed when and only
when you really wanted a CHARACTER in character-oriented functions and
regexes. If you only wanted a whole string (which is the case of most
CGIs), the same old perl had been good enough before the arrival of perl
5.6.
With this understood, let's review the document once again.
I REPEAT. until perl 6, PERL KNEW NOTHING ABOUT ENCODING.
And here is just one of many documents that blindly assumes legacy data
is in ISO 8859-1
pod/perluniintro.pod
When you combine legacy data and Unicode the legacy data needs
to be upgraded to Unicode. Normally ISO 8859-1 (or EBCDIC, if
applicable) is assumed. You can override this assumption by
using the C<encoding> pragma, for example
use encoding 'latin2'; # ISO 8859-2
in which case literals (string or regular expression) and chr/ord
in your whole script are assumed to produce Unicode characters from
ISO 8859-2 code points. Note that the matching for the encoding
names is forgiving: instead of C<latin2> you could have said
C<Latin 2>, or C<iso8859-2>, and so forth. With just
use encoding;
first the environment variable C<PERL_ENCODING> will be consulted,
and if that doesn't exist, ISO 8859-1 (Latin 1) will be assumed.
Dan the Encode Maintainer
P.S. Here is a small script that checks the behavior of /¥w/ in perl.
Use it as
env LC_CTYPE=foo.bar /path/to/perl w-test.pl
If /¥w/ behaves differently from LC_CTYPE=C, it prints it.
#!/usr/local/bin/perl
# w-test.pl
use strict;
use Getopt::Std;
my %Opt;
getopts("v", \%Opt);
my $first = $Opt{v} ? 0 : 0x80;
for my $i ($first..0xff){
my $c = chr($i);
printf qq{'$c' (\\x%0x) =~ /\\w/\n}, $i if $c =~ /\w/o;
}
__END__