[Encode] In what character encoding legacy scripts are written?

jhi and porters,

With Encode done, I am now focusing on other codes and documentationsthat are related. Naturally there are many but before just sendingpatches, I would like to call for an attention. Many documents in thecore state that legacy encoding defaults to ISO-8859-1. Though this isNOT WRONG it is NOT CORRECT EITHER.There are millions of perl scripts with literals non-latin1. /(.)/matches a single "character" on not only ISO-8859-1 but also onvirtually any single-byte encodings. The correct answer is: perl wasnot encoding conscious except for single-byte locale until Unicodesupport was introducedTo clear my point, Let me show you a table. "SBE" is short forSingle Byte Encodings and "OK" means works as expected.


                                Literals        substr()        /(.)/   /¥w/
        ----------------------------------------------------------
        ASCII           OK              OK              OK              OK
        Latin1          OK              OK              OK              OK 
(with locale)
        Other SBE               OK              OK              OK              
Somewhat OK
        EUC                     OK              NG              NG              
NG
        Shift_JIS               NG              NG              NG              
NG

Contrary to popular (mis)?belief that you needed a special version ofperl such as Jperl until perl 5.6 is very wrong. jcode.pl byutashiro-san has existed since perl4 (and it does encoding conversionvia regex! And it is still maintained). Speaking of the Japaneseenvironment alone, My Jcode took over in Perl5 and it has been working,with (crude) Unicode support via XS. Jperl was needed when and onlywhen you really wanted a CHARACTER in character-oriented functions andregexes. If you only wanted a whole string (which is the case of mostCGIs), the same old perl had been good enough before the arrival of perl5.6.

  With this understood,  let's review the document once again.

  I REPEAT.  until perl 6,  PERL KNEW NOTHING ABOUT ENCODING.

And here is just one of many documents that blindly assumes legacy datais in ISO 8859-1


pod/perluniintro.pod

When you combine legacy data and Unicode the legacy data needs
to be upgraded to Unicode.  Normally ISO 8859-1 (or EBCDIC, if
applicable) is assumed.  You can override this assumption by
using the C<encoding> pragma, for example

    use encoding 'latin2'; # ISO 8859-2

in which case literals (string or regular expression) and chr/ord
in your whole script are assumed to produce Unicode characters from
ISO 8859-2 code points.  Note that the matching for the encoding
names is forgiving: instead of C<latin2> you could have said
C<Latin 2>, or C<iso8859-2>, and so forth.  With just

    use encoding;

first the environment variable C<PERL_ENCODING> will be consulted,
and if that doesn't exist, ISO 8859-1 (Latin 1) will be assumed.



Dan the Encode Maintainer

P.S. Here is a small script that checks the behavior of /¥w/ in perl.Use it as


env LC_CTYPE=foo.bar /path/to/perl w-test.pl

If /¥w/ behaves differently from LC_CTYPE=C, it prints it.

#!/usr/local/bin/perl
# w-test.pl
use strict;
use Getopt::Std;
my %Opt;
getopts("v", \%Opt);
my $first = $Opt{v} ? 0 : 0x80;

for my $i ($first..0xff){
    my $c = chr($i);
    printf qq{'$c' (\\x%0x) =~ /\\w/\n}, $i if $c =~ /\w/o;
}
__END__