[Encode] In what character encoding legacy scripts are written?

2002-04-05 04:49:25
jhi and porters,

With Encode done, I am now focusing on other codes and documentations that are related. Naturally there are many but before just sending patches, I would like to call for an attention. Many documents in the core state that legacy encoding defaults to ISO-8859-1. Though this is NOT WRONG it is NOT CORRECT EITHER. There are millions of perl scripts with literals non-latin1. /(.)/ matches a single "character" on not only ISO-8859-1 but also on virtually any single-byte encodings. The correct answer is: perl was not encoding conscious except for single-byte locale until Unicode support was introduced To clear my point, Let me show you a table. "SBE" is short for Single Byte Encodings and "OK" means works as expected.

                                Literals        substr()        /(.)/   /¥w/
        ASCII           OK              OK              OK              OK
        Latin1          OK              OK              OK              OK 
(with locale)
        Other SBE               OK              OK              OK              
Somewhat OK
        EUC                     OK              NG              NG              
        Shift_JIS               NG              NG              NG              

Contrary to popular (mis)?belief that you needed a special version of perl such as Jperl until perl 5.6 is very wrong. by utashiro-san has existed since perl4 (and it does encoding conversion via regex! And it is still maintained). Speaking of the Japanese environment alone, My Jcode took over in Perl5 and it has been working, with (crude) Unicode support via XS. Jperl was needed when and only when you really wanted a CHARACTER in character-oriented functions and regexes. If you only wanted a whole string (which is the case of most CGIs), the same old perl had been good enough before the arrival of perl 5.6.
  With this understood,  let's review the document once again.


And here is just one of many documents that blindly assumes legacy data is in ISO 8859-1

When you combine legacy data and Unicode the legacy data needs
to be upgraded to Unicode.  Normally ISO 8859-1 (or EBCDIC, if
applicable) is assumed.  You can override this assumption by
using the C<encoding> pragma, for example

    use encoding 'latin2'; # ISO 8859-2

in which case literals (string or regular expression) and chr/ord
in your whole script are assumed to produce Unicode characters from
ISO 8859-2 code points.  Note that the matching for the encoding
names is forgiving: instead of C<latin2> you could have said
C<Latin 2>, or C<iso8859-2>, and so forth.  With just

    use encoding;

first the environment variable C<PERL_ENCODING> will be consulted,
and if that doesn't exist, ISO 8859-1 (Latin 1) will be assumed.

Dan the Encode Maintainer

P.S. Here is a small script that checks the behavior of /¥w/ in perl. Use it as

env /path/to/perl

If /¥w/ behaves differently from LC_CTYPE=C, it prints it.

use strict;
use Getopt::Std;
my %Opt;
getopts("v", \%Opt);
my $first = $Opt{v} ? 0 : 0x80;

for my $i ($first..0xff){
    my $c = chr($i);
    printf qq{'$c' (\\x%0x) =~ /\\w/\n}, $i if $c =~ /\w/o;

<Prev in Thread] Current Thread [Next in Thread>