Re: My favorite bug to fix for 5.8.0

On 2002.03.10, at 04:59, Larry Wall wrote:

In Markus's lovely http://www.cl.cam.ac.uk/~mgk25/unicode.html document,
he writes:

On POSIX systems, the selected locale identifies already theencoding

    expected in all input and output files of a process.

With all due respect I have to tell you locale is one of the perlfeatures that sucks the most, especially when CJK is concerned. It hadbeen long before perl stops spitting warnings after warnings unlessLANG=C. So my .cshrc contains 'unsetenv LANG' to keep perl happy formore than a decade. I can't help wondering how many existing perl codesactually use locale.Well, perl is not to blame when locale is concerned. It's standards(and the very lack thereof) to blame. Locale might have worked on romansystems but hardly ever so on CJK (yeah, I know some do use locale, suchas tcsh but I never dared use it or my serial console may be rendereduseless).

Perl currently violates this, and I'm getting very tired very quickly
of having to put things like

I have been glad perl does violates that, among other thing thatsuck. Why are you, the postmodernistic social hacker, suddenly behaveso modernistic here?

    eval {
        binmode IN, ":utf8";
        binmode STDIN, ":utf8";
        binmode STDOUT, ":utf8";
    };

I am not going to yell at something already fixed but I can't keepgrumbling this new use of binmode sucks. Reminds me too much of DOS.

in my programs, despite running in a LANG=en_US.UTF-8 locale with a
UTF-8 aware xterm and a UTF-8 aware editor.  What will it take to fix
that?  Not much, I think.

Not much for perl, maybe. So much for the OSes, definitely. For onething none of the platforms I use daily has such locales as *.UTF-8.

In the more-difficult-but-oh-so-user-friendly category, it would also
be lovely if someone came up with a dwimmish layer that could recognize
when it isn't getting UTF-8 and attempt autorecognition of other
encodings, perhaps with hints from the locale.  Camel III called it
:any, but maybe :guess would be better documentation.  Then saying
C<use open ":guess"> could just dwim all the opens.  There's arguments
both for and against making that the default.  After all, just because
you've set a UTF-8 locale doesn't actually mean that all the files you
receive are in that format.  It has to be at least easy to turn on

guessing, even if that's not the default. But if we do want toestablish

guessing as a default, then the transition to widespread use of UTF-8
locales is probably our only chance.


  I beg not to squeeze locale into Unicode features.

perlunicode as of 5.7.3

       Use of locales with utf8 may lead to odd results.  Cur-
       rently there is some attempt to apply 8-bit locale info to
       characters in the range 0..255, but this is demonstrably
       incorrect for locales that use characters above that range
       (when mapped into Unicode).  It will also tend to run
       slower.  Avoidance of locales is strongly encouraged.

By letting locale into Unicode features you are going to make thiseven worse.

Markus, what's your take on this?  Do you think open by default should
try to Do the Right Thing?  I'm trying to balance out the needs of
neophytes with experts here.  Perhaps this is another of those things
that should work differently under C<use strict>.  But it's so
pitifully easy to distinguish UTF-8 from ISO-8859-1 that it seems like
that should almost be mandatory.

And it is so pitifully hard, if not impossible, to distinguish UTF-8from EUC, Shift JIS, Big5, GB2312, KSC5601....

But the first step is recognizing UTF-8 locales.

Or kiss locale goodbye. After all locale is only good for bilingualsystems. With Internet bilingual is not good enough even just forwebsurfing (You'll never know to what encoding your next click willtake). IMHO, One of the best thing about Unicode is to get rid of thevery need of locale once for all....We already have utf8 pragma. If we really need something to make utf8stream by default (yet leave other things in 'use bytes;' realm), whydon't we just extend it like


use utf8 qw(:filehandle);

  for instance?

Dan the Man with Too Many Encodings to Deal with