perl-unicode

Re: My favorite bug to fix for 5.8.0

2002-03-09 18:06:30
On 2002.03.10, at 04:59, Larry Wall wrote:
In Markus's lovely http://www.cl.cam.ac.uk/~mgk25/unicode.html document,
he writes:

On POSIX systems, the selected locale identifies already the encoding
    expected in all input and output files of a process.

With all due respect I have to tell you locale is one of the perl features that sucks the most, especially when CJK is concerned. It had been long before perl stops spitting warnings after warnings unless LANG=C. So my .cshrc contains 'unsetenv LANG' to keep perl happy for more than a decade. I can't help wondering how many existing perl codes actually use locale. Well, perl is not to blame when locale is concerned. It's standards (and the very lack thereof) to blame. Locale might have worked on roman systems but hardly ever so on CJK (yeah, I know some do use locale, such as tcsh but I never dared use it or my serial console may be rendered useless).

Perl currently violates this, and I'm getting very tired very quickly
of having to put things like

I have been glad perl does violates that, among other thing that suck. Why are you, the postmodernistic social hacker, suddenly behave so modernistic here?

    eval {
        binmode IN, ":utf8";
        binmode STDIN, ":utf8";
        binmode STDOUT, ":utf8";
    };

I am not going to yell at something already fixed but I can't keep grumbling this new use of binmode sucks. Reminds me too much of DOS.

in my programs, despite running in a LANG=en_US.UTF-8 locale with a
UTF-8 aware xterm and a UTF-8 aware editor.  What will it take to fix
that?  Not much, I think.

Not much for perl, maybe. So much for the OSes, definitely. For one thing none of the platforms I use daily has such locales as *.UTF-8.

In the more-difficult-but-oh-so-user-friendly category, it would also
be lovely if someone came up with a dwimmish layer that could recognize
when it isn't getting UTF-8 and attempt autorecognition of other
encodings, perhaps with hints from the locale.  Camel III called it
:any, but maybe :guess would be better documentation.  Then saying
C<use open ":guess"> could just dwim all the opens.  There's arguments
both for and against making that the default.  After all, just because
you've set a UTF-8 locale doesn't actually mean that all the files you
receive are in that format.  It has to be at least easy to turn on
guessing, even if that's not the default. But if we do want to establish
guessing as a default, then the transition to widespread use of UTF-8
locales is probably our only chance.

  I beg not to squeeze locale into Unicode features.

perlunicode as of 5.7.3
       Use of locales with utf8 may lead to odd results.  Cur-
       rently there is some attempt to apply 8-bit locale info to
       characters in the range 0..255, but this is demonstrably
       incorrect for locales that use characters above that range
       (when mapped into Unicode).  It will also tend to run
       slower.  Avoidance of locales is strongly encouraged.

By letting locale into Unicode features you are going to make this even worse.

Markus, what's your take on this?  Do you think open by default should
try to Do the Right Thing?  I'm trying to balance out the needs of
neophytes with experts here.  Perhaps this is another of those things
that should work differently under C<use strict>.  But it's so
pitifully easy to distinguish UTF-8 from ISO-8859-1 that it seems like
that should almost be mandatory.

And it is so pitifully hard, if not impossible, to distinguish UTF-8 from EUC, Shift JIS, Big5, GB2312, KSC5601....

But the first step is recognizing UTF-8 locales.

Or kiss locale goodbye. After all locale is only good for bilingual systems. With Internet bilingual is not good enough even just for websurfing (You'll never know to what encoding your next click will take). IMHO, One of the best thing about Unicode is to get rid of the very need of locale once for all.... We already have utf8 pragma. If we really need something to make utf8 stream by default (yet leave other things in 'use bytes;' realm), why don't we just extend it like

use utf8 qw(:filehandle);

  for instance?

Dan the Man with Too Many Encodings to Deal with