On 2002.03.10, at 04:59, Larry Wall wrote:
In Markus's lovely http://www.cl.cam.ac.uk/~mgk25/unicode.html document,
he writes:
On POSIX systems, the selected locale identifies already the
encoding
expected in all input and output files of a process.
With all due respect I have to tell you locale is one of the perl
features that sucks the most, especially when CJK is concerned. It had
been long before perl stops spitting warnings after warnings unless
LANG=C. So my .cshrc contains 'unsetenv LANG' to keep perl happy for
more than a decade. I can't help wondering how many existing perl codes
actually use locale.
Well, perl is not to blame when locale is concerned. It's standards
(and the very lack thereof) to blame. Locale might have worked on roman
systems but hardly ever so on CJK (yeah, I know some do use locale, such
as tcsh but I never dared use it or my serial console may be rendered
useless).
Perl currently violates this, and I'm getting very tired very quickly
of having to put things like
I have been glad perl does violates that, among other thing that
suck. Why are you, the postmodernistic social hacker, suddenly behave
so modernistic here?
eval {
binmode IN, ":utf8";
binmode STDIN, ":utf8";
binmode STDOUT, ":utf8";
};
I am not going to yell at something already fixed but I can't keep
grumbling this new use of binmode sucks. Reminds me too much of DOS.
in my programs, despite running in a LANG=en_US.UTF-8 locale with a
UTF-8 aware xterm and a UTF-8 aware editor. What will it take to fix
that? Not much, I think.
Not much for perl, maybe. So much for the OSes, definitely. For one
thing none of the platforms I use daily has such locales as *.UTF-8.
In the more-difficult-but-oh-so-user-friendly category, it would also
be lovely if someone came up with a dwimmish layer that could recognize
when it isn't getting UTF-8 and attempt autorecognition of other
encodings, perhaps with hints from the locale. Camel III called it
:any, but maybe :guess would be better documentation. Then saying
C<use open ":guess"> could just dwim all the opens. There's arguments
both for and against making that the default. After all, just because
you've set a UTF-8 locale doesn't actually mean that all the files you
receive are in that format. It has to be at least easy to turn on
guessing, even if that's not the default. But if we do want to
establish
guessing as a default, then the transition to widespread use of UTF-8
locales is probably our only chance.
I beg not to squeeze locale into Unicode features.
perlunicode as of 5.7.3
Use of locales with utf8 may lead to odd results. Cur-
rently there is some attempt to apply 8-bit locale info to
characters in the range 0..255, but this is demonstrably
incorrect for locales that use characters above that range
(when mapped into Unicode). It will also tend to run
slower. Avoidance of locales is strongly encouraged.
By letting locale into Unicode features you are going to make this
even worse.
Markus, what's your take on this? Do you think open by default should
try to Do the Right Thing? I'm trying to balance out the needs of
neophytes with experts here. Perhaps this is another of those things
that should work differently under C<use strict>. But it's so
pitifully easy to distinguish UTF-8 from ISO-8859-1 that it seems like
that should almost be mandatory.
And it is so pitifully hard, if not impossible, to distinguish UTF-8
from EUC, Shift JIS, Big5, GB2312, KSC5601....
But the first step is recognizing UTF-8 locales.
Or kiss locale goodbye. After all locale is only good for bilingual
systems. With Internet bilingual is not good enough even just for
websurfing (You'll never know to what encoding your next click will
take). IMHO, One of the best thing about Unicode is to get rid of the
very need of locale once for all....
We already have utf8 pragma. If we really need something to make utf8
stream by default (yet leave other things in 'use bytes;' realm), why
don't we just extend it like
use utf8 qw(:filehandle);
for instance?
Dan the Man with Too Many Encodings to Deal with