Re: Interpretation of non-UTF8 strings

W liście z pon, 16-08-2004, godz. 16:18 +0300, Jarkko Hietaniemi
napisał:

Those whose think switching UTF-8 based on locale settings should spend
some time with the Redhat bug database.  RH 8 and 9 used an early
prerelease version of Perl 5.8.0, which did switch on fully UTF-8-ness
based on locale settings.  This turned out to be quite a mess because
RH8/9 had *by default* such locales - *every* RH8/9 user was subjected
to full UTF-8, e.g. UTF-8 I/O.


It's because RedHat tried to globally switch to UTF-8 too early, when
many programs are not ready for it, and many people use other encodings
extensively. It should give them a choice (or the choice should be more
clear - I haven't used newer RedHats so I don't know how easy is to set
it up to use some other encoding by default).

We can't expect everybody to use UTF-8 yet, but we also can't assume
that ISO-8859-1 is enough for everyone. The world and each personal
computer uses a large number of encodings. We can't change that
immediately, we must adapt to it.

The locale mechanism is there so programs have a chance of working
without forcing a particular encoding on all people, and without
specifying the encoding separately for each program. This is not
perfect, because some data (e.g. transferred from the Internet) will
likely to use some other encoding, but I don't know any better option.

To repeat: forcing a single encoding everywhere is not an option (at
least at the moment), and forcing each program (including ls and grep,
literally each program dealing with texts) to have encoding switches in
its configuration and to perform the recoding on the fly is not an
option either.

The locale is a default. It should be overridable in programs where it's
important enough to support various encodings, and where the authors
were patient enough to implement it, but otherwise it should be the
default for all programs which must interpret particular characters
themselves and don't only pass them through.

When someone feels that he is ready to use UTF-8 for daily work,
terminal emulators, filenames, program sources etc., he will set up
a UTF-8 locale to inform software about this fact

But there is a simple workaround for that, as perluniintro would tell
you: the encoding pragma.


The encoding pragma partially works. It doesn't influence assumed
encoding of files opened without specifying the encoding, nor handling
of filenames, and it needs to be told about the encoding literally.
How to say it should be taken from the locale?

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak(_at_)knm(_dot_)org(_dot_)pl
    ^^     http://qrnik.knm.org.pl/~qrczak/