Re: perlunicode comment - when Unicode does not happen

On Sun, 28 Dec 2003, Nick Ing-Simmons wrote:

Jungshik Shin <jshin(_at_)mailaps(_dot_)org> writes:


 Then, he should switch to en_GB.UTF-8.


I probably will.


  Good !

Besides, he implied that
he still uses ISO-8859-1 for files whose names can be covered by
ISO-8859-1, which is why I wrote about mixing up two encodings
in a single file system _under_ his control.


There is a tendancy for programs to assume that the locale's encoding
is used for the contents of the file. In the UK there are a LOT of files
which are not UTF-8 but iso8859-1 or iso8859-15.


  Sure, there are tons of text files in EUC-JP, GB2312, EUC-KR,
ISO-8859-7, Windows-1251, ISO-8859-1, TIS-620, KOI8-R. Switching to
a UTF-8 locale means converting them all to UTF-8 (which is one-time
cost) as well as well as their names. I did almost two years ago and so
have others.  If you want to keep them in ISO-8859-1/15. Fine. That's
your choice, but please don't blame programs (or their tendency) for
making a fair-enough assumption when *** NO OTHER ExTERNAL information is
available ***.

  Not all files are under your control? That's when 'additional
external information' comes to the scene. Computers are stupid.  You know
that well. Often times, you have to help them instead of being helped
by them.

assumptions are "mostly harmless". If I switch to a UTF-8 locale and
a stupid program dies because I spelt naive correctly in 8859-1
and that is a UTF-8 coding violation I don't gain much.


  You're not supposed to do that if you're in UTF-8. Why would you
want to use anything other than UTF-8 if you like Unicode/UTF-8
so much.

 Moreover, why would you think that en_GB.UTF-8 locale gives him the
time and date format NOT suitable for him? You're making a mistake of
binding locale and encoding. Encoding should never be a part of the
locale definition.


That is EXACTLY the point Jarkko and I are making. The locale setting
really tells you NOTHING about the encoding.


   So, what is nl_langinfo(CODESET) for?

So when presented with

if (-d "\x{20ac}4") ...

how is "locale" supposed to help poor Joe in his en_US.utf8 locale looking
at a sub-dir created by Kurt in de_DE(_at_)euro or was it Karl in de_DE.utf8


   How could it? No way. It CANNOT. Have I ever said it could?
Absolutely not. It's YOUR responsibility to take care of that mess that
was created by you or your colleagues. You have to pay the price for
mixing up multiple encodings (even if it's your friends/colleagues that
made them, you're trying to access them so that you have to make it work
by providing additional information. otherwise, programs cannot help
resorting to the locale-based default). For what reason do you think I
proposed a set of options you agreed that would work more or less?

 Before writing that, please read the man page of 'smbmount' and
'mount' if Linux system is available to you. They're not environment
variables.

I think you are on "our" side.


  Sure, I'm, but I'm afraid you learned 'too much lessons' from
Perl 5.8.

  Jungshik