Re: perlunicode comment - when Unicode does not happen

On Tue, 23 Dec 2003, Nick Ing-Simmons wrote:

Ed Batutis <ed(_at_)batutis(_dot_)com> writes:

I don't think we understand common practice (or that such practices
are even established yet) well enough to specify that yet.


  Common practice is that file names on 'local disks' are assumed to be
in the character encoding of the current locale. Of course, this
assumption doesn't always hold and can break things with networked file
system and all sort of different file systems, but what could Perl do
about it other than offering some options/flexibility to let users do
what they want? Perl users are supposed to be 'consenting adults' (maybe
not in terms of physical age for some young users) so that  given a set
of options, they have to pick one most suitable for them for a given
task.

Because we don't know how, because the "common practice" isn't established.


  As I wrote, it's been established well before Unicode came into the
scene. It has little to do with UTF-8 or Unicode.

If we "just fix it" now the behaviour will be tied down and when the
"common practice" is established we will not be able to support it.


   Let's not 'fix' it (not carve it on a stone), but offer a few
well-thought-out options. For instance, Perl may offer (not that these
are particularly well-thought-out) 'just treat this as a sequence of
octets', 'locale', and 'unicode'. 'locale' on Unix means multibyte
encoding returned by  nl_langinfo(CODESET) or equivalent.  On Windows,
it's whatever 'A' APIs accept or is returned by ACP_??().  'unicode'
is utf8 on Unix-like OS, BeOS and 'utf-16(le)' on Windows.

When _I_ want Unicode named things on Linux I just put file names in UTF-8.


  In that case, you're mixing two encodings on your file system by
creating files with UTF-8 names while still using en_GB.ISO-8859-1
locale. Why does Perl have to be held responsible for your intentional act
that is bound to break things? Because I don't want to be restricted by
the character repertoire of legacy encodings, I switched over to UTF-8
locale almost two years ago.

Suits me fine, but is not going to mesh with my locale setting because
I am going to leave that as en_GB otherwise piles of legacy C apps get ill.


  Well, things are changing rapidly on that front.

Now when I have samba-mounted a WinXP file system that is wrong, same for


  Well, actually, if your WinXP file system has only characters covered
by Windows-1252, you can use 'codepage=cp1252' and 'iocharset=iso8859-1'
for smbmount/mount.  Obviously, there's a problem because iso8859-1 is a
subset of Windows-1252. If you use en_GB.UTF-8 on Linux, there'd not be
such a problem because you can use 'codepage=cp1252' and 'iocharset=utf8'.

CDROMs most likely. This mess will converge some more - I can already
see that happening.


 UDF is the way to go in CD-ROM/DVD-ROM.

_My_ gut feeling is that on Linux at least the way forward is to
pass the UTF-8 string through -d - and indeed possibly "upgrade" to UTF-8
if the string has high-bit octets.
But you seem to be making the case that UTF-8 should be converted to
some "local" multi-byte encoding - which is the "common practice" ?


  That's because there are a lot of people like you who still use en_GB
(ja_JP.eucJP, de_DE.iso8859-1, etc) instead of en_GB.UTF-8 (ja_JP.UTF-8,
de_DE.UTF-8) :-) On Linux, the number is dwindling, but on Solaris
and other Unix (not that they don't support UTF-8 locales but that most
system admins. don't bother to install necessary locales and support
files), it's not decreasing as fast.

   Jungshik