Re: perlunicode comment - when Unicode does not happen

In 'perlunicode' under 'when Unicode does not happen' there is thestatement(regarding Unicode functionality and the file system-related functionsand
operators - BTW the author missed '-X' ):


Oops.

"One reason why Perl does not attempt to resolve the role of Unicodein thiscases is that the answers are highly dependent on the operating systemand
the file system(s)."
This statement seems a bit evasive - especially since there are nootherreasons listed. What are the other reasons? Is there a plan to makethings
easier in this area? Or not? If not, why?


I don't understand.  It is intentionally evasive because the answer is
intentionally evasive... what "other reasons"?  No.  Because the answer
depend highly on the operating system and on the file system(s).

This statement also seems like an exaggeration. On Unix-like systems,it isobvious how to deal with the file system - you convert Unicode tomultibyte.

*Which* multibyte? There are multiple different encodings just forUnicode.Perl cannot know which is one is being used. Just by the virtue of"beingin UNIX" the process cannot start playing Unicode games with thefilenames--

it must know it is in a right directory / filesystem before doing that.

And _other applications_ must know about that-- otherwise (say) "foo"will

look like "\0f\0o\0o".

(AFAIK) W2K and later _are able_ to use UTF-16LE encoded Unicode forfilenames,but because of backward compatibility reasons using 8-bit codepages ismuch

more likely.

The Apple HFS handles Unicode using _normalized_ (NFC, IIRC) UTF-8.

There we have two different Unicode encodings, both in use.

On Windows, there is a choice when dealing with the file system -multibyte
or Unicode - but the new "-C" switch seems to cover that choice.

How so? The *old* -C switch (as in 5.6) did attempt to cover theUnicodefilenames support of Windows but Gurusamy Sarathy deemed the supportbroken(one aspect of brokenness being that it was a global switch) and unusedenoughthat the -C switch was recycled to a completely a semantics that hasnothing to

do with filenames, in Windows or anywhere else.

On other
systems - who knows (true!) but isn't that a porting issue? Those'other'people wouldn't be harmed by making things easier for the rest of us,right?

Any solutions will have to be OS-dependent, quite possiblyapplication-dependent,

and I very much think the solutions do not belong to the core language.

Dealing with qx and 'system()' also seems less than mysterious to me -
there's no 'wfork' - as far as I know - so you use multibyte.


How do you know what kinds of strings are sent to the system?  How do
you know what kinds of strings are returned from the system?

Don't get me wrong - I love the Unicode support in Perl - it is anamazingeffort. But dealing with the file system (and to a lesser extentqx/system)
seems like a big hole to me


It is a big hole but I think Perl cannot portably do much to fill it.
Perl cannot know what your filesystem can handle.

An example. Say I have a Shift-JIS string and I
want to do a mkdir (on a Shift-JIS-enabled OS with 5.8.1 build 807):


"a Shift-JIS enabled OS"?  I have no idea what do you mean by that.
OSes are somewhat unlikely do assume a character set since that's
rather more an application level issue.

$newdir = "kanji_here_\x89\x5C";
mkdir $newdir;

The above works the way I'd expect, although

print (-d $newdir ? 'yes' : 'no');
prints 'no' - oops a character handling bug! The second byte of thekanji isa backslash, which confuses Perl, apparently. "-d" really ought toassume
the user knows what he is doing


I tend to misbelieve that :-)  All "-d" is doing is passing the $newdir
(UTF-8) bytes to stat(2).

and do character-handling based on the
current file system encoding setting (LC_CTYPE or the equivalent).


There is no portable "current file system encoding setting" API.

It seems counter-intuitive that this fails:

use encoding 'shiftjis';
$newdir = "kanji_here_\x89\x5C";
mkdir $newdir;
Whoops - I just created a directory with Unicode utf-8 bytes (whichdon'tassemble into valid Japanese characters). I don't think that's whatmost
users would expect - and 'mkdir' could do better than that.

How did you expect Perl to know that your filesystem expects andaccepts shiftjis?

What if you do a chdir() to a filesystem that does not?

Anyway, I don't mean to criticize all the wonderful work that has beendone.This is more a question about future direction and also a request toupdatethe documentation - if this kind of thing isn't going to be fixed soonit
would be nice to add some sample code showing how to write a proper
Unicode-ized Perl script that deals with the file system properly.


Perl 5.8 as all the bits and pieces required to do whatever you want
with filenames but it cannot know when to do which conversions.  In some

cases and OSes you can just convert the characters into whatever bytesyou

want and push them out as-is (say, a directory name into UTF-8), and the
system will do happily just that.  In some cases you would need to call
a different set of system calls (like in Windows).

Again, I think the right way to do what you want is to create a set

of (operating system dependent) modules (some may require XS) thatintroducethe necessary filesystem-related (mkdir etc) variants (or overrides, ifone

wants those).

--

Jarkko Hietaniemi <jhi(_at_)iki(_dot_)fi> http://www.iki.fi/jhi/ "There is thisspecial

biologist word we use for 'stable'.  It is 'dead'." -- Jack Cohen