perl-unicode

perlunicode comment - when Unicode does not happen

2003-12-22 13:30:05
In 'perlunicode' under 'when Unicode does not happen' there is the statement
(regarding Unicode functionality and the file system-related functions and
operators - BTW the author missed '-X' ):

"One reason why Perl does not attempt to resolve the role of Unicode in this
cases is that the answers are highly dependent on the operating system and
the file system(s)."

This statement seems a bit evasive - especially since there are no other
reasons listed. What are the other reasons? Is there a plan to make things
easier in this area? Or not? If not, why?

This statement also seems like an exaggeration. On Unix-like systems, it is
obvious how to deal with the file system - you convert Unicode to multibyte.
On Windows, there is a choice when dealing with the file system - multibyte
or Unicode - but the new "-C" switch seems to cover that choice. On other
systems - who knows (true!) but isn't that a porting issue? Those 'other'
people wouldn't be harmed by making things easier for the rest of us, right?
Dealing with qx and 'system()' also seems less than mysterious to me -
there's no 'wfork' - as far as I know - so you use multibyte.

Don't get me wrong - I love the Unicode support in Perl - it is an amazing
effort. But dealing with the file system (and to a lesser extent qx/system)
seems like a big hole to me. An example. Say I have a Shift-JIS string and I
want to do a mkdir (on a Shift-JIS-enabled OS with 5.8.1 build 807):

$newdir = "kanji_here_\x89\x5C";
mkdir $newdir;

The above works the way I'd expect, although

print (-d $newdir ? 'yes' : 'no');

prints 'no' - oops a character handling bug! The second byte of the kanji is
a backslash, which confuses Perl, apparently. "-d" really ought to assume
the user knows what he is doing and do character-handling based on the
current file system encoding setting (LC_CTYPE or the equivalent).

It seems counter-intuitive that this fails:

use encoding 'shiftjis';
$newdir = "kanji_here_\x89\x5C";
mkdir $newdir;

Whoops - I just created a directory with Unicode utf-8 bytes (which don't
assemble into valid Japanese characters). I don't think that's what most
users would expect - and 'mkdir' could do better than that.

Anyway, I don't mean to criticize all the wonderful work that has been done.
This is more a question about future direction and also a request to update
the documentation - if this kind of thing isn't going to be fixed soon it
would be nice to add some sample code showing how to write a proper
Unicode-ized Perl script that deals with the file system properly.

Regards,

=ED