In 'perlunicode' under 'when Unicode does not happen' there is the
statement
(regarding Unicode functionality and the file system-related functions
and
operators - BTW the author missed '-X' ):
Oops.
"One reason why Perl does not attempt to resolve the role of Unicode
in this
cases is that the answers are highly dependent on the operating system
and
the file system(s)."
This statement seems a bit evasive - especially since there are no
other
reasons listed. What are the other reasons? Is there a plan to make
things
easier in this area? Or not? If not, why?
I don't understand. It is intentionally evasive because the answer is
intentionally evasive... what "other reasons"? No. Because the answer
depend highly on the operating system and on the file system(s).
This statement also seems like an exaggeration. On Unix-like systems,
it is
obvious how to deal with the file system - you convert Unicode to
multibyte.
*Which* multibyte? There are multiple different encodings just for
Unicode.
Perl cannot know which is one is being used. Just by the virtue of
"being
in UNIX" the process cannot start playing Unicode games with the
filenames--
it must know it is in a right directory / filesystem before doing that.
And _other applications_ must know about that-- otherwise (say) "foo"
will
look like "\0f\0o\0o".
(AFAIK) W2K and later _are able_ to use UTF-16LE encoded Unicode for
filenames,
but because of backward compatibility reasons using 8-bit codepages is
much
more likely.
The Apple HFS handles Unicode using _normalized_ (NFC, IIRC) UTF-8.
There we have two different Unicode encodings, both in use.
On Windows, there is a choice when dealing with the file system -
multibyte
or Unicode - but the new "-C" switch seems to cover that choice.
How so? The *old* -C switch (as in 5.6) did attempt to cover the
Unicode
filenames support of Windows but Gurusamy Sarathy deemed the support
broken
(one aspect of brokenness being that it was a global switch) and unused
enough
that the -C switch was recycled to a completely a semantics that has
nothing to
do with filenames, in Windows or anywhere else.
On other
systems - who knows (true!) but isn't that a porting issue? Those
'other'
people wouldn't be harmed by making things easier for the rest of us,
right?
Any solutions will have to be OS-dependent, quite possibly
application-dependent,
and I very much think the solutions do not belong to the core language.
Dealing with qx and 'system()' also seems less than mysterious to me -
there's no 'wfork' - as far as I know - so you use multibyte.
How do you know what kinds of strings are sent to the system? How do
you know what kinds of strings are returned from the system?
Don't get me wrong - I love the Unicode support in Perl - it is an
amazing
effort. But dealing with the file system (and to a lesser extent
qx/system)
seems like a big hole to me
It is a big hole but I think Perl cannot portably do much to fill it.
Perl cannot know what your filesystem can handle.
An example. Say I have a Shift-JIS string and I
want to do a mkdir (on a Shift-JIS-enabled OS with 5.8.1 build 807):
"a Shift-JIS enabled OS"? I have no idea what do you mean by that.
OSes are somewhat unlikely do assume a character set since that's
rather more an application level issue.
$newdir = "kanji_here_\x89\x5C";
mkdir $newdir;
The above works the way I'd expect, although
print (-d $newdir ? 'yes' : 'no');
prints 'no' - oops a character handling bug! The second byte of the
kanji is
a backslash, which confuses Perl, apparently. "-d" really ought to
assume
the user knows what he is doing
I tend to misbelieve that :-) All "-d" is doing is passing the $newdir
(UTF-8) bytes to stat(2).
and do character-handling based on the
current file system encoding setting (LC_CTYPE or the equivalent).
There is no portable "current file system encoding setting" API.
It seems counter-intuitive that this fails:
use encoding 'shiftjis';
$newdir = "kanji_here_\x89\x5C";
mkdir $newdir;
Whoops - I just created a directory with Unicode utf-8 bytes (which
don't
assemble into valid Japanese characters). I don't think that's what
most
users would expect - and 'mkdir' could do better than that.
How did you expect Perl to know that your filesystem expects and
accepts shiftjis?
What if you do a chdir() to a filesystem that does not?
Anyway, I don't mean to criticize all the wonderful work that has been
done.
This is more a question about future direction and also a request to
update
the documentation - if this kind of thing isn't going to be fixed soon
it
would be nice to add some sample code showing how to write a proper
Unicode-ized Perl script that deals with the file system properly.
Perl 5.8 as all the bits and pieces required to do whatever you want
with filenames but it cannot know when to do which conversions. In some
cases and OSes you can just convert the characters into whatever bytes
you
want and push them out as-is (say, a directory name into UTF-8), and the
system will do happily just that. In some cases you would need to call
a different set of system calls (like in Windows).
Again, I think the right way to do what you want is to create a set
of (operating system dependent) modules (some may require XS) that
introduce
the necessary filesystem-related (mkdir etc) variants (or overrides, if
one
wants those).
--
Jarkko Hietaniemi <jhi(_at_)iki(_dot_)fi> http://www.iki.fi/jhi/ "There is this
special
biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen