perl-unicode

Re: perlunicode comment - when Unicode does not happen

2003-12-23 07:30:03
On Tue, 23 Dec 2003, Jarkko Hietaniemi wrote:

It works because it relies
on iconv(3) to convert between the current locale codeset and UTF-16
(used internally by Mozilla) if/wherever possible. 'wc*to*mb/mb*to*wc'
is only used  only where iconv(3) is not available. Anyway, yes, that's
possible.

Note that I'm not *opposed* to someone fixing e.g. Win32 being able to
acces Unicode names in NTFS/VFAT.  What I'm opposed to is anyone
thinking there are (a) easy (b) portable solutions.  We are talking
always of  very OS and FS specific solutions.

  OK. I'm sorry if I misunderstood you. You're absolutely right that
we're talking about very OS/FS-dependent issues.

Win32 and Mac OS X are probably the most "well-off".  For (other) UNIXy
systems, I don't know.

  I guess BeOS is in the same league as Win2k/XP [1] and Mac OS X.
There, everything should be in UTF-8.

If one is happy
with just using UTF-8 filenames, Perl 5.8 already can work fine.  If one

  I wish everybody were :-) on Unix. Fortunately, UTF-8 seems to be
catching on judging from the 'emergence' of two 'file system conversion'
tools. See, for instance, <http://osx.freshmeat.net/releases/144059/>.

If a user mixes multiple encodings/code sets in her/his file
system, that's not Perl's problem but her/his problem so that I don't
think that's a valid reason for not doing something reasonable.

wants to use locales and especially some non 8-bit locales, well, Perl
currently most definitely does not switch its "filename encoding" based
on locales.  Personally I think that's a daft idea... at least without
a new specific (say) LC_FILENAME control-- overloading the poor LC_CTYPE
sounds dangerous.

 I don't see how introducing a new LC_* would help here. Whether
it's LC_CTYPE or LC_FILENAME, the problem is still there.

Perhaps, we need a pragma to indicate which of the following is to be
assumed about the file system character encoding, 'locale', 'native',
'unicode', 'user-specified'. On Unix, 'locale' and 'native' would be
identical both meaning that Perl should convert its internal Unicode
to and from the codeset returned by 'nl_langinfo(CODESET)'. Directly
inspecting LC_CTYPE or other environment variables is a BAD idea and
should be used as a fallback only where nl_langinfo(CODESET) is not
supported. When converting to and from 'native' encoding, it should rely
on iconv(3)' available on the system instead of its internal 'encoding'
converter.  However, there's a problem here. A lot of system admins on
commericial Unix install only the minimal set of iconv(3) modules. See
<http://bugzilla.mozilla.org/show_bug.cgi?id=202747#c18>. Therefore,
perhaps, we first try iconv(3) and then fall back to using
Perl's 'encoding'. There are other problems when using iconv(3)
(e.g. <http://bugzilla.mozilla.org/show_bug.cgi?id=197051).

  'unicode' on Unix means 'utf8'.  'user-specified' means whatever a
user wants to use. On Windows, 'locale' means using the code page of
the current system locale. 'native' is UTF-16LE (but on Win 9x/ME, the
character repertoire would be limited to that of the system codepage).
The same is true of 'unicode'.  On Mac OS X, locale, native and unicode
would mean all the same (UTF-8). As for 'normalization', I have to think
more about it. And so on......  I've been just thinking aloud so that
you have to bear with some incoherency.

   Jungshik