perl-unicode

Re: perlunicode comment - when Unicode does not happen

2003-12-22 21:30:04
On Mon, 22 Dec 2003, Ed Batutis wrote:

"Jarkko Hietaniemi" <jhi(_at_)iki(_dot_)fi> wrote in message
news:0C06A42A-34CE-11D8-A034-00039362CB92(_at_)iki(_dot_)fi(_dot_)(_dot_)(_dot_)

You do know that ...
Yes.

If wctomb or mbtowc are to be used, then Perl's Unicode must be converted
either to the locale's wide char or to its multibyte. This isn't trivial,
but Mozilla solved this same problem. It can portably work. (Are you
listening Brian Stell!). It wasn't easy for them, but they did it.

  You're probably talking about nsNativeCharsetUtils.cpp in Mozilla.
(http://lxr.mozilla.org/seamonkey/source/xpcom/io/nsNativeCharsetUtils.cpp).
I'm familar with that part because I made a few changes there in the last
6 months.  Mozilla doesn't use wc*mb/mb*wc() because it can't possibly
know _what_ 'wchar_t' actually is in the current locale? Note that
'wchar_t' is not only locale dependent (i.e. run-time dependency) on a
single platform but also a compiler-dependent. It works because it relies
on iconv(3) to convert between the current locale codeset and UTF-16
(used internally by Mozilla) if/wherever possible. 'wc*to*mb/mb*to*wc'
is only used  only where iconv(3) is not available. Anyway, yes, that's
possible. If a user mix multiple encodings/code sets in her/his file
system, that's not Perl's problem but her/his problem so that I don't
think that's a valid reason for not doing something reasonable.

Imagine ...

I don't have to imagine. But I think that where a Perl script opens its
files is its own business. I don't see why Perl would have to do anything in
that regard. Even if it did, I don't see that feature as blocking the
simpler feature of just doing a conversion to/from multibyte before/after a
system call. If I'm dealing with just Japanese on a Japanese system, that's
all I need.

Uhhh... from a Win32 API bug workaround you deduce that ... SJIS should
work?

 Well, Win32 has an API to test whether a backslash is the second byte
of a 'multibyte character'. That is, the code snippet given by Ed could
have been written better with that API.


Here's my dilemma: utf-8 doesn't work as an argument to -d and neither does
Shift-JIS (at least with certain Shift-JIS characters). Those are my only
choices. So you are saying basically 'Shift-JIS be damned  - write a
module'? I hope you'll understand if I find it hard to sympathize with that

 Win32 is troublesome because it has two tier-ed APIs, code-page
dependent 'A' APIs and Unicode-based 'W' APIs. If 'W' APIs are guaranteed
to be available everywhere (from Win95 to WinXP), Perl can just convert
whatever legacy encodings into UTF-16LE and call 'W' APIs. Actually,
you don't have to call 'W' APIs directly but just using the 'generic'
APIs would be translated into 'W' APIs if a macro (whose name is escaping
me) is defined at the compile time. Now the question is whether 'W'
APIs are available on old Win95/98/ME. They're available if MS IE 5.x
or later and/or relatively new version of MS Word/Office are installed
because they come with MSLU (Microsoft Layer for Unicode) dll. So, for
the majority of cases, the above should work.  However, there are some
small number of cases where MSLU is not available on Win 9x/ME. In that
case, you have to fall back to 'A' APIs. Even with MSLU installed, on
Win9x/ME, you're limited to the character repertoire of the legacy code
page (i.e. Shift_JIS on Japanese windows, Windows-932 on Chinese Windows,
Windows-1252 on Western European Windows). Therefore, a better approach
might be to do the OS detection and use 'A' APIs on Win 9x/ME and 'W'
APIs on Win 2k/XP. That's what Mozilla does.  Unfortunately, this code is
not yet deployed to the file I/O part of Mozilla, which is the cause of
several bugs. (See http://bugzilla.mozilla.org/show_bug.cgi?id=162361)
Still another approach is to build two separate binaries of Win32 Perl,
one for Win 9x/ME and the other for Win 2k/XP.

  Jungshik