That is helpful information. I have been spending time to determine the
local page by other means but have consistently been challenged that this
is the wrong approach and that Perl must know somehow. Getting a
definitive answer is almost as helpful as getting a better answer.
Based on what you are saying, there is no way to ask Perl what the "local
codepage" is and hence there can be no variant of "Encode" which can be
told to convert from "local codepage" to UTF8 without having to provide
the "local codepage" value explicitly.
Is I18N::Langinfo(CODESET()) the best way to determine the local codepage
for Unix ? Windows seems to reliably include the codepage number in the
locale but Unix is all over the map.
I greatly appreciate your responses.
Nicholas Clark <nick(_at_)ccl4(_dot_)org>
Sent by: Nicholas Clark <nick(_at_)flirble(_dot_)org>
11/09/2005 05:49 AM
To
David Schlegel/Lexington/IBM(_at_)IBMUS
cc
David Graff <graff(_at_)ldc(_dot_)upenn(_dot_)edu>,
perl-unicode(_at_)perl(_dot_)org
Subject
Re: Converting between UTF8 and local codepage without specifying local
codepage
On Tue, Nov 08, 2005 at 05:08:08PM -0500, David Schlegel wrote:
And yes, figuring out the local code page on unix is particularly
squirrelly. The codepage for "fr_CA.ISOxxx" is pretty easy but what
about
"fr_CA" and "fr" ? There are a lot of aliases and rules involved so that
the locale is just about useless (in one case you can tell it is
shift-JIS
because the "j" in the locale is capitalized (I wish I was kidding!).
As a number of others have suggested to me it seems like something basic
that Perl should absolutely know someplace internally. But I have yet to
find an API to get it.
If there was some way to do decode/encode without having to know the
local
codepage that would make me happy to. I just want to get encode/decode
to
work.
No, it's not something that Perl knows internally. By default all case
conversion and similar operations are done 8 bit cleanly but assuming
US-ASCII for 8 bit data. If you C<use locale>; then system locales are
used
for case related operations and collation. This is done by calling the C
function setlocal() with the strings from the environment variables
LC_CTYPE
and LC_COLLATE, which sets the behaviour or C functions such as toupper()
and
tolower(). Hence Perl *still* has no idea what the local code page is
called, even when it's told to use it. The situation is the same for any C
program.
Unicode tables are used for Unicode data, and there is a (buggy)
assumption
that 8 bit data can be converted to Unicode by assuming that it's
ISO-8859-1.
Definitely buggy. Not possible to change without breaking backward
compatibility.
Nicholas Clark