perl-unicode

Re: Converting between UTF8 and local codepage without specifying local codepage

2005-11-09 09:27:21
That is helpful information. I have been spending time to determine the 
local page by other means but have consistently been challenged that this 
is the wrong approach and that Perl must know somehow. Getting a 
definitive answer is almost as helpful as getting a better answer. 

Based on what you are saying, there is no way to ask Perl what the "local 
codepage" is and hence there can be no variant of "Encode" which can be 
told to convert from "local codepage" to UTF8 without having to provide 
the "local codepage" value explicitly. 

Is I18N::Langinfo(CODESET())  the best way to determine the local codepage 
for Unix ? Windows seems to reliably include the codepage number in the 
locale but Unix is all over the map.

I greatly appreciate your responses. 




Nicholas Clark <nick(_at_)ccl4(_dot_)org> 
Sent by: Nicholas Clark <nick(_at_)flirble(_dot_)org>
11/09/2005 05:49 AM

To
David Schlegel/Lexington/IBM(_at_)IBMUS
cc
David Graff <graff(_at_)ldc(_dot_)upenn(_dot_)edu>, 
perl-unicode(_at_)perl(_dot_)org
Subject
Re: Converting between UTF8 and local codepage without specifying local 
codepage






On Tue, Nov 08, 2005 at 05:08:08PM -0500, David Schlegel wrote:
And yes, figuring out the local code page on unix is particularly 
squirrelly.  The codepage for "fr_CA.ISOxxx" is pretty easy but what 
about 
"fr_CA" and "fr" ? There are a lot of aliases and rules involved so that 

the locale is just about useless (in one case you can tell it is 
shift-JIS 
because the "j" in the locale is capitalized (I wish I was kidding!). 

As a number of others have suggested to me it seems like something basic 

that Perl should absolutely know someplace internally. But I have yet to 

find an API to get it. 
If there was some way to do decode/encode without having to know the 
local 
codepage that would make me happy to. I just want to get encode/decode 
to 
work. 

No, it's not something that Perl knows internally. By default all case
conversion and similar operations are done 8 bit cleanly but assuming
US-ASCII for 8 bit data. If you C<use locale>; then system locales are 
used
for case related operations and collation. This is done by calling the C
function setlocal() with the strings from the environment variables 
LC_CTYPE
and LC_COLLATE, which sets the behaviour or C functions such as toupper() 
and
tolower(). Hence Perl *still* has no idea what the local code page is
called, even when it's told to use it. The situation is the same for any C
program.

Unicode tables are used for Unicode data, and there is a (buggy) 
assumption
that 8 bit data can be converted to Unicode by assuming that it's 
ISO-8859-1.
Definitely buggy. Not possible to change without breaking backward
compatibility.

Nicholas Clark