perl-unicode

Re: Converting between UTF8 and local codepage without specifying local codepage

2005-11-09 03:49:43
On Tue, Nov 08, 2005 at 05:08:08PM -0500, David Schlegel wrote:
And yes, figuring out the local code page on unix is particularly 
squirrelly.  The codepage for "fr_CA.ISOxxx" is pretty easy but what about 
"fr_CA" and "fr" ? There are a lot of aliases and rules involved so that 
the locale is just about useless (in one case you can tell it is shift-JIS 
because the "j" in the locale is capitalized (I wish I was kidding!). 

As a number of others have suggested to me it seems like something basic 
that Perl should absolutely know someplace internally. But I have yet to 
find an API to get it. 
If there was some way to do decode/encode without having to know the local 
codepage that would make me happy to. I just want to get encode/decode to 
work. 

No, it's not something that Perl knows internally. By default all case
conversion and similar operations are done 8 bit cleanly but assuming
US-ASCII for 8 bit data. If you C<use locale>; then system locales are used
for case related operations and collation. This is done by calling the C
function setlocal() with the strings from the environment variables LC_CTYPE
and LC_COLLATE, which sets the behaviour or C functions such as toupper() and
tolower(). Hence Perl *still* has no idea what the local code page is
called, even when it's told to use it. The situation is the same for any C
program.

Unicode tables are used for Unicode data, and there is a (buggy) assumption
that 8 bit data can be converted to Unicode by assuming that it's ISO-8859-1.
Definitely buggy. Not possible to change without breaking backward
compatibility.

Nicholas Clark