Yes I've re-read it after your suggestion but the one area it completely
dances around is the local codepage.
And from my use of Encode::decode and encode, it is the one piece of
information that it seems I am required to know when converting local
strings to UTF8.
The data isn't of "unknown origin" - it just came in from stdin or a local
file that I know is in "local codepage". \
And yes, figuring out the local code page on unix is particularly
squirrelly. The codepage for "fr_CA.ISOxxx" is pretty easy but what about
"fr_CA" and "fr" ? There are a lot of aliases and rules involved so that
the locale is just about useless (in one case you can tell it is shift-JIS
because the "j" in the locale is capitalized (I wish I was kidding!).
As a number of others have suggested to me it seems like something basic
that Perl should absolutely know someplace internally. But I have yet to
find an API to get it.
If there was some way to do decode/encode without having to know the local
codepage that would make me happy to. I just want to get encode/decode to
work.
David Graff <graff(_at_)ldc(_dot_)upenn(_dot_)edu>
11/07/2005 08:20 PM
To
David Schlegel/Lexington/IBM(_at_)IBMUS
cc
perl-unicode(_at_)perl(_dot_)org
Subject
Re: Converting between UTF8 and local codepage without specifying local
codepage
dschlege(_at_)us(_dot_)ibm(_dot_)com said:
Is there someway to convert from "whatever" the local codepage is to
utf8
and back again ?
The Encode::encode and decode routines require passing a specific
codepage to do the conversion but finding out what the "local codepage"
is is very tricky across different platforms, particularly UNIX where it
is hard to determine.
Have you looked at the "perllocale" man page? It's not clear to me that
figuring out the "local codepage" (i.e. the "locale") is particularly hard
on unix systems -- that's what the POSIX "locale" protocol is for. (I
don't know how you would figure it out on MS-Windows systems, but that's
more a matter of me being blissfully ignorant of MS software generally.)
If you're dealing with data of unknown origin, and it's in some clearly
non-ASCII, non-Unicode encoding, then being able to detect its character
set is a speculative matter, especially for text in languages that use
single-byte encodings.
The "Encode::Guess" module can help in detecting any of the unicode
encodings and most of the multi-byte non-unicode sets (i.e. the legacy
code
pages for Chinese, Japanese and Korean), but it can't help much when it
comes to correctly detecting, say, ISO Cyrillic vs. ISO Greek (vs. Thai
vs.
Arabic ...), let alone "Latin1" vs. "Latin2".
David Graff