use Encode 'from_to';
my $orjan = 'ÖRJAN';
my $lundstrom = 'LUNDSTRÖM';
print $orjan . ' ' . $lundstrom . "\n";
from_to $orjan,'latin1','utf-8';
from_to $lundstrom,'latin1','utf-8';
It is my understanding that from_to is the wrong thing to use here. The
Your understanding is correct.
- you obtain some character data, for example by putting it literally in
your script. If the script itself is in utf-8, it should contain
"use utf8;". If not (like your script), perl will assume ISO-8859-1.
Or "use encoding 'whatever';", and Perl actually assumes whatever is
your native encoding, be it ISO 8859-1, or -2, or CP1252, or EBCDIC,
or whatever.
A different source of data would be reading from a file, which is
opened with the correct encoding specified (see Andreas' reply).
A third source would be by reading a file or a socket and obtainng raw
bytes which can be interpreted as characters using decode().
In this case, e.g.:
$lundstrom = decode("latin-1", $lundstrom);
- Manipulate the data using perl string operations
- Output the data to a filehandle which is opened using the correct
encoding.
The from_to function looks enticing, particularly because everyone has
heard about perl and utf8 strings, when it's almost always the wrong
thing to use. And perl does not use utf8, but supports unicode character
semantics.
At least in the current Encode doc there is a section:
B<CAVEAT>: The following operations look the same but are not quite so;
from_to($data, ïso-8859-1", ütf8"); #1
$data = decode(ïso-8859-1", $data); #2
Both #1 and #2 make $data consist of a completely valid UTF-8 string
but only #2 turns utf8 flag on. #1 is equivalent to
$data = encode(ütf8", decode(ïso-8859-1", $data));
See L</"The UTF-8 flag"> below.
--
Bart.
--
Jarkko Hietaniemi <jhi(_at_)iki(_dot_)fi> http://www.iki.fi/jhi/ "There is this
special
biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen