Re: Decoding more languages

Octavian Rasnita <orasnita(_at_)fcc(_dot_)ro> writes:

   Oh, sorry, but I've made a mistake when writing the message.
The Romanian language uses ISO-8859-2 and not ISO-8859-1
So the question remains. Is it possible to decode a text written in more
languages that use more charsets?


Yes. But perhaps not as easily as you would like.
You need markers which show where the encodings change.

For perl purposes the language is not important, it is the 
"charset" (encoding) that matters. The encoding determines what 
the 8-bit bytes (also called octets) in a file mean as characters.
So one "file" can normally only be in one encoding - this includes
the perl script. Unicode and UTF-8 are designed to avoid this problem
because UTF-8 can represent any Unicode code point and there 
are Unicode code points for (almost) all characters used by any 
language.  

However older 8-bit encodings like iso-8859-1 and iso-8859-2 pick 
different 256 character subsets. If I recall correctly 

So you cannot just enter 8-bit string litterals in both encodings 
into one perl script, and have perl know what they are directly.
But you can have 


my $spanish = "...";    
my $romanian = "...";
# Note that only one of those can "look right" in an iso-8859-* editor

my $combined = Encode::decode('iso8859-1',$spanish).
               Encode::decode('iso8859-2',$romanian);

You can then "print" the combined string as UTF-8 (or other Unicode 
encoding). But you will then need some way of viewing the Unicode 
file. An editor which can view the UTF-8 file will probably also 
allow you to enter UTF-8 strings directly as well. So you could 
write you script in UTF-8 and avoid the problem.

Note that you cannot (in general) "print" the combined string as
either 8859-1 or 8859-2


Thank you.


----- Original Message ----- 
From: "Nick Ing-Simmons" <nick(_dot_)ing-simmons(_at_)elixent(_dot_)com>
To: <orasnita(_at_)fcc(_dot_)ro>
Sent: Tuesday, April 13, 2004 11:13 AM
Subject: Re: Decoding more languages

Octavian Rasnita <orasnita(_at_)fcc(_dot_)ro> writes:

Hello all,

I want to transform a text that contains words in more languages (it is a
course for learning a foreign language) in UTF-8.
I have 2 texts, one that contains Romanian and French words, and another

one

that contains Romanian and Spanish words.
I have seen that I can Encode::decode('ISO-8859-1', $text) the romanian