Re: UTF-8 case conversion

On Wed, 3 Sep 2003, Bart Schuller wrote:

On Wed, Sep 03, 2003 at 01:05:21PM +0200, 
sigfrid(_dot_)lundberg(_at_)lub(_dot_)lu(_dot_)se wrote:

use Encode 'from_to';

my $orjan = 'ÖRJAN';
my $lundstrom = 'LUNDSTRÖM';

print $orjan . ' ' . $lundstrom . "\n";

from_to $orjan,'latin1','utf-8';
from_to  $lundstrom,'latin1','utf-8';


It is my understanding that from_to is the wrong thing to use here. The
variables $orjan and $lundstrom contain perl strings containing perl
characters with unicode semantics.


I think I'm starting to understand... Terry Jones pointed the problem with
from_to. I removed those calls, added the use utf8; pragma and 'rewrote'
my program in utf-8 using the recode program. Then it started to work as
expected.

from_to is used to encode bytes in one encoding into bytes in another
encoding. Both before and after this operation do these bytes *not*
equal characters for perl. So you should not use perl level operations
like uc or lc or regexes or substr on them.

The way it all is supposed to work is:

- you obtain some character data, for example by putting it literally in
  your script. If the script itself is in utf-8, it should contain
  "use utf8;". If not (like your script), perl will assume ISO-8859-1.


Exactly, this is my working case above.

  A different source of data would be reading from a file, which is
  opened with the correct encoding specified (see Andreas' reply).


I tested that. I created a file with fake personal names, first
in iso-8859-1 and just read it. Didn't work that well. Then I transformed
it into utf-8 and doing

binmode STDOUT, ":utf8";
binmode STDIN, ":utf8";

it worked.

  A third source would be by reading a file or a socket and obtainng raw
  bytes which can be interpreted as characters using decode().

- Manipulate the data using perl string operations

- Output the data to a filehandle which is opened using the correct
  encoding.

The from_to function looks enticing, particularly because everyone has
heard about perl and utf8 strings, when it's almost always the wrong
thing to use. And perl does not use utf8, but supports unicode character
semantics.


And the problem is to grasp the difference! Thanks a lot all of you!

Sigge

--
Bart.