perl-unicode

Unicode traps (Was: Unicode aware module)

1999-06-14 01:48:12
On 13 Jun 1999 09:04:36 -0700, Russ Allbery <rra(_at_)stanford(_dot_)edu> 
said:

I find myself wanting some clear idea of how much stuff can potentially
break if a routine not written with use utf8 in mind suddenly finds itself
operating in that environment.  Putting myself forward as a "typical Perl
programmer and module writer whose never had to deal with Unicode before,"
I actually have no idea what exactly utf8 will do to me.  Sure, there's
documentation, but there's an additional conceptual leap needed too.
Maybe more specific perltrap-like examples....  (It's possible that I'm
just out of the loop and someone already has plenty of those.)

Excellent idea. That would be very helpful indeed. I had extreme
difficulties following Ilya's and Sarathy's arguments.

I have two examples.

-----------------------------------------------------------------------

This one is from Jarkko (printed without permission ;-)

        s/([\xC0-\xDF])([\x80-\xBF])
         /chr(ord($1)<<6&0xC0|ord($2)&0x3F)/egx;

It converts convertible UTF8 to LATIN1 (faster than light) but only if
utf8 is not in effect (or only if some 'use byte' were in effect).

-----------------------------------------------------------------------

And this one is from Ulrich Pfeifer (printed without permission ;-)

  static unsigned char *lchars = 
          "abcdefghijklmnopqrstuvwxyzàáâãäåæçèéêëìíîïñòóôõöøùúûüýß";
  static unsigned char *uchars = 
          "ABCDEFGHIJKLMNOPQRSTUVWXYZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝß";

You'll notice, this isn't perl. But it is a trap nonetheless, because
it is in WAIT.xs and is being used in perl code to convert LATIN1 to
lowercase. It would do arbitrary nonsense on data in other encodings.

-----------------------------------------------------------------------

Now, is it a property of the code or a property of the string?

Both examples assume a property of the argument they handle, the first
one assumes UTF8, the second LATIN1. So it becomes a property of the
code that it handles only a certain encoding. I get the impression, no
code can ever do the right thing on characters without knowing the
encoding of the bits it receives.

-- 
andreas

<Prev in Thread] Current Thread [Next in Thread>