On 13 Jun 1999 09:04:36 -0700, Russ Allbery <rra(_at_)stanford(_dot_)edu>
said:
I find myself wanting some clear idea of how much stuff can potentially
break if a routine not written with use utf8 in mind suddenly finds itself
operating in that environment. Putting myself forward as a "typical Perl
programmer and module writer whose never had to deal with Unicode before,"
I actually have no idea what exactly utf8 will do to me. Sure, there's
documentation, but there's an additional conceptual leap needed too.
Maybe more specific perltrap-like examples.... (It's possible that I'm
just out of the loop and someone already has plenty of those.)
Excellent idea. That would be very helpful indeed. I had extreme
difficulties following Ilya's and Sarathy's arguments.
I have two examples.
-----------------------------------------------------------------------
This one is from Jarkko (printed without permission ;-)
s/([\xC0-\xDF])([\x80-\xBF])
/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/egx;
It converts convertible UTF8 to LATIN1 (faster than light) but only if
utf8 is not in effect (or only if some 'use byte' were in effect).
-----------------------------------------------------------------------
And this one is from Ulrich Pfeifer (printed without permission ;-)
static unsigned char *lchars =
"abcdefghijklmnopqrstuvwxyzàáâãäåæçèéêëìíîïñòóôõöøùúûüýß";
static unsigned char *uchars =
"ABCDEFGHIJKLMNOPQRSTUVWXYZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝß";
You'll notice, this isn't perl. But it is a trap nonetheless, because
it is in WAIT.xs and is being used in perl code to convert LATIN1 to
lowercase. It would do arbitrary nonsense on data in other encodings.
-----------------------------------------------------------------------
Now, is it a property of the code or a property of the string?
Both examples assume a property of the argument they handle, the first
one assumes UTF8, the second LATIN1. So it becomes a property of the
code that it handles only a certain encoding. I get the impression, no
code can ever do the right thing on characters without knowing the
encoding of the bits it receives.
--
andreas