perl-unicode

Re: Converting string to UTF-16LE

2004-03-02 12:30:04
On Tue, Mar 02, 2004 at 05:25:21PM +0100, Robert Allerstorfer wrote:
: > On Mon, 01 Mar 2004 20:55:14 +0000 Nick Ing-Simmons
: > <nick(_at_)ing-simmons(_dot_)net> wrote: 
: 
: > lib/unicore/To/Upper.pl includes a toupper mapping of ñ to Ñ properly.
: 
: while you are getting attention to the
: 
: unicore/To/Upper.pl
: 
: file, you may also want to note that I have found a very nasty bug
: related to this file, as well as to the files
: 
: unicore/To/Lower.pl
: unicore/To/Fold.pl
: 
: which I have reported using the suggested perlbug tool back on January
: 6, but nobody has yet responded to it. Tha bug report is still online
: at
: 
: http://bugs6.perl.org/rt3/Ticket/Display.html?id=24826
: 
: In short, I discoverd that these files cause *all* string operations
: on utf8 strings to be very slooow! Removing the
: 
: '%utf8::ToSpecFUNCTION = (...)'
: 
: definitions from these files increased the speed on a simple test
: regex with Perl 5.8.2 from 77 to 9 seconds on a Windows machine and
: from 775 s to 66 s on a (slower) BSD/OS system!!!

Offhand (and I'm just guessing here from the contents of the hashes),
somebody has overgeneralized somewhere, and applied language-specific
tranformations when they're not desired, with the result that utf8
strings have to be prepared to change lengths at various times.  And
changing string lengths is always going to slow you down compared to
doing things in place.

If this is somewhere near the mark, then the appropriate solution is
for Perl 5 to distinguish levels of Unicode support like Perl 6 will:

    Level 0: a character is a byte
    Level 1: a character is a codepoint (no canicalization)
    Level 2: a character is a grapheme (language independent)
    Level 3: a character is a psychological unit for a particular language

Anyway, sounds to me like someone has mixed Level 3 support into levels
1 and 2.  If that's the case, I think it's a fundamental mistake.  Perl 5
should pick a level to default to, and stick with it.  Going to other
levels should require explicit lexically-scoped declaration to minimize
magical action at a distance.  In particular, decent level 3 support
requires knowing the language you're working under, and is expected
to run a lot slower in some cases.

There's also some performance loss between levels 1 and 2.  That may be
the case here, if the tables in question are deemed to be "language
independent" (probably in the sense of "best guess at language
dependent behavior in the absence of actually knowing the language".)

Of course, I could be all wet, and it's just a normal bug.  Needs to
be researched, and I'd love to figure it out myself, but unfortunately
I am required to delegate that sort of stuff these days.

Larry