perl-unicode

Re: Unicode aware module

1999-06-16 13:52:48
On Sun, 13 Jun 1999 15:46:30 EDT, Ilya Zakharevich wrote:
I mean "eight bits", nothing more, nothing less.  In particular,
when I say "byte", I care nothing about character encodings.

From the point of view of Perl the width of "components of a string" is
absolutely irrelevant.  AFAIK, Perl happily works on system with size
of U8 being 64bits.  C<use 'utf8'> brings this to our kitchentop
systems too.

If you mean a particular encoding, then the Perl code *looks*
indifferent to this.  C<chop> removes the last small integer in a
sequence no matter whether this last integer is between 0..255 or
0..2**36-1.

It seems you want to restrict perl operations to operating _only_
on characters.  The horse has already bolted out of the barn on
that--there's code in modules out there that edit binaries/network
packets using index() and substr().

See what I wrote.  They will continue to work as before as far as byte
0xFE is represented as a small integer 0xFE.

           You want the code loose this transparency when compiled?
Why?

The transparency you speak of only exists when you're confining
yourself to operating with characters.

I do not follow.  Any small integer in 0..255 *is* a small integer in
0..2*36-1.  Thus "confining yourself to chars" may be a pessimization,
but it will not change the results of the operations.

Why would they care how Perl represents their text in memory, as far
as all operations work as expected?

They'd care if it was ten times slower.  (But this is a side issue.)

There is no (simple) way to speed it up.

The problem to solve is:

   How do you avoid breaking things that rely on manipulating
   individual bytes when you (globally) switch Perl operations
   to operate on characters instead of bytes?

As I said, there is no such problem.  255 <= 2**36-1.  *This is taken
care of in I/O operations*.

If I understand you correctly, you are suggesting that Perl should
use utf8 (not bytes) as its internal representation for all data.

This sounds like a good thing *in theory*, but in practice, I do
not see how it can be implemented without slowing things down
considerably.  I/O operations will need to copy things around (or
loop through every byte read) to convert things from bytes to/from
utf8.  Skipping characters will not be a simple C<string++>; instead,
it will need to be done with C<string += UTF8SKIP(string)>.

Those point to *very* good reasons why Larry chose not to make
utf8 the default internal representation for data.


Sarathy
gsar(_at_)activestate(_dot_)com

<Prev in Thread] Current Thread [Next in Thread>