perl-unicode

Re: Unicode aware module

1999-06-13 12:54:26
On Sun, Jun 13, 1999 at 12:10:20PM -0700, Gurusamy Sarathy wrote:
No.  My point is different: *There is no need to know* whether the
program "wants" to treat it as bytes or characters.  Perl's string are
sequences of small integers.

Not currently, no.  They are sequences of *bytes* (even when
C<use utf8> is in effect).  It is the *ops* that choose to treat
them as either bytes or encoded characters, depending on whether
C<use utf8> is in effect.

Absolutely.  And my point is that the current implementation is broken
beyond recognition ;-).  I'm discussing how "a clean" implemenation
may (=should ;-) behave.  My point is that such a clean implementation
is just a tiny modification of what we have now.

[My main metaobjection is to using a word "character" to denote "a
 small integer with or without cultural data associated to it".]

Correct, I/O doesn't have any notions of encoding attached to it
currently.

And this should change to have Perl a tranparent assess to
"international" data.

                                                    Now what do you
mean when you say "byte"?

I mean "eight bits", nothing more, nothing less.  In particular,
when I say "byte", I care nothing about character encodings.

From the point of view of Perl the width of "components of a string" is
absolutely irrelevant.  AFAIK, Perl happily works on system with size
of U8 being 64bits.  C<use 'utf8'> brings this to our kitchentop
systems too.

If you mean a particular encoding, then the Perl code *looks*
indifferent to this.  C<chop> removes the last small integer in a
sequence no matter whether this last integer is between 0..255 or
0..2**36-1.

It seems you want to restrict perl operations to operating _only_
on characters.  The horse has already bolted out of the barn on
that--there's code in modules out there that edit binaries/network
packets using index() and substr().

See what I wrote.  They will continue to work as before as far as byte
0xFE is represented as a small integer 0xFE.

           You want the code loose this transparency when compiled?
Why?

The transparency you speak of only exists when you're confining
yourself to operating with characters.

I do not follow.  Any small integer in 0..255 *is* a small integer in
0..2*36-1.  Thus "confining yourself to chars" may be a pessimization,
but it will not change the results of the operations.

Why would they care how Perl represents their text in memory, as far
as all operations work as expected?

They'd care if it was ten times slower.  (But this is a side issue.)

There is no (simple) way to speed it up.

The problem to solve is:

   How do you avoid breaking things that rely on manipulating
   individual bytes when you (globally) switch Perl operations
   to operate on characters instead of bytes?

As I said, there is no such problem.  255 <= 2**36-1.  *This is taken
care of in I/O operations*.

[...things about I/O that are irrelevant to the discussion...]

I'm not worried about I/O currently, because the internals always
treat I/O as bytes, and there is no ambiguity.

As I said, by moving all the encoding issues (except width8/36) to I/O
allows to resolve your other questions.

Ilya

<Prev in Thread] Current Thread [Next in Thread>