Re: Unicode aware module

On Sun, 13 Jun 1999 14:27:09 EDT, Ilya Zakharevich wrote:

On Sun, Jun 13, 1999 at 09:32:06AM -0700, Gurusamy Sarathy wrote:

I put it to you that it is impossible to know whether the program
"wants" to treat it as bytes or characters, irrespective of what
the data is supposed to be.


No.  My point is different: *There is no need to know* whether the
program "wants" to treat it as bytes or characters.  Perl's string are
sequences of small integers.


Not currently, no.  They are sequences of *bytes* (even when
C<use utf8> is in effect).  It is the *ops* that choose to treat
them as either bytes or encoded characters, depending on whether
C<use utf8> is in effect.

                            The value of *small* changes depending
on byte-vs-utf8 encoding.  If you read a GIF file in utf8 mode, it is
*still* translated to the same sequence of small integers.

[Summary: Switching on utf8 encoding does not break anything as far as
system interaction commands know how to translate data.]


Correct, I/O doesn't have any notions of encoding attached to it
currently. (And binmode() is just a hint to the CRT to deal with
the CRLF abomination, so it is not an "encoding" that perl has to
deal with.)

Additionally, one may have an assignment of cultural information to
these small integer, such as \w and capitalization.


IOW, you have a means to group arbitrary bytes into categories.
Using C<use utf8> makes it change to grouping arbitrary *characters*
(which may be more or less than one byte, depending on the encoding
in effect).

                                                    Now what do you
mean when you say "byte"?


I mean "eight bits", nothing more, nothing less.  In particular,
when I say "byte", I care nothing about character encodings.

If you mean a particular encoding, then the Perl code *looks*
indifferent to this.  C<chop> removes the last small integer in a
sequence no matter whether this last integer is between 0..255 or
0..2**36-1.


It seems you want to restrict perl operations to operating _only_
on characters.  The horse has already bolted out of the barn on
that--there's code in modules out there that edit binaries/network
packets using index() and substr().

           You want the code loose this transparency when compiled?
Why?


The transparency you speak of only exists when you're confining
yourself to operating with characters.

[Summary: when you say "code operating on bytes", I do not know what
you mean.]


This is a major part of the problem. :-)

I do not think we need more than two internal encodings: a quick
American, and a slow Universal.  All the others should be done as i/o
filters.


Try saying that to someone from China, or India.  :-)


Why would they care how Perl represents their text in memory, as far
as all operations work as expected?


They'd care if it was ten times slower.  (But this is a side issue.)

These are arbitrary restrictions you speak of, and I (naturally :)
don't agree.


*What* restriction do you mean?  What is your *problem to solve*?


The problem to solve is:

   How do you avoid breaking things that rely on manipulating
   individual bytes when you (globally) switch Perl operations
   to operate on characters instead of bytes?

(C<use utf8> currently just skirts the issue by being a lexical
pragma.  But that also makes it more cumbersome to use, because
to utf8-enable an application means putting C<use utf8> or
C<use caller 'encoding'> in every module used by the application.
Hence the motivation for a global utf8 switch, which creates the
problem above.)

[...things about I/O that are irrelevant to the discussion...]


I'm not worried about I/O currently, because the internals always
treat I/O as bytes, and there is no ambiguity.


Sarathy
gsar(_at_)activestate(_dot_)com