perl-unicode

Re: Unicode aware module

1999-06-13 11:34:08
On Sun, Jun 13, 1999 at 09:32:06AM -0700, Gurusamy Sarathy wrote:
          But that's a poor example of what I'm driving at.
chop($foo) would be a better example.  Would you want that to
remove one byte or one character?

I do not care as far as it chop()s.  If $foo contains bytes, let
it chop a byte.  If $foo contains chars, let it chop a char.

I put it to you that it is impossible to know whether the program
"wants" to treat it as bytes or characters, irrespective of what
the data is supposed to be.

No.  My point is different: *There is no need to know* whether the
program "wants" to treat it as bytes or characters.  Perl's string are
sequences of small integers.  The value of *small* changes depending
on byte-vs-utf8 encoding.  If you read a GIF file in utf8 mode, it is
*still* translated to the same sequence of small integers.

[Summary: Switching on utf8 encoding does not break anything as far as
 system interaction commands know how to translate data.]

Additionally, one may have an assignment of cultural information to
these small integer, such as \w and capitalization.  Now what do you
mean when you say "byte"?

If you mean a particular encoding, then the Perl code *looks*
indifferent to this.  C<chop> removes the last small integer in a
sequence no matter whether this last integer is between 0..255 or
0..2**36-1.  You want the code loose this transparency when compiled?
Why?

If you mean ignoring the cultural info: why care?  Perl will ignore
cultural info as far as you do not apply any operation which wants
cultural info.  You do not need any particular pragma for this.

C<no 'locale'> was a poor-man way to implement a lexical switching of
locales - between a "C" one and the current one.  The only motivation
for this was to avoid bugs/attacks in suid scripts.  Now
internally-kept tables allow us to have cultural info which is free of
these security considerations.  IMO we do not want to repeat this
disasterous error again with utf8.

[Summary: when you say "code operating on bytes", I do not know what
you mean.]

I do not think we need more than two internal encodings: a quick
American, and a slow Universal.  All the others should be done as i/o
filters.

Try saying that to someone from China, or India.  :-)

Why would they care how Perl represents their text in memory, as far
as all operations work as expected?

I'm not convinced that you can guarantee the answer will be correct.
Consider a piece of code that must convert raw utf8 data to utf16.
Will it "do the right thing" when globalutf16 (or whatever) is
in effect?

I do not see a place for such a code in Perl.

These are arbitrary restrictions you speak of, and I (naturally :)
don't agree.

*What* restriction do you mean?  What is your *problem to solve*?
Given a sequence of small integers, find another sequence of small
integers, algorithmically constructed based on the first one?  This
problem is transparent wrt representation of these integers in
memory.  As far as Perl OPs act transparently wrt representation, so
will Perl programs.

When you say "raw utf8 data" and "utf16", you somehow bind yourself to
the *external* representation of data.  My point is that these
aspects should be treated on the level of I/O.

Suppose that cvt8to16 is program which takes a sequence of small
integers, fails if any of them is not in the range 0..255, and
converts it to another sequence of small integers which are also in
the rage 0..255.

Then

  open I, "<i";
  open O, ">o";
  I->encoding('rawbytes');
  O->encoding('rawbytes');
  print O cvt8to16($_) while <I>;

will convert from utf8 to utf16 

     no matter whether you have a global utf8 switch or not, and
     no matter whether <I> actually does any translation

as far as

     <I> marks data as being in width36- or width8-encoding depending
     on whether it chooses to do a translation raw->width36 or do no
     translation (which is an optimization, since it will get width8-data);

     all encoding-specific operations grant these markings;

     `print O' grants these markings.

Ilya

<Prev in Thread] Current Thread [Next in Thread>