perl-unicode

Re: Unicode aware module

1999-06-13 02:31:03
On Sun, Jun 13, 1999 at 01:46:41AM -0700, Gurusamy Sarathy wrote:
On Sun, 13 Jun 1999 04:21:03 EDT, Ilya Zakharevich wrote:
Please tell me whether

  sub foo { return $1 if s/\btypedef\s+int\s+(\w+)// }

is operating on bytes or characters.

Characters.  

Wrong.  You cannot win.  ;-)

Apparently the above sub processes a C file, so it should enforce "C"
locale, and in Perl-speak it means it is operating on bytes.

             But that's a poor example of what I'm driving at.
chop($foo) would be a better example.  Would you want that to
remove one byte or one character?

I do not care as far as it chop()s.  If $foo contains bytes, let
it chop a byte.  If $foo contains chars, let it chop a char.

Then we need to determine another way to say that the subroutine
operates on a *sequence of integers 0..255 packed into a sequence of
bytes* (which is C<no utf8>, required if we have a globalutf8 pragma).

Uhh, that's what C<use byte> is.  If the code wants to play with
bytes, C<no utf8> makes little sense.  You'll have to exhaustively
deny all possible current and future character encodings via
C<no utf16>, C<no big5>, ad nauseam.  (We're talking about
hypothetical encodings yet to be supported, but you get the idea.)

I do not think we need more than two internal encodings: a quick
American, and a slow Universal.  All the others should be done as i/o
filters.

I'm not convinced that you can guarantee the answer will be correct.
Consider a piece of code that must convert raw utf8 data to utf16.
Will it "do the right thing" when globalutf16 (or whatever) is
in effect?

I do not see a place for such a code in Perl.  (Assuming C<use utf8;>):

  open I, "<i";
  open O, ">o";
  I->encoding('big5');
  O->encoding('utf16');
  print O, <I>;

The program knows that all the data is either 'byte' or 'utf8'.  All
the other formats are done on the boundary of perl and the system.
Inside Perl you may need only two translations: force_utf8 and
force_byte, and in fact I do not even see a place for them as far as
all the operations give equivalent results on equivalent 'byte' and
utf8 strings.

[Here a string is an sequence of small integers, 0..255 or 0..2**36-1 in
 'byte' and 'utf8' cases.]

Ilya

<Prev in Thread] Current Thread [Next in Thread>