perl-unicode

Re: Unicode aware module

1999-06-12 14:47:30
On Sat, 12 Jun 1999 14:54:10 EDT, Ilya Zakharevich wrote:
On Sat, Jun 12, 1999 at 06:30:09AM -0700, Gurusamy Sarathy wrote:
It seems better to have both C<use utf8> and C<use locale> have
a global effect, and add a lexically scoped C<use byte> (or
similar) that will mark places that operate on binary data,
and therefore turn off any encoding related pragmata.  What
do you think?

Thanks.  This is what I'm advocating (for locales) the last two years,
and what I realized about utf8 a couple of weeks ago (but did not
collect enough strength to start Yet Another Fruitless Compain yet).

Again, the one thing I don't like about the above is that
code that operates on binary data will silently fail to do the
right thing when the user enables the global switch.  I'm
considering doing it iff we can find some way to warn users
that the data may not match the mode in effect.

The only addition: being locale-sensitive and UTF-8 encoded is a
property of *data*, not of a Perl script.  An attempt to handle them by
marking sections of code may be noble, but looks like a lost cause.

The C<use byte> I'm talking about has little to do with data.
Perl *operations* behave *differently* depending on the mode in
effect, so C<use byte> would simply be a way to mark sections
where string operations should affect bytes rather than
characters.

We *could* find two extra bits in SvFLAGS to mark these property - if
we decide that instead of 'use byte' we do

  open FOO, 'foo';
  binmode FOO, 'byte';

or

  my $foo : bytes;

This is just like bless(\"foo", 'utf8').  All encodings are just
different 'types' of binary data.

But finding a way to easily "mark" data (by marking its source?)
has potential, because then we can detect mismatch between data
and operations.  We may even be able to piggyback on taint magic
for propagating the type.

Of course, the default would still be byte mode for both operations
and data.  If the user enables any of the global switches for
special encodings, they are supposed to mark their data sources
as well.  Mismatches then generate warnings.  Is this acceptable
and/or useful?


Sarathy
gsar(_at_)activestate(_dot_)com

<Prev in Thread] Current Thread [Next in Thread>