Re: Interpretation of non-UTF8 strings

W liście z pon, 16-08-2004, godz. 18:56 +0200, Marcin 'Qrczak' Kowalczyk
napisał:

There are also two models how a Perl script may operate, which should
better not be mixed in one program:
A. The old model: it tries to work on the original encoding of the data.
   Uses non-UTF-8 scalars exclusively if the encoding is a byte encoding
   other than UTF-8, uses UTF-8 scalars if it's UTF-8, some things break
   for multibyte encodings other than UTF-8 (e.g. regexps).
B. The new model: it uses Unicode internally, which is physically
   represented by non-UTF-8 scalars if it happens to fit in ISO-8859-1
   and by UTF-8 scalars otherwise.


In my Kogut<->Perl bridge I would like to use Perl in the B model,
because I was told here that non-UTF-8 scalars are interpreted according
to the B model when they are mixed with UTF-8 scalars, so this is the
only model which makes sense when Unicode is used. It's also more
convenient for me because Kogut strings are Unicoded internally.

What will be the necessary Perl interpreter invocation arguments to make
this work? See below for what I mean by "work".


The following places exchange text with the external world, encoding
characters as bytes, without an explicit encoding specified by the
protocol, so they should use the encoding of my choice which I will put
somewhere in the invocation arguments (which will usually be the default
encoding of the locale), or they should use the default encoding of the
locale themselves - either of this is fine for me:

- file contents, including stdin/stdout/stderr and sockets,
  unless overridden explicitly
- filenames (including functions like mkdir, stat, glob)
- arguments of system and exec
- @ARGV
- %ENV
- $! when it contains the result of strerror()
- and probably other similar things I've forgotten.

There are also places in the Perl API which use Perl scalars. They
should always interpret them according to the B model, i.e. a scalar
with the UTF-8 flag turned off is interpreted as ISO-8859-1.

There are also places which don't have to support more than ASCII, but
it would be nice if they had an official interpretation of non-ASCII
characters, either the locale encoding or ISO-8859-1 I suppose, so I
know how to convert Unicode strings for them:
- variable and package names (get_sv, gv_stashpv)

The encoding of the script source should be specified separately from
everything else, because it's depends on how the script has been
written, while others depend on where it is run.

In the case of my Kogut<->Perl bridge there is no such thing as script
source (the interpreter is invoked with -e ""). But code might call
eval_sv among other things, and it's argument, being a Perl scalar,
should be interpreted as above.


Note: with these options:
   -Mencoding=$ENCODING -Mopen=:encoding($ENCODING)
file contents are recoded correctly, but all other things are broken,
including eval_sv which interprets non-UTF-8 strings according to the
locale.

OTOH with this option:
   -Mopen=:encoding($ENCODING)
eval_sv works, but stdin/stdout/stderr are not recoded.

Note: $! has the same weird behavior as @ARGV:
$ perl -Mencoding=ISO-8859-2 -Mopen=:encoding\(ISO-8859-2\) -e '
eval {open F, "/etc/shadow"}; print "$!\n"'
Brak dostępu
$ perl -Mencoding=ISO-8859-2 -Mopen=:encoding\(ISO-8859-2\) -e '
eval {open F, "/etc/shadow"}; print substr($!, 0), "\n"'
"\x{00ea}" does not map to iso-8859-2 at -e line 1.
Brak dost\x{00ea}pu

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak(_at_)knm(_dot_)org(_dot_)pl
    ^^     http://qrnik.knm.org.pl/~qrczak/