W liście z pon, 16-08-2004, godz. 18:56 +0200, Marcin 'Qrczak' Kowalczyk
napisał:
There are also two models how a Perl script may operate, which should
better not be mixed in one program:
A. The old model: it tries to work on the original encoding of the data.
Uses non-UTF-8 scalars exclusively if the encoding is a byte encoding
other than UTF-8, uses UTF-8 scalars if it's UTF-8, some things break
for multibyte encodings other than UTF-8 (e.g. regexps).
B. The new model: it uses Unicode internally, which is physically
represented by non-UTF-8 scalars if it happens to fit in ISO-8859-1
and by UTF-8 scalars otherwise.
My point for now is that Perl should be aware which model it actually
uses under which circumstances (as I assume it decided to support both).
If Perl always interpreted non-UTF-8 scalars in the locale encoding,
the models would be easier to coexist and distinguish locally, because
the meaning of the given string would be unambiguous without context
(as long as the concept of the locale encoding is defined well enough).
$ perl -e 'use open ":locale"; use encoding(latin2); print chr(260), "\n"'
Ą
$ perl -e 'use encoding(latin2); use open ":locale"; print chr(260), "\n"'
"\x{12a9}" does not map to iso-8859-2 at -e line 1.
panic: sv_setpvn called with negative strlen at -e line 1.
"\x{12a1}" does not map to iso-8859-2.
\x{12a1}
$ echo -e '\0241' | perl -e 'use open ":locale"; use encoding(latin2); print
ord(<>), "\n"'
260
$ echo -e '\0241' | perl -e 'use encoding(latin2); use open ":locale"; print
ord(<>), "\n"'
196
$ echo -e '\0241' | perl -e 'use open ":encoding(latin2)"; use
encoding(latin2); print ord(<>), "\n"'
260
$ echo -e '\0241' | perl -e 'use encoding(latin2); use open
":encoding(latin2)"; print ord(<>), "\n"'
260
196 is the first byte of UTF-8 encoding of the Unicode character 260.
I guess something forgot to turn on the UTF-8 flag.
--
__("< Marcin Kowalczyk
\__/ qrczak(_at_)knm(_dot_)org(_dot_)pl
^^ http://qrnik.knm.org.pl/~qrczak/