Re: Interpretation of non-UTF8 strings

W liście z pon, 16-08-2004, godz. 18:56 +0200, Marcin 'Qrczak' Kowalczyk
napisał:

There are also two models how a Perl script may operate, which should
better not be mixed in one program:
A. The old model: it tries to work on the original encoding of the data.
   Uses non-UTF-8 scalars exclusively if the encoding is a byte encoding
   other than UTF-8, uses UTF-8 scalars if it's UTF-8, some things break
   for multibyte encodings other than UTF-8 (e.g. regexps).
B. The new model: it uses Unicode internally, which is physically
   represented by non-UTF-8 scalars if it happens to fit in ISO-8859-1
   and by UTF-8 scalars otherwise.


My point for now is that Perl should be aware which model it actually
uses under which circumstances (as I assume it decided to support both).

If Perl always interpreted non-UTF-8 scalars in the locale encoding,
the models would be easier to coexist and distinguish locally, because
the meaning of the given string would be unambiguous without context
(as long as the concept of the locale encoding is defined well enough).


$ perl -e 'use open ":locale"; use encoding(latin2); print chr(260), "\n"'
Ą
$ perl -e 'use encoding(latin2); use open ":locale"; print chr(260), "\n"'
"\x{12a9}" does not map to iso-8859-2 at -e line 1.
panic: sv_setpvn called with negative strlen at -e line 1.
"\x{12a1}" does not map to iso-8859-2.
\x{12a1}

$ echo -e '\0241' | perl -e 'use open ":locale"; use encoding(latin2); print 
ord(<>), "\n"'
260
$ echo -e '\0241' | perl -e 'use encoding(latin2); use open ":locale"; print 
ord(<>), "\n"'
196
$ echo -e '\0241' | perl -e 'use open ":encoding(latin2)"; use 
encoding(latin2); print ord(<>), "\n"'
260
$ echo -e '\0241' | perl -e 'use encoding(latin2); use open 
":encoding(latin2)"; print ord(<>), "\n"'
260

196 is the first byte of UTF-8 encoding of the Unicode character 260.
I guess something forgot to turn on the UTF-8 flag.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak(_at_)knm(_dot_)org(_dot_)pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

<Prev in Thread]

Current Thread

[Next in Thread>

Previous by Date:

Re: Interpretation of non-UTF8 strings, Marcin 'Qrczak' Kowalczyk

Next by Date:

Re: Interpretation of non-UTF8 strings, Jarkko Hietaniemi

Previous by Thread:

Re: Interpretation of non-UTF8 strings, Marcin 'Qrczak' Kowalczyk

Next by Thread:

Re: Interpretation of non-UTF8 strings, Jarkko Hietaniemi

Indexes:

[Date] [Thread] [Top] [All Lists]