Re: Interpretation of non-UTF8 strings

Marcin 'Qrczak' Kowalczyk <qrczak(_at_)knm(_dot_)org(_dot_)pl> writes:

W liście z pon, 16-08-2004, godz. 11:16 +0100, Nick Ing-Simmons napisał:

Perl treats them inconsistently. On one hand they are read from files
and used as filenames without any recoding, which implies that they are
assumed to be in some unspecified default encoding.


Actually perl makes no such assumption - this is just historical
"it just works" code which is compatible with perl's before 5.6.


There is a reasonable assumption that all textual data are in some
unspecified but consistent encoding unless specified otherwise.

The historical model is that data is manipulated in its stored encoding
directly. It kind of works, as long as all data uses the same encoding,
and as long as locale is consulted when the actual meaning of non-ASCII
bytes is important (which can be quite hard if the encoding happens to
be UTF-8 or another multibyte encoding).

Since exchanging non-ASCII data between computers becomes more
important, and the assumption that all data uses the same encoding too
often becomes false, and a multibyte encoding - UTF-8 - becomes more
common, another text processing model appears. The model is to use
Unicode internally, and convert data on I/O. This model is usually
better suited for handling non-ASCII data, especially if different
sources use different encodings, but it's a switch from existing
practice, so it's not universally adopted yet.

It's convenient to assume that this conversion uses some default
encoding unless specified otherwise, so not all programs must deal with
encodings explicitly. Programs which don't specify encodings at all work
too, as long as all data they encounter is encoded using that encoding.
The locale mechanism is used on Unix to specify the default encoding
and other things.

In my case the encoding is ISO-8859-2. It will become UTF-8 in future
when more programs are compatible with UTF-8.

On the other hand
they are upgraded to UTF-8 as if they were ISO-8859-1.


This is possibly dubious practice, but was what happened in 5.6 
which had Unicode but no Encode module. That situation lasted 
long enough that there is a code base that relies on it.


This is broken.


But (sadly) we have to be compatible with some 5.6 codebase.


perl -e 'use Glib; use Gtk2 -init;
$window = Gtk2::Object->new(Gtk2::Window, title => "ąćęłńóśźż");
$window->show_all(); Gtk2->main()'


so perl knows what you are doing.


It shows incorrect title: characters are treated as if they were
ISO-8859-1. It's unreasonable to assume that everybody lives in USA or
Western Europe and uses ISO-8859-1. I have locale set correctly to pl_PL
with ISO-8859-2. How to tell Perl to respect that?


Add 'use encoding qw(iso-8859-2);'

IMHO it would be more logical to assume that strings without the UTF-8
flag are in some default encoding, probably taken from the locale.
Upgrading them to UTF-8 should take it into account instead of blildly
assuming ISO-8859-1,


It would be more logical but would break things.


They are already broken by assuming that everyone uses ISO-8859-1.


perl5.8 allows you to specify it.
If you don't specify it it assumes perl5.6 compatibility mode ;-)