Re: Interpretation of non-UTF8 strings

W liście z pon, 16-08-2004, godz. 11:16 +0100, Nick Ing-Simmons napisał:

Perl treats them inconsistently. On one hand they are read from files
and used as filenames without any recoding, which implies that they are
assumed to be in some unspecified default encoding.


Actually perl makes no such assumption - this is just historical
"it just works" code which is compatible with perl's before 5.6.


There is a reasonable assumption that all textual data are in some
unspecified but consistent encoding unless specified otherwise.

The historical model is that data is manipulated in its stored encoding
directly. It kind of works, as long as all data uses the same encoding,
and as long as locale is consulted when the actual meaning of non-ASCII
bytes is important (which can be quite hard if the encoding happens to
be UTF-8 or another multibyte encoding).

Since exchanging non-ASCII data between computers becomes more
important, and the assumption that all data uses the same encoding too
often becomes false, and a multibyte encoding - UTF-8 - becomes more
common, another text processing model appears. The model is to use
Unicode internally, and convert data on I/O. This model is usually
better suited for handling non-ASCII data, especially if different
sources use different encodings, but it's a switch from existing
practice, so it's not universally adopted yet.

It's convenient to assume that this conversion uses some default
encoding unless specified otherwise, so not all programs must deal with
encodings explicitly. Programs which don't specify encodings at all work
too, as long as all data they encounter is encoded using that encoding.
The locale mechanism is used on Unix to specify the default encoding
and other things.

In my case the encoding is ISO-8859-2. It will become UTF-8 in future
when more programs are compatible with UTF-8.

On the other hand
they are upgraded to UTF-8 as if they were ISO-8859-1.


This is possibly dubious practice, but was what happened in 5.6 
which had Unicode but no Encode module. That situation lasted 
long enough that there is a code base that relies on it.


This is broken.

perl -e 'use Glib; use Gtk2 -init;
$window = Gtk2::Object->new(Gtk2::Window, title => "ąćęłńóśźż");
$window->show_all(); Gtk2->main()'

It shows incorrect title: characters are treated as if they were
ISO-8859-1. It's unreasonable to assume that everybody lives in USA or
Western Europe and uses ISO-8859-1. I have locale set correctly to pl_PL
with ISO-8859-2. How to tell Perl to respect that?

IMHO it would be more logical to assume that strings without the UTF-8
flag are in some default encoding, probably taken from the locale.
Upgrading them to UTF-8 should take it into account instead of blildly
assuming ISO-8859-1,


It would be more logical but would break things.


They are already broken by assuming that everyone uses ISO-8859-1.

and using an UTF-8 string as a filename should
convert it to this encoding back (on Unix)


This is the tricky bit, theory goes that way forward on Unix
is (probably) UTF-8 filenames, and that in a UTF-8 locale and 
with 'use utf8' this works already. (There are a few rough edges...)


Filenames should be assumed to use the locale's encoding by default,
because C programs generally don't recode filenames read from files
or input by the user by other means. Tons of programs, including ls,
would break if the filename encoding were different from the terminal
encoding. They should be assumed UTF-8 if the locale says the default
encoding is UTF-8.

This leaves chr() ambiguous,


It isn't ambiguous it is always (ignoring EBCDIC platforms for now)
Unicode/Latin1 - which can be represented one of two ways - UTF-8 or 
single octet.


If it's ISO-8859-1, then the default I/O behavior is broken, because
it passes bytes without recoding, thus assuming that everybody uses the
same encoding as non-UTF-8 Perl scalars, i.e. ISO-8859-1.

I understand that not recoding anything by default is required for
compatibility. The wrong assumption was that unrecoded data is
ISO-8859-1, because it makes maintaining the compatibility harder.

Python explicitly distinguishes byte strings and Unicode strings,
which allows the two models to coexist without ambiguity.

If Perl scalars are a mixture of ISO-8859-1 and UTF-8, instead of a
mixture of the default locale encoding and UTF-8, how to tell Perl to
recode external strings (default I/O, including stdin/stdout/stderr,
@ARGV, filenames) between the default locale encoding and Perl's
internal encodings?


I'm making a bridge between Perl and my language Kogut, which uses
the second model above: data is manipulated in Unicode (it's stored
internally in a mixture of ISO-8859-1 and UTF-32) and recoded during
communication with the world, using a specified or default encoding,
which defaults to the encoding taken from the locale.

When converting strings from Kogut to Perl I do the following: if the
string is ASCII-only, I turn it to a Perl scalar directly and turn
off the UTF-8 flag (I ignore the existence of EBCDIC as the default),
otherwise I encode it in UTF-8 and turn on the UTF-8 flag to make sure
that Perl interprets it correctly.

If Perl scalars without the UTF-8 flag are indeed meant to be encoded
in ISO-8859-1, I could save some bytes and use non-UTF-8 if the string
is purely ISO-8859-1. I was worried that it's not equvalent, because
filenames treat them differently, and because non-UTF-8 strings are a
default for I/O.

When converting strings from Perl to Kogut, it's obvious what to do when
the UTF-8 flag is on, but it's not clear what to do when it's off. If
I treat the string as ISO-8859-1, then it breaks on all non-ISO-8859-1
locales if the programmer was not careful to recode everything on the
Perl side. I expect almost all Perl code keeps @ARGV and filenames and
file contents in non-UTF-8 scalars, so it would break almost all code.

But treating it as the current locale breaks in other places. They are
fewer places but the breakage is more severe, e.g. chr is inconsistent.
Perl's non-UTF-8 strings are indeed treated as ISO-8859-1 by Perl when
concatenated with UTF-8 strings.

It follows that either choice breaks much code. I made the bridge for
entertainment and as research in language interoperability, so I don't
lose much if it can't be made to work. But maybe there is hope that it
will work consistently in future. I would like to know how it is
supposed to eventually work.

I also made a Python bridge, which was easier for many reasons. In
particular handling strings was easier: I use Python Unicode strings to
represent all non-ASCII data, and in the other direction I assume the
default encoding for byte strings.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak(_at_)knm(_dot_)org(_dot_)pl
    ^^     http://qrnik.knm.org.pl/~qrczak/