perl-unicode

Interpretation of non-UTF8 strings

2004-08-13 11:30:07
Should strings without the UTF8 flag be interpreted in the default
encoding of the current locale or in ISO-8859-1?

Perl treats them inconsistently. On one hand they are read from files
and used as filenames without any recoding, which implies that they are
assumed to be in some unspecified default encoding. On the other hand
they are upgraded to UTF-8 as if they were ISO-8859-1.

Perl is inconsistent whether "\xE0" or chr(0xE0) means the character
0xE0 in the default encoding or U+00E0:

perl -e '
$x = "foo\xE0";
$y = substr($x . chr(300), 0, 4);
print $x eq $y, "\n";
open F1, ">$x";
open F2, ">$y"'

The strings are equal, yet two filenames are created. I consider this
behavior broken.

IMHO it would be more logical to assume that strings without the UTF-8
flag are in some default encoding, probably taken from the locale.
Upgrading them to UTF-8 should take it into account instead of blildly
assuming ISO-8859-1, and using an UTF-8 string as a filename should
convert it to this encoding back (on Unix) or use UTF-16 API (on
Windows).

This leaves chr() ambiguous, so there should be some other function for
making Unicode code points, as chr should probably be kept for
compatibility to mean the default encoding.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak(_at_)knm(_dot_)org(_dot_)pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

<Prev in Thread] Current Thread [Next in Thread>