perl-unicode

Re: Interpretation of non-UTF8 strings

2004-08-16 03:30:06
Marcin 'Qrczak' Kowalczyk <qrczak(_at_)knm(_dot_)org(_dot_)pl> writes:
Should strings without the UTF8 flag be interpreted in the default
encoding of the current locale or in ISO-8859-1?

This is a tricky question - and status quo is likely to remain 
for compatibility reasons.


Perl treats them inconsistently. On one hand they are read from files
and used as filenames without any recoding, which implies that they are
assumed to be in some unspecified default encoding. 

Actually perl makes no such assumption - this is just historical
"it just works" code which is compatible with perl's before 5.6.

On the other hand
they are upgraded to UTF-8 as if they were ISO-8859-1.

This is possibly dubious practice, but was what happened in 5.6 
which had Unicode but no Encode module. That situation lasted 
long enough that there is a code base that relies on it.

In perl5.8 you can use explict Encode, or :encoding layer 
or use encoding or ... to get what you want.


Perl is inconsistent whether "\xE0" or chr(0xE0) means the character
0xE0 in the default encoding or U+00E0:

perl -e '
$x = "foo\xE0";
$y = substr($x . chr(300), 0, 4);
print $x eq $y, "\n";
open F1, ">$x";
open F2, ">$y"'

The strings are equal, yet two filenames are created. I consider this
behavior broken.

FWIW so do I, but concensus has not been reached on the right fix.

I would also like to see something akin to 'use locale' (which would 
treat 0xE0 according to locale's CTYPE), which treats 0x80..0xFF 
according to Unicode (== latin1 by definition) semantics.  


IMHO it would be more logical to assume that strings without the UTF-8
flag are in some default encoding, probably taken from the locale.
Upgrading them to UTF-8 should take it into account instead of blildly
assuming ISO-8859-1, 

It would be more logical but would break things.

and using an UTF-8 string as a filename should
convert it to this encoding back (on Unix) 

This is the tricky bit, theory goes that way forward on Unix
is (probably) UTF-8 filenames, and that in a UTF-8 locale and 
with 'use utf8' this works already. (There are a few rough edges...)

For older Unixes which don't do UTF-8 there is the issue of how 
you discover what the current locale's encoding is - if they 
are old enough to not have UTF-8 locales, they probably lack 
the API to get the encoding as well :-(

But I agree that getting two different file names is bad.

or use UTF-16 API (on
Windows).

Snag here is when you say "Windows" you mean WinNT and later, Win9X
(and WinME?) can't do that. For Win9x you have to convert to the 
current "code page" - akin to the Unix case.


This leaves chr() ambiguous, 

It isn't ambiguous it is always (ignoring EBCDIC platforms for now)
Unicode/Latin1 - which can be represented one of two ways - UTF-8 or 
single octet. Representation is supposed to be invisible to perl code
but in case of file names it isn't.

so there should be some other function for
making Unicode code points, as chr should probably be kept for
compatibility to mean the default encoding.

<Prev in Thread] Current Thread [Next in Thread>