Re: Interpretation of non-UTF8 strings

Portability is not a sufficient excuse though. There are bugs, like


That's right, we haven't fixed things because we are lazy and stupid.
How did you guess?

that with double recoding, or with $ARGV[0] not being equivalent to
substr($ARGV[0], 0).


What substr() example you are referring to here?  I cannot find this
in your recent messages.

The API is, I'm afraid, not good enough, even if we ignore the old mode
of manipulating data in its external encoding. Namely, it doesn't
distinguish specifying the encoding of the script source (which depends
on where it has been written) from specifying the encoding that the
script should assume on STDIN/STDOUT/STDERR and other places (which
depends on where it is being run). Well, other places when implemented,
assuming it will be indeed triggered by the 'encoding' pragma.


You may consider the encoding pragma broken for your uses, and that is
fine, but I have to point out that many people are happily using it.

If your environment is such that your script is in encoding X and
your utilities operate in encoding X, all is fine.  It's when you mix
encodings when things get murkier.

Take for example the output of qx(): you may declare somehow that it is
in UTF-8, but the moment some utility behaves differently and spits out
Latin-1 or Latin-2 or SJIS, you are screwed.

I hope the -C flag is considered a temporary hack, to be eventually
replaced with somethings which supports other encodings and not only
UTF-8.


Possibly.  It was an explicit solution for much greater brokenness that
resulting from assuming implicit UTF-8 from locales.

use encoding files => "ISO-8859-2";
use encoding terminal => "UTF-8";


What do you mean by "terminal"?  The STD* streams or /dev/tty?

use encoding filenames => "ISO-8859-1";
use encoding env => "locale";


Something like that would be nice, yes.  Someone needs to implement it,
though, and that's the problem.

We should think how it interacts with locale-aware behavior of
functions. Without 'use locale' and other pragmas it's clear: Perl
consistently assumes that every text is ISO-8859-1. When something like


Well, no.  In that case Perl assumes that everything is in whatever
8-bit encoding the platform happens to be using, with the exception that
/\w/ and so forth only implement the character set of ASCII (in effect,
the raw underlying <ctype.h> API).

'use encoding' is in effect, Perl still interprets the scalars in the
same way, but treats them differently when they interact with the world.

But with 'use locale' it assumes that non-UTF-8 scalars are in the
current locale encoding, which is incompatible with the assumptions
taken when UTF-8 scalars and non-UTF-8 scalars are mixed. So it will
probably never work together. If 'use locale' includes some essential
features besides the treatment of texts, like date/time formatting,
it should be available by other means, without at the same time causing
ord(lc(chr(161))) to be equal to 177, which doesn't make sense if
character codes are interpreted according to Unicode. It implies
that when localized texts are taken from the system, they must be
decoded from the locale encoding.


If you really do have a Grand Plan of how to integrate locales and
Unicode happily, congratulations.

-- 
Jarkko Hietaniemi <jhi(_at_)iki(_dot_)fi> http://www.iki.fi/jhi/ "There is this 
special
biologist word we use for 'stable'.  It is 'dead'." -- Jack Cohen