Re: Correct use of UTF-8 under Unix

Markus Kuhn writes:
: I have just read through the list archive, and noted that a few people
: might have some doubts about how UTF-8 is used under Unix.

Well, I just read through your list archive, and I think you are more
of an idealist than I can afford to be.  You keep saying, "If Plan 9
can do a complete conversion, so can we."  But you'll notice that
people aren't in fact using Plan 9, by and large.  Plan 9 is a research
project.  It doesn't have millions of installations or millions of
interconnections with other installations.

Don't get me wrong.  Perl will work fine in your idealized world.  But
I intend it to work okay in the other world too.  I simultaneously try
to keep my head in the clouds and my feet on the ground.  Sometimes
it's a stretch, though.

: They
: apparently got confused by many of the features described in the Unicode
: standard (BOM, line separator, etc.), and thereby completely forgot the
: big UTF-8 prime directive under Unix:
: 
:   UTF-8 is ASCII compatible

Sure, and Perl banks on that to a great extent, but much of the world is
not ASCII compatible.

: Not only the encoding, but also the use of it.

Er, only until you actually start trying to use it for anything both
useful and un-American, like sorting, or updating your screen...

: So don't change anything
: about how ASCII was used when introducing UTF-8, because only this means
: that UTF-8 can truly substitute ASCII in a realistic way:

To the extent possible, I agree with you.  :-)

: This means the following:
: 
:   - A UTF-8 Unix plain text file that contains only ASCII characters
:     (and this is the majority of files on Unix installations all over
:     the world) will *not* change a single bit.

That may be true, but I don't think it's true enough.  49% of the files
in the world could be in non-ASCII, and your statement would still be
strictly true.  But not terribly useful.  The problem is not so much
files as it is interfaces.  What percentage of the text you use comes
from the system you're on?  How is that percentage changing over time?
What about if you're running a Linux set-top box that doesn't even have
a disk?  Or closer to current reality, did that tar file you just
unpacked come from a UTF-8 only system?  Will your browser convert text
to UTF-8 when it saves it?  What's coming down that socket you just
opened?  What's coming out of the file descriptor my process just
inherited?  Was it a pipe to a process on my machine, or was it a
foreign port?

I'm not suggesting there is an easy answer to this.  In fact, I'm
suggesting there isn't.  And that any suggestion that there is isn't.

:   - This means that there is never a BOM at the start of a file. BOMs could
:     be ignored by special new Unicode programs, but they are definitely
:     not ignored by the many existing ASCII programs. Adding a
:     BOM would break a tremendous amount of things and would violate the
:     prime directive, as BOMs are definitely not ASCII compatible.

I don't like BOMs either, in case you missed that.  Of course, I loathe
UTF-16 too, so that's not too terribly surprising.  Surrogate characters
are too pukey to contemplate.

:   - This means that lines in UTF-8 plaintext files are terminated
:     in one and only one way: 0x0a = LF. Neither U+2028 (line separator,
:     introduced for use inside *.doc-style word processing binary files)
:     nor overly long UTF-8 sequences for LF such as 0x80 0x8a must be accepted
:     as line terminators, otherwise we would get into the horrible
:     scenario that programs start to disagree what exactly a line is
:     (which a whole load of new security risks associated). Programs
:     such as "wc -l" must on UTF-8 files without any modification
:     whatsoever! There is no reason to change the Unix line semantics when
:     moving from ASCII to UTF-8. U+2028 is treated just like any other
:     character and has no special meaning in a Unix plaintext file.

Fine by me, till someone asks to treat a file otherwise, in which case
they should be let.  What's more at issue is whether a *file* should be
able to request being treated otherwise, if we give the user the right
to request that files be given the right to request that they be so
treated.  Or some such.  :-)

: How do applications find out that files are now in UTF-8? Simple
: applications such as cat and echo do not have to. For them UTF-8 is
: just like ASCII.

You oversimplify again.  Even "cat -v" has to know how to treat bytes
with the high bit set.  And "echo -e" probably wants a way to interpolate
characters larger than can be interpolated by \nnn.

: However, programs which count characters, position
: cursors, determine character classes, use regexp, etc. have to know
: about the file encoding, and there are well-established mechanisms to do
: that: they are told, preferably via established POSIX mechanisms
: (LC_CTYPE, LANG), or via other command line switches.

You have a major showstopper here as far as us Perl folks are
concerned.  Neither the environment nor the command line can be trusted
in a setuid situation.  The Perl community is for this reason
particularly leary of anything having to do with locales.  I noticed
that you frequently invoke the name of POSIX on your mailing list, but
that won't work here.  Around here people will actually shudder if you
say "POSIX".

: Ideally, all that should be necessary to turn a Unix installation into a
: pure UTF-8 system is the addition of the line
: 
:   export LC_CTYPE=UTF-8
: 
: in /etc/profile, plus conversion of the existing ISO 8859, JIS, KOI8,
: etc. files and file names.

No.  It is not ideal.  If you're going to have a kernel-wide switch,
then ideally the kernel should tell the process.  The environment
simply cannot be trusted, any historical POSIX botches to the contrary
notwithstanding.  You've been arguing for LC_CTYPE for several months
now.  I hope you haven't argued for it for so long that you can't see
its problems anymore.

As for Perl, although it will ideally keep everything as UTF-8
internally, it'll still be assuming that it has to know on an
interface-by-interface basis whether to expect UTF-8 or something
else.  Even on your idealized Linux, we'll still have to know what to
do with the sockets connected to the real world.  It is not so much
more of a stretch for us to decide on a file-by-file basis, using the
best available information.  On your ideal system, the best available
information might be that we should always guess files to be UTF-8.
That's fine.  But please don't use the environment to convey such
important, system-wide information.

: Editors and terminal emulators will then
: activate their UTF-8 modes, email software will convert received
: messages from the indicated MIME character set into UTF-8 before saving
: them as a file, etc. We are not quite there yet, but that should be the
: long-term goal.

I would like that too.  But Perl has always been about getting from here
to there, and this is very much a getting-from-here-to-there problem.

Nevertheless, I do appreciate idealists--at least as long as they're
not collectivizing the peasants, some of whom were my third cousins
living in the Ukraine before they were starved to death.  So I feel I
owe it to them to be able to distinguish Unicode from Russian.

When the whole world joins your collective, I'll say I believed in it
all along.  :-)

Larry