perl-unicode

Re: Correct use of UTF-8 under Unix

1999-11-02 03:50:16

Larry Wall wrote:

Markus Kuhn writes:
: I have just read through the list archive, and noted that a few people
: might have some doubts about how UTF-8 is used under Unix.

Well, I just read through your list archive, and I think you are more
of an idealist than I can afford to be.  You keep saying, "If Plan 9
can do a complete conversion, so can we."  But you'll notice that
people aren't in fact using Plan 9, by and large.  Plan 9 is a research
project.  It doesn't have millions of installations or millions of
interconnections with other installations.
[...]

Thanks to Larry for this argument to deal with a mixed encoding environment!
I have tried to argue for this before, but you can say it much better.

Don't get me wrong.  Perl will work fine in your idealized world.  But
I intend it to work okay in the other world too.  I simultaneously try
to keep my head in the clouds and my feet on the ground.  Sometimes
it's a stretch, though.

Great line for a quote! :-)

I completely agree that a UTF-8 only system is an ideal, which we will not see
for quite a while.  Just unpacking a tar archive can give you files in any
encoding that it happens to contain.

So, what now?  I certainly would like Vim to be able to handle multiple
encodings.  The easy way out is to let the user set the 'fileencoding' option.
This is actually already working, with these values:
            ansi        default setting, good for most Western languages
            japan       set to use shift-JIS (Windows CP 932) encoding
            korea       set to use Korean DBCS
            prc         use simplified Chinese encoding
            taiwan      use traditional Chinese encoding

Especially people in Korea are using this now.  Utf8 should soon be added to
this list, someone started work on it.

There should at least be a good default for 'fileencoding'.  Even better would
be when it can be set automatically for a file that is opened.  Unfortunately,
I don't see a hint on how to do this in Larry's comments.  Only how _not_ to
do it.

I suppose the default could come from the system.  If it's not the
environment, then perhaps the terminal setting.  For Vim this should work
quite well, since the text need to be displayed properly.  What the terminal
can display is a good hint for what a file may contain.  Not perfect though,
since a file could be converted when displaying it.

:   - This means that there is never a BOM at the start of a file. BOMs could
:     be ignored by special new Unicode programs, but they are definitely
:     not ignored by the many existing ASCII programs. Adding a
:     BOM would break a tremendous amount of things and would violate the
:     prime directive, as BOMs are definitely not ASCII compatible.

I don't like BOMs either, in case you missed that.  Of course, I loathe
UTF-16 too, so that's not too terribly surprising.  Surrogate characters
are too pukey to contemplate.

I agree that a BOM in an othewise ASCII file is evil.  But what about when the
file contains unicode characters > 0x80?  If you can handle those, wouldn't
you also be able to handle the BOM?  UCS-2 and UCS-4 files aren't ASCII
compatible anyway, so you could add a BOM to them without breaking anything.
The BOM must be a valid, non-printing, zero-width unicode character, of
course.  I believe it does exist for unicode (sorry, couldn't find it right
now...).  I don't know about the other encodings.

The rule could be:
- ASCII only or 8 bit encoding: no BOM
- non-ASCII characters present: add a BOM

This would work for Vim.  When reading a file, the BOM can be used to set the
'fileencoding' option.  Without the BOM it can use a default encoding, or use
another method to detect the encoding.  When writing a file, Vim can check if
non-ASCII characters are present, and prepend a BOM only when needed.

Now, I'm sure the reaction will be that this isn't a good idea.  Let's here
the arguments, perhaps they can be solved.

--
ERROR 047: Keyboard not found.  Press RETURN to continue.

--/-/---- Bram Moolenaar ---- Bram(_at_)moolenaar(_dot_)net ---- 
Bram(_at_)vim(_dot_)org ---\-\--
  \ \    www.vim.org/iccf      www.moolenaar.net       www.vim.org    / /