perl-unicode

Re: Use of UTF-8 under Perl and Unix

1999-11-02 16:09:46

Markus Kuhn wrote:

[Sorry for letting this grow longer, but I didn't want to cut out too much of
Markus's original text to avoid getting the context wrong]

Bram Moolenaar wrote on 1999-11-02 11:53 UTC:
The rule could be:
- ASCII only or 8 bit encoding: no BOM
- non-ASCII characters present: add a BOM

But BOMs get lost far too easily under Unix. How do you want to preserve
them across cut&paste, &c.?

Cut & Paste with X-windows includes an indication of the encoding.  You can
create your own ones, when needed.  Vim already uses VIM_TEXT to pass on
information about line/character/blockwise selection.  So I should at least be
able to use VIM_ENCODED_TEXT between two Vims, and include an indication of
the encoding.  But there probably is a standard for several encodings, so that
this works between different programs.

Is there other Cut&Paste to worry about?  Other situations where the BOM could
be lost?  I coudn't think of one.  Internally Vim would use the 'fileencoding'
setting, so that you can edit several buffers in different encoding at the
same time.  When writing a file, a BOM might be added, based on the
'fileencoding' option and perhaps on non-ASCII characters being present in the
buffer.

This would work for Vim.  When reading a file, the BOM can be used to set
the 'fileencoding' option.  Without the BOM it can use a default encoding,
or use another method to detect the encoding.  When writing a file, Vim
can check if non-ASCII characters are present, and prepend a BOM only when
needed.

Then why use the very restricted Unicode BOMs, which can only signal the
various Unicode encodings, but nothing else. ISO 2022 provides ESC
sequences that you can place at the start of a file to signal EVERY
encoding in the ECMA registry. Several hundred different ASCII
extensions have registered ISO 2022 codes to announce them. If you want
to have a stateful encoding with all its uglinees, then better say so by
admitting that what you really want is ISO 2022. ISO 2022 is in no way
worse than BOMs. It has exactly the same problems.

I don't know ISO 2022.  The term "ESC sequences" worries me.  Does this mean
it is not a single unicode character, but a sequnce of unicode characters?
How many programs would interpret this as being part of the actual text,
instead of ignoring it?  That would be bad.  You would in fact have created a
new file type, which causes more trouble than it solves.  Hopefully I'm wrong
here.

A "grep hello *.txt" still won't work with either BOMs or ISO 2022 unless
grep is converted into a very different piece of software.

I don't understand this one.  I see three possibilities:
1. If grep doesn't know anything about encoding, then it will work with or
   without BOM in the same way (except perhaps accidentally matching the BOM
   itself, but that is very unlikely to happen, "hello" will certainly not be
   a problem).
2. If it does know about encoding but not about BOMs, then it should handle
   the BOM as a non-printing zero-width character, thus ignore it.
3. If it does know about encoding and BOMs, it will be able to do everything
   correctly, and perhaps do clever matching of some character sets, convert
   the regexp to the recognized encoding, etc.

So why would grep not work with BOMs?

Don't misunderstand me: I don't recommend the use of ISO 2022. All I say
is that ISO 2022 is a much better mechanism then BOMs for declaring the
encoding of the following text.

I'll look around for info on ISO 2022.  Note that I'm using BOM here as a way
to recognize the file type for UTF-8 files, nothing else (byte order isn't
relevant for UTF-8, right?).

If you receive files via transport mechanisms that have no well-defined
text encoding (e.g., tar, ftp, etc.), then we have to do what we always
have done under these circumstances: By manual intervention, we make
sure that the characters are displayed correctly. So far, this manual
intervention came in the form of selecting the right font for xterm, in
the future it will come in the form of selecting the right conversion
table to Unicode, assuming that more and more applications (xterm, perl,
etc.) will prefer to process text directly in Unicode. To some degree
you can do autodetection, but not to the degree where is makes manual
selection unnecessary. UTF-8 can be failry easily autodetected, by
checking whether no malformed UTF-8 sequences are present. The various
ISO 8859-* sets on the other hand cannot be autodetected without
including something short of a full spell-checker. ISO 8859-* files
would need additional marking much more urgently then UTF-8, therefore I
have real problems seeing any practical advantages of using BOMs (an
ugly hack copied from a CJK standard into Unicode to make a committee
happy that couldn't agree about endian issues), which can only mark
Unicode files.

The practical purpose of the BOM is that I can unpack that tar file and edit
the files without doing any recoding.  This is an ideal, and it will only work
if all non-ASCII text files are properly marked for the encoding they contain.
In practice this will only work for some files.  I was hoping that UTF-8 files
could at least be recognized, also when there is only one UTF-8 character in
an otherwise ASCII text.  Autodetect won't work then, it might as well be some
ISO 8859-* text or MS-DOS text.  In fact, if I have understood this well, you
can only detect if a file is _not_ UTF-8.

Now, I'm sure the reaction is: Just use UTF-8 for those files, and forget
about the BOM.  Then the question remains: How do I know these _are_ UTF-8
files?  They might mostly contain plain ASCII, which can make autodetection
fail.  If there is anything I hate then it's something that breaks in rare
cases.

Think about this situation: You have three types of files, ASCII, some
ISO-8859 standard that you happen to encounter often, and UTF-8.  If we can
recognize UTF-8 files reliably, you could recognize files completely
automatically for this situation.  Wouldn't that be nice?  I believe this
situation is true for many of us: Using ISO 8859-1 already for many files,
and making the transfer to UTF-8 bit by bit.

I am perfectly aware that we will live for some time in a mixed
environment. I have myself been using such a mixed environment for a
long time. Many of my files are ISO 8859-15 encoded, many others are
UTF-8 encoded, some are CP437 encoded, others ISO 8859-7. On my window
manager, I have two options to start up xterm: One is in 8-bit mode with
an ISO 8859-1 locale (that is the xterm that I use to work on 8859-15
files), the other is in -u8 mode with an UTF-8 locale (that is what I
use to work on my UTF-8 files). Note: With "locale", I mean *not* just
POSIX locales (LANG), but also all the other environment variable, shell
aliases, and xrdb settings that affect my applications! I have separated
8859-15 and UTF-8 files in different directories and manually take care
of not mixing them up too much. I am looking forward to the day when
emacs and vim can operate in an UTF-8 mode, because then I will convert
most of my files and will more and more rarely use the ISO 8859-15 xterm
environment.

I'm glad you are aware of this situation.  I'm sure many people are in a
system like this, more or less.  I remember not having any problem, except
that one letter (\xE9) in the company name kept showing up wrong in reports...
The only solution for this is conversion to UTF-8, I can agree on that.
And certainly if we can recognize a UTF-8 file reliably, so that it stands out
from the "old" formats by being actually user-friendly.

When I unpack tar files, I assume that they are only in ASCII (which
works 99% of the time). Only when I encounter non-ASCII characters, I
manually start deciding what the character encoding most likely is and
then either switch to the appropriate xterm environment or more likely
call iconv or recode to convert the file to a more convenient encoding.
It doesn't matter which environment I use initially, because as long as
the tar file contained only ASCII, I do not notice any difference
between the two environments, and very few tar files I encounter contain
non-ASCII characters, and if they do, it is usually clear from the
context what the most likely encoding is.

You must be downloading a limited kind of tar files.  Try unpacking some of
the Japanes rpms...  I also have a set of CDs where I can read the 5% that's
English.  Of course, I don't use these files because I can't read them anyway.
But if I would have been an Asian, I would have wanted to read them.  The same
is true for anyone speaking a foreign language and getting applications or
text files over the internet.  I'm lucky to only be able to read languages
that fit into ISO 8859-1. :-)

People who think that I live in some UTF-8 only dreamworld severely
misunderstood me, because I practice the pieceful coexistence of ISO
8859-1, ISO 8859-15, ISO 8859-7, CP437/850, and ISO 10646/UTF-8 every
day.

I did misunderstood you, indeed.  We share the suffering...

I am not sure what a charset autodetection facility in vim would give
me, because this still doesn't provide autodetection for the many other
tools that I use. Vim is just a tiny part of the entire toolkit. If vim
autodetects UTF-8, this is of no use to me as long as I haven't manually
choosen an appropriate character set for xterm, started xterm in UTF-8
mode, told the applications that read vims output what the encoding is, &c.

If you are using an xterm in UTF-8 mode, wouldn't it be possible to display
any encoding, by converting the text when displaying it?  That is, the file
and text buffer could be some ISO-8859 encoding, and Vim would convert them to
UTF-8 before sending them to the xterm.  I'm not sure if this will be
implemented soon (we first need _any_ UTF-8 implementation), but it should be
possible.

Based on my practical experience with switching between different
encodings every day (and supporting fellow international students with a
huge variety of email character requirements (Greek, Thai, Arabic,
Korean, Chinese, French, German, &c.)), I am extremely sceptical about
the usability of adding automatic character-set detection to the Unix
plain text semantics. I like to control file encoding issues manually
until we get to a state where using only UTF-8 becomes feasible.

It seems we agree at least on the part of automatic detection not being
reliable enough.  Which is exactly why it would be so nice if UTF-8 files
_can_ be detected reliably!  Sorry, I'm repeating myself...

As for Perl, my wish list would be:

  - allow the use of UTF-8 as the only internal string encoding
  - make sure, the regex and string manipulation functions can deal with
    UTF-8 strings well
  - provide an easy facility to put highly-configurable converters onto
    every I/O path that Perl supports, including a library of good
    many-to-one conversion tables

This will fail if you don't know the encoding for one I/O path.  You can only
keep the original text in that case, which breaks the above rule to only use
UTF-8...  Perhaps you could assume some encoding and make sure it gets
converted back to the original encoding without loss of info?  Then you would
at least need to remember what conversion was used.  Complicated...

  - If you do some autodetection, make sure that the detection of an encoding
    and the activation of automatic conversion are two separate issues
    that are fully under the user's control. For instance, I could imagine
    a number of library function that

      - Check for malformed UTF-8 sequences
      - Check for various types of BOMs
      - Check for various types of ISO 2022 announcers
      - Cut good example spots out of a long string of unknown encoding,
        convert them to UTF-8 under a list of candidate encodings, and
        present them to the user for manual selection of the most likely
        encoding.

Can't do that for Perl programs that run non-interactively.  Manual selection
can be _very_ inconvenient, unless there is a real user that has time,
knowledge and patience to answer the question of what encoding this is.  I
don't know many of those users...

For instance, if I receive a file that might be either ISO 8859-1 or ISO
8859-15, then there should be a Perl function that cuts out a few
example words of this file that contain characters where ISO 8859-1 and
ISO 8859-15 differ, such that the user can decide based on a display of
these example characters in various decodings what the most likely
encoding was. If the file does not contain any of the characters in
which say ISO 8859-1 and ISO 8859-15 differ, then the question whether
the file was encoded in ISO 8859-1 or ISO 8859-15 is obviously
irrelevant for converting it to something else.

I've never seen it done this way.  Most programs have some setting which
encoding to use.  The text is displayed without asking anything.  If it's
wrong, the user should have some way to change the setting.
Can you imagine Netscape presenting you with a list of encodings for you to
select one from, each time you open a page that looks like some ISO 8859 text?
No, that will only annoy people.
Sorry Markus, I don't see this manual selection as a feasible solution.
Only when it's a user that initiates this, then it would be very useful.
Thus, in Netscape you would have some "choose encoding" dialog, which shows
the result of various alternatives.  Yes, that would be nice.  But not as an
automatic appearing dialog, that is my point.

Sorry about the length...

--
hundred-and-one symptoms of being an internet addict:
64. The remote to the T.V. is missing...and you don't even care.

--/-/---- Bram Moolenaar ---- Bram(_at_)moolenaar(_dot_)net ---- 
Bram(_at_)vim(_dot_)org ---\-\--
  \ \    www.vim.org/iccf      www.moolenaar.net       www.vim.org    / /